Hi Slurm Support How are you doing ? We have discovered that we lost significant transaction records via sreport after these events: 1) https://bugs.schedmd.com/show_bug.cgi?id=8033 2) Upgrade v19.05.05 from v18.08.06 on 2019-11-20 16:00:54 MST Some findings, Records before 11th Sep 2019 is no longer there. sreport cluster AccountUtilizationByUser -t hours start=2019-01-01 end=2019-09-10 -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2019-01-01T00:00:00 - 2019-09-09T23:59:59 (21776400 secs) Usage reported in CPU Hours -------------------------------------------------------------------------------- Cluster Account Login Proper Name Used Energy --------- --------------- --------- --------------- ------------ -------- Records after 11th Sep 2019, sreport cluster AccountUtilizationByUser -t hours start=2019-09-11 end=2019-09-12 -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2019-09-11T00:00:00 - 2019-09-11T23:59:59 (86400 secs) Usage reported in CPU Hours -------------------------------------------------------------------------------- Cluster Account Login Proper Name Used Energy --------- --------------- --------- --------------- --------- -------- m3 root 83995 0 m3 p001 1774 0 ..... ..... ..... [damienl@m3-login1 ~]$ which sreport /opt/slurm-19.05.4/bin/sreport [damienl@m3-login1 ~]$ sreport -V slurm 19.05.4 We understand that from bug 8033, there will be some lost records, but not sure how much is lost. Or is this a separate bug from the upgrade ? I will be supplying you with more configs and logs. Kindly investigate and advise. Many Thanks Damien
Created attachment 12532 [details] Current slurmdbd.conf
Created attachment 12533 [details] Current slurm.conf
Created attachment 12534 [details] slurmdbd logs
Created attachment 12536 [details] More slurmdbd logs
Created attachment 12537 [details] mariadb.log. from sql server
This is taken at 8th Nov 2019 MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM m3_job_table; +----------+ | COUNT(*) | +----------+ | 11701937 | Kindly let us know if you need clarification or more logs Thanks Damien
Hi Damien, > We understand that from bug 8033, there will be some lost records, but not > sure how much is lost. Or is this a separate bug from the upgrade ? The accounting information lost in bug 8033 is limited to the specific runaway jobs detected there. I'm pretty sure that not ALL the jobs where runaway jobs for such long period (2019-01-01 to 2019-09-10), so something happen and looks unrelated to bug 8033. This looks like a problem related to a rollup (the aggregation process that creates the info from jobs to sreport), but I still need to look into the logs and config that you provided to be certain. Can you get jobs for such period by sacct? > MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM m3_job_table; > +----------+ > | COUNT(*) | > +----------+ > | 11701937 | Thanks for that information, could you also post the output of this other SQL command: MariaDB [slurm_acct_db_1905]> SELECT COUNT(*) FROM m3_job_table WHERE time_start > 1546300800 && time_start < 1568073600; This will tell us if the information is there, so we can restore the aggregated info usually shown by sreport. Regards, Albert
MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | brick | | db_backup | | mysql | | performance_schema | | slurm_acct_db | | test | +--------------------+ 7 rows in set (0.02 sec) MariaDB [(none)]> use slurm_acct_db; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM m3_job_table; +----------+ | COUNT(*) | +----------+ | 12082135 | +----------+ 1 row in set (4.75 sec) MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM m3_job_table WHERE time_start > 1546300800 && time_start < 1568073600; +----------+ | COUNT(*) | +----------+ | 3286576 | +----------+ 1 row in set (21.96 sec) MariaDB [slurm_acct_db]>
Hi Albert, We seems to lost last year records too. Details: -- sreport cluster AccountUtilizationByUser -t hours start=2018-01-01 end=2018-09-10 -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2018-01-01T00:00:00 - 2018-09-09T23:59:59 (21776400 secs) Usage reported in CPU Hours -------------------------------------------------------------------------------- Cluster Account Login Proper Name Used Energy --------- --------------- --------- --------------- ----------- -------- [damienl@m3-login1 etc]$ -- Cheers Damien
Hi Damien, The bad news is that I don't see any error in the logs that you sent me that could explain your error. Probably it happen before that date. The good news is that from your comment 9, it looks that we will be able to restore that information. I will focus on restore the info, but if you sent me more slurmdbd logs from dates when you know that it was working and when it started to fail, I will try to find out why that happen. Before to explain you how to restore the, I would need: 1) The oldest date that you want to recover the sreport info from. 2) The output of this command: MariaDB [slurm_acct_db_1905]> select * from m3_last_ran_table; Also some collateral comments: - I've marked your slurmdbd.conf as private. Remember to remove the StoragePass of that file or to mark it as private in the future. - I've seen that you use DebugLevel=9 for slurmdbd. That's really not necessary. - I've seen that you use DebugFlags=gres. If you have any specific issue, it could make sense. But in general we recommend to use "scontrol setdebugflags +gres" to troubleshoot something, and "scontrol setdebugflags -gres" when you are done. Changing the slurm.conf to do that is not recommended, because it force you to reconfigure and because it will stay there if you forget it. Regards, Albert
Hi Albert Thanks for your reply. We kept only 7 days worth of slurmdbd.log, it might not be sufficient for this investigation. Nevertheless I will still cloudsend* these files to albert.gil@schedmd.com as they are quite huge. Ideally, we want to go back as much as possible (This database started late 2015), but in this case, backtrack to at least 2018-01-01. The attached slurmdbd.conf has a fake password. I will attempting to downgrade both debug levels, which will be more suitable settings ? Kindly advise Thanks Damien
MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | brick | | db_backup | | mysql | | performance_schema | | slurm_acct_db | | test | +--------------------+ 7 rows in set (0.02 sec) MariaDB [(none)]> use slurm_acct_db; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [slurm_acct_db]> select * from m3_last_ran_table; +---------------+--------------+----------------+ | hourly_rollup | daily_rollup | monthly_rollup | +---------------+--------------+----------------+ | 1576284158 | 1576284158 | 1576284158 | +---------------+--------------+----------------+ 1 row in set (0.00 sec) MariaDB [slurm_acct_db]>
Hi Albert I have cloudsend* the logs to you. See whether if this is sufficient. Thanks Damien
Hi Albert You might also want to refer 'https://bugs.schedmd.com/show_bug.cgi?id=8033' attachments "current slurmdbd.log" Those logs were when we just upgrade to v19.05.04 I hope that this is helpful Thanks Damien
Hi Damien, > Ideally, we want to go back as much as possible (This database started late 2015), but in this case, backtrack to at least 2018-01-01. I'm pretty sure that the data is there because you don't have any PurgeJobAfter neither PurgeResvAfter, but just in case, can you run this command to ensure that the information is there: MariaDB [slurm_acct_db_1905]> SELECT COUNT(*) FROM m3_job_table WHERE time_start > 1514764800 && time_start < 1517443200; These command will return the number similar to the number of jobs started in 2018-Jan. If the number makes sense (is not zero), we will restore the information. BUT, before that we need to fix an IMPORTANT issue in your setup that could obstruct the restoring process: I've detected that your logrotate of the slurmdbd is using SIGHUP to signal the slurmdbd. This is not recommended anymore since 17.11, but we should use SIGUSR2: https://github.com/SchedMD/slurm/blob/slurm-17-11-13-2/RELEASE_NOTES#L221 https://slurm.schedmd.com/slurm.conf.html#SECTION_LOGGING https://slurm.schedmd.com/SLUG17/FieldNotes.pdf This is specially important because the rollup process to restore data from 2018-01 will take several days, and this SIGHUP may interfere on it. Once we know that the jobs info to restore the sreport info from is in the DB and that logrotate won't interfere with the rollup process, we will be ready to start the rollup process. Regards, Albert
MariaDB [(none)]> MariaDB [(none)]> MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | brick | | db_backup | | mysql | | performance_schema | | slurm_acct_db | | test | +--------------------+ 7 rows in set (0.02 sec) MariaDB [(none)]> use slurm_acct_db; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [slurm_acct_db]> SELECT COUNT(*) FROM m3_job_table WHERE time_start > 1514764800 && time_start < 1517443200; +----------+ | COUNT(*) | +----------+ | 123964 | +----------+ 1 row in set (30.55 sec) MariaDB [slurm_acct_db]>
$ pwd /etc/logrotate.d $ cat slurmdb /mnt/slurm-logs/slurmdbd.log { compress missingok nocopytruncate nocreate nodelaycompress nomail notifempty noolddir rotate 5 sharedscripts size=5M create 640 slurm root postrotate pkill -x --signal SIGUSR2 slurmdbd endscript }
Hi Damien, Ok, it seems that we are ready to start the rollup process. These are the steps that you must follow: 0) Enable DB_USAGE as DebugFlags of slurmdbd.conf, and leave DebugLevel as Info. 1) Stop the slurmdbd 2) Update the o2_last_ran_table with the desired <EPOCH_TIME> using the command: MariaDB [slurmdbd]> UPDATE m3_last_ran_table SET hourly_rollup = <EPOCH_TIME>, daily_rollup = <EPOCH_TIME>, monthly_rollup = <EPOCH_TIME>; Note that <EPOCH_TIME> must be the integer number that represents the Epoch time that you want to recover your sreport info from. You can use epochconverter.com or a similar tool (eg date command [1]) to obtain your desired number. It seems that you want to restore data from 2018-01-01, then the <EPOCH_TIME> should be 1514761200 (in my timezone). 3) Start slurmdbd 4) Monitor the slurmdbd: 4.1) Watch slurmdbd log and check the lines with "m3 curr hour is now <EpochTime1>-<EpochTime2>". These lines will give you the current status/date of the rollup. 4.2) With mytop and similar tools ensure that everything is working fine at sql level. 4.3) With "sdiag" and "sacctmgr show stats" you can also see if the DB is working normal, or if there is some user sending a lot or huge queries to it. For two years it may take several days to complete, it will depend on how many jobs per day you have and the performance of your DB. Keep an eye in the logs to be sure that the process is going well, and let me know how it works. Regards, Albert [1] $ date "+%s" -d "01/01/2018 00:00:00" 1514761200 $ date -d @1514761200 Mon Jan 1 00:00:00 CET 2018
Hi Albert Thanks for this instruction and advice. Some questions related to this, 1) When doing these record restorations, how will it affect our daily Slurm operations ? srun, sbatch, sacctmgr, sacct ? 2) If I do this, MariaDB [slurmdbd]> UPDATE m3_last_ran_table SET hourly_rollup = <EPOCH_TIME>, daily_rollup = <EPOCH_TIME>, monthly_rollup = <EPOCH_TIME>; How will it affects my current records in the system now ? any effects ? 3) If I do this, MariaDB [slurmdbd]> UPDATE m3_last_ran_table SET hourly_rollup = <EPOCH_TIME>, daily_rollup = <EPOCH_TIME>, monthly_rollup = <EPOCH_TIME>; For start-date=2018-01-01 , How will it reacts if we are doing start-date=2016-01-01 in the future ? 4) In what extent when one user's many queries to sacct or squeue is going to affect Slurm or its database ? 5) What is causing these lost records ? bugs? Database faults? Upgrading procedure? Kindly advise Thanks Damien
Hi Damien, > 1) When doing these record restorations, how will it affect our daily Slurm > operations ? srun, sbatch, sacctmgr, sacct ? It won't affect, almost at all. The slurmdbd is already running rollup process (threads) every hour to update the information that you can see in sreport. As it does it every hour it use to take just seconds to finish. The main difference is that now we are going to request the slurmdbd to do a rollup process/thread from 2018-01-01, and that will take a lot of time. And the only affectation that you will see is that the info of sreport won't be updated until the full and long rollup is completed, unlike in the normal scenario, where this information is updated every hour. > 2) If I do this, > MariaDB [slurmdbd]> UPDATE m3_last_ran_table SET hourly_rollup = > <EPOCH_TIME>, > daily_rollup = <EPOCH_TIME>, monthly_rollup = <EPOCH_TIME>; > How will it affects my current records in the system now ? any effects ? First of all, to avoid any confusion, please note that you should replace <EPOCH_TIME> with the actual epoch time number representing the date that you want to recover from (eg 1514761200). This command won't affect any record of the system, but will trigger a rollup to recompute the sreport aggregated information from the actual records of each job (that are totally untouched). So, only sreport information is updated. > 3) If I do this, > MariaDB [slurmdbd]> UPDATE m3_last_ran_table SET hourly_rollup = > <EPOCH_TIME>, > daily_rollup = <EPOCH_TIME>, monthly_rollup = <EPOCH_TIME>; > For start-date=2018-01-01 , How will it reacts if we are doing > start-date=2016-01-01 in the future ? The rollup process recomputes / restores sreport info from the date that we put, up to current time. So, if you want to restore the info form 2016-01-01 maybe you should do it now, because whenever you do it you will also recompute the information from 2018-01-01 to now. You will do it twice and it could take quite a lot of time (days). > 4) In what extent when one user's many queries to sacct or squeue is going > to affect Slurm or its database ? That's a general issue. The amount of RPC commands that slurmctld and slurmdbd receive and respond could be important. Sometimes users do so aggressive queries that it could lead to a degradation of the responsiveness of the system (ie, kind of an an unintentional DoS attack). Particularly, big and constant sacct queries could degrade the responsiveness of slurmdbd (or the sql backend), and constant squeue queries the responsiveness of slurmctld. That's why sdiag and "sacctmgr show stats" report the RPC by users, to detect those aggressive queries if responsiveness is being affected, to allow admins to take actions and request the users to avoid such queries. I just mentioned it because we want slurmdbd as healthy as possible while running the long rollup, because if it's stopped, we would need to restart the long rollup from the beginning. And you won't like it! ;-) > 5) What is causing these lost records ? bugs? Database faults? Upgrading > procedure? That's the big question that we are working on to discover. Your issue is strange, but we've seen it more than once, so we are working to try to discover what could lead to this. Looks like a corner case of specific signaling done by logrotates while the slurmdbd is doing special rollups due runaway jobs or similar scenarios. But we are still working on it. Please keep me updated about how the long rollup goes, Albert
Hi Albert How do we know the rollup process has completed (after a few days) ? Is there a special string (Done ?) at the slurmdbd.log ? Or what should we look out ? Cheers Damien
Hi Damien, > How do we know the rollup process has completed (after a few days) ? Is > there a special string (Done ?) at the slurmdbd.log ? > Or what should we look out ? The simplest way is checking the slurmdbd.log lines: m3 curr hour is now <EpochTime1>-<EpochTime2> If those EpochTimes are the last current hour, that means that the long rollup ended and you can check if it ended properly with sreport. But actually, the last log of a long rollup should be: (as_mysql_usage.c:372) query update "m3_last_ran_table" set hourly_rollup=<CurrentEpochTime>, daily_rollup=<CurrentEpochTime>, monthly_rollup=<CurrentEpochTime> You can also check the m3_last_ran_table directly in sql. But my recommendation is tail the slurmdbd.log and keep an eye in the values of the lines "m3 curr hour is now <EpochTime1>-<EpochTime2>". This is the best way to see the progress and to see when it's done. Regards, Albert
Hi Damien, Is the rollup started? Have a nice days, Albert
Hi Albert Thanks for the follow up. We are still working on it. Cheers Damien
Hi Slurm Support How are you doing ? Happy New Year 2020. I am following up on this. We have started the rollup process for a few days: MariaDB [slurm_acct_db]> UPDATE m3_last_ran_table SET hourly_rollup = 1514725200, daily_rollup = 1514725200, monthly_rollup = 1514725200; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 MariaDB [slurm_acct_db]> So far, I have fail to notice any "m3 curr hour is now <EpochTime1>" message inside our slurmdbd.logs In addition, I can observed these errors: cat slurmdbd.log |grep error [2020-01-02T03:21:53.184] error: We have more time than is possible (7624746+1209600+0)(8834346) > 8712000 for cluster m3(2420) from 2018-05-07T02:00:00 - 2018-05-07T03:00:00 tres 1 [2020-01-02T03:22:04.458] error: We have more time than is possible (8048961+1196640+0)(9245601) > 8712000 for cluster m3(2420) from 2018-05-07T03:00:00 - 2018-05-07T04:00:00 tres 1 [2020-01-02T03:22:15.952] error: We have more time than is possible (8115044+1209600+0)(9324644) > 8712000 for cluster m3(2420) from 2018-05-07T04:00:00 - 2018-05-07T05:00:00 tres 1 Any advice on these issues ? Thanks Damien
Created attachment 12649 [details] Current slurmdbd.log
Created attachment 12650 [details] yesterday's slurmdbd log
Hi Damien, > How are you doing ? Happy New Year 2020. Good, thanks! Hope you do too! :-)) > We have started the rollup process for a few days: > > MariaDB [slurm_acct_db]> UPDATE m3_last_ran_table SET hourly_rollup = > 1514725200, daily_rollup = 1514725200, monthly_rollup = 1514725200; > > So far, I have fail to notice any "m3 curr hour is now <EpochTime1>" message > inside our slurmdbd.logs I think that you skipped the step 0) of comment 21: > 0) Enable DB_USAGE as DebugFlags of slurmdbd.conf, and leave DebugLevel as Info. The lines "m3 curr hour is now <EpochTime1>" will only appear if DB_USAGE is enabled. Also, I can see a lot of high debug logs, that will be too verbose. Set DebugLevel at Info will be better. What is the current output of: MariaDB [slurm_acct_db_1905]> select * from m3_last_ran_table; Most probably the rollup is running, but we cannot see it. We can try to change DebugFlags and DebugLevel and SIGHUP the daemon to see the new output. > In addition, I can observed these errors: > > cat slurmdbd.log |grep error > [2020-01-02T03:21:53.184] error: We have more time than is possible > (7624746+1209600+0)(8834346) > 8712000 for cluster m3(2420) from > 2018-05-07T02:00:00 - 2018-05-07T03:00:00 tres 1 > [2020-01-02T03:22:04.458] error: We have more time than is possible > (8048961+1196640+0)(9245601) > 8712000 for cluster m3(2420) from > 2018-05-07T03:00:00 - 2018-05-07T04:00:00 tres 1 > [2020-01-02T03:22:15.952] error: We have more time than is possible > (8115044+1209600+0)(9324644) > 8712000 for cluster m3(2420) from > 2018-05-07T04:00:00 - 2018-05-07T05:00:00 tres 1 The dates shown (2018-05) are telling me that these are errors occurred time ago and reshown becuase of the rollup being run. These errors use to be related to reservations, and we are working on them (eg bug 6839). At this point, my recommendation is to ignore them if their dates are so old, and once the rollup ends, if they still happen, then we will focus on them. Regards, Albert
MariaDB [(none)]> MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | brick | | db_backup | | mysql | | performance_schema | | slurm_acct_db | | test | +--------------------+ 7 rows in set (0.09 sec) MariaDB [(none)]> use slurm_acct_db; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [slurm_acct_db]> select * from m3_last_ran_table; +---------------+--------------+----------------+ | hourly_rollup | daily_rollup | monthly_rollup | +---------------+--------------+----------------+ | 1568076288 | 1568076288 | 1568076288 | +---------------+--------------+----------------+ 1 row in set (0.00 sec) MariaDB [slurm_acct_db]>
Hi Albert, Thanks for your reply. Should I repeat steps 0-4 again ? Cheers Damien
Hi Damien, > MariaDB [slurm_acct_db]> select * from m3_last_ran_table; > +---------------+--------------+----------------+ > | hourly_rollup | daily_rollup | monthly_rollup | > +---------------+--------------+----------------+ > | 1568076288 | 1568076288 | 1568076288 | > +---------------+--------------+----------------+ I'm not sure about this output. I expected the value that you set (1514725200) or current time. But maybe it makes sense if slurmdbd is signaled in the middle of the rollup process. Have you signaled the slumdbd process with a SIGHUP as I mentioned before? > Should I repeat steps 0-4 again? If you have done the SIGHUP, the logs now should tell us about the progress of the rollup process. Let me see today's log again and then we decide. If you do it it won't hurt, but maybe we are almost done and we will restarted from the beginning. Regards, Albert
Hi Albert, This is today's figures MariaDB [slurm_acct_db]> select * from m3_last_ran_table; +---------------+--------------+----------------+ | hourly_rollup | daily_rollup | monthly_rollup | +---------------+--------------+----------------+ | 1568076288 | 1568076288 | 1568076288 | +---------------+--------------+----------------+ 1 row in set (0.00 sec) MariaDB [slurm_acct_db]> It is same as yesterday. Cheers Damien
Hi Damien, The values of m3_last_ran_table are informational, but not definitive. My recommendation is: 1) Edit your slurmdbd.conf to set DebugFlags=DB_USAGE and DebugLevel=Info 2) Signal the slurmdbd daemon with SIGHUP to make the above changes to be loaded, with a command like: $ kill -s SIGHUP <slurmdbd PID> 3) Wait some minutes to gather some logs with the new configuration, and send me that logs. With the new logs I will confirm if the rollup is actually running or not. If it is running, we will only need to wait, and we will eb able to monitor as explained in comment 21. If it's not, we will need to repeat the steps 0) to 4) of the comment 21. If you want, we can go directly to repeat the steps 0) to 4) of the comment 21. That would mean to restart the rollup form the beginning. It's not a problem in the sense that data won't be lost, but maybe it's not necessary and will take longer to complete. If you follow the above recommended 3 steps I will tell you if it's necessary or not. Regards, Albert
# cat slurmdbd.conf # # Authentication info AuthType=auth/munge #AuthInfo=/var/run/munge/munge.socket.2 # # slurmDBD info #DbdAddr= DbdHost=m3-mgmt2 DbdBackupHost=m3-mgmt1 #DbdPort=7031 SlurmUser=slurm #MessageTimeout=300 #DefaultQOS=normal,standby DebugLevel=info DebugFlags=DB_USAGE
Created attachment 12664 [details] Current slurmdbd logs Current slurmdbd logs Please examine this Thanks
Hi Albert, Should we attempt this task again ? UPDATE m3_last_ran_table SET hourly_rollup = 1514725200, daily_rollup = 1514725200, monthly_rollup = 1514725200; Cheers Damien
> Should we attempt this task again ? > > UPDATE m3_last_ran_table SET hourly_rollup = 1514725200, daily_rollup = > 1514725200, monthly_rollup = 1514725200; No. In your new logs I can see that the rollup is almost finished: [2020-01-04T00:28:24.457] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1575428400-1575432000 That is, the rollup is running and already in Dec-2019 (1575428400). So, in some minutes or hours it will be done. Let's keep an eye on these logs lines "m3 curr hour is now", and once they arrive to current time, let's check if your sreport is working again. Regards, Albert
Hi Damien, > In your new logs I can see that the rollup is almost finished: > > [2020-01-04T00:28:24.457] 0(as_mysql_rollup.c:1072) m3 curr hour is now > 1575428400-1575432000 > > That is, the rollup is running and already in Dec-2019 (1575428400). > So, in some minutes or hours it will be done. > > Let's keep an eye on these logs lines "m3 curr hour is now", and once they > arrive to current time, let's check if your sreport is working again. Actually, after a closer look of your last logs, I can see this: [2020-01-04T00:17:45.276] Terminate signal (SIGINT or SIGTERM) received [2020-01-04T00:18:59.820] slurmdbd version 19.05.4 started [2020-01-04T00:18:59.822] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1568073600-1568077200 So, it seems that you signaled / killed slurmdbd without SIGHUP, right? In such case, I'm not sure if the roll up process will complete properly. Let's wait a little bit, and then, if necessary, we will repeat comment 21. Regards, Albert
Hi Albert, Thanks for the details. The current rollup is: # grep curr slurmdbd.log .... .... .... [2020-01-04T00:31:00.954] 0(as_mysql_rollup.c:1755) curr day is now 1575464400-1575550800 [2020-01-04T00:31:01.603] 0(as_mysql_rollup.c:1755) curr day is now 1575550800-1575637200 [2020-01-04T00:31:02.251] 0(as_mysql_rollup.c:1755) curr day is now 1575637200-1575723600 [2020-01-04T00:31:02.898] 0(as_mysql_rollup.c:1755) curr day is now 1575723600-1575810000 [2020-01-04T00:31:03.545] 0(as_mysql_rollup.c:1755) curr day is now 1575810000-1575896400 [2020-01-04T00:31:04.193] 0(as_mysql_rollup.c:1755) curr day is now 1575896400-1575982800 [2020-01-04T00:31:04.841] 0(as_mysql_rollup.c:1755) curr day is now 1575982800-1576069200 [2020-01-04T00:31:05.494] 0(as_mysql_rollup.c:1755) curr day is now 1576069200-1576155600 [2020-01-04T00:31:06.145] 0(as_mysql_rollup.c:1755) curr day is now 1576155600-1576242000 [2020-01-04T00:31:06.795] 0(as_mysql_rollup.c:1755) curr day is now 1576242000-1576328400 [2020-01-04T00:31:07.770] 0(as_mysql_rollup.c:1755) curr day is now 1576328400-1576414800 [2020-01-04T00:31:08.417] 0(as_mysql_rollup.c:1755) curr day is now 1576414800-1576501200 [2020-01-04T00:31:09.068] 0(as_mysql_rollup.c:1755) curr day is now 1576501200-1576587600 [2020-01-04T00:31:09.718] 0(as_mysql_rollup.c:1755) curr day is now 1576587600-1576674000 [2020-01-04T00:31:10.371] 0(as_mysql_rollup.c:1755) curr day is now 1576674000-1576760400 [2020-01-04T00:31:11.022] 0(as_mysql_rollup.c:1755) curr day is now 1576760400-1576846800 [2020-01-04T00:31:11.672] 0(as_mysql_rollup.c:1755) curr day is now 1576846800-1576933200 [2020-01-04T00:31:12.320] 0(as_mysql_rollup.c:1755) curr day is now 1576933200-1577019600 [2020-01-04T00:31:12.968] 0(as_mysql_rollup.c:1755) curr day is now 1577019600-1577106000 [2020-01-04T00:31:13.618] 0(as_mysql_rollup.c:1755) curr day is now 1577106000-1577192400 [2020-01-04T00:31:14.266] 0(as_mysql_rollup.c:1755) curr day is now 1577192400-1577278800 [2020-01-04T00:31:14.914] 0(as_mysql_rollup.c:1755) curr day is now 1577278800-1577365200 [2020-01-04T00:31:15.574] 0(as_mysql_rollup.c:1755) curr day is now 1577365200-1577451600 [2020-01-04T00:31:16.222] 0(as_mysql_rollup.c:1755) curr day is now 1577451600-1577538000 [2020-01-04T00:31:16.870] 0(as_mysql_rollup.c:1755) curr day is now 1577538000-1577624400 [2020-01-04T00:31:17.518] 0(as_mysql_rollup.c:1755) curr day is now 1577624400-1577710800 [2020-01-04T00:31:18.167] 0(as_mysql_rollup.c:1755) curr day is now 1577710800-1577797200 [2020-01-04T00:31:18.817] 0(as_mysql_rollup.c:1755) curr day is now 1577797200-1577883600 [2020-01-04T00:31:19.544] 0(as_mysql_rollup.c:1755) curr day is now 1577883600-1577970000 [2020-01-04T00:31:20.198] 0(as_mysql_rollup.c:1755) curr day is now 1577970000-1578056400 [2020-01-04T00:31:20.848] 0(as_mysql_rollup.c:1755) curr month is now 1567260000-1569852000 [2020-01-04T00:31:21.009] 0(as_mysql_rollup.c:1755) curr month is now 1569852000-1572526800 [2020-01-04T00:31:21.072] 0(as_mysql_rollup.c:1755) curr month is now 1572526800-1575118800 [2020-01-04T00:31:21.138] 0(as_mysql_rollup.c:1755) curr month is now 1575118800-1577797200 [2020-01-04T01:00:00.220] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578056400-1578060000 [2020-01-04T02:00:00.328] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578060000-1578063600 Is this going to be alright ? Cheers Damien
Hi Damien, From the dates of your last logs I can see that the rollup is already finished. But I'm not sure if it went well because of the SIGTERM used (instead of the SIGUP). Could you check if the sreport info is already working as you expected? For example, are these commands returning the right information for you? $ sreport cluster AccountUtilizationByUser -t hours start=2019-01-01 end=2019-09-10 $ sreport cluster AccountUtilizationByUser -t hours start=2018-01-01 end=2018-09-10 Or are they still returning an empty output? If they are still empty, then we would need to repeat the steps 0) to 4) of comment 21. Regards, Albert
Hi Albert, Somehow, the rollups isn't progressing at all, see details: $ sreport cluster AccountUtilizationByUser -t hours start=2018-01-01 end=2018-09-10 -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2018-01-01T00:00:00 - 2018-09-09T23:59:59 (21776400 secs) Usage reported in CPU Hours -------------------------------------------------------------------------------- Cluster Account Login Proper Name Used Energy --------- --------------- --------- --------------- ----------- -------- $ sreport cluster AccountUtilizationByUser -t hours start=2019-01-01 end=2019-05-10 -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2019-01-01T00:00:00 - 2019-05-09T23:59:59 (11149200 secs) Usage reported in CPU Hours -------------------------------------------------------------------------------- Cluster Account Login Proper Name Used Energy --------- --------------- --------- --------------- ----------- -------- Should I repeat steps 0-4 again ? Cheers Damien
Created attachment 12670 [details] Yesterday's slurmdbd logs
Hi Albert, I will re-run steps 0-4 again. Thanks Damien
Details: # systemctl stop slurmdbd MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | brick | | db_backup | | mysql | | performance_schema | | slurm_acct_db | | test | +--------------------+ 7 rows in set (0.02 sec) MariaDB [(none)]> use slurm_acct_db; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [slurm_acct_db]> UPDATE m3_last_ran_table SET hourly_rollup = 1514725200, daily_rollup = 1514725200, monthly_rollup = 1514725200; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 # systemctl start slurmdbd # systemctl status slurmdbd ● slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/etc/systemd/system/slurmdbd.service; disabled; vendor preset: disabled) Active: active (running) since Sun 2020-01-05 20:48:55 AEDT; 52s ago Process: 23892 ExecStart=/opt/slurm-19.05.4/sbin/slurmdbd (code=exited, status=0/SUCCESS) Main PID: 23894 (slurmdbd) CGroup: /system.slice/slurmdbd.service └─23894 /opt/slurm-19.05.4/sbin/slurmdbd Jan 05 20:48:55 m3-mgmt2 systemd[1]: Starting Slurm DBD accounting daemon... Jan 05 20:48:55 m3-mgmt2 systemd[1]: PID file /opt/slurm/var/run/slurmdbd.pid not readable (yet?) after start. Jan 05 20:48:55 m3-mgmt2 systemd[1]: Started Slurm DBD accounting daemon. # cat slurmdbd.conf # # Example slurmdbd.conf file. # # See the slurmdbd.conf man page for more information. # # Archive info #ArchiveJobs=yes #ArchiveDir="/tmp" #ArchiveSteps=yes #ArchiveScript= #JobPurge=12 #StepPurge=1 # # Authentication info AuthType=auth/munge #AuthInfo=/var/run/munge/munge.socket.2 # # slurmDBD info #DbdAddr= DbdHost=m3-mgmt2 DbdBackupHost=m3-mgmt1 #DbdPort=7031 SlurmUser=slurm #MessageTimeout=300 #DefaultQOS=normal,standby DebugLevel=info DebugFlags=DB_USAGE LogFile=/mnt/slurm-logs/slurmdbd.log PidFile=/opt/slurm/var/run/slurmdbd.pid ... ... ... # sdiag ******************************************************* sdiag output at Sun Jan 05 20:53:48 2020 (1578218028) Data since Sun Jan 05 11:00:00 2020 (1578182400) ******************************************************* Server thread count: 3 Agent queue size: 0 Agent count: 0 DBD Agent queue size: 0 Jobs submitted: 7088 Jobs started: 6353 Jobs completed: 5562 Jobs canceled: 1018 Jobs failed: 0 Job states ts: Sun Jan 05 20:53:42 2020 (1578218022) Jobs pending: 1448 Jobs running: 592 Main schedule statistics (microseconds): Last cycle: 27471 Max cycle: 802957 Total cycles: 1959 Mean cycle: 70358 Mean depth cycle: 408 Cycles per minute: 3 Last queue length: 1408 Backfilling stats Total backfilled jobs (since last slurm start): 202497 Total backfilled jobs (since last stats cycle start): 5039 Total backfilled heterogeneous job components: 0 Total cycles: 1019 Last cycle when: Sun Jan 05 20:53:21 2020 (1578218001) Last cycle: 141285 Max cycle: 145525328 Mean cycle: 2192981 Last depth cycle: 1408 Last depth cycle (try sched): 264 Depth Mean: 1719 Depth Mean (try depth): 307 Last queue length: 1408 Queue length mean: 1729 Latency for 1000 calls to gettimeofday(): 27 microseconds Remote Procedure Call statistics by message type REQUEST_FED_INFO ( 2049) count:5285629 ave_time:109 total_time:577229559 REQUEST_JOB_INFO_SINGLE ( 2021) count:5102192 ave_time:385304 total_time:1965900032724 REQUEST_PARTITION_INFO ( 2009) count:4041568 ave_time:155 total_time:628195478 REQUEST_NODE_INFO_SINGLE ( 2040) count:1035813 ave_time:306177 total_time:317142507075 REQUEST_STEP_COMPLETE ( 5016) count:306999 ave_time:99697 total_time:30606993657 MESSAGE_EPILOG_COMPLETE ( 6012) count:303591 ave_time:259603 total_time:78813185056 REQUEST_COMPLETE_PROLOG ( 6018) count:303531 ave_time:358997 total_time:108966800488 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:300354 ave_time:296373 total_time:89017103008 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:291833 ave_time:208149 total_time:60744983368 REQUEST_JOB_USER_INFO ( 2039) count:183380 ave_time:240325 total_time:44070870638 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:155714 ave_time:216273 total_time:33676870292 REQUEST_KILL_JOB ( 5032) count:18667 ave_time:31078 total_time:580139572 REQUEST_CONTROL_STATUS ( 2053) count:16582 ave_time:113 total_time:1879345 REQUEST_JOB_READY ( 4019) count:11394 ave_time:320067 total_time:3646846680 REQUEST_UPDATE_NODE ( 3002) count:11050 ave_time:47152 total_time:521040427 REQUEST_JOB_STEP_CREATE ( 5001) count:3674 ave_time:5399 total_time:19837210 REQUEST_RESOURCE_ALLOCATION ( 4001) count:3213 ave_time:99653 total_time:320187814 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:3197 ave_time:310173 total_time:991624330 REQUEST_JOB_INFO ( 2003) count:2940 ave_time:230417 total_time:677428673 REQUEST_SHARE_INFO ( 2022) count:1720 ave_time:4350 total_time:7482128 REQUEST_NODE_INFO ( 2007) count:732 ave_time:162297 total_time:118801883 REQUEST_RESERVATION_INFO ( 2024) count:681 ave_time:112640 total_time:76707947 REQUEST_BUILD_INFO ( 2001) count:496 ave_time:124063 total_time:61535616 REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:484 ave_time:318084 total_time:153953059 REQUEST_PING ( 1008) count:209 ave_time:98 total_time:20516 REQUEST_CANCEL_JOB_STEP ( 5005) count:149 ave_time:166049 total_time:24741416 REQUEST_JOB_REQUEUE ( 5023) count:41 ave_time:320618 total_time:13145362 REQUEST_UPDATE_JOB ( 3001) count:26 ave_time:241148 total_time:6269871 ACCOUNTING_UPDATE_MSG (10001) count:23 ave_time:267997 total_time:6163948 REQUEST_STATS_INFO ( 2035) count:19 ave_time:160 total_time:3049 REQUEST_JOB_STEP_INFO ( 2005) count:18 ave_time:79970 total_time:1439470 REQUEST_UPDATE_RESERVATION ( 3009) count:7 ave_time:339883 total_time:2379187 REQUEST_CREATE_RESERVATION ( 3006) count:6 ave_time:584184 total_time:3505106 REQUEST_JOB_ALLOCATION_INFO ( 4014) count:5 ave_time:294182 total_time:1470914 ACCOUNTING_REGISTER_CTLD (10003) count:4 ave_time:88363 total_time:353453 REQUEST_PRIORITY_FACTORS ( 2026) count:2 ave_time:222624 total_time:445249 REQUEST_DELETE_RESERVATION ( 3008) count:2 ave_time:90494 total_time:180988 REQUEST_TRIGGER_SET ( 2017) count:1 ave_time:143 total_time:143 Remote Procedure Call statistics by user swil0005 ( 10264) count:4776587 ave_time:211373 total_time:1009642698670 root ( 0) count:3454922 ave_time:190785 total_time:659149073667 asegal ( 11199) count:602103 ave_time:116767 total_time:70306296721 vpam0001 ( 11208) count:440336 ave_time:103518 total_time:45582950337 hven0001 ( 10350) count:382377 ave_time:121750 total_time:46554575000 claudiak ( 11247) count:374354 ave_time:97038 total_time:36326902148 joshuamhardy ( 10156) count:340457 ave_time:119542 total_time:40699111831 robe0002 ( 12508) count:322189 ave_time:104570 total_time:33691308597 jinmingz ( 12653) count:307693 ave_time:110626 total_time:34038914724 spiper ( 12815) count:273645 ave_time:54378 total_time:14880367316 ychen ( 11648) count:241669 ave_time:87925 total_time:21248959266 ska565 ( 10255) count:237335 ave_time:98843 total_time:23458969424 aazhar ( 11233) count:208767 ave_time:114387 total_time:23880429936 strikm ( 12849) count:204618 ave_time:112867 total_time:23094801719 scho0011 ( 10572) count:203313 ave_time:107091 total_time:21773130513 cpen0001 ( 10450) count:201477 ave_time:147547 total_time:29727373710 qhao ( 12730) count:170165 ave_time:123615 total_time:21034990075 vieth ( 11159) count:169133 ave_time:158804 total_time:26859114073 xiaoliuz ( 10462) count:165261 ave_time:124182 total_time:20522580031 zoey ( 12871) count:155061 ave_time:162552 total_time:25205622325 anirka ( 12061) count:148696 ave_time:120795 total_time:17961820065 kngu0030 ( 11191) count:147892 ave_time:151416 total_time:22393360967 huiw ( 11363) count:143849 ave_time:164400 total_time:23648810727 jafarl ( 10347) count:143272 ave_time:147580 total_time:21144169452 szhou ( 10465) count:137997 ave_time:123024 total_time:16977057538 sarahsyh ( 11365) count:121744 ave_time:137291 total_time:16714470080 mcfadyen ( 10725) count:120742 ave_time:100926 total_time:12186021144 ttha0011 ( 11105) count:119246 ave_time:133196 total_time:15883199475 zyu ( 12933) count:113759 ave_time:143486 total_time:16322839750 moya0001 ( 12915) count:99689 ave_time:63050 total_time:6285402851 ldill ( 12521) count:76008 ave_time:132145 total_time:10044077828 galisa ( 10716) count:72641 ave_time:156348 total_time:11357326488 rbeare ( 10299) count:58607 ave_time:136209 total_time:7982833426 lindacac ( 11464) count:58196 ave_time:151928 total_time:8841612977 ksabaroe ( 10365) count:56816 ave_time:164156 total_time:9326702165 a1058400 ( 11860) count:55839 ave_time:139722 total_time:7801981104 mfarrell ( 10679) count:50020 ave_time:130276 total_time:6516443014 ctav0001 ( 12340) count:48213 ave_time:162952 total_time:7856434261 pzhao ( 12949) count:47245 ave_time:148631 total_time:7022114217 tnewing ( 10925) count:45732 ave_time:155070 total_time:7091706843 louisaps ( 11398) count:44061 ave_time:92058 total_time:4056168232 yannyinc ( 10505) count:42865 ave_time:149373 total_time:6402875018 aw474 ( 11662) count:41634 ave_time:144716 total_time:6025106294 msolovev ( 12837) count:41541 ave_time:146177 total_time:6072372827 mhos0007 ( 10900) count:41360 ave_time:172423 total_time:7131415452 jaszczur ( 11828) count:40177 ave_time:138800 total_time:5576578996 fgag0002 ( 12538) count:38351 ave_time:166825 total_time:6397911747 xzha0043 ( 11476) count:34900 ave_time:155263 total_time:5418679277 zhiqianx ( 11058) count:34607 ave_time:154425 total_time:5344192307 apoz0003 ( 11024) count:33772 ave_time:117764 total_time:3977134722 stuarto ( 10273) count:30187 ave_time:177455 total_time:5356834440 jste0021 ( 10774) count:30128 ave_time:158491 total_time:4775018907 dmpar7 ( 11685) count:29634 ave_time:234738 total_time:6956235225 govindap ( 11549) count:28820 ave_time:175051 total_time:5044973901 qwang1 ( 11135) count:27969 ave_time:97339 total_time:2722490753 jhendrik ( 10515) count:24099 ave_time:182218 total_time:4391274365 pcoo0005 ( 11639) count:23095 ave_time:167659 total_time:3872085939 mkoeda ( 11862) count:22950 ave_time:174783 total_time:4011285397 yaghmaien ( 12664) count:22198 ave_time:161156 total_time:3577361041 megh0001 ( 11997) count:21933 ave_time:47738 total_time:1047050184 nourank ( 11994) count:19776 ave_time:180700 total_time:3573531225 mnak0010 ( 11610) count:19580 ave_time:27441 total_time:537299418 clementeadam ( 10691) count:19172 ave_time:173858 total_time:3333207528 melvinl ( 12907) count:19136 ave_time:108347 total_time:2073334479 tandrill ( 11162) count:18161 ave_time:178073 total_time:3233995756 slurm ( 497) count:16609 ave_time:505 total_time:8396746 hfettke ( 11128) count:15678 ave_time:169441 total_time:2656503506 phoebeimms ( 10690) count:14828 ave_time:157577 total_time:2336565926 bmoffat ( 12008) count:14212 ave_time:133626 total_time:1899101747 earsenau ( 10724) count:13152 ave_time:137014 total_time:1802009458 tdissana ( 10686) count:13007 ave_time:146443 total_time:1904788443 andrewpe ( 10152) count:12822 ave_time:244065 total_time:3129410173 permezelf ( 12929) count:12798 ave_time:151064 total_time:1933327415 julianab ( 12946) count:11901 ave_time:168979 total_time:2011019083 dawhite ( 12881) count:11852 ave_time:125491 total_time:1487323424 rshi0007 ( 10423) count:11698 ave_time:133010 total_time:1555961613 bktan12 ( 11006) count:11396 ave_time:161681 total_time:1842522728 mom008 ( 11465) count:11258 ave_time:72377 total_time:814826789 aburmest1 ( 12975) count:10463 ave_time:141915 total_time:1484861084 kpaw0001 ( 10567) count:9956 ave_time:62541 total_time:622665731 suzanm ( 11205) count:9887 ave_time:163837 total_time:1619862596 jamesr ( 12917) count:9581 ave_time:160528 total_time:1538024920 judominguez ( 10934) count:8887 ave_time:203153 total_time:1805421024 ademarco ( 10256) count:8833 ave_time:177587 total_time:1568630680 kdkan3 ( 12953) count:8647 ave_time:116190 total_time:1004701014 aabu0005 ( 11216) count:8446 ave_time:174502 total_time:1473846180 ymei ( 11123) count:8334 ave_time:130454 total_time:1087204816 nrogasch ( 10517) count:8241 ave_time:177806 total_time:1465303523 gfullers ( 12941) count:7946 ave_time:140812 total_time:1118896543 damienl ( 10005) count:7622 ave_time:164583 total_time:1254456277 tbiczok ( 11180) count:7504 ave_time:131691 total_time:988211100 dspark ( 11831) count:7055 ave_time:181935 total_time:1283555206 adamshephard ( 10901) count:6706 ave_time:32015 total_time:214695781 dapergu ( 11857) count:6370 ave_time:173681 total_time:1106351132 zseeger ( 10027) count:6305 ave_time:28131 total_time:177372258 iharding ( 10381) count:5833 ave_time:148085 total_time:863781620 ewilliam ( 12847) count:5709 ave_time:191181 total_time:1091455829 ykha0001 ( 10766) count:5457 ave_time:151603 total_time:827298039 sfwon17 ( 12934) count:5120 ave_time:176767 total_time:905047774 benfulcher ( 10424) count:4486 ave_time:131741 total_time:590991109 simonb ( 10830) count:4435 ave_time:183933 total_time:815743363 sravanfar ( 12509) count:4248 ave_time:169800 total_time:721313291 camillab ( 10501) count:4161 ave_time:159129 total_time:662137685 kmban4 ( 11033) count:4131 ave_time:190332 total_time:786261896 nsingh ( 12834) count:4027 ave_time:184753 total_time:744002623 aurina ( 10385) count:3997 ave_time:126118 total_time:504095710 yoge0001 ( 11113) count:3978 ave_time:27056 total_time:107632094 ljia110 ( 10233) count:3910 ave_time:27307 total_time:106771981 peterl ( 10905) count:3480 ave_time:119417 total_time:415573169 vishalc ( 12719) count:3345 ave_time:32306 total_time:108066322 inas0002 ( 10978) count:3215 ave_time:116009 total_time:372970046 sgwe0001 ( 10173) count:2929 ave_time:47957 total_time:140468963 ctiw0001 ( 12015) count:2895 ave_time:58289 total_time:168747048 bmajor ( 12512) count:2727 ave_time:139026 total_time:379126127 jmton7 ( 12943) count:2196 ave_time:42457 total_time:93237247 xvuthith ( 10699) count:1748 ave_time:76291 total_time:133358371 tsepehri ( 11188) count:1707 ave_time:183171 total_time:312674469 aher0013 ( 11326) count:1642 ave_time:28058 total_time:46071730 tbui ( 12935) count:1587 ave_time:22361 total_time:35487104 scohen1 ( 11404) count:1519 ave_time:123714 total_time:187922032 sbednarek ( 11506) count:1498 ave_time:33230 total_time:49779274 pzaremoo ( 10041) count:1352 ave_time:91206 total_time:123311633 mziemann ( 10442) count:1269 ave_time:83213 total_time:105598546 rkai0001 ( 10719) count:1229 ave_time:45272 total_time:55640022 aceg4000 ( 10727) count:1066 ave_time:123745 total_time:131912834 jzhou ( 11691) count:1035 ave_time:25848 total_time:26753363 taoh ( 11834) count:1030 ave_time:19250 total_time:19827874 amar0047 ( 10989) count:962 ave_time:95813 total_time:92172157 mtang1 ( 12982) count:920 ave_time:90786 total_time:83523830 ganjq ( 10783) count:898 ave_time:54372 total_time:48826402 hbandara ( 10629) count:810 ave_time:133367 total_time:108027292 uqhsun8 ( 12012) count:806 ave_time:113564 total_time:91533045 andrewc ( 10452) count:791 ave_time:153213 total_time:121191798 hhew0002 ( 11689) count:790 ave_time:116917 total_time:92364847 philipc ( 10003) count:738 ave_time:101720 total_time:75069494 tshaw ( 11932) count:735 ave_time:111857 total_time:82215140 creboul ( 10332) count:666 ave_time:240181 total_time:159960962 xuanlih ( 10644) count:658 ave_time:105494 total_time:69415062 esomayeh ( 10379) count:639 ave_time:52879 total_time:33790207 yzha0576 ( 12828) count:628 ave_time:17735 total_time:11137867 lawyl1 ( 10025) count:595 ave_time:69755 total_time:41504246 helenm ( 11051) count:594 ave_time:153806 total_time:91361024 scottk ( 11656) count:544 ave_time:162602 total_time:88455901 pmajka ( 11257) count:535 ave_time:96999 total_time:51894557 ciden1 ( 10709) count:523 ave_time:126679 total_time:66253180 hbor4 ( 12526) count:515 ave_time:39184 total_time:20180243 dsubedi ( 12857) count:483 ave_time:62795 total_time:30330074 wcao0006 ( 10768) count:436 ave_time:153782 total_time:67049126 bangw ( 11947) count:413 ave_time:72297 total_time:29858921 kcorbett ( 12751) count:411 ave_time:238572 total_time:98053421 mchen1 ( 10991) count:408 ave_time:103528 total_time:42239709 sgujjari ( 12661) count:406 ave_time:147109 total_time:59726647 rfro0003 ( 11455) count:370 ave_time:112172 total_time:41503868 wtsch1 ( 11038) count:363 ave_time:147584 total_time:53573184 mliu153 ( 10667) count:355 ave_time:73361 total_time:26043229 yyer0001 ( 12041) count:353 ave_time:216543 total_time:76439952 yzhang1 ( 11678) count:346 ave_time:68847 total_time:23821374 rjoh0016 ( 12888) count:333 ave_time:192596 total_time:64134786 elam ( 10854) count:324 ave_time:237683 total_time:77009311 gahun3 ( 12820) count:305 ave_time:37434 total_time:11417606 debopris ( 11349) count:304 ave_time:13713 total_time:4168760 vchr0002 ( 11487) count:296 ave_time:80685 total_time:23882886 rwic0002 ( 11371) count:292 ave_time:65437 total_time:19107737 xjia0032 ( 11502) count:285 ave_time:19269 total_time:5491763 hede2 ( 12967) count:274 ave_time:159774 total_time:43778263 mbelouso ( 10091) count:268 ave_time:68789 total_time:18435508 hngu0026 ( 11057) count:257 ave_time:27413 total_time:7045295 rrag0004 ( 12334) count:237 ave_time:33318 total_time:7896424 pubua ( 12833) count:231 ave_time:80494 total_time:18594191 zdyson ( 11463) count:218 ave_time:192859 total_time:42043381 siyuanh ( 10797) count:211 ave_time:54773 total_time:11557109 kvoigt ( 11346) count:206 ave_time:125098 total_time:25770354 zguo61 ( 12968) count:168 ave_time:31954 total_time:5368372 mkha0097 ( 12931) count:165 ave_time:27113 total_time:4473809 dbw ( 10800) count:158 ave_time:181044 total_time:28605027 pleu0005 ( 11794) count:158 ave_time:40353 total_time:6375793 trungn ( 10419) count:150 ave_time:223696 total_time:33554444 michaelr ( 11512) count:118 ave_time:43280 total_time:5107081 amaligg ( 10873) count:111 ave_time:95843 total_time:10638642 kwong ( 12517) count:107 ave_time:158567 total_time:16966704 ssingh ( 11014) count:104 ave_time:153402 total_time:15953858 ec2-user ( 1000) count:90 ave_time:68201 total_time:6138175 bche132 ( 10239) count:86 ave_time:114158 total_time:9817664 klow12 ( 10192) count:85 ave_time:116761 total_time:9924697 pwib0001 ( 12841) count:62 ave_time:37243 total_time:2309079 szhang ( 12694) count:60 ave_time:26926 total_time:1615580 pt5684 ( 11786) count:60 ave_time:219032 total_time:13141977 khodgins ( 11538) count:49 ave_time:209620 total_time:10271421 aazi0007 ( 12038) count:45 ave_time:15090 total_time:679090 sherenan ( 11901) count:22 ave_time:204737 total_time:4504227 xli479 ( 11039) count:21 ave_time:334382 total_time:7022034 nisald ( 12338) count:21 ave_time:2039 total_time:42831 gholamrh ( 10097) count:15 ave_time:5580 total_time:83703 ctan ( 10189) count:11 ave_time:135917 total_time:1495095 swatts ( 11445) count:8 ave_time:173205 total_time:1385641 chines ( 10011) count:8 ave_time:51216 total_time:409728 lper0012 ( 11103) count:7 ave_time:68797 total_time:481585 pab07 ( 10875) count:6 ave_time:950 total_time:5704 rgomi ( 11448) count:3 ave_time:150652 total_time:451957 yliu0291 ( 12964) count:3 ave_time:3896 total_time:11688 Pending RPC statistics No pending RPCs # sacctmgr show stats Rollup statistics Hour count:0 ave_time:0 max_time:0 total_time:0 Day count:0 ave_time:0 max_time:0 total_time:0 Month count:0 ave_time:0 max_time:0 total_time:0 Remote Procedure Call statistics by message type DBD_STEP_COMPLETE ( 1441) count:60 ave_time:29866 total_time:1792012 DBD_JOB_COMPLETE ( 1424) count:30 ave_time:40383 total_time:1211494 DBD_SEND_MULT_MSG ( 1474) count:12 ave_time:145284 total_time:1743411 SLURM_PERSIST_INIT ( 6500) count:3 ave_time:3442 total_time:10328 DBD_FINI ( 1401) count:2 ave_time:226 total_time:452 DBD_STEP_START ( 1442) count:2 ave_time:77160 total_time:154320 DBD_REGISTER_CTLD ( 1434) count:1 ave_time:23248 total_time:23248 DBD_GET_ASSOCS ( 1410) count:1 ave_time:903980 total_time:903980 DBD_GET_TRES ( 1486) count:1 ave_time:41687 total_time:41687 DBD_GET_QOS ( 1448) count:1 ave_time:736 total_time:736 DBD_JOB_START ( 1425) count:1 ave_time:81920 total_time:81920 DBD_CLUSTER_TRES ( 1407) count:1 ave_time:1376 total_time:1376 Remote Procedure Call statistics by user slurm ( 497) count:108 ave_time:46427 total_time:5014128 root ( 0) count:7 ave_time:135833 total_time:950836
# grep curr slurmdbd.log .... .... .... [2020-01-05T10:12:30.247] 0(as_mysql_rollup.c:1755) curr day is now 1577365200-1577451600 [2020-01-05T10:12:30.900] 0(as_mysql_rollup.c:1755) curr day is now 1577451600-1577538000 [2020-01-05T10:12:31.553] 0(as_mysql_rollup.c:1755) curr day is now 1577538000-1577624400 [2020-01-05T10:12:32.206] 0(as_mysql_rollup.c:1755) curr day is now 1577624400-1577710800 [2020-01-05T10:12:32.860] 0(as_mysql_rollup.c:1755) curr day is now 1577710800-1577797200 [2020-01-05T10:12:33.517] 0(as_mysql_rollup.c:1755) curr day is now 1577797200-1577883600 [2020-01-05T10:12:34.177] 0(as_mysql_rollup.c:1755) curr day is now 1577883600-1577970000 [2020-01-05T10:12:34.835] 0(as_mysql_rollup.c:1755) curr day is now 1577970000-1578056400 [2020-01-05T10:12:35.494] 0(as_mysql_rollup.c:1755) curr day is now 1578056400-1578142800 [2020-01-05T10:12:36.152] 0(as_mysql_rollup.c:1755) curr month is now 1567260000-1569852000 [2020-01-05T10:12:36.323] 0(as_mysql_rollup.c:1755) curr month is now 1569852000-1572526800 [2020-01-05T10:12:36.386] 0(as_mysql_rollup.c:1755) curr month is now 1572526800-1575118800 [2020-01-05T10:12:36.446] 0(as_mysql_rollup.c:1755) curr month is now 1575118800-1577797200 [2020-01-05T11:00:00.536] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578178800-1578182400 [2020-01-05T12:00:00.651] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578182400-1578186000 [2020-01-05T13:00:00.770] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578186000-1578189600 [2020-01-05T14:00:00.879] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578189600-1578193200 [2020-01-05T15:00:00.996] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578193200-1578196800 [2020-01-05T16:00:00.117] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578196800-1578200400 [2020-01-05T17:00:00.241] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578200400-1578204000 [2020-01-05T18:00:00.422] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578204000-1578207600 [2020-01-05T19:00:00.538] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578207600-1578211200 [2020-01-05T20:00:00.687] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578211200-1578214800 [2020-01-05T20:48:59.824] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514725200-1514728800 [2020-01-05T20:49:36.231] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514728800-1514732400 [2020-01-05T20:50:11.603] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514732400-1514736000 [2020-01-05T20:50:49.505] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514736000-1514739600 [2020-01-05T20:51:24.940] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514739600-1514743200 [2020-01-05T20:52:01.426] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514743200-1514746800 [2020-01-05T20:52:35.893] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514746800-1514750400 [2020-01-05T20:53:11.876] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514750400-1514754000 [2020-01-05T20:53:47.179] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514754000-1514757600 [2020-01-05T20:54:23.453] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514757600-1514761200 [2020-01-05T20:54:59.562] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514761200-1514764800 [2020-01-05T20:55:37.539] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514764800-1514768400 [2020-01-05T20:56:13.236] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514768400-1514772000 [2020-01-05T20:56:48.166] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514772000-1514775600 [2020-01-05T20:57:22.783] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514775600-1514779200 [2020-01-05T20:57:58.447] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1514779200-1514782800
Hi Albert, Something doesn't look right, See: # grep curr slurmdbd.log .... .... .... [2020-01-06T16:18:40.770] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523253600-1523257200 [2020-01-06T16:19:20.166] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523257200-1523260800 [2020-01-06T16:20:04.982] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523260800-1523264400 [2020-01-06T16:20:50.152] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523264400-1523268000 [2020-01-06T16:21:39.481] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523268000-1523271600 [2020-01-06T16:22:20.248] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523271600-1523275200 [2020-01-06T16:22:58.921] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523275200-1523278800 [2020-01-06T16:23:37.047] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523278800-1523282400 [2020-01-06T16:24:16.126] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1523282400-1523286000 1523282400 Epoch time Assuming that this timestamp is in seconds: GMT: Monday, 9 April 2018 2:00:00 PM Your time zone: Tuesday, 10 April 2018 12:00:00 AM GMT+10:00 Relative: 2 years ago But, # sreport cluster AccountUtilizationByUser -t hours start=2018-03-01 end=2018-03-10 -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2018-03-01T00:00:00 - 2018-03-09T23:59:59 (777600 secs) Usage reported in CPU Hours -------------------------------------------------------------------------------- Cluster Account Login Proper Name Used Energy --------- --------------- --------- --------------- ---------- -------- Still no value. Kindly advise Thanks Damien
Created attachment 12671 [details] slurmdbd logs Current slurmdbd log Kindly review
Hi Damien, > I will re-run steps 0-4 again. That was the right decision. Thanks for the detailed information, it helps me a lot. > Something doesn't look right, > > # grep curr slurmdbd.log > [2020-01-06T16:24:16.126] 0(as_mysql_rollup.c:1072) m3 curr hour is now > 1523282400-1523286000 > > 1523282400 Epoch time > Assuming that this timestamp is in seconds: > GMT: Monday, 9 April 2018 2:00:00 PM > Your time zone: Tuesday, 10 April 2018 12:00:00 AM GMT+10:00 > Relative: 2 years ago > > # sreport cluster AccountUtilizationByUser -t hours start=2018-03-01 > end=2018-03-10 > ----------------------------------------------------------------------------- > --- > Cluster/Account/User Utilization 2018-03-01T00:00:00 - 2018-03-09T23:59:59 > (777600 secs) > Usage reported in CPU Hours > ----------------------------------------------------------------------------- > --- > Cluster Account Login Proper Name Used Energy > --------- --------------- --------- --------------- ---------- -------- > > Still no value. I understand your point, but that's expected. Now the rollup is working properly, so please don't stop the daemon until it fully ends. The reason why you still cannot see that information is because the rollup first does all the hourly aggregations/tables until current time, then the daily ones and finally the monthly ones. Your sreport command is internally translated into queries to the daily tables, and they are not yet done. You should check with sreport only when the rollup is fully completed. Now it's working. Keep an eye in the logs and just ensure that slurmdbd is not stopped until the rollup is done. And keep me updated! ;-) Regards, Albert
Hi Albert, The progress is still on the slow side, details: # grep curr slurmdbd.log ... ... [2020-01-09T00:24:46.189] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1547226000-1547229600 [2020-01-09T00:24:58.307] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1547229600-1547233200 [2020-01-09T00:25:10.480] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1547233200-1547236800 [2020-01-09T00:25:23.083] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1547236800-1547240400 1547240400 Timestamp to Human date [batch convert] Supports Unix timestamps in seconds, milliseconds, microseconds and nanoseconds. Assuming that this timestamp is in seconds: GMT: Friday, 11 January 2019 8:00:00 PM Your time zone: Saturday, 12 January 2019 7:00:00 AM GMT+11:00 DST Relative: A year ago Cheers Damien
It looks good. Let's see if it finishes for the end of the week...? Will see. Thanks for letting me know, Albert
# grep Error slurmdbd.log ..... ..... ..... [2020-01-13T09:00:06.350] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578837600-1578841200 [2020-01-13T09:00:06.438] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578841200-1578844800 [2020-01-13T09:00:06.526] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578844800-1578848400 [2020-01-13T09:00:06.613] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578848400-1578852000 [2020-01-13T09:00:06.704] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578852000-1578855600 [2020-01-13T09:00:06.792] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578855600-1578859200 [2020-01-13T09:00:06.881] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578859200-1578862800 [2020-01-13T09:00:06.969] 0(as_mysql_rollup.c:1072) m3 curr hour is now 1578862800-1578866400 [2020-01-13T09:00:07.081] 0(as_mysql_rollup.c:1755) curr day is now 1578574800-1578661200 [2020-01-13T09:00:10.728] 0(as_mysql_rollup.c:1755) curr day is now 1578661200-1578747600 [2020-01-13T09:00:13.876] 0(as_mysql_rollup.c:1755) curr day is now 1578747600-1578834000
Hi Albert The sreports are working now (2018-2020), but I am observing this: -- $ sacctmgr show reservation Cluster Name TRES TimeStart TimeEnd UnusedWall ---------- --------------- ------------------------------ ------------------- ------------------- ---------- m3 CryoemFacility cpu=60 2019-10-28T20:16:55 2020-05-27T10:12:23 9.3522e+06 m3 cryosparc cpu=28 2019-10-28T20:16:55 2020-11-08T11:33:37 1.2592e+07 m3 M3-backup cpu=48 2019-10-28T20:16:55 2020-03-14T08:43:14 1.2571e+07 m3 mxbeampostproc cpu=48 2019-10-28T20:16:55 2025-01-08T11:38:57 1.2592e+07 m3 nfs cpu=6 2019-10-28T20:16:55 2020-09-22T18:17:40 1.2592e+07 m3 simple cpu=120 2019-10-28T20:16:55 2020-06-29T10:41:27 1.1655e+07 m3 uow cpu=52 2019-10-28T20:16:55 2025-01-08T10:00:00 1.2393e+07 m3 2019-10-30T12:03:48 2020-01-13T09:45:32 1.2376e+07 m3 2019-10-30T12:03:48 2025-06-30T09:00:00 1.2449e+07 m3 2019-10-30T12:03:48 2025-06-30T09:00:00 1.2449e+07 m3 highmem cpu=36 2019-12-02T19:57:29 2028-11-30T11:34:35 8.9936e+06 m3 cpu=188 2020-01-06T11:47:03 2020-01-13T12:03:39 5.3497e+06 m3 DDN2 cpu=360 2020-01-07T22:48:49 2020-01-17T00:00:00 6.3564e+06 m3 2020-01-13T09:45:32 2025-06-30T09:00:00 2.2468e+04 m3 cpu=188 2020-01-13T12:03:39 2025-06-30T09:00:00 1.1489e+04 Notice the empty entries... $ scontrol show reservation ReservationName=mxbeampostproc StartTime=2017-12-18T12:20:35 EndTime=2025-01-08T11:38:57 Duration=2577-23:18:22 Nodes=m3d[006-007] NodeCnt=2 CoreCnt=48 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES TRES=cpu=48 Users=(null) Accounts=-training Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=cryosparc StartTime=2018-11-09T11:33:37 EndTime=2020-11-08T11:33:37 Duration=730-00:00:00 Nodes=m3h015 NodeCnt=1 CoreCnt=28 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=28 Users=jafarl,lancew Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=sexton StartTime=2019-02-07T11:36:44 EndTime=2025-06-30T09:00:00 Duration=2334-22:23:16 Nodes=dgx[003-004],m3g[000,006],m3p010 NodeCnt=5 CoreCnt=188 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES TRES=cpu=188 Users=galisa,lynnliang,mbelouso,xzha0043,ctan,spiper,rjoh0016,trungn,lancew Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=M3-backup StartTime=2019-03-15T08:43:14 EndTime=2025-06-30T09:00:00 Duration=2299-01:16:46 Nodes=m3d[002,008] NodeCnt=2 CoreCnt=48 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=48 Users=ctan,trungn,rchiu Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=CryoemFacility StartTime=2019-05-28T10:12:23 EndTime=2025-06-30T09:00:00 Duration=2224-22:47:37 Nodes=m3c004,m3g014 NodeCnt=2 CoreCnt=60 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES TRES=cpu=60 Users=lancew,hven0001,jafarl Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=AWX StartTime=2019-06-07T14:48:59 EndTime=2025-06-30T09:00:00 Duration=2214-18:11:01 Nodes=m3a[009-010] NodeCnt=2 CoreCnt=48 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=48 Users=trungn,ctan,damienl Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=simple StartTime=2019-06-30T10:41:27 EndTime=2020-06-29T10:41:27 Duration=365-00:00:00 Nodes=m3a[016-020] NodeCnt=5 CoreCnt=120 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=120 Users=hael,creboul,zqi1,sarahle,lancew,marionb,ctan,skie0002,mcaggian Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=uow StartTime=2019-09-13T11:06:48 EndTime=2025-01-08T10:00:00 Duration=1943-21:53:12 Nodes=m3c011,m3h011 NodeCnt=2 CoreCnt=52 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES TRES=cpu=52 Users=lancew,simonb,jafarl,jbouwer Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=highmem StartTime=2019-12-02T15:26:56 EndTime=2028-11-30T11:34:35 Duration=3285-20:07:39 Nodes=m3m000 NodeCnt=1 CoreCnt=36 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES TRES=cpu=36 Users=trungn,ctan,damienl,chines,lancew,jafarl,nwong,kerriw,philipc,smichnow,jchang,mmcg0002,angu0022,ska565,ozgeo,rbeare,zseeger,keving,swil0005,lawyl1,nlkar1,stya0001,kholt Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=DDN2 StartTime=2020-01-06T14:40:54 EndTime=2020-01-17T00:00:00 Duration=10-09:19:06 Nodes=m3i[030-039] NodeCnt=10 CoreCnt=360 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES TRES=cpu=360 Users=(null) Accounts=nq46 Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a -- Is this problem related ? Cheers Damien
Created attachment 12711 [details] Current slurmdbd.log Kindly review
Hi Damien, > The sreports are working now (2018-2020) Good! I think that now we can consider the initial issue related to sreport fixed. > but I am observing this: > $ sacctmgr show reservation > Notice the empty entries... I see some issues: - mxbeampostproc started even before you installed 18.08 (=2017-12-18) - Except "simple" and "DDN2" have a Duration > 1 year. I'm not sure why or how you get so long reservations, is that intentional? > Is this problem related ? No, the blank values and the long values are not related to the rollup that we just done. Maybe it's a good idea if we close this one and we open a new one. We will need some output of the db to work on it, but I'm just thinking that probably it will be easy if you share with us the DB so we can look at it better. Regards, Albert
Hi Albert, Thanks for your replies. What is standard practise for the duration of a reservation ? Not more than 2 years ? Yes, the empty entries should be treated as separate ticket. You want the output of the DB ? How? Which SQL commands ? or mysqldump ? Cheers Damien
Hi Damien, > What is standard practise for the duration of a reservation ? Not more than > 2 years ? Well, I would say that so long reservations is strange. Maybe if I've more info about the goal of your reservations we can recommend you some alternatives. For example some flags like DAILY, WEEKLY, WEEKDAY or WEEKEND can make sense in some scenarios. Or maybe your goal can be better accomplished with a different partition configuration, or QoS. It depends on your goals. Feel free to open a specific bug to explain your goals and how are you using reservations and we will discuss it further there. > Yes, the empty entries should be treated as separate ticket. > You want the output of the DB ? How? Which SQL commands ? or mysqldump ? Yes, mysqldump is a good option. I guess that you will need to use some kind of sharing files tools, so mark your comment as private to ensure to keep your db private. Regards, Damien
Hi Damien, It this is Ok for you, I'm closing this ticket as "info given" because the main issue related to the sreport information is fixed. Please don't hesitate to reopen it if you need further support, or open a new one to keep helping you with the reservation issue that you may have. Regards, Albert
Hi Albert, Thanks for everything. I am fine to close this case, but we have not cover any measure to prevent this from happening again. Cheers Damien
You are right Damien, we are tracking internally the investigation of what could lead to your issue, that has been also reported by other users. I'll let you know if we find something. Albert
Good Afternoon Slurm Support Sorry for bringing this up after this while. Is there any follow-ups or preventive measures for this matter? I want to mitigate this because we are planning to move from 19.05.04 to 20.02.03 soon. Cheers Damien
Hi Damien! > Sorry for bringing this up after this while. We are glad to hear from you again, specially if everything is working and you are just being preventive! ;-) > Is there any follow-ups or preventive measures for this matter? I want to > mitigate this because we are planning to move from 19.05.04 to 20.02.03 > soon. So far, most of the similar cases are related to slurmdbd being killed externally in the middle of a rollup process. As mentioned in comment 17 and specially at the end of the comment 23, in your case that was done by logrotate, but you already fixed it. In other cases it might be related to a very-very large number or very-very old runaway jobs, as they also lead to long rollup process. Some minor cases are also related to very-very long reservations, that may lead to also long rollups. But, actually, all similar cases were reported on 18.08 or 19.05, but none has been already reported on 20.02, and plenty of sites are already using 20.02. Anyway, the specific preventive actions related to this issue on rollup/sreport info would be: - Check and fix runaway jobs with "sacctmgr show runaway". If you are clean, they won't trigger rollups. - Check that reservations are correctly on sync between slurmctrld (with "scontrol show reservations") and slurmdbd (wuth "sacctmgr show reservations"). If they are on sync, it won't trugger rollups neither. - Check the logs of slurmdbd to see that it is not "terminated" due any unknown reason, and that includes not manual signaling by SIGINT / SIGTERM / SIGHUP. If slurmdbd is not being killed, rollups will work normally. With that extra preventive actions, you can go ahead and follow the normal upgrade steps (that also include other preventive but more general actions). As a reference: https://slurm.schedmd.com/quickstart_admin.html#upgrade https://slurm.schedmd.com/SLUG19/Field_Notes_3.pdf (slides 5-10) I hope that helps, Albert
Hi Albert, Thanks for your reply. I will observed your notes and read Tim's rants. Cheers Damien
Ok Damien. I'm closing this bug again as infogiven. But if you face any issue or need any support along the upgrade, please don't hesitate to open a new bug. You can reopen this too, but if the issue is not related to the original one, open a new one and keep this one focused on the original issue is a much better option. Thanks, Albert
Thanks