Hello Team, There has been lot of messages like below On 14th and 15th may and the jobs are not committing into to slurmdbd when we see the below messages and we are not able to query anything from slurmdbd. error: There is no reservation by id 562, time_start 1588618629, and cluster 'scp' error: There is no reservation by id 602, time_start 1589522400, and cluster 'scp Also there were lot of runway jobs. We wanted to understand why this is happening and how do we prevent this in future, attached is slurmdbd log for your review, do let me know if you need any other info to investigate further. Thanks Sathish
Created attachment 14286 [details] slurmctld logs
Created attachment 14287 [details] slurmdbd log
It is critical that you apply this fix as soon as possible, or you will lose data. If you attach your slurm.conf, I can tell how close you are. I need the output of these commands: $ sacctmgr show reservations format=cluster,name,id,start,end,tres and $ scontrol show reservations and $ sdiag Note the output of sdiag; you might want to start a watch on it (`watch sdiag`). The field "DBD agent queue size" shows how many RPC messages for the slurmdbd are backed up on the slurmctld. The size of the queue is MaxDBDMsgs, which by default is max(10000, 2*MaxJobCount+4*NodeCount). The queue is not emptying because there is an RPC that the slurmdbd won't accept. As indicated by the error message, the bad RPC references a reservation that doesn't exist. Reservations are uniquely referenced by resv_id and time_start, and the problem here will be that the time_start is incorrect in the RPC message. To get it to clear, we have to create a reservation manually with the correct id and time_start. The fix: Watch sdiag and the slurmdbd.log. Run the following mysql query for any "no reservation by id ..." errors until there are no more, and the DBD agent queue size in sdiag decreases, eventually reaching 0-100. mysql> insert into <cluster>_resv_table (id_resv, deleted, time_start, resv_name) values (<id>, 0, <time_start from error>, 'xxbadresv_<resv_id>xx'); eg. mysql> insert into <cluster>_resv_table (id_resv, deleted, time_start, resv_name) values (602, 0, 1589522400, 'xxbadresv602xx'); These can be removed once everything is running smoothly again. <cluster> is your cluster name, which looks like "scp" for you. We have seen this, rarely, a few times before, but we have had trouble getting any information on how the bad RPC ended up in the queue. I'll follow up with anything I need to investigate the underlying bug. Thanks for already sending your logs. Note: the runaway jobs are from the data for those jobs not getting to the database yet. Check again once the queue has emptied. Thanks
Hi Broderick, We dont see this issue after 15th of May, sdiag looks normal and there is no such errors on slurmdbd.log, however I have attached all the requested command output for your review. Please review and do let us know if you think that we still need to apply the suggested fix ? Also, we wanted to know the root cause of this and how to prevent this in future. Also, I see this warning message on the slurmdbd.log, what does this mean ? [2020-05-20T16:15:12.435] Warning: Note very large processing time from hourly_rollup for scp: usec=6470762 began=16:15:05.965 [2020-05-20T16:15:22.787] Warning: Note very large processing time from daily_rollup for scp: usec=10351127 began=16:15:12.436 Thanks Sathish
Created attachment 14321 [details] slurm.conf
Created attachment 14322 [details] scontrol show reservations output
Created attachment 14323 [details] sacctmgr show reservations output
Created attachment 14324 [details] sdiag ouput
Sorry for the delay. As you noticed, your cluster does not currently have the problem I expected and wrote instructions about. This is concerning because I don't know of a way the slurmctld can resolve the problem itself, but it looks like your cluster did have the problem on May 14th and 15th. I am looking into what could have happened. Do you have any more information on any reboots or resets during this time? Are there any problems currently with your cluster? Are there any runaway jobs? Side note: you sent the same file twice instead of `scontrol show reservations`. (In reply to Sathishkumar from comment #4) > Also, I see this warning message on the slurmdbd.log, what does this mean ? > > [2020-05-20T16:15:12.435] Warning: Note very large processing time from > hourly_rollup for scp: usec=6470762 began=16:15:05.965 > [2020-05-20T16:15:22.787] Warning: Note very large processing time from > daily_rollup for scp: usec=10351127 began=16:15:12.436 This is not a serious problem alone, but it could indicate that your database is somewhat overloaded during particularly heavy rollup times. In case you don't know, rollup refers to the collection of cluster usage for each hour, day, and month. This is used when creating reports via sreport. Thanks
Yes, at the moment we have no issues, but we had an load issues on 13th on one of the control plane where slurmctld00(primary) is running, the load went high and due to that slurmctld primary lost communication and backup took over and I believe for some reason the backup was not completely took over and stuck in between and not able to process the statesavelocation for some reason and ended up split with brain issue. We noticed this issue had all started after this load on the control plane and after 15th we done have any issues. But now when I try to check if there is any runway jobs, it gives me the below error, it was working fine until yesterday. [kfdv397@seskscpn084 ~]$ sacctmgr show runaway sacctmgr: error: Slurmctld running on cluster scp is not up, can't check running jobs But all the other command given below works fine [kfdv397@seskscpn084 ~]$ scontrol ping Slurmctld(primary/backup) at slurmctld00/slurmctld01 are UP/UP [kfdv397@seskscpn084 ~]$ sacctmgr show cluster Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- scp 0 8192 1 normal I am not able to sacctmgr show runaway on slurmctld and slurmdbd as well and it gives me the same error message, saying slurmctld running on cluster scp is not up. Any idea why I am not able to check runaway jobs ? Attached it the latest sdiag output and also attached scontrol show reservation which was taken on 20th. Please review and let me know if you need any other details from my end to proceed. Thanks Sathish
Created attachment 14418 [details] sdiag taken on 28th
Created attachment 14419 [details] scontrol_show_reservations_output Have attached the correct scontrol show reservations output.
Created attachment 14420 [details] slurmdbd.log Latest slurmdbd.log attached for your review.
I see that there are messages like below on the log, DBD_CLUSTER_TRES: cluster not registered, I believe if we restart the slurmdbd service this will get fixed, before doing that,i wanted to you to review this and make sure there is nothing wrong with slurmdbd. Is this happening because slurmctld was not communicating with slurmdbd ?
Created attachment 14421 [details] slurmctld log slurmctld.log from the primary controller which is slurmctld00
Created attachment 14423 [details] slurmctld logs slurmctld.log from the backup controller which is slurmctld01
(In reply to Sathishkumar from comment #14) > I see that there are messages like below on the log, > > DBD_CLUSTER_TRES: cluster not registered, > > I believe if we restart the slurmdbd service this will get fixed, before > doing that,i wanted to you to review this and make sure there is nothing > wrong with slurmdbd. > > Is this happening because slurmctld was not communicating with slurmdbd ? This should be fixed by a restart of the slurmdbd and/or the slurmctld, yes. Do you have any plans to upgrade to a supported version soon? There have been changes and fixes to cluster registration in more recent versions of Slurm, as well as fixes to backup controller takeover.
Hi Broderick, Yes, we are working on upgrading to 19.05.7 version. Before we upgrade, we are trying to get the database slimmed by enabling archival and purging. We have a test setup with replicated production db to test the archival and purging process with the below parameters in slurmdbd.conf. But for some reason the archival and purging is not happening as per the defined schedule. There is no errors on the logs. #Purging PurgeEventAfter=18852hours PurgeJobAfter=18852hours PurgeResvAfter=18852hours PurgeStepAfter=18852hours PurgeSuspendAfter=18852hours PurgeTXNAfter=18852hours PurgeUsageAfter=18852hours #Archival ArchiveDir=/opt/archive ArchiveEvents=yes ArchiveJobs=yes ArchiveResvs=yes Also, If I run the archival process manually,I get the below error [root@slurmdbd1905 archive]# sacctmgr archive dump sacctmgr: error: slurmdbd: Getting response to message type 1459 sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error Problem dumping archive: Unspecified error We have about 30900000+ jobs on the db. Could you please help to proceed further with this ? Thanks
Have you upgraded anything yet, or is it all still on 17.11? There have been several major fixes to archive/purge in 18.08 and 19.05.
It was on 17.11 but I saw the release notes of 18.08 and found there were several improvements related to archive and purge and now we are using 18.08 on the test environment, but still the status is same, any thoughts ?
I also updated to the latest version but still no improvement, the archive and purge does not trigger as per the defined schedule. [root@slurmdbd1905 ~]# slurmdbd -V slurm 19.05.7 Also, manual archival is still failing with the same error message. [root@slurmdbd1905 ~]# sacctmgr archive dump This may result in loss of accounting database records (if Purge* options enabled). Are you sure you want to continue? (You have 30 seconds to decide) (N/y): y sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error Problem dumping archive: Unspecified error
Hi Broderick, This is bit critical for us and would be really helpful if you can help on this as soon as possible. Thanks Sathish
I understand, I am working on it. You said previously that there are no errors; the scheduled purge just doesn't happen. Please attach the slurmdbd log from the 19.05 test slurmdbd. What is the slurmdbd debug level? Please also include the slurmdbd.conf (redact the storage password). I am assuming that we are just working on the test slurmdbd for now, correct? What is the process you followed to upgrade the slurmdbd on your test cluster? Do you use RPMs? Did you upgrade all the slurm binaries at once, or just the slurmdbd? What is the status of your production cluster? Did you restart any daemons? Have you been able to check for runaway jobs again? I just want to make sure it is in a functional state at least while we work on the test cluster. Thanks
Oh I see from the slurmdbd.log I already have that you are probably at debug4. You don't need to be that high; debug2 is plenty. I do still need the slurmdbd.log from the 19.05 test cluster.
Created attachment 14529 [details] slurmdbd.conf from the test system running 19.05
Created attachment 14530 [details] slurmdbd.log from the test system running 19.05
(In reply to Broderick Gardner from comment #23) > I understand, I am working on it. You said previously that there are no > errors; the scheduled purge just doesn't happen. Please attach the slurmdbd > log from the 19.05 test slurmdbd. > Please find the attached slurmdbd.log from the test slurmdbd > What is the slurmdbd debug level? Please also include the slurmdbd.conf > (redact the storage password). I am assuming that we are just working on the > test slurmdbd for now, correct? > please find the attached slurmdbd.conf from the test slurmdbd and debug4 is set on the test system as well. > What is the process you followed to upgrade the slurmdbd on your test > cluster? Do you use RPMs? Did you upgrade all the slurm binaries at once, or > just the slurmdbd? build the rpm and then ran yum update slurm-slurmdbd slurm-contribs slurm-libpmi slurm-slurmd slurm slurm-pam_slurm slurm-perlapi First the update was done from the current version(17.11.13) to 18.08.9, db conversion took almost 6 hours and after that I checked the archival and purging and noticed that is was not happening and manually triggered archival was also failing. Then I build the packages for version 19.05.7 and ran the above mentioned yum update command and it took 20 min to complete the db conversion and then i checked the archival and purge and notice it is not triggering as per the defined schedule and ran the archival manually and I was just getting the below error. [root@slurmdbd1905 ~]# sacctmgr archive dump This may result in loss of accounting database records (if Purge* options enabled). Are you sure you want to continue? (You have 30 seconds to decide) (N/y): y sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error Problem dumping archive: Unspecified error > > What is the status of your production cluster? Did you restart any daemons? > Have you been able to check for runaway jobs again? I just want to make sure > it is in a functional state at least while we work on the test cluster. Yes, i did the restart and after the following error DBD_CLUSTER_TRES: cluster not registered, disappeared and there were no runaway jobs and it is functional without any issues. > > Thanks
(In reply to Broderick Gardner from comment #24) > Oh I see from the slurmdbd.log I already have that you are probably at > debug4. You don't need to be that high; debug2 is plenty. I do still need > the slurmdbd.log from the 19.05 test cluster. Sure, please review the attached requested logs and config's and do let me know if there is anything needed from my end.
The logs from each start of your slurmdbd are concerning, ie. not good. What is happening there from your side? Here is the timeline I read from the log: [2020-06-04T06:58:24.953] slurmdbd version 19.05.7 started ... [2020-06-04T06:58:24.954] debug4: 0(as_mysql_usage.c:118) query select hourly_rollup, daily_rollup, monthly_rollup from "scp_last_ran_table" ... (some connections, but no logs to indicate the init process finished) [2020-06-04T07:28:56.554] Terminate signal (SIGINT or SIGTERM) received [2020-06-04T07:28:56.554] debug: Waiting for rollup thread to finish. [2020-06-04T07:28:56.555] debug: rpc_mgr shutting down (log messages to indicate the init finished!) [2020-06-04T07:30:31.001] slurmdbd version 19.05.7 started (restarted by systemd?) ... I see this pattern a few times for 18.08 and 19.05, with time between slurmdbd start and the SIGTERM ranging from 30 minutes to 12 hours. Preemptive guess would be an issue with mysql, ie the slurmdbd is hanging during init on a mysql query. Please enable the slow query log in mysql and attach that after restarting the slurmdbd. In case that's not enough, you might want to roll back the test cluster database to 17.11 (from backup/production) and start the slurmdbd-19.05 again as well. Thanks
Created attachment 14551 [details] mariadb_slow.log
(In reply to Broderick Gardner from comment #29) > The logs from each start of your slurmdbd are concerning, ie. not good. What > is happening there from your side? Here is the timeline I read from the log: > > [2020-06-04T06:58:24.953] slurmdbd version 19.05.7 started > ... > [2020-06-04T06:58:24.954] debug4: 0(as_mysql_usage.c:118) query > select hourly_rollup, daily_rollup, monthly_rollup from "scp_last_ran_table" > ... > (some connections, but no logs to indicate the init process finished) > [2020-06-04T07:28:56.554] Terminate signal (SIGINT or SIGTERM) received > [2020-06-04T07:28:56.554] debug: Waiting for rollup thread to finish. > [2020-06-04T07:28:56.555] debug: rpc_mgr shutting down > (log messages to indicate the init finished!) > [2020-06-04T07:30:31.001] slurmdbd version 19.05.7 started > (restarted by systemd?) > ... > > I see this pattern a few times for 18.08 and 19.05, with time between > slurmdbd start and the SIGTERM ranging from 30 minutes to 12 hours. > > Preemptive guess would be an issue with mysql, ie the slurmdbd is hanging > during init on a mysql query. Please enable the slow query log in mysql and > attach that after restarting the slurmdbd. In case that's not enough, you > might want to roll back the test cluster database to 17.11 (from > backup/production) and start the slurmdbd-19.05 again as well. > > Thanks We already have the slow query log enabled and attached the same for your review. Thanks
Can you give me any more information about the slurmdbd startup procedure? Are you manually sending the SIGTERM, or is that systemd?
(In reply to Broderick Gardner from comment #32) > Can you give me any more information about the slurmdbd startup procedure? > Are you manually sending the SIGTERM, or is that systemd? what kind of slurmdbd startup procedure are you looking for ? where do you see the SIGTERM, it should be coming from systemd.
I mean are you doing anything manually during the slurmdbd startup to cause the SIGTERM? `systemctl restart slurmdbd`? As I mentioned, the slurmdbd appears to hang, possibly on a mysql query, before it finishes the init phase. It completes the init phase after receiving a SIGTERM, then restarts. Are you aware of this or something like this? I need to know if this is happening consistently. Please redo the conversion from 17.11 to 19.05, watching for the stalled init. The slurmdbd config is printed at debug2 during the init, so if you watch the log (`tail -f /var/log/slurm/slurmdbd.log`), you should see if it is stalled. If it is, I need to know what query it's stalled on. mysql> show engine innodb status;
Created attachment 14571 [details] show_engine_innodb_status_output
(In reply to Broderick Gardner from comment #34) > I mean are you doing anything manually during the slurmdbd startup to cause > the SIGTERM? `systemctl restart slurmdbd`? > > As I mentioned, the slurmdbd appears to hang, possibly on a mysql query, > before it finishes the init phase. It completes the init phase after > receiving a SIGTERM, then restarts. Are you aware of this or something like > this? > > I need to know if this is happening consistently. Please redo the conversion > from 17.11 to 19.05, watching for the stalled init. The slurmdbd config is > printed at debug2 during the init, so if you watch the log (`tail -f > /var/log/slurm/slurmdbd.log`), you should see if it is stalled. If it is, I > need to know what query it's stalled on. > mysql> show engine innodb status; I have started the conversion from 17.11 to 19.05, I have started the conversion around 11.35 UTC and now the time is 13.05 UTC, see the log below and the log was not updated for almost 1.5 hours, is this expected or is this stalled ? Attached is the show engine innodb status, could you please review and let me know what is going on here ? [2020-06-08T11:28:19.687] slurmdbd version 17.11.13 started [2020-06-08T11:28:51.913] Warning: Note very large processing time from hourly_rollup for scp: usec=32224147 began=11:28:19.689 [2020-06-08T11:30:02.731] Warning: Note very large processing time from daily_rollup for scp: usec=70817945 began=11:28:51.913 [2020-06-08T11:35:07.302] Terminate signal (SIGINT or SIGTERM) received [2020-06-08T11:35:07.352] MySQL server version is: 5.5.60-MariaDB [2020-06-08T11:35:07.379] pre-converting step table for scp [2020-06-08T11:35:07.406] adding column tres_usage_in_ave after tres_alloc in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_in_max after tres_usage_in_ave in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_in_max_taskid after tres_usage_in_max in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_in_max_nodeid after tres_usage_in_max_taskid in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_in_min after tres_usage_in_max_nodeid in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_in_min_taskid after tres_usage_in_min in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_in_min_nodeid after tres_usage_in_min_taskid in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_in_tot after tres_usage_in_min_nodeid in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_out_ave after tres_usage_in_tot in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_out_max after tres_usage_out_ave in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_out_max_taskid after tres_usage_out_max in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_out_max_nodeid after tres_usage_out_max_taskid in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_out_min after tres_usage_out_max_nodeid in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_out_min_taskid after tres_usage_out_min in table "scp_step_table" [2020-06-08T11:35:07.406] adding column tres_usage_out_min_nodeid after tres_usage_out_min_taskid in table "scp_step_table" [2020-06-08T11:35:07.407] adding column tres_usage_out_tot after tres_usage_out_min_nodeid in table "scp_step_table" [2020-06-08T11:51:58.747] Warning: Note very large processing time from make table current "scp_step_table": usec=1011341345 began=11:35:07.406 [root@slurmdbd1905 tmp]# date Mon Jun 8 13:05:24 UTC 2020 Based on the process list, it appears that is is updating the table. MariaDB [slurm_acct_db]> show processlist; +----+-------+-----------+---------------+---------+------+----------------+------------------------------------------------------------------------------------------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+-------+-----------+---------------+---------+------+----------------+------------------------------------------------------------------------------------------------------+----------+ | 26 | slurm | localhost | slurm_acct_db | Query | 0 | Writing to net | update "scp_step_table" set tres_usage_in_max='2=1290240,6=1,7=178241536', tres_usage_in_max_nodeid= | 0.000 | | 34 | root | localhost | slurm_acct_db | Query | 0 | NULL | show processlist | 0.000 | +----+-------+-----------+---------------+---------+------+----------------+------------------------------------------------------------------------------------------------------+----------+ 2 rows in set (0.00 sec) MariaDB [slurm_acct_db]> show processlist; +----+-------+-----------+---------------+---------+------+----------+------------------------------------------------------------------------------------------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+-------+-----------+---------------+---------+------+----------+------------------------------------------------------------------------------------------------------+----------+ | 26 | slurm | localhost | slurm_acct_db | Query | 0 | Updating | update "scp_step_table" set tres_usage_in_max='2=2553724928,6=40616,7=9574662144', tres_usage_in_max | 0.000 | | 34 | root | localhost | slurm_acct_db | Query | 0 | NULL | show processlist | 0.000 | +----+-------+-----------+---------------+---------+------+----------+------------------------------------------------------------------------------------------------------+----------+ 2 rows in set (0.00 sec)
No, that looks normal, but I need more of the slurmdbd.log.
When you run the upgraded slurmdbd the first time, it should not be under systemd. It should be manual: slurmdbd -D This is because systemd will kill the daemon if it thinks it's hung while upgrading the database.
(In reply to Broderick Gardner from comment #38) > When you run the upgraded slurmdbd the first time, it should not be under > systemd. It should be manual: > slurmdbd -D > > This is because systemd will kill the daemon if it thinks it's hung while > upgrading the database. your last comment suggested to watch tail -f /var/log/slurm/slurmdbd.log to see if it is stalled. So I assumed, that the conversion should be under systemd. Please find the attached slurmdbd.log, conversion is still happening. Please review and let me know what is next. Thanks Sathish
Created attachment 14580 [details] slurmdbd.log on 08062020 slurmdbd.log taken during the conversion from 17.11 to 19.05
Hi Broderick, I just checked and it appears that the conversion got completed successfully. Below are the log entries which are missing from my earlier slurmdbd.log attachement. [2020-06-08T16:55:41.614] dropping column max_disk_write_task from table "scp_step_table" [2020-06-08T16:55:41.615] dropping column max_disk_write_node from table "scp_step_table" [2020-06-08T16:55:41.615] dropping column ave_disk_write from table "scp_step_table" [2020-06-08T17:12:25.219] Warning: Note very large processing time from make table current "scp_step_table": usec=1003607405 began=16:55:41.612 [2020-06-08T17:12:25.317] adding column max_jobs_accrue_pa after max_jobs_per_user in table qos_table [2020-06-08T17:12:25.317] adding column max_jobs_accrue_pu after max_jobs_accrue_pa in table qos_table [2020-06-08T17:12:25.317] adding column min_prio_thresh after max_jobs_accrue_pu in table qos_table [2020-06-08T17:12:25.317] adding column grp_jobs_accrue after grp_jobs in table qos_table [2020-06-08T17:12:25.317] adding column preempt_exempt_time after preempt_mode in table qos_table [2020-06-08T17:12:25.379] Conversion done: success! [2020-06-08T17:12:25.387] Accounting storage MYSQL plugin loaded Please do let me know if you need any other other logs for review. Thanks Sathish
(In reply to Sathishkumar from comment #39) > your last comment suggested to watch tail -f /var/log/slurm/slurmdbd.log to > see if it is stalled. So I assumed, that the conversion should be under > systemd. Yes, I forgot about the systemd issue with long database upgrade times. The log does not show slurmdbd 19.05 starting, only slurmdbd 17.11. Can you verify the procedure? The fact that it did a database conversion indicates a database version mismatch, but it looks like slurmdbd 19.05 was _not_ running.
(In reply to Broderick Gardner from comment #42) > (In reply to Sathishkumar from comment #39) > > your last comment suggested to watch tail -f /var/log/slurm/slurmdbd.log to > > see if it is stalled. So I assumed, that the conversion should be under > > systemd. > Yes, I forgot about the systemd issue with long database upgrade times. > > > The log does not show slurmdbd 19.05 starting, only slurmdbd 17.11. Can you > verify the procedure? The fact that it did a database conversion indicates a > database version mismatch, but it looks like slurmdbd 19.05 was _not_ > running. Please review the logs below and suggest the next steps. [root@slurmdbd1905 ~]# slurmdbd -V slurm 19.05.7 [root@slurmdbd1905 ~]# [root@slurmdbd1905 ~]# systemctl status slurmdbd ● slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/slurmdbd.service.d └─override.conf Active: active (running) since Mon 2020-06-08 11:35:07 UTC; 6h ago Main PID: 3258 (slurmdbd) CGroup: /system.slice/slurmdbd.service └─3258 /usr/sbin/slurmdbd Jun 08 16:55:41 slurmdbd1905 slurmdbd[3258]: dropping column ave_disk_write from table "scp_step_table" Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: Warning: Note very large processing time from make table current "scp_step_table": usec=1003607405 began=16:55:41.612 Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: adding column max_jobs_accrue_pa after max_jobs_per_user in table qos_table Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: adding column max_jobs_accrue_pu after max_jobs_accrue_pa in table qos_table Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: adding column min_prio_thresh after max_jobs_accrue_pu in table qos_table Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: adding column grp_jobs_accrue after grp_jobs in table qos_table Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: adding column preempt_exempt_time after preempt_mode in table qos_table Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: Conversion done: success! Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: Accounting storage MYSQL plugin loaded Jun 08 17:21:29 slurmdbd1905 slurmdbd[3258]: slurmdbd version 19.05.7 started [root@slurmdbd1905 ~]# tail -n 10 /var/log/slurm/slurmdbd.log [2020-06-08T16:55:41.615] dropping column ave_disk_write from table "scp_step_table" [2020-06-08T17:12:25.219] Warning: Note very large processing time from make table current "scp_step_table": usec=1003607405 began=16:55:41.612 [2020-06-08T17:12:25.317] adding column max_jobs_accrue_pa after max_jobs_per_user in table qos_table [2020-06-08T17:12:25.317] adding column max_jobs_accrue_pu after max_jobs_accrue_pa in table qos_table [2020-06-08T17:12:25.317] adding column min_prio_thresh after max_jobs_accrue_pu in table qos_table [2020-06-08T17:12:25.317] adding column grp_jobs_accrue after grp_jobs in table qos_table [2020-06-08T17:12:25.317] adding column preempt_exempt_time after preempt_mode in table qos_table [2020-06-08T17:12:25.379] Conversion done: success! [2020-06-08T17:12:25.387] Accounting storage MYSQL plugin loaded [2020-06-08T17:21:29.874] slurmdbd version 19.05.7 started Thanks Sathish
I have not started the archival/purging process, I am waiting for your confirmation to see if all fine with the conversion based on the logs which i shared, i will proceed as soon as I hear from you. Thanks Sathish
Go ahead, I'm still investigating the logs, but you can proceed.
(In reply to Broderick Gardner from comment #45) > Go ahead, I'm still investigating the logs, but you can proceed. ok, I have enabled the below parameters with slurmdbd.conf and restarted it. #Purging PurgeEventAfter=18852hours PurgeJobAfter=18852hours PurgeResvAfter=18852hours PurgeStepAfter=18852hours PurgeSuspendAfter=18852hours PurgeTXNAfter=18852hours PurgeUsageAfter=18852hours #Archival ArchiveDir=/opt/archive ArchiveEvents=yes ArchiveJobs=yes ArchiveResvs=yes I will let you know the status as soon as I see the archival is initialised. Thanks Sathish
I did ran the archive manually, but i got the same error, [root@slurmdbd1905 opt]# sacctmgr archive dump This may result in loss of accounting database records (if Purge* options enabled). Are you sure you want to continue? (You have 30 seconds to decide) (N/y): y sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error Problem dumping archive: Unspecified error Attached is the slurmdbd.log for your review. Under archive folder, I do see these files, in the previous instance, these files where not created. [root@slurmdbd1905 archive]# pwd /opt/archive [root@slurmdbd1905 archive]# ls scp_event_table_archive_2017-09-01T00:00:00_2017-10-01T00:00:00 scp_event_table_archive_2017-12-01T00:00:00_2018-01-01T00:00:00 scp_event_table_archive_2018-03-06T07:00:00_2018-04-15T06:59:59 scp_event_table_archive_2017-10-01T00:00:00_2017-11-01T00:00:00 scp_event_table_archive_2018-01-01T00:00:00_2018-02-01T00:00:00 scp_event_table_archive_2017-11-01T00:00:00_2017-12-01T00:00:00 scp_event_table_archive_2018-02-01T00:00:00_2018-03-01T00:00:00 Thanks Sathish
Created attachment 14592 [details] slurmctld logs taken after the running the manual archival
Created attachment 14593 [details] mariadb_slow.log taken after running the archival manually
Thanks. My confusion before was my mistake, sorry. I was concerned about not seeing the "version 19.05 started" log message, even after the conversion completed. The conversion appears to be error-free. Looking at the archive once again, it is interesting that the event table successfully archived, but the jobs (and others) did not. There is no error on the slurmdbd during the archive? Please post the slurmdbd log again, either all of it or trimmed to just since [2020-06-08T17:21:29.874] slurmdbd version 19.05.7 started Thanks
(In reply to Broderick Gardner from comment #50) > Thanks. My confusion before was my mistake, sorry. I was concerned about not > seeing the "version 19.05 started" log message, even after the conversion > completed. > > The conversion appears to be error-free. Looking at the archive once again, > it is interesting that the event table successfully archived, but the jobs > (and others) did not. There is no error on the slurmdbd during the archive? > Please post the slurmdbd log again, either all of it or trimmed to just since > [2020-06-08T17:21:29.874] slurmdbd version 19.05.7 started > > Thanks Hi Broderick, Thanks for the update. As I mentioned, I ran the archival manually, and below is the command I have used and gives an error saying "Unspecified error" see below. [root@slurmdbd1905 opt]# sacctmgr archive dump This may result in loss of accounting database records (if Purge* options enabled). Are you sure you want to continue? (You have 30 seconds to decide) (N/y): y sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error Problem dumping archive: Unspecified error why does it gives the above error ? please advise. There is no errors on the logs. Also, I left the configuration as it is and it looks like the archival ran as per the scheduled interval and completed, I am saying it is completed because I see that both the event and job table has been successfully archived and I can see them under archive folder, but I need to validate if all ran successfully. [root@slurmdbd1905 archive]# ls scp_event_table_archive_2017-09-01T00:00:00_2017-10-01T00:00:00 scp_job_table_archive_2018-03-01T00:00:00_2018-04-15T06:59:59 scp_event_table_archive_2017-10-01T00:00:00_2017-11-01T00:00:00 scp_job_table_archive_2018-04-06T18:00:00_2018-04-15T06:59:59 scp_event_table_archive_2017-11-01T00:00:00_2017-12-01T00:00:00 scp_job_table_archive_2018-04-15T07:00:00_2018-04-15T07:59:59 scp_event_table_archive_2017-12-01T00:00:00_2018-01-01T00:00:00 scp_job_table_archive_2018-04-15T08:00:00_2018-04-15T08:59:59 scp_event_table_archive_2018-01-01T00:00:00_2018-02-01T00:00:00 scp_job_table_archive_2018-04-15T09:00:00_2018-04-15T09:59:59 scp_event_table_archive_2018-02-01T00:00:00_2018-03-01T00:00:00 scp_job_table_archive_2018-04-15T10:00:00_2018-04-15T10:59:59 scp_event_table_archive_2018-03-06T07:00:00_2018-04-15T06:59:59 scp_job_table_archive_2018-04-15T11:00:00_2018-04-15T11:59:59 scp_job_table_archive_2017-09-01T00:00:00_2017-10-01T00:00:00 scp_job_table_archive_2018-04-15T12:00:00_2018-04-15T12:59:59 scp_job_table_archive_2017-10-01T00:00:00_2017-11-01T00:00:00 scp_job_table_archive_2018-04-15T13:00:00_2018-04-15T13:59:59 scp_job_table_archive_2017-11-01T00:00:00_2017-12-01T00:00:00 scp_job_table_archive_2018-04-15T14:00:00_2018-04-15T14:59:59 scp_job_table_archive_2017-12-01T00:00:00_2018-01-01T00:00:00 scp_job_table_archive_2018-04-15T15:00:00_2018-04-15T15:59:59 scp_job_table_archive_2018-01-01T00:00:00_2018-02-01T00:00:00 scp_job_table_archive_2018-04-15T16:00:00_2018-04-15T16:59:59 scp_job_table_archive_2018-02-01T00:00:00_2018-03-01T00:00:00 I am wondering why I got the above "Unspecified error" while running the archive manually and not while the archival is triggered by the scheduled time. could you please review the slurmdbd.log_09062020 logs which i have attached along with this and do let me know if you see any errors. Thanks Sathish
Created attachment 14601 [details] slurmdbd logs taken on 09062020 (taken after the archival ran by the scheduled interval) slurmdbd logs taken on 09062020 (taken after the archival ran by the scheduled interval)
Hi Broderick, I appreciate if you can review and comment, as I said earlier, this is bit critical and we wanted upgrade to the supported version as soon as possible. Thanks Sathish
Created attachment 14621 [details] sacctmgr error fix I found an issue with error reporting that is the reason why it is "failing" with "no error". Applying this patch and rerunning the manual `sacctmgr archive dump` command will show the true error message. The method for applying the patch depends on your build and deployment method. Let me know if you are able to do it or not. Thanks
(In reply to Broderick Gardner from comment #54) > Created attachment 14621 [details] > sacctmgr error fix > > I found an issue with error reporting that is the reason why it is "failing" > with "no error". Applying this patch and rerunning the manual `sacctmgr > archive dump` command will show the true error message. > > The method for applying the patch depends on your build and deployment > method. Let me know if you are able to do it or not. > > Thanks We are using rpms to deploy the slurm in our environment, I have applied the patch which you have given and rebuild the rpms from the src with the applied patch, [root@slurmdbd1905n slurmdbd]# patch -b -V numbered < ../../../../../sacctmgr_error_fix patching file accounting_storage_slurmdbd.c [root@slurmdbd1905n slurmdbd]# ls Makefile.am Makefile.in accounting_storage_slurmdbd.c accounting_storage_slurmdbd.c.~1~ slurmdbd_agent.c slurmdbd_agent.h Is this the only difference in the given patch ? [root@slurmdbd1905n slurmdbd]# diff accounting_storage_slurmdbd.c accounting_storage_slurmdbd.c.~1~ 2931c2931 < error("slurmdbd: DBD_ARCHIVE_DUMP failure: %s", slurm_strerror(rc)); --- > error("slurmdbd: DBD_ARCHIVE_DUMP failure: %m"); now, on the test system, sacctmgr archive dump is working as expected, because the archive/purge was already ran by the system as per the scheduled interval as per the following parameter Purge*After configured in slurmdbd.conf But I feel that if we run sacctmgr archive dump with full data in the DB, it is still fail, I could not test sacctmgr archive dump, because the data was already archived/purged from the databases. But when I try to pull more than two months data using sacct, it gives the below error, any thing less than two months it is working as expected, the memory usage is high when we pull data for more than two months [root@slurmdbd1905n slurm-patch]# sacct --starttime=2020-03-01 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- sacct: error: slurmdbd: Getting response to message type: DBD_GET_JOBS_COND sacct: error: slurmdbd: DBD_GET_JOBS_COND failure: Unspecified error sometimes it will consume all the available memory and slurmdbd will get killed by the system, see the below status which was captured when we ran sacct to pull more than three months data. [root@slurmdbd1905n slurm-patch]# systemctl status slurmdbd.service ● slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/slurmdbd.service.d └─override.conf Active: activating (auto-restart) (Result: signal) since Mon 2020-06-15 18:09:12 UTC; 26s ago Main PID: 78925 (code=killed, signal=KILL) Jun 15 18:09:12 slurmdbd1905n systemd[1]: Unit slurmdbd.service entered failed state. Jun 15 18:09:12 slurmdbd1905n systemd[1]: slurmdbd.service failed. I believe this is the same status for sacctmgr archive dump as well, when we run it will full data on the DB, it is getting crashed with unspecified error. Could you please advise a fix for this ? do let me know if you need any other info from my end to take this further. Thanks Sathish
(In reply to Sathishkumar from comment #55) > Is this the only difference in the given patch ? Yes. > now, on the test system, sacctmgr archive dump is working as expected, > because the archive/purge was already ran by the system as per the scheduled > interval as per the following parameter Purge*After configured in > slurmdbd.conf > > But I feel that if we run sacctmgr archive dump with full data in the DB, it > is still fail, I could not test sacctmgr archive dump, because the data was > already archived/purged from the databases. Yes, it will still fail. The patch fixes error logging, so the correct error will be logged. > > But when I try to pull more than two months data using sacct, it gives the > below error, any thing less than two months it is working as expected, the > memory usage is high when we pull data for more than two months Is this different than usual? If the amount of data to be sent from the slurmdbd is over 3GB, it will fail. > [root@slurmdbd1905n slurm-patch]# sacct --starttime=2020-03-01 > JobID JobName Partition Account AllocCPUS State ExitCode > ------------ ---------- ---------- ---------- ---------- ---------- -------- > sacct: error: slurmdbd: Getting response to message type: DBD_GET_JOBS_COND > sacct: error: slurmdbd: DBD_GET_JOBS_COND failure: Unspecified error > > > sometimes it will consume all the available memory and slurmdbd will get > killed by the system, see the below status which was captured when we ran > sacct to pull more than three months data. Requesting large amounts of data from the slurmdbd can cause the slurmdbd to use a massive amount of memory processing the request before determining that the amount of data is too large. See Bug 5817. No, there is currently not a fix for this. You can mitigate it to some extent using the MaxQueryTimeRange in the slurmdbd.conf. This limits the maximum range of time that can be requested from the slurmdbd. > > I believe this is the same status for sacctmgr archive dump as well, when we > run it will full data on the DB, it is getting crashed with unspecified > error. This sacct error is most likely unrelated to the sacctmgr archive dump error. In order to proceed, I need to see the actual error message, which means restoring the database and running slurmdbd-19.05 with the patch. During conversion, comment out the archive and purge options from slurmdbd.conf. Then restore those lines, restart the slurmdbd, and try `sacctmgr archive dump` again. This might not be feasible considering the size of the database, but you may want to backup the database after 19.05 conversion but before the archive/purge. Summary of issues: Automatic archive/purge: working in 19.05 Manual archive/purge: Unknown error. (Note that manual archive/purge is not recommended as a regular operation) error: No reservation by id: On hold until the cluster is fully upgraded to 19.05 Thanks
(In reply to Broderick Gardner from comment #56) > (In reply to Sathishkumar from comment #55) > > Is this the only difference in the given patch ? > Yes. > > > now, on the test system, sacctmgr archive dump is working as expected, > > because the archive/purge was already ran by the system as per the scheduled > > interval as per the following parameter Purge*After configured in > > slurmdbd.conf > > > > But I feel that if we run sacctmgr archive dump with full data in the DB, it > > is still fail, I could not test sacctmgr archive dump, because the data was > > already archived/purged from the databases. > Yes, it will still fail. The patch fixes error logging, so the correct error > will be logged. > > > > > But when I try to pull more than two months data using sacct, it gives the > > below error, any thing less than two months it is working as expected, the > > memory usage is high when we pull data for more than two months > Is this different than usual? If the amount of data to be sent from the > slurmdbd is over 3GB, it will fail. > > > [root@slurmdbd1905n slurm-patch]# sacct --starttime=2020-03-01 > > JobID JobName Partition Account AllocCPUS State ExitCode > > ------------ ---------- ---------- ---------- ---------- ---------- -------- > > sacct: error: slurmdbd: Getting response to message type: DBD_GET_JOBS_COND > > sacct: error: slurmdbd: DBD_GET_JOBS_COND failure: Unspecified error > > > > > > sometimes it will consume all the available memory and slurmdbd will get > > killed by the system, see the below status which was captured when we ran > > sacct to pull more than three months data. > Requesting large amounts of data from the slurmdbd can cause the slurmdbd to > use a massive amount of memory processing the request before determining > that the amount of data is too large. See Bug 5817. No, there is currently > not a fix for this. You can mitigate it to some extent using the > MaxQueryTimeRange in the slurmdbd.conf. This limits the maximum range of > time that can be requested from the slurmdbd. > > > > > I believe this is the same status for sacctmgr archive dump as well, when we > > run it will full data on the DB, it is getting crashed with unspecified > > error. > This sacct error is most likely unrelated to the sacctmgr archive dump error. > > > In order to proceed, I need to see the actual error message, which means > restoring the database and running slurmdbd-19.05 with the patch. During > conversion, comment out the archive and purge options from slurmdbd.conf. > Then restore those lines, restart the slurmdbd, and try `sacctmgr archive > dump` again. This might not be feasible considering the size of the > database, but you may want to backup the database after 19.05 conversion but > before the archive/purge. > > Summary of issues: > Automatic archive/purge: working in 19.05 > Manual archive/purge: Unknown error. > (Note that manual archive/purge is not recommended as a regular operation) > error: No reservation by id: On hold until the cluster is fully upgraded to > 19.05 > > Thanks Thanks for the clarification, I will share the actual error by tomorrow, after restoring the db and then running 19.05 with the patch. I have a question from your input above 1. Do you think that it is not feasible to use sacctmgr archive dump for the size of our DB ? if yes, on which case should I use sacctmgr archive dump ? 2. I dont understand what do you mean by backup the database after 19.05 conversion but before the archive/purge. (FYI, i am testing this in a test system) Also, as I said that on the test system the scheduled archive and purging are working as expected. Archive and purging will trigger everyday and will keep 548 days of data on the DB and purge everything. After purging the DB has become 1.5G less. At the same time when I checked the size of the archive folder which contains all the generated files are not more than 102M, is that expected and how do we confirm that all the data archived ? I ask this because the difference is so huge between the size which got reduced from the databases(1.5G) and the archive folder size(102M). Based on the last generated archive file (scp_resv_table_archive_2018-11-24T00:00:00_2018-12-14T23:59:59) it archived the data until 549 days. Is there any way available to check if the data archived are intact? Regards Sathish
(In reply to Sathishkumar from comment #57) > I have a question from your input above > 1. Do you think that it is not feasible to use sacctmgr archive dump for the > size of our DB ? if yes, on which case should I use sacctmgr archive dump ? I just mean that you should not need to use `sacctmgr archive dump` because it is configured to run automatically. > > 2. I dont understand what do you mean by backup the database after 19.05 > conversion but before the archive/purge. (FYI, i am testing this in a test > system) The conversion of the database from 17.11 to 19.05 takes a long time, and you have had to do it a few times. So you could back up the converted database in case it is needed again. This would just be to potentially save time in the future, not to protect against data loss. > Also, as I said that on the test system the scheduled archive and purging > are working as expected. Archive and purging will trigger everyday and will > keep 548 days of data on the DB and purge everything. Have you changed the configuration? The slurmdbd.conf lines you sent before indicated purging after 18852 hours, which would trigger a purge every hour. In any case, I expect that to be sufficient; you should never have to run `sacctmgr archive dump` outside of testing. > After purging the DB > has become 1.5G less. At the same time when I checked the size of the > archive folder which contains all the generated files are not more than > 102M, is that expected and how do we confirm that all the data archived ? I > ask this because the difference is so huge between the size which got > reduced from the databases(1.5G) and the archive folder size(102M). > Based on the last generated archive file > (scp_resv_table_archive_2018-11-24T00:00:00_2018-12-14T23:59:59) it archived > the data until 549 days. Is there any way available to check if the data > archived are intact? You could reload the data and check the database size again. Also run sacct on a range of time that was purged. Thanks
(In reply to Broderick Gardner from comment #58) > (In reply to Sathishkumar from comment #57) > > I have a question from your input above > > 1. Do you think that it is not feasible to use sacctmgr archive dump for the > > size of our DB ? if yes, on which case should I use sacctmgr archive dump ? > I just mean that you should not need to use `sacctmgr archive dump` because > it is configured to run automatically. Thanks for the clarification. > > > > 2. I dont understand what do you mean by backup the database after 19.05 > > conversion but before the archive/purge. (FYI, i am testing this in a test > > system) > The conversion of the database from 17.11 to 19.05 takes a long time, and > you have had to do it a few times. So you could back up the converted > database in case it is needed again. This would just be to potentially save > time in the future, not to protect against data loss. Got your point > > > Also, as I said that on the test system the scheduled archive and purging > > are working as expected. Archive and purging will trigger everyday and will > > keep 548 days of data on the DB and purge everything. > Have you changed the configuration? The slurmdbd.conf lines you sent before > indicated purging after 18852 hours, which would trigger a purge every hour. > In any case, I expect that to be sufficient; you should never have to run > `sacctmgr archive dump` outside of testing. Yes, i have changed to 548 days, just to see if archive is running fine if I convert it from hours to days and I dont see any issues with that at the moment. > > > After purging the DB > > has become 1.5G less. At the same time when I checked the size of the > > archive folder which contains all the generated files are not more than > > 102M, is that expected and how do we confirm that all the data archived ? I > > ask this because the difference is so huge between the size which got > > reduced from the databases(1.5G) and the archive folder size(102M). > > Based on the last generated archive file > > (scp_resv_table_archive_2018-11-24T00:00:00_2018-12-14T23:59:59) it archived > > the data until 549 days. Is there any way available to check if the data > > archived are intact? > You could reload the data and check the database size again. Also run sacct > on a range of time that was purged. I randomly loaded the archived file and I was able to get the desired result using sacct. But I was curious to know if there is any tools or command options available to check the integrity. > > Thanks I will share the error message as soon as I am done with the testing. Thanks
Hi Broderick, I installed the patch which you have shared and did an upgrade on the new test VM with the production data Slurm version [root@slurmdbd1905b ~]# slurmdbd -V slurm 19.05.7 Slurmdbd purging and archive details PurgeStepAfter=548days PurgeSuspendAfter=548days PurgeTXNAfter=548days PurgeUsageAfter=548days # ##Archival ArchiveDir=/mnt/backup/slurmdbd/archive ArchiveEvents=yes ArchiveJobs=yes ArchiveResvs=yes I ran the archive dump but still getting the same error, any idea ? [root@slurmdbd1905b ~]# sacctmgr archive dump This may result in loss of accounting database records (if Purge* options enabled). Are you sure you want to continue? (You have 30 seconds to decide) (N/y): y sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: Unspecified error Problem dumping archive: Unspecified error Also, we have a scheduled maintenance coming weekend to upgrade the slurm from 17.11 to 19.05. We are also planning to change the statesave location to different path. Is there any sequence I should follow ? I am planning to follow the below sequence (Note: all slurm services are running on a container) Please advise if there if I am wrong. 1. Make sure there is no active jobs on the cluster. 2. Upgrade the slurmdbd first, wait for 6+ hours to complete the conversion, once the conversion is done, make sure the dbd is coming up and working as expected by querying sacct and sacctmgr, also checking service status. (please advise if there is any specific test i should perform to make sure the database if up and running fine after conversion) 3. Provision a new slurmctld00(primary) and slurmctld01(backup) with 19.05 with the new statesave location 4. Stop the slurmctld primary and backup 5. Copy over all the statesave file from old location to the new location 6. Start the slurmctld primary and backup 7. Validate if the communication between slurmdbd and slrumctld are working as expected, once it is confirmed. (please advise if there is any specific test i should perform to make sure the communication between slurmdbd and slurmctld are fine with the 19.05 version ) 8. Proceed with upgrading all slurmd's Regards Sathish
Hi Broderick, Is there any easy way to verify if I am running the slurmdbd with your patch. Regards Sathish
No there's not. But the error message is different, so I would guess that you are. sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error Problem dumping archive: Unspecified error vs sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: Unspecified error Problem dumping archive: Unspecified error
(In reply to Sathishkumar from comment #60) > Also, we have a scheduled maintenance coming weekend to upgrade the slurm > from 17.11 to 19.05. We are also planning to change the statesave location > to different path. Is there any sequence I should follow ? > > I am planning to follow the below sequence (Note: all slurm services are > running on a container) Please advise if there if I am wrong. Here are notes on your procedure: > 1. Make sure there is no active jobs on the cluster. One way to accomplish this is to set a maintenance reservation for the period of the upgrade so that no jobs are scheduled that could interfere. > > 2. Upgrade the slurmdbd first, wait for 6+ hours to complete the conversion, > once the conversion is done, make sure the dbd is coming up and working as > expected by querying sacct and sacctmgr, also checking service status. > (please advise if there is any specific test i should perform to make sure > the database if up and running fine after conversion) Make sure there is a backup of your database prior to running the upgraded slurmdbd. When running the slurmdbd to convert the database, run it in the foreground of a terminal instead of as a service. For example: $ sudo -u <slurm user> slurmdbd -D This prevents any interference from systemd during the conversion. Once it's done and responding to client commands (sacct or sacctmgr), feel free to stop it (SIGTERM/Ctrl-c) and restart it as a systemd service. This process has been heavily tested, so if it completes without error, you can be reasonably sure that it is correct. > 3. Provision a new slurmctld00(primary) and slurmctld01(backup) with 19.05 > with the new statesave location > > 4. Stop the slurmctld primary and backup > > 5. Copy over all the statesave file from old location to the new location > > 6. Start the slurmctld primary and backup > > 7. Validate if the communication between slurmdbd and slrumctld are working > as expected, once it is confirmed. (please advise if there is any specific > test i should perform to make sure the communication between slurmdbd and > slurmctld are fine with the 19.05 version ) If the slurmctld cannot talk to the slurmdbd, its log will have errors like: error: slurmdbd: Sending PersistInit msg: Connection refused Also, the ControlHost for the cluster in `sacctmgr show clusters` will not show an ip address. > > 8. Proceed with upgrading all slurmd's Online rolling upgrades are possible, but it seems you don't want to take any chances with that. That is fine. As an FYI, this is possible by increasing the MaxJobCount to make sure that the slurmdbd agent queue on the slurmctld is large enough to hold all accounting updates while the slurmdbd is offline.
(In reply to Broderick Gardner from comment #63) > (In reply to Sathishkumar from comment #60) > > Also, we have a scheduled maintenance coming weekend to upgrade the slurm > > from 17.11 to 19.05. We are also planning to change the statesave location > > to different path. Is there any sequence I should follow ? > > > > I am planning to follow the below sequence (Note: all slurm services are > > running on a container) Please advise if there if I am wrong. > Here are notes on your procedure: > > 1. Make sure there is no active jobs on the cluster. > One way to accomplish this is to set a maintenance reservation for the > period of the upgrade so that no jobs are scheduled that could interfere. > > > > 2. Upgrade the slurmdbd first, wait for 6+ hours to complete the conversion, > > once the conversion is done, make sure the dbd is coming up and working as > > expected by querying sacct and sacctmgr, also checking service status. > > (please advise if there is any specific test i should perform to make sure > > the database if up and running fine after conversion) > Make sure there is a backup of your database prior to running the upgraded > slurmdbd. > When running the slurmdbd to convert the database, run it in the foreground > of a terminal instead of as a service. For example: > $ sudo -u <slurm user> slurmdbd -D > This prevents any interference from systemd during the conversion. Once it's > done and responding to client commands (sacct or sacctmgr), feel free to > stop it (SIGTERM/Ctrl-c) and restart it as a systemd service. > This process has been heavily tested, so if it completes without error, you can be > reasonably sure that it is correct. Thanks for your input, So basically I should run slurmdbd -Dvv as root and on the other console do the upgrade by running "yum update slurm" this will basically start updating all the installed slurm packages and start the conversion. Is this right ? > > > 3. Provision a new slurmctld00(primary) and slurmctld01(backup) with 19.05 > > with the new statesave location > > > > 4. Stop the slurmctld primary and backup > > > > 5. Copy over all the statesave file from old location to the new location Any input coping the data to statesave location ? > > > > 6. Start the slurmctld primary and backup > > > > 7. Validate if the communication between slurmdbd and slrumctld are working > > as expected, once it is confirmed. (please advise if there is any specific > > test i should perform to make sure the communication between slurmdbd and > > slurmctld are fine with the 19.05 version ) > If the slurmctld cannot talk to the slurmdbd, its log will have errors like: > error: slurmdbd: Sending PersistInit msg: Connection refused > Also, the ControlHost for the cluster in `sacctmgr show clusters` will not > show an ip address. Sure, will use this step to validate the communication between slurmctld and slurmdbd are fine after the upgrade and DB conversion > > > > > 8. Proceed with upgrading all slurmd's > > > Online rolling upgrades are possible, but it seems you don't want to take > any chances with that. That is fine. As an FYI, this is possible by > increasing the MaxJobCount to make sure that the slurmdbd agent queue on the > slurmctld is large enough to hold all accounting updates while the slurmdbd > is offline. yes, thanks for the note, Since there is change in the statesave location, we dont want to do this rolling updates. Also, we have few VMs and on that slurmd will still be the current version which is 17.11, do u see any issues with slurmd-17.11 communicating with 19.05 ? Regards Sathish
(In reply to Sathishkumar from comment #64) > Thanks for your input, So basically I should run slurmdbd -Dvv as root and > on the other console do the upgrade by running "yum update slurm" this will > basically start updating all the installed slurm packages and start the > conversion. Is this right ? Well, you don't need 2 consoles. For example: $ sudo systemctl stop slurmdbd $ sudo yum update slurm ... $ sudo -u slurm slurmdbd -Dvv ... This simply runs the slurmdbd completely outside systemd. Do you typically run the slurmdbd as root? It should be run as whatever the SlurmUser is in slurmdbd.conf, which could be root. I am not sure what user the systemd service file runs slurmdbd as. From man slurmdbd.conf: SlurmUser The name of the user that the slurmdbd daemon executes as. This user must exist on the machine ex‐ ecuting the Slurm Database Daemon and have the same UID as the hosts on which slurmctld execute. For security purposes, a user other than "root" is recommended. The default value is "root". This name should also be the same SlurmUser on all clusters reporting to the SlurmDBD. NOTE: If this user is different from the one set for slurmctld and is not root, it must be added to accounting with AdminLevel=Admin and slurmctld must be restarted. > Any input coping the data to statesave location ? That is correct, just copy the statesave location while both slurmctld's are not running. > > Also, we have few VMs and on that slurmd will still be the current version > which is 17.11, do u see any issues with slurmd-17.11 communicating with > 19.05 ? That should be fine. This is not quite as frequently tested, but we do support running slurmd up to 2 versions lower than the slurmctld. On that note, the slurmdbd supports talking to a slurmctld up to 2 versions lower than itself as well.
(In reply to Broderick Gardner from comment #65) > (In reply to Sathishkumar from comment #64) > > Thanks for your input, So basically I should run slurmdbd -Dvv as root and > > on the other console do the upgrade by running "yum update slurm" this will > > basically start updating all the installed slurm packages and start the > > conversion. Is this right ? > Well, you don't need 2 consoles. For example: > $ sudo systemctl stop slurmdbd > $ sudo yum update slurm Thanks, the above step make sense. > ... > $ sudo -u slurm slurmdbd -Dvv sure, will use this to run dbd in foreground > ... > > This simply runs the slurmdbd completely outside systemd. Do you typically > run the slurmdbd as root? It should be run as whatever the SlurmUser is in > slurmdbd.conf, which could be root. I am not sure what user the systemd > service file runs slurmdbd as. yes, we run slurmdbd as root, but we have set slurm as SlurmUser, going forward will use slurm user to manage all slurm managements. > From man slurmdbd.conf: > SlurmUser > The name of the user that the slurmdbd daemon executes as. This > user must exist on the machine ex‐ > ecuting the Slurm Database Daemon and have the same UID as the hosts > on which slurmctld execute. > For security purposes, a user other than "root" is recommended. > The default value is "root". This > name should also be the same SlurmUser on all clusters reporting to > the SlurmDBD. NOTE: If this > user is different from the one set for slurmctld and is not root, > it must be added to accounting > with AdminLevel=Admin and slurmctld must be restarted. > > > > Any input coping the data to statesave location ? > That is correct, just copy the statesave location while both slurmctld's are > not running. > > > > Also, we have few VMs and on that slurmd will still be the current version > > which is 17.11, do u see any issues with slurmd-17.11 communicating with > > 19.05 ? > That should be fine. This is not quite as frequently tested, but we do > support running slurmd up to 2 versions lower than the slurmctld. On that > note, the slurmdbd supports talking to a slurmctld up to 2 versions lower > than itself as well. Ok, I ask this the slurm.conf has some specific changes with version 19.05, like SlurmctldHost, which cannot be used with 17.11 version, in this case i believe slurmctld will complain about "node appears to have a different slurm.conf than the slurmctld" Off-course this is obvious and can be ignored, are there any other errors we should ignore while running different version fo slurmd-17.11 vs slurmctld-19.05 ? Regards Sathish
Hi broderick, I see that there is change in pam_slurm_adopt behavior, meaning with 17.11, if you add the below line with /etc/pam.d/sshd, it will prevent users sshing into nodes that they do not have a running job on, but now with even we remove the below line from /etc/pam.d/sshd, it is still preventing to ssh into nodes when there is no running job, is that right ? account required pam_slurm_adopt.so I see the below change log with the 19.05 release notes. NOTE: Limit pam_slurm_adopt to run only in the sshd context by default, for security reasons. A new module option 'service=<name>' can be used to allow a different PAM applications to work. The option 'service=*' can be used to restore the old behavior of always performing the adopt logic regardless of the PAM application context. We want to restore the old behavior, like if we enable pam_slurm_adopt.so with in pam.d/sshd, it should allow sshing only if there is a running job. But if we remove pam_slurm_adopt.so, it should allow sshing irrespective of the jobs running or not. Kindly advise. Thanks Sathish
(In reply to Sathishkumar from comment #67) This change will not change what you are describing. It is a security hardening change only. If pam_slurm_adopt is removed from the PAM configuration, then it will not restrict ssh. If you observe a problem with this, please file a separate ticket to investigate.
(In reply to Broderick Gardner from comment #68) > (In reply to Sathishkumar from comment #67) > This change will not change what you are describing. It is a security > hardening change only. If pam_slurm_adopt is removed from the PAM > configuration, then it will not restrict ssh. > > If you observe a problem with this, please file a separate ticket to > investigate. Sure, I will validate the behavior once again and create a ticket if it is still problem. I will update the status of the upgrade once we are done, probably we can close this ticket by next week. Thanks Sathish
Created attachment 14821 [details] slurmdbd.log and sdiag after 19.05 upgrade
Hi Broderick, We updated the 19.05.7 over the weekend everything and overall the upgrade went well. We see the below errors on the logs and also the archival/Purging is not triggered yet. Attached is the logs for your review. error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-18T18:00:00 - 2019-12-18T19:00:00 tres 1 error: We have more allocated time than is possible (23624074 > 20678400) for cluster scp(5744) from 2019-12-18T19:00:00 - 2019-12-18T20:00:00 tres 1 error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-18T19:00:00 - 2019-12-18T20:00:00 tres 1 error: We have more allocated time than is possible (23664697 > 20678400) for cluster scp(5744) from 2019-12-18T20:00:00 - 2019-12-18T21:00:00 tres 1 error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-18T20:00:00 - 2019-12-18T21:00:00 tres 1 error: We have more allocated time than is possible (23743160 > 20678400) for cluster scp(5744) from 2019-12-18T21:00:00 - 2019-12-18T22:00:00 tres 1 error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-18T21:00:00 - 2019-12-18T22:00:00 tres 1 error: We have more allocated time than is possible (23664440 > 20678400) for cluster scp(5744) from 2019-12-18T22:00:00 - 2019-12-18T23:00:00 tres 1 error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-18T22:00:00 - 2019-12-18T23:00:00 tres 1 error: We have more allocated time than is possible (23535521 > 20678400) for cluster scp(5744) from 2019-12-18T23:00:00 - 2019-12-19T00:00:00 tres 1 error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-18T23:00:00 - 2019-12-19T00:00:00 tres 1 error: We have more allocated time than is possible (23665706 > 20678400) for cluster scp(5744) from 2019-12-19T00:00:00 - 2019-12-19T01:00:00 tres 1 error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-19T00:00:00 - 2019-12-19T01:00:00 tres 1 Please do let me know if there is anything needed my end. Regards Sathish
Hi Broderick, The archive/purge took some time to initiate and it had initiated last night and all archived/purged as per the defined timeline. But from time to time we are seeing the errors like below in the slurmdbd logs [2020-07-01T15:00:13.659] error: We have more time than is possible (13372857+259200+30700800)(44332857) > 30700800 for cluster scp(8528) from 2020-06-29T07:00:00 - 2020-06-29T08:00:00 tres 1 [2020-07-01T15:00:13.963] error: We have more time than is possible (14480721+259200+30700800)(45440721) > 30700800 for cluster scp(8528) from 2020-06-29T08:00:00 - 2020-06-29T09:00:00 tres 1 [2020-07-01T15:00:14.289] error: We have more time than is possible (14602653+259200+30700800)(45562653) > 30700800 for cluster scp(8528) from 2020-06-29T09:00:00 - 2020-06-29T10:00:00 tres 1 [2020-07-01T15:00:14.607] error: We have more time than is possible (14549595+259200+30700800)(45509595) > 30700800 for cluster scp(8528) from 2020-06-29T10:00:00 - 2020-06-29T11:00:00 tres 1 [2020-07-01T15:00:14.944] error: We have more time than is possible (14695663+259200+30700800)(45655663) > 30700800 for cluster scp(8528) from 2020-06-29T11:00:00 - 2020-06-29T12:00:00 tres 1 [2020-07-01T15:00:15.447] error: We have more time than is possible (17721970+259200+30700800)(48681970) > 30700800 for cluster scp(8528) from 2020-06-29T12:00:00 - 2020-06-29T13:00:00 tres 1 Any thoughts ? Regards Sathish
Do you have any reservations? Have you had any reservations during the last month? scontrol show reservations sacctmgr show reservations There are a few possibly sources for this error, but the most likely is related to Bug 6839. When there are down nodes in a reservation, time is double-counted. This is not limited to 19.05, and the fix will only be in 20.11 released later this year. The reason the fix will not be in a released version of slurm for several months is because it involves tricky logic that requires significant testing. The error only occurs when a time range including down nodes in a reservation is "rolled up" into the hourly rollup table _and_ the total allocated time goes over the total possible (meaning that even though time is double counted, the error will not show until enough nodes are allocated to push the time over the total possible time). Please also run `sacctmgr show runawayjobs` Thanks
(In reply to Broderick Gardner from comment #73) > Do you have any reservations? Have you had any reservations during the last > month? > scontrol show reservations > sacctmgr show reservations > > There are a few possibly sources for this error, but the most likely is > related to Bug 6839. When there are down nodes in a reservation, time is > double-counted. This is not limited to 19.05, and the fix will only be in > 20.11 released later this year. The reason the fix will not be in a released > version of slurm for several months is because it involves tricky logic that > requires significant testing. > > The error only occurs when a time range including down nodes in a > reservation is "rolled up" into the hourly rollup table _and_ the total > allocated time goes over the total possible (meaning that even though time > is double counted, the error will not show until enough nodes are allocated > to push the time over the total possible time). > > Please also run `sacctmgr show runawayjobs` > > Thanks We do have reservations and we have been using the reservations for a while. The reported jobID 30700800 has not been submitted to the reservations, see the sacct output for the job 30700800 also other requested outputs. $sacct -j 30700800 --format=Node,submit,Reservation NodeList Submit Reservation --------------- ------------------- -------------------- seskscpn079 2020-06-02T09:39:37 seskscpn079 2020-06-02T09:39:45 seskscpn079 2020-06-02T09:39:45 $sacctmgr show runawayjobs Runaway Jobs: No runaway jobs found on cluster scp $sacctmgr show reservations Cluster Name TRES TimeStart TimeEnd UnusedWall ---------- --------------- ------------------------------ ------------------- ------------------- ---------- scp NFS cpu=36 2019-09-03T11:37:18 2020-07-24T11:30:08 2.8747e+07 scp PS-lab cpu=24 2020-02-29T10:29:02 2021-01-29T12:45:54 1.0710e+07 scp CC-FEP cpu=160 2020-05-21T07:48:14 2021-04-01T10:25:40 3.6158e+06 scp easybuild cpu=64 2020-05-25T10:36:21 2022-05-14T04:22:11 5.5861e+06 scp clab cpu=32 2020-06-08T07:17:52 2021-06-08T07:17:52 4.5545e+06 scp reinvent-prod cpu=40 2020-06-26T11:14:39 2021-03-31T16:00:16 9.2859e+05 scp maint052 cpu=40 2020-06-26T11:14:52 2021-06-26T11:14:52 3.0879e+06 scp changewindow cpu=8528 2020-06-27T15:44:45 2021-06-27T15:44:45 2.9853e+06 scp icinga-check cpu=1 2020-06-28T19:14:50 2021-03-31T15:16:33 2.8832e+06 scp cryoem cpu=96 2020-07-01T07:00:00 2020-07-01T19:00:00 3.2056e+04 scp cryoem cpu=96 2020-07-02T07:00:00 2020-07-02T19:00:00 1.6995e+04 $scontrol show res ReservationName=PS-lab StartTime=2020-01-30T12:45:54 EndTime=2021-01-29T12:45:54 Duration=365-00:00:00 Nodes=seskscpn056 NodeCnt=1 CoreCnt=24 Features=broadwell&standard PartitionName=core Flags= NodeName=seskscpn056 CoreIDs=1-24 TRES=cpu=24 Users=(null) Accounts=cp Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=icinga-check StartTime=2020-03-31T15:16:33 EndTime=2021-03-31T15:16:33 Duration=365-00:00:00 Nodes=seskscpn056 NodeCnt=1 CoreCnt=1 Features=broadwell&standard PartitionName=core Flags=IGNORE_JOBS NodeName=seskscpn056 CoreIDs=0 TRES=cpu=1 Users=scp,icinga-user Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=reinvent-prod StartTime=2020-03-31T16:00:16 EndTime=2021-03-31T16:00:16 Duration=365-00:00:00 Nodes=seskscpg[052-053] NodeCnt=2 CoreCnt=80 Features=cascadelake&dss PartitionName=gpu Flags=SPEC_NODES TRES=cpu=80 Users=(null) Accounts=cc,mai Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=CC-FEP StartTime=2020-04-01T10:25:40 EndTime=2021-04-01T10:25:40 Duration=365-00:00:00 Nodes=seskscpg[004-005,011,050-051] NodeCnt=5 CoreCnt=160 Features=(null) PartitionName=gpu Flags=SPEC_NODES TRES=cpu=160 Users=(null) Accounts=admins,cc,sc Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=cryoem StartTime=2020-07-02T07:00:00 EndTime=2020-07-02T19:00:00 Duration=12:00:00 Nodes=seskscpg[033-034,045-046] NodeCnt=4 CoreCnt=96 Features=cascadelake&volta&highmem PartitionName=gpu Flags=FLEX,DAILY,NO_HOLD_JOBS_AFTER_END TRES=cpu=96 Users=(null) Accounts=sc,admins Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=easybuild StartTime=2020-05-14T04:22:11 EndTime=2022-05-14T04:22:11 Duration=730-00:00:00 Nodes=seskscpg[001,026-027] NodeCnt=3 CoreCnt=64 Features=(null) PartitionName=gpu Flags=FLEX,IGNORE_JOBS,SPEC_NODES,NO_HOLD_JOBS_AFTER_END TRES=cpu=64 Users=(null) Accounts=ops,admins Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=clab StartTime=2020-06-08T07:17:52 EndTime=2021-06-08T07:17:52 Duration=365-00:00:00 Nodes=seskscpn042 NodeCnt=1 CoreCnt=32 Features=broadwell,standard PartitionName=core Flags=IGNORE_JOBS,SPEC_NODES TRES=cpu=32 Users=clab Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
From what I can tell, everything is working as expected right now. I will close this ticket. Thanks