9075 – There is no reservation by id

Ticket 9075 - There is no reservation by id

Summary: There is no reservation by id

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmdbd (show other tickets)
Version:	- Unsupported Older Versions
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Broderick Gardner
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-05-18 12:12 MDT by Sathishkumar
Modified:	2020-08-04 11:43 MDT (History)
CC List:	0 users

See Also:	8321
Site:	AstraZeneca
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmctld logs (19.80 MB, application/x-zip-compressed) 2020-05-18 12:15 MDT, Sathishkumar	Details
slurmdbd log (5.99 MB, application/x-zip-compressed) 2020-05-18 12:18 MDT, Sathishkumar	Details
slurm.conf (9.08 KB, text/plain) 2020-05-20 10:33 MDT, Sathishkumar	Details
scontrol show reservations output (58.39 KB, text/plain) 2020-05-20 10:34 MDT, Sathishkumar	Details
sacctmgr show reservations output (58.39 KB, text/plain) 2020-05-20 10:35 MDT, Sathishkumar	Details
sdiag ouput (20.00 KB, text/plain) 2020-05-20 10:36 MDT, Sathishkumar	Details
sdiag taken on 28th (20.45 KB, text/plain) 2020-05-28 11:28 MDT, Sathishkumar	Details
scontrol_show_reservations_output (2.36 KB, text/plain) 2020-05-28 11:30 MDT, Sathishkumar	Details
slurmdbd.log (5.99 MB, application/x-zip-compressed) 2020-05-28 11:47 MDT, Sathishkumar	Details
slurmctld log (61.05 MB, text/plain) 2020-05-28 12:14 MDT, Sathishkumar	Details
slurmctld logs (92.12 MB, text/plain) 2020-05-28 12:18 MDT, Sathishkumar	Details
slurmdbd.conf from the test system running 19.05 (846 bytes, text/plain) 2020-06-04 12:06 MDT, Sathishkumar	Details
slurmdbd.log from the test system running 19.05 (60.98 MB, application/x-gzip) 2020-06-04 12:10 MDT, Sathishkumar	Details
mariadb_slow.log (11.33 MB, application/x-zip-compressed) 2020-06-05 07:40 MDT, Sathishkumar	Details
show_engine_innodb_status_output (7.91 KB, text/plain) 2020-06-08 07:18 MDT, Sathishkumar	Details
slurmdbd.log on 08062020 (6.00 MB, application/x-zip-compressed) 2020-06-08 11:10 MDT, Sathishkumar	Details
slurmctld logs taken after the running the manual archival (6.12 MB, application/x-gzip) 2020-06-08 13:34 MDT, Sathishkumar	Details
mariadb_slow.log taken after running the archival manually (11.48 MB, application/x-gzip) 2020-06-08 13:42 MDT, Sathishkumar	Details
slurmdbd logs taken on 09062020 (taken after the archival ran by the scheduled interval) (6.12 MB, application/x-gzip) 2020-06-08 23:53 MDT, Sathishkumar	Details
sacctmgr error fix (735 bytes, patch) 2020-06-10 10:36 MDT, Broderick Gardner	Details \| Diff
slurmdbd.log and sdiag after 19.05 upgrade (6.32 MB, application/x-gzip) 2020-06-29 08:07 MDT, Sathishkumar	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Sathishkumar 2020-05-18 12:12:44 MDT

Hello Team, 

There has been lot of messages like below On 14th and 15th may and the jobs are not committing into to slurmdbd when we see the below messages and we are not able to query anything from slurmdbd. 

error: There is no reservation by id 562, time_start 1588618629, and cluster 'scp'
error: There is no reservation by id 602, time_start 1589522400, and cluster 'scp

Also there were lot of runway jobs. We wanted to understand why this is happening and how do we prevent this in future, attached is slurmdbd log for your review, do let me know if you need any other info to investigate further. 

Thanks
Sathish

Comment 1 Sathishkumar 2020-05-18 12:15:57 MDT

Created attachment 14286 [details]
slurmctld logs

Comment 2 Sathishkumar 2020-05-18 12:18:58 MDT

Created attachment 14287 [details]
slurmdbd log

Comment 3 Broderick Gardner 2020-05-19 10:47:50 MDT

It is critical that you apply this fix as soon as possible, or you will lose data. If you attach your slurm.conf, I can tell how close you are.

I need the output of these commands:
$ sacctmgr show reservations format=cluster,name,id,start,end,tres
and
$ scontrol show reservations
and
$ sdiag

Note the output of sdiag; you might want to start a watch on it (`watch sdiag`). The field "DBD agent queue size" shows how many RPC messages for the slurmdbd are backed up on the slurmctld. The size of the queue is MaxDBDMsgs, which by default is max(10000, 2*MaxJobCount+4*NodeCount).

The queue is not emptying because there is an RPC that the slurmdbd won't accept. As indicated by the error message, the bad RPC references a reservation that doesn't exist. Reservations are uniquely referenced by resv_id and time_start, and the problem here will be that the time_start is incorrect in the RPC message. To get it to clear, we have to create a reservation manually with the correct id and time_start.

The fix:
Watch sdiag and the slurmdbd.log. Run the following mysql query for any "no reservation by id ..." errors until there are no more, and the DBD agent queue size in sdiag decreases, eventually reaching 0-100.

mysql> insert into <cluster>_resv_table (id_resv, deleted, time_start, resv_name) values (<id>, 0, <time_start from error>, 'xxbadresv_<resv_id>xx');

eg.
mysql> insert into <cluster>_resv_table (id_resv, deleted, time_start, resv_name) values (602, 0, 1589522400, 'xxbadresv602xx');

These can be removed once everything is running smoothly again. <cluster> is your cluster name, which looks like "scp" for you.

We have seen this, rarely, a few times before, but we have had trouble getting any information on how the bad RPC ended up in the queue. I'll follow up with anything I need to investigate the underlying bug. Thanks for already sending your logs.

Note: the runaway jobs are from the data for those jobs not getting to the database yet. Check again once the queue has emptied.

Thanks

Comment 4 Sathishkumar 2020-05-20 10:32:01 MDT

Hi Broderick, 

We dont see this issue after 15th of May, sdiag looks normal and there is no such errors on slurmdbd.log, however I have attached all the requested command output for your review. 

Please review and do let us know if you think that we still need to apply the suggested fix ? Also, we wanted to know the root cause of this and how to prevent this in future. 


Also, I see this warning message on the slurmdbd.log, what does this mean ? 

[2020-05-20T16:15:12.435] Warning: Note very large processing time from hourly_rollup for scp: usec=6470762 began=16:15:05.965
[2020-05-20T16:15:22.787] Warning: Note very large processing time from daily_rollup for scp: usec=10351127 began=16:15:12.436

Thanks
Sathish

Comment 5 Sathishkumar 2020-05-20 10:33:42 MDT

Created attachment 14321 [details]
slurm.conf

Comment 6 Sathishkumar 2020-05-20 10:34:48 MDT

Created attachment 14322 [details]
scontrol show reservations output

Comment 7 Sathishkumar 2020-05-20 10:35:39 MDT

Created attachment 14323 [details]
sacctmgr show reservations output

Comment 8 Sathishkumar 2020-05-20 10:36:50 MDT

Created attachment 14324 [details]
sdiag ouput

Comment 9 Broderick Gardner 2020-05-27 15:16:49 MDT

Sorry for the delay. As you noticed, your cluster does not currently have the problem I expected and wrote instructions about. This is concerning because I don't know of a way the slurmctld can resolve the problem itself, but it looks like your cluster did have the problem on May 14th and 15th. I am looking into what could have happened. Do you have any more information on any reboots or resets during this time?

Are there any problems currently with your cluster? Are there any runaway jobs?

Side note: you sent the same file twice instead of `scontrol show reservations`.

(In reply to Sathishkumar from comment #4)
> Also, I see this warning message on the slurmdbd.log, what does this mean ? 
> 
> [2020-05-20T16:15:12.435] Warning: Note very large processing time from
> hourly_rollup for scp: usec=6470762 began=16:15:05.965
> [2020-05-20T16:15:22.787] Warning: Note very large processing time from
> daily_rollup for scp: usec=10351127 began=16:15:12.436
This is not a serious problem alone, but it could indicate that your database is somewhat overloaded during particularly heavy rollup times. In case you don't know, rollup refers to the collection of cluster usage for each hour, day, and month. This is used when creating reports via sreport.

Thanks

Comment 10 Sathishkumar 2020-05-28 11:27:50 MDT

Yes, at the moment we have no issues, but we had an load issues on 13th on one of the control plane where slurmctld00(primary) is running, the load went high and due to that slurmctld primary lost communication and backup took over and I believe for some reason the backup was not completely took over and stuck in between and not able to process the statesavelocation for some reason and 
ended up split with brain issue. We noticed this issue had all started after this load on the control plane and after 15th we done have any issues. 


But now when I try to  check if there is any runway jobs, it gives me the below error, it was working fine until yesterday.

[kfdv397@seskscpn084 ~]$ sacctmgr show runaway
sacctmgr: error: Slurmctld running on cluster scp is not up, can't check running jobs
 
But all the other command given below works fine 

[kfdv397@seskscpn084 ~]$ scontrol  ping
Slurmctld(primary/backup) at slurmctld00/slurmctld01 are UP/UP


[kfdv397@seskscpn084 ~]$ sacctmgr show cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
       scp                            0  8192         1                                                                                           normal


I am not able to sacctmgr show runaway on slurmctld and slurmdbd as well and it gives me the same error message, saying slurmctld running on cluster scp is not up. 

Any idea why I am not able to check runaway jobs ? 
Attached it the latest sdiag output and also attached scontrol show reservation which was taken on 20th. 

Please review and let me know if you need any other details from my end to proceed. 

Thanks
Sathish

Comment 11 Sathishkumar 2020-05-28 11:28:44 MDT

Created attachment 14418 [details]
sdiag taken on 28th

Comment 12 Sathishkumar 2020-05-28 11:30:51 MDT

Created attachment 14419 [details]
scontrol_show_reservations_output

Have attached the correct scontrol show reservations output.

Comment 13 Sathishkumar 2020-05-28 11:47:17 MDT

Created attachment 14420 [details]
slurmdbd.log

Latest slurmdbd.log attached for your review.

Comment 14 Sathishkumar 2020-05-28 12:03:01 MDT

I see that there are messages like below on the log, 

DBD_CLUSTER_TRES: cluster not registered, 

I believe if we restart the slurmdbd service this will get fixed, before doing that,i wanted to you to review this and make sure there is nothing wrong with slurmdbd.

Is this happening because slurmctld was not communicating with slurmdbd ?

Comment 15 Sathishkumar 2020-05-28 12:14:51 MDT

Created attachment 14421 [details]
slurmctld log

slurmctld.log from the primary controller which is slurmctld00

Comment 16 Sathishkumar 2020-05-28 12:18:58 MDT

Created attachment 14423 [details]
slurmctld logs

slurmctld.log from the backup controller which is slurmctld01

Comment 17 Broderick Gardner 2020-06-02 09:46:14 MDT

(In reply to Sathishkumar from comment #14)
> I see that there are messages like below on the log, 
> 
> DBD_CLUSTER_TRES: cluster not registered, 
> 
> I believe if we restart the slurmdbd service this will get fixed, before
> doing that,i wanted to you to review this and make sure there is nothing
> wrong with slurmdbd.
> 
> Is this happening because slurmctld was not communicating with slurmdbd ?

This should be fixed by a restart of the slurmdbd and/or the slurmctld, yes.

Do you have any plans to upgrade to a supported version soon? There have been changes and fixes to cluster registration in more recent versions of Slurm, as well as fixes to backup controller takeover.

Comment 18 Sathishkumar 2020-06-03 11:32:46 MDT

Hi Broderick, 

Yes, we are working on upgrading to 19.05.7 version. Before we upgrade, we are trying to get the database slimmed by enabling archival and purging. We have a test setup with replicated production db to test the archival and purging process with the below parameters in slurmdbd.conf. 

But for some reason the archival and purging is not happening as per the defined schedule. There is no errors on the logs. 

#Purging
PurgeEventAfter=18852hours
PurgeJobAfter=18852hours
PurgeResvAfter=18852hours
PurgeStepAfter=18852hours
PurgeSuspendAfter=18852hours
PurgeTXNAfter=18852hours
PurgeUsageAfter=18852hours

#Archival
ArchiveDir=/opt/archive
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes


Also, If I run the archival process manually,I get the below error 

[root@slurmdbd1905 archive]# sacctmgr archive dump
sacctmgr: error: slurmdbd: Getting response to message type 1459
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
Problem dumping archive: Unspecified error

We have about 30900000+ jobs on the db. Could you please help to proceed further with this ? 

Thanks

Comment 19 Broderick Gardner 2020-06-03 12:51:04 MDT

Have you upgraded anything yet, or is it all still on 17.11? There have been several major fixes to archive/purge in 18.08 and 19.05.

Comment 20 Sathishkumar 2020-06-03 23:38:50 MDT

It was on 17.11 but I saw the release notes of 18.08 and found there were several improvements related to archive and purge and now we are using 18.08 on the test environment, but still the status is same, any thoughts ?

Comment 21 Sathishkumar 2020-06-04 02:19:13 MDT

I also updated to the latest version but still no improvement, the archive and purge does not trigger as per the defined schedule. 

[root@slurmdbd1905 ~]# slurmdbd -V
slurm 19.05.7

Also, manual archival is still failing with the same error message. 

[root@slurmdbd1905 ~]# sacctmgr archive dump
This may result in loss of accounting database records (if Purge* options enabled).
Are you sure you want to continue? (You have 30 seconds to decide)
(N/y): y
sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
Problem dumping archive: Unspecified error

Comment 22 Sathishkumar 2020-06-04 10:30:41 MDT

Hi Broderick, 

This is bit critical for us and would be really helpful if you can help on this as soon as possible. 

Thanks
Sathish

Comment 23 Broderick Gardner 2020-06-04 10:49:30 MDT

I understand, I am working on it. You said previously that there are no errors; the scheduled purge just doesn't happen. Please attach the slurmdbd log from the 19.05 test slurmdbd.

What is the slurmdbd debug level? Please also include the slurmdbd.conf (redact the storage password). I am assuming that we are just working on the test slurmdbd for now, correct?

What is the process you followed to upgrade the slurmdbd on your test cluster? Do you use RPMs? Did you upgrade all the slurm binaries at once, or just the slurmdbd?


What is the status of your production cluster? Did you restart any daemons? Have you been able to check for runaway jobs again? I just want to make sure it is in a functional state at least while we work on the test cluster.

Thanks

Comment 24 Broderick Gardner 2020-06-04 11:13:13 MDT

Oh I see from the slurmdbd.log I already have that you are probably at debug4. You don't need to be that high; debug2 is plenty. I do still need the slurmdbd.log from the 19.05 test cluster.

Comment 25 Sathishkumar 2020-06-04 12:06:22 MDT

Created attachment 14529 [details]
slurmdbd.conf from the test system running 19.05

Comment 26 Sathishkumar 2020-06-04 12:10:27 MDT

Created attachment 14530 [details]
slurmdbd.log from the test system running 19.05

Comment 27 Sathishkumar 2020-06-04 12:21:22 MDT

(In reply to Broderick Gardner from comment #23)
> I understand, I am working on it. You said previously that there are no
> errors; the scheduled purge just doesn't happen. Please attach the slurmdbd
> log from the 19.05 test slurmdbd.
> 
Please find the attached slurmdbd.log from the test slurmdbd

> What is the slurmdbd debug level? Please also include the slurmdbd.conf
> (redact the storage password). I am assuming that we are just working on the
> test slurmdbd for now, correct?
> 
please find the attached slurmdbd.conf from the test slurmdbd and debug4 is set on the test system as well. 

> What is the process you followed to upgrade the slurmdbd on your test
> cluster? Do you use RPMs? Did you upgrade all the slurm binaries at once, or
> just the slurmdbd?

build the rpm and then ran yum update slurm-slurmdbd slurm-contribs slurm-libpmi slurm-slurmd slurm slurm-pam_slurm slurm-perlapi 

First the update was done from the current version(17.11.13) to 18.08.9, db conversion took almost 6 hours and after that I checked the archival and purging and noticed that is was not happening and manually triggered archival was also failing. 

Then I build the packages for version 19.05.7 and ran the above mentioned yum update command and it took 20 min to complete the db conversion and then i checked the archival and purge and notice it is not triggering as per the defined schedule and ran the archival manually and I was just getting the below error. 

[root@slurmdbd1905 ~]# sacctmgr archive dump
This may result in loss of accounting database records (if Purge* options enabled).
Are you sure you want to continue? (You have 30 seconds to decide)
(N/y): y
sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
Problem dumping archive: Unspecified error


> 
> What is the status of your production cluster? Did you restart any daemons?
> Have you been able to check for runaway jobs again? I just want to make sure
> it is in a functional state at least while we work on the test cluster.

Yes, i did the restart and after the following error DBD_CLUSTER_TRES: cluster not registered, disappeared and there were no runaway jobs and it is functional without any issues.

> 
> Thanks

Comment 28 Sathishkumar 2020-06-04 12:23:56 MDT

(In reply to Broderick Gardner from comment #24)
> Oh I see from the slurmdbd.log I already have that you are probably at
> debug4. You don't need to be that high; debug2 is plenty. I do still need
> the slurmdbd.log from the 19.05 test cluster.

Sure, please review the attached requested logs and config's and do let me know if there is anything needed from my end.

Comment 29 Broderick Gardner 2020-06-04 13:08:03 MDT

The logs from each start of your slurmdbd are concerning, ie. not good. What is happening there from your side? Here is the timeline I read from the log:

[2020-06-04T06:58:24.953] slurmdbd version 19.05.7 started
...
[2020-06-04T06:58:24.954] debug4: 0(as_mysql_usage.c:118) query
select hourly_rollup, daily_rollup, monthly_rollup from "scp_last_ran_table"
...
(some connections, but no logs to indicate the init process finished)
[2020-06-04T07:28:56.554] Terminate signal (SIGINT or SIGTERM) received
[2020-06-04T07:28:56.554] debug:  Waiting for rollup thread to finish.
[2020-06-04T07:28:56.555] debug:  rpc_mgr shutting down
(log messages to indicate the init finished!)
[2020-06-04T07:30:31.001] slurmdbd version 19.05.7 started
(restarted by systemd?)
...

I see this pattern a few times for 18.08 and 19.05, with time between slurmdbd start and the SIGTERM ranging from 30 minutes to 12 hours.

Preemptive guess would be an issue with mysql, ie the slurmdbd is hanging during init on a mysql query. Please enable the slow query log in mysql and attach that after restarting the slurmdbd. In case that's not enough, you might want to roll back the test cluster database to 17.11 (from backup/production) and start the slurmdbd-19.05 again as well.

Thanks

Comment 30 Sathishkumar 2020-06-05 07:40:33 MDT

Created attachment 14551 [details]
mariadb_slow.log

Comment 31 Sathishkumar 2020-06-05 07:41:26 MDT

(In reply to Broderick Gardner from comment #29)
> The logs from each start of your slurmdbd are concerning, ie. not good. What
> is happening there from your side? Here is the timeline I read from the log:
> 
> [2020-06-04T06:58:24.953] slurmdbd version 19.05.7 started
> ...
> [2020-06-04T06:58:24.954] debug4: 0(as_mysql_usage.c:118) query
> select hourly_rollup, daily_rollup, monthly_rollup from "scp_last_ran_table"
> ...
> (some connections, but no logs to indicate the init process finished)
> [2020-06-04T07:28:56.554] Terminate signal (SIGINT or SIGTERM) received
> [2020-06-04T07:28:56.554] debug:  Waiting for rollup thread to finish.
> [2020-06-04T07:28:56.555] debug:  rpc_mgr shutting down
> (log messages to indicate the init finished!)
> [2020-06-04T07:30:31.001] slurmdbd version 19.05.7 started
> (restarted by systemd?)
> ...
> 
> I see this pattern a few times for 18.08 and 19.05, with time between
> slurmdbd start and the SIGTERM ranging from 30 minutes to 12 hours.
> 
> Preemptive guess would be an issue with mysql, ie the slurmdbd is hanging
> during init on a mysql query. Please enable the slow query log in mysql and
> attach that after restarting the slurmdbd. In case that's not enough, you
> might want to roll back the test cluster database to 17.11 (from
> backup/production) and start the slurmdbd-19.05 again as well.
> 
> Thanks

We already have the slow query log enabled and attached the same for your review.

Thanks

Comment 32 Broderick Gardner 2020-06-05 10:16:32 MDT

Can you give me any more information about the slurmdbd startup procedure? Are you manually sending the SIGTERM, or is that systemd?

Comment 33 Sathishkumar 2020-06-05 10:37:24 MDT

(In reply to Broderick Gardner from comment #32)
> Can you give me any more information about the slurmdbd startup procedure?
> Are you manually sending the SIGTERM, or is that systemd?

what kind of slurmdbd startup procedure are you looking for ? where do you see the SIGTERM, it should be coming from systemd.

Comment 34 Broderick Gardner 2020-06-05 14:55:36 MDT

I mean are you doing anything manually during the slurmdbd startup to cause the SIGTERM? `systemctl restart slurmdbd`?

As I mentioned, the slurmdbd appears to hang, possibly on a mysql query, before it finishes the init phase. It completes the init phase after receiving a SIGTERM, then restarts. Are you aware of this or something like this?

I need to know if this is happening consistently. Please redo the conversion from 17.11 to 19.05, watching for the stalled init. The slurmdbd config is printed at debug2 during the init, so if you watch the log (`tail -f /var/log/slurm/slurmdbd.log`), you should see if it is stalled. If it is, I need to know what query it's stalled on.
mysql> show engine innodb status;

Comment 35 Sathishkumar 2020-06-08 07:18:26 MDT

Created attachment 14571 [details]
show_engine_innodb_status_output

Comment 36 Sathishkumar 2020-06-08 07:19:03 MDT

(In reply to Broderick Gardner from comment #34)
> I mean are you doing anything manually during the slurmdbd startup to cause
> the SIGTERM? `systemctl restart slurmdbd`?
> 
> As I mentioned, the slurmdbd appears to hang, possibly on a mysql query,
> before it finishes the init phase. It completes the init phase after
> receiving a SIGTERM, then restarts. Are you aware of this or something like
> this?
> 
> I need to know if this is happening consistently. Please redo the conversion
> from 17.11 to 19.05, watching for the stalled init. The slurmdbd config is
> printed at debug2 during the init, so if you watch the log (`tail -f
> /var/log/slurm/slurmdbd.log`), you should see if it is stalled. If it is, I
> need to know what query it's stalled on.
> mysql> show engine innodb status;


I have started the conversion from 17.11 to 19.05, I have started the conversion around 11.35 UTC and now the time is 13.05 UTC, see the log below and the log was not updated for almost 1.5 hours, is this expected or is this stalled ? 

Attached is the show engine innodb status, could you please review and let me know what is going on here ? 

[2020-06-08T11:28:19.687] slurmdbd version 17.11.13 started
[2020-06-08T11:28:51.913] Warning: Note very large processing time from hourly_rollup for scp: usec=32224147 began=11:28:19.689
[2020-06-08T11:30:02.731] Warning: Note very large processing time from daily_rollup for scp: usec=70817945 began=11:28:51.913
[2020-06-08T11:35:07.302] Terminate signal (SIGINT or SIGTERM) received
[2020-06-08T11:35:07.352] MySQL server version is: 5.5.60-MariaDB
[2020-06-08T11:35:07.379] pre-converting step table for scp
[2020-06-08T11:35:07.406] adding column tres_usage_in_ave after tres_alloc in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_in_max after tres_usage_in_ave in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_in_max_taskid after tres_usage_in_max in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_in_max_nodeid after tres_usage_in_max_taskid in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_in_min after tres_usage_in_max_nodeid in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_in_min_taskid after tres_usage_in_min in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_in_min_nodeid after tres_usage_in_min_taskid in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_in_tot after tres_usage_in_min_nodeid in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_out_ave after tres_usage_in_tot in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_out_max after tres_usage_out_ave in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_out_max_taskid after tres_usage_out_max in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_out_max_nodeid after tres_usage_out_max_taskid in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_out_min after tres_usage_out_max_nodeid in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_out_min_taskid after tres_usage_out_min in table "scp_step_table"
[2020-06-08T11:35:07.406] adding column tres_usage_out_min_nodeid after tres_usage_out_min_taskid in table "scp_step_table"
[2020-06-08T11:35:07.407] adding column tres_usage_out_tot after tres_usage_out_min_nodeid in table "scp_step_table"
[2020-06-08T11:51:58.747] Warning: Note very large processing time from make table current "scp_step_table": usec=1011341345 began=11:35:07.406
[root@slurmdbd1905 tmp]# date
Mon Jun  8 13:05:24 UTC 2020


Based on the process list, it appears that is is updating the table. 

MariaDB [slurm_acct_db]> show processlist;
+----+-------+-----------+---------------+---------+------+----------------+------------------------------------------------------------------------------------------------------+----------+
| Id | User  | Host      | db            | Command | Time | State          | Info                                                                                                 | Progress |
+----+-------+-----------+---------------+---------+------+----------------+------------------------------------------------------------------------------------------------------+----------+
| 26 | slurm | localhost | slurm_acct_db | Query   |    0 | Writing to net | update "scp_step_table" set tres_usage_in_max='2=1290240,6=1,7=178241536', tres_usage_in_max_nodeid= |    0.000 |
| 34 | root  | localhost | slurm_acct_db | Query   |    0 | NULL           | show processlist                                                                                     |    0.000 |
+----+-------+-----------+---------------+---------+------+----------------+------------------------------------------------------------------------------------------------------+----------+
2 rows in set (0.00 sec)


MariaDB [slurm_acct_db]> show processlist;
+----+-------+-----------+---------------+---------+------+----------+------------------------------------------------------------------------------------------------------+----------+
| Id | User  | Host      | db            | Command | Time | State    | Info                                                                                                 | Progress |
+----+-------+-----------+---------------+---------+------+----------+------------------------------------------------------------------------------------------------------+----------+
| 26 | slurm | localhost | slurm_acct_db | Query   |    0 | Updating | update "scp_step_table" set tres_usage_in_max='2=2553724928,6=40616,7=9574662144', tres_usage_in_max |    0.000 |
| 34 | root  | localhost | slurm_acct_db | Query   |    0 | NULL     | show processlist                                                                                     |    0.000 |
+----+-------+-----------+---------------+---------+------+----------+------------------------------------------------------------------------------------------------------+----------+
2 rows in set (0.00 sec)

Comment 37 Broderick Gardner 2020-06-08 09:39:24 MDT

No, that looks normal, but I need more of the slurmdbd.log.

Comment 38 Broderick Gardner 2020-06-08 09:43:35 MDT

When you run the upgraded slurmdbd the first time, it should not be under systemd. It should be manual:
slurmdbd -D

This is because systemd will kill the daemon if it thinks it's hung while upgrading the database.

Comment 39 Sathishkumar 2020-06-08 11:09:36 MDT

(In reply to Broderick Gardner from comment #38)
> When you run the upgraded slurmdbd the first time, it should not be under
> systemd. It should be manual:
> slurmdbd -D
> 
> This is because systemd will kill the daemon if it thinks it's hung while
> upgrading the database.

your last comment suggested to watch tail -f /var/log/slurm/slurmdbd.log to see if it is stalled. So I assumed, that the conversion should be under systemd. Please find the attached slurmdbd.log, conversion is still happening. 

Please review and let me know what is next. 

Thanks
Sathish

Comment 40 Sathishkumar 2020-06-08 11:10:53 MDT

Created attachment 14580 [details]
slurmdbd.log on 08062020

slurmdbd.log taken during the conversion from 17.11 to 19.05

Comment 41 Sathishkumar 2020-06-08 11:24:36 MDT

Hi Broderick, 

I just checked and it appears that the conversion got completed successfully. 

Below are the log entries which are missing from my earlier slurmdbd.log attachement. 


[2020-06-08T16:55:41.614] dropping column max_disk_write_task from table "scp_step_table"
[2020-06-08T16:55:41.615] dropping column max_disk_write_node from table "scp_step_table"
[2020-06-08T16:55:41.615] dropping column ave_disk_write from table "scp_step_table"
[2020-06-08T17:12:25.219] Warning: Note very large processing time from make table current "scp_step_table": usec=1003607405 began=16:55:41.612
[2020-06-08T17:12:25.317] adding column max_jobs_accrue_pa after max_jobs_per_user in table qos_table
[2020-06-08T17:12:25.317] adding column max_jobs_accrue_pu after max_jobs_accrue_pa in table qos_table
[2020-06-08T17:12:25.317] adding column min_prio_thresh after max_jobs_accrue_pu in table qos_table
[2020-06-08T17:12:25.317] adding column grp_jobs_accrue after grp_jobs in table qos_table
[2020-06-08T17:12:25.317] adding column preempt_exempt_time after preempt_mode in table qos_table
[2020-06-08T17:12:25.379] Conversion done: success!
[2020-06-08T17:12:25.387] Accounting storage MYSQL plugin loaded

Please do let me know if you need any other other logs for review. 

Thanks
Sathish

Comment 42 Broderick Gardner 2020-06-08 11:29:28 MDT

(In reply to Sathishkumar from comment #39)
> your last comment suggested to watch tail -f /var/log/slurm/slurmdbd.log to
> see if it is stalled. So I assumed, that the conversion should be under
> systemd.
Yes, I forgot about the systemd issue with long database upgrade times.


The log does not show slurmdbd 19.05 starting, only slurmdbd 17.11. Can you verify the procedure? The fact that it did a database conversion indicates a database version mismatch, but it looks like slurmdbd 19.05 was _not_ running.

Comment 43 Sathishkumar 2020-06-08 11:45:16 MDT

(In reply to Broderick Gardner from comment #42)
> (In reply to Sathishkumar from comment #39)
> > your last comment suggested to watch tail -f /var/log/slurm/slurmdbd.log to
> > see if it is stalled. So I assumed, that the conversion should be under
> > systemd.
> Yes, I forgot about the systemd issue with long database upgrade times.
> 
> 
> The log does not show slurmdbd 19.05 starting, only slurmdbd 17.11. Can you
> verify the procedure? The fact that it did a database conversion indicates a
> database version mismatch, but it looks like slurmdbd 19.05 was _not_
> running.

Please review  the logs below and suggest the next steps.

[root@slurmdbd1905 ~]# slurmdbd -V
slurm 19.05.7
[root@slurmdbd1905 ~]#


[root@slurmdbd1905 ~]# systemctl  status slurmdbd
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmdbd.service.d
           └─override.conf
   Active: active (running) since Mon 2020-06-08 11:35:07 UTC; 6h ago
 Main PID: 3258 (slurmdbd)
   CGroup: /system.slice/slurmdbd.service
           └─3258 /usr/sbin/slurmdbd

Jun 08 16:55:41 slurmdbd1905 slurmdbd[3258]: dropping column ave_disk_write from table "scp_step_table"
Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: Warning: Note very large processing time from make table current "scp_step_table": usec=1003607405 began=16:55:41.612
Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: adding column max_jobs_accrue_pa after max_jobs_per_user in table qos_table
Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: adding column max_jobs_accrue_pu after max_jobs_accrue_pa in table qos_table
Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: adding column min_prio_thresh after max_jobs_accrue_pu in table qos_table
Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: adding column grp_jobs_accrue after grp_jobs in table qos_table
Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: adding column preempt_exempt_time after preempt_mode in table qos_table
Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: Conversion done: success!
Jun 08 17:12:25 slurmdbd1905 slurmdbd[3258]: Accounting storage MYSQL plugin loaded
Jun 08 17:21:29 slurmdbd1905 slurmdbd[3258]: slurmdbd version 19.05.7 started


[root@slurmdbd1905 ~]# tail -n 10  /var/log/slurm/slurmdbd.log
[2020-06-08T16:55:41.615] dropping column ave_disk_write from table "scp_step_table"
[2020-06-08T17:12:25.219] Warning: Note very large processing time from make table current "scp_step_table": usec=1003607405 began=16:55:41.612
[2020-06-08T17:12:25.317] adding column max_jobs_accrue_pa after max_jobs_per_user in table qos_table
[2020-06-08T17:12:25.317] adding column max_jobs_accrue_pu after max_jobs_accrue_pa in table qos_table
[2020-06-08T17:12:25.317] adding column min_prio_thresh after max_jobs_accrue_pu in table qos_table
[2020-06-08T17:12:25.317] adding column grp_jobs_accrue after grp_jobs in table qos_table
[2020-06-08T17:12:25.317] adding column preempt_exempt_time after preempt_mode in table qos_table
[2020-06-08T17:12:25.379] Conversion done: success!
[2020-06-08T17:12:25.387] Accounting storage MYSQL plugin loaded
[2020-06-08T17:21:29.874] slurmdbd version 19.05.7 started

Thanks
Sathish

Comment 44 Sathishkumar 2020-06-08 12:33:52 MDT

I have not started the archival/purging process, I am waiting for your confirmation to see if all fine with the conversion based on the logs which i shared, i will proceed as soon as I hear from you. 

Thanks
Sathish

Comment 45 Broderick Gardner 2020-06-08 12:42:46 MDT

Go ahead, I'm still investigating the logs, but you can proceed.

Comment 46 Sathishkumar 2020-06-08 13:15:23 MDT

(In reply to Broderick Gardner from comment #45)
> Go ahead, I'm still investigating the logs, but you can proceed.

ok, I have enabled the below parameters with slurmdbd.conf and restarted it.


#Purging
PurgeEventAfter=18852hours
PurgeJobAfter=18852hours
PurgeResvAfter=18852hours
PurgeStepAfter=18852hours
PurgeSuspendAfter=18852hours
PurgeTXNAfter=18852hours
PurgeUsageAfter=18852hours

#Archival
ArchiveDir=/opt/archive
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes

I will let you know the status as soon as I see the archival is initialised. 

Thanks
Sathish

Comment 47 Sathishkumar 2020-06-08 13:32:38 MDT

I did ran the archive manually, but i got the same error,

[root@slurmdbd1905 opt]# sacctmgr archive dump
This may result in loss of accounting database records (if Purge* options enabled).
Are you sure you want to continue? (You have 30 seconds to decide)
(N/y): y

sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
 Problem dumping archive: Unspecified error

Attached is the slurmdbd.log for your review. 

Under archive folder, I do see these files, in the previous instance, these files where not created. 

[root@slurmdbd1905 archive]# pwd
/opt/archive
[root@slurmdbd1905 archive]# ls
scp_event_table_archive_2017-09-01T00:00:00_2017-10-01T00:00:00  scp_event_table_archive_2017-12-01T00:00:00_2018-01-01T00:00:00  scp_event_table_archive_2018-03-06T07:00:00_2018-04-15T06:59:59
scp_event_table_archive_2017-10-01T00:00:00_2017-11-01T00:00:00  scp_event_table_archive_2018-01-01T00:00:00_2018-02-01T00:00:00
scp_event_table_archive_2017-11-01T00:00:00_2017-12-01T00:00:00  scp_event_table_archive_2018-02-01T00:00:00_2018-03-01T00:00:00

Thanks
Sathish

Comment 48 Sathishkumar 2020-06-08 13:34:27 MDT

Created attachment 14592 [details]
slurmctld logs taken after the running the manual archival

Comment 49 Sathishkumar 2020-06-08 13:42:49 MDT

Created attachment 14593 [details]
mariadb_slow.log taken after running the archival manually

Comment 50 Broderick Gardner 2020-06-08 23:07:49 MDT

Thanks. My confusion before was my mistake, sorry. I was concerned about not seeing the "version 19.05 started" log message, even after the conversion completed. 

The conversion appears to be error-free. Looking at the archive once again, it is interesting that the event table successfully archived, but the jobs (and others) did not. There is no error on the slurmdbd during the archive? Please post the slurmdbd log again, either all of it or trimmed to just since
[2020-06-08T17:21:29.874] slurmdbd version 19.05.7 started

Thanks

Comment 51 Sathishkumar 2020-06-08 23:48:44 MDT

(In reply to Broderick Gardner from comment #50)
> Thanks. My confusion before was my mistake, sorry. I was concerned about not
> seeing the "version 19.05 started" log message, even after the conversion
> completed. 
> 
> The conversion appears to be error-free. Looking at the archive once again,
> it is interesting that the event table successfully archived, but the jobs
> (and others) did not. There is no error on the slurmdbd during the archive?
> Please post the slurmdbd log again, either all of it or trimmed to just since
> [2020-06-08T17:21:29.874] slurmdbd version 19.05.7 started
> 
> Thanks

Hi Broderick, 

Thanks for the update. As I mentioned, I ran the archival manually, and below is the command I have used and gives an error saying "Unspecified error" see below.

[root@slurmdbd1905 opt]# sacctmgr archive dump
This may result in loss of accounting database records (if Purge* options enabled).
Are you sure you want to continue? (You have 30 seconds to decide)
(N/y): y

sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
 Problem dumping archive: Unspecified error


why does it gives the above error ? please advise. 

There is no errors on the logs. Also, I left the configuration as it is and it looks like the archival ran as per the scheduled interval and completed, I am saying it is completed because I see that both the event and job table has been successfully archived and I can see them under archive folder, but I need to validate if all ran successfully. 

[root@slurmdbd1905 archive]# ls
scp_event_table_archive_2017-09-01T00:00:00_2017-10-01T00:00:00  scp_job_table_archive_2018-03-01T00:00:00_2018-04-15T06:59:59
scp_event_table_archive_2017-10-01T00:00:00_2017-11-01T00:00:00  scp_job_table_archive_2018-04-06T18:00:00_2018-04-15T06:59:59
scp_event_table_archive_2017-11-01T00:00:00_2017-12-01T00:00:00  scp_job_table_archive_2018-04-15T07:00:00_2018-04-15T07:59:59
scp_event_table_archive_2017-12-01T00:00:00_2018-01-01T00:00:00  scp_job_table_archive_2018-04-15T08:00:00_2018-04-15T08:59:59
scp_event_table_archive_2018-01-01T00:00:00_2018-02-01T00:00:00  scp_job_table_archive_2018-04-15T09:00:00_2018-04-15T09:59:59
scp_event_table_archive_2018-02-01T00:00:00_2018-03-01T00:00:00  scp_job_table_archive_2018-04-15T10:00:00_2018-04-15T10:59:59
scp_event_table_archive_2018-03-06T07:00:00_2018-04-15T06:59:59  scp_job_table_archive_2018-04-15T11:00:00_2018-04-15T11:59:59
scp_job_table_archive_2017-09-01T00:00:00_2017-10-01T00:00:00    scp_job_table_archive_2018-04-15T12:00:00_2018-04-15T12:59:59
scp_job_table_archive_2017-10-01T00:00:00_2017-11-01T00:00:00    scp_job_table_archive_2018-04-15T13:00:00_2018-04-15T13:59:59
scp_job_table_archive_2017-11-01T00:00:00_2017-12-01T00:00:00    scp_job_table_archive_2018-04-15T14:00:00_2018-04-15T14:59:59
scp_job_table_archive_2017-12-01T00:00:00_2018-01-01T00:00:00    scp_job_table_archive_2018-04-15T15:00:00_2018-04-15T15:59:59
scp_job_table_archive_2018-01-01T00:00:00_2018-02-01T00:00:00    scp_job_table_archive_2018-04-15T16:00:00_2018-04-15T16:59:59
scp_job_table_archive_2018-02-01T00:00:00_2018-03-01T00:00:00


I am wondering why I got the above "Unspecified error" while running the archive manually and not while the archival is triggered by the scheduled time.  

could you please review the slurmdbd.log_09062020 logs which i have attached along with this and do let me know if you see any errors. 


Thanks
Sathish

Comment 52 Sathishkumar 2020-06-08 23:53:25 MDT

Created attachment 14601 [details]
slurmdbd logs taken on 09062020 (taken after the archival ran by the scheduled interval)

slurmdbd logs taken on 09062020 (taken after the archival ran by the scheduled interval)

Comment 53 Sathishkumar 2020-06-10 06:29:21 MDT

Hi Broderick, I appreciate if you can review and comment, as I said earlier, this is bit critical and we wanted upgrade to the supported version as soon as possible. 

Thanks
Sathish

Comment 54 Broderick Gardner 2020-06-10 10:36:25 MDT

Created attachment 14621 [details]
sacctmgr error fix

I found an issue with error reporting that is the reason why it is "failing" with "no error". Applying this patch and rerunning the manual `sacctmgr archive dump` command will show the true error message.

The method for applying the patch depends on your build and deployment method. Let me know if you are able to do it or not.

Thanks

Comment 55 Sathishkumar 2020-06-15 12:43:18 MDT

(In reply to Broderick Gardner from comment #54)
> Created attachment 14621 [details]
> sacctmgr error fix
> 
> I found an issue with error reporting that is the reason why it is "failing"
> with "no error". Applying this patch and rerunning the manual `sacctmgr
> archive dump` command will show the true error message.
> 
> The method for applying the patch depends on your build and deployment
> method. Let me know if you are able to do it or not.
> 
> Thanks

We are using rpms to deploy the slurm in our environment, I have applied the patch which you have given and rebuild the rpms from the src with the applied patch,   


[root@slurmdbd1905n slurmdbd]# patch -b -V numbered  <  ../../../../../sacctmgr_error_fix
patching file accounting_storage_slurmdbd.c


[root@slurmdbd1905n slurmdbd]# ls
Makefile.am  Makefile.in  accounting_storage_slurmdbd.c  accounting_storage_slurmdbd.c.~1~  slurmdbd_agent.c  slurmdbd_agent.h

Is this the only difference in the given patch ? 

[root@slurmdbd1905n slurmdbd]# diff accounting_storage_slurmdbd.c accounting_storage_slurmdbd.c.~1~
2931c2931
<               error("slurmdbd: DBD_ARCHIVE_DUMP failure: %s", slurm_strerror(rc));
---
>               error("slurmdbd: DBD_ARCHIVE_DUMP failure: %m");


now, on the test system, sacctmgr archive dump is working as expected, because the archive/purge was already ran by the system as per the scheduled interval as per the following parameter Purge*After configured in slurmdbd.conf

But I feel that if we run sacctmgr archive dump with full data in the DB, it is still fail, I could not test sacctmgr archive dump, because the data was already archived/purged from the databases.

But when I try to pull more than two months data using sacct, it gives the below error, any thing less than two months it is working as expected, the memory usage is high when we pull data for more than two months


[root@slurmdbd1905n slurm-patch]# sacct --starttime=2020-03-01
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
sacct: error: slurmdbd: Getting response to message type: DBD_GET_JOBS_COND
sacct: error: slurmdbd: DBD_GET_JOBS_COND failure: Unspecified error


sometimes it will consume all the available memory and slurmdbd will get killed by the system, see the below status which was captured when we ran sacct to pull more than three months data.

[root@slurmdbd1905n slurm-patch]# systemctl  status slurmdbd.service
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmdbd.service.d
           └─override.conf
   Active: activating (auto-restart) (Result: signal) since Mon 2020-06-15 18:09:12 UTC; 26s ago
 Main PID: 78925 (code=killed, signal=KILL)

Jun 15 18:09:12 slurmdbd1905n systemd[1]: Unit slurmdbd.service entered failed state.
Jun 15 18:09:12 slurmdbd1905n systemd[1]: slurmdbd.service failed.


I believe this is the same status for sacctmgr archive dump as well, when we run it will full data on the DB, it is getting crashed with unspecified error. 

Could you please advise a fix for this ? do let me know if you need any other info from my end to take this further. 

Thanks
Sathish

Comment 56 Broderick Gardner 2020-06-15 14:24:06 MDT

(In reply to Sathishkumar from comment #55)
> Is this the only difference in the given patch ? 
Yes.

> now, on the test system, sacctmgr archive dump is working as expected,
> because the archive/purge was already ran by the system as per the scheduled
> interval as per the following parameter Purge*After configured in
> slurmdbd.conf
> 
> But I feel that if we run sacctmgr archive dump with full data in the DB, it
> is still fail, I could not test sacctmgr archive dump, because the data was
> already archived/purged from the databases.
Yes, it will still fail. The patch fixes error logging, so the correct error will be logged.

> 
> But when I try to pull more than two months data using sacct, it gives the
> below error, any thing less than two months it is working as expected, the
> memory usage is high when we pull data for more than two months
Is this different than usual? If the amount of data to be sent from the slurmdbd is over 3GB, it will fail. 

> [root@slurmdbd1905n slurm-patch]# sacct --starttime=2020-03-01
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
> ------------ ---------- ---------- ---------- ---------- ---------- --------
> sacct: error: slurmdbd: Getting response to message type: DBD_GET_JOBS_COND
> sacct: error: slurmdbd: DBD_GET_JOBS_COND failure: Unspecified error
> 
> 
> sometimes it will consume all the available memory and slurmdbd will get
> killed by the system, see the below status which was captured when we ran
> sacct to pull more than three months data.
Requesting large amounts of data from the slurmdbd can cause the slurmdbd to use a massive amount of memory processing the request before determining that the amount of data is too large. See Bug 5817. No, there is currently not a fix for this. You can mitigate it to some extent using the MaxQueryTimeRange in the slurmdbd.conf. This limits the maximum range of time that can be requested from the slurmdbd.

> 
> I believe this is the same status for sacctmgr archive dump as well, when we
> run it will full data on the DB, it is getting crashed with unspecified
> error. 
This sacct error is most likely unrelated to the sacctmgr archive dump error.


In order to proceed, I need to see the actual error message, which means restoring the database and running slurmdbd-19.05 with the patch. During conversion, comment out the archive and purge options from slurmdbd.conf. Then restore those lines, restart the slurmdbd, and try `sacctmgr archive dump` again. This might not be feasible considering the size of the database, but you may want to backup the database after 19.05 conversion but before the archive/purge.

Summary of issues:
Automatic archive/purge: working in 19.05
Manual archive/purge: Unknown error. 
(Note that manual archive/purge is not recommended as a regular operation)
error: No reservation by id: On hold until the cluster is fully upgraded to 19.05

Thanks

Comment 57 Sathishkumar 2020-06-18 13:17:17 MDT

(In reply to Broderick Gardner from comment #56)
> (In reply to Sathishkumar from comment #55)
> > Is this the only difference in the given patch ? 
> Yes.
> 
> > now, on the test system, sacctmgr archive dump is working as expected,
> > because the archive/purge was already ran by the system as per the scheduled
> > interval as per the following parameter Purge*After configured in
> > slurmdbd.conf
> > 
> > But I feel that if we run sacctmgr archive dump with full data in the DB, it
> > is still fail, I could not test sacctmgr archive dump, because the data was
> > already archived/purged from the databases.
> Yes, it will still fail. The patch fixes error logging, so the correct error
> will be logged.
> 
> > 
> > But when I try to pull more than two months data using sacct, it gives the
> > below error, any thing less than two months it is working as expected, the
> > memory usage is high when we pull data for more than two months
> Is this different than usual? If the amount of data to be sent from the
> slurmdbd is over 3GB, it will fail. 
> 
> > [root@slurmdbd1905n slurm-patch]# sacct --starttime=2020-03-01
> >        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
> > ------------ ---------- ---------- ---------- ---------- ---------- --------
> > sacct: error: slurmdbd: Getting response to message type: DBD_GET_JOBS_COND
> > sacct: error: slurmdbd: DBD_GET_JOBS_COND failure: Unspecified error
> > 
> > 
> > sometimes it will consume all the available memory and slurmdbd will get
> > killed by the system, see the below status which was captured when we ran
> > sacct to pull more than three months data.
> Requesting large amounts of data from the slurmdbd can cause the slurmdbd to
> use a massive amount of memory processing the request before determining
> that the amount of data is too large. See Bug 5817. No, there is currently
> not a fix for this. You can mitigate it to some extent using the
> MaxQueryTimeRange in the slurmdbd.conf. This limits the maximum range of
> time that can be requested from the slurmdbd.
> 
> > 
> > I believe this is the same status for sacctmgr archive dump as well, when we
> > run it will full data on the DB, it is getting crashed with unspecified
> > error. 
> This sacct error is most likely unrelated to the sacctmgr archive dump error.
> 
> 
> In order to proceed, I need to see the actual error message, which means
> restoring the database and running slurmdbd-19.05 with the patch. During
> conversion, comment out the archive and purge options from slurmdbd.conf.
> Then restore those lines, restart the slurmdbd, and try `sacctmgr archive
> dump` again. This might not be feasible considering the size of the
> database, but you may want to backup the database after 19.05 conversion but
> before the archive/purge.
> 
> Summary of issues:
> Automatic archive/purge: working in 19.05
> Manual archive/purge: Unknown error. 
> (Note that manual archive/purge is not recommended as a regular operation)
> error: No reservation by id: On hold until the cluster is fully upgraded to
> 19.05
> 
> Thanks


Thanks for the clarification, I will share the actual error by tomorrow, after restoring the db and then running 19.05 with the patch. 

I have a question from your input above 
1. Do you think that it is not feasible to use sacctmgr archive dump for the size of our DB ? if yes, on which case should I use sacctmgr archive dump ?

2. I dont understand what do you mean by backup the database after 19.05 conversion but before the archive/purge. (FYI, i am testing this in a test system) 

Also, as I said that on the test system the scheduled archive and purging are working as expected. Archive and purging will trigger everyday and will keep 548 days of data on the DB and purge everything. After purging the DB has become 1.5G less. At the same time when I checked the size of the archive folder which contains all the generated files are not more than 102M, is that expected and how do we confirm that all the data archived ? I ask this because the difference is so huge between the size which got reduced from the databases(1.5G) and the archive folder size(102M). 
Based on the last generated archive file (scp_resv_table_archive_2018-11-24T00:00:00_2018-12-14T23:59:59) it archived the data until 549 days. Is there any way available to check if the data archived are intact? 

Regards
Sathish

Comment 58 Broderick Gardner 2020-06-18 16:21:27 MDT

(In reply to Sathishkumar from comment #57)
> I have a question from your input above 
> 1. Do you think that it is not feasible to use sacctmgr archive dump for the
> size of our DB ? if yes, on which case should I use sacctmgr archive dump ?
I just mean that you should not need to use `sacctmgr archive dump` because it is configured to run automatically. 
> 
> 2. I dont understand what do you mean by backup the database after 19.05
> conversion but before the archive/purge. (FYI, i am testing this in a test
> system) 
The conversion of the database from 17.11 to 19.05 takes a long time, and you have had to do it a few times. So you could back up the converted database in case it is needed again. This would just be to potentially save time in the future, not to protect against data loss.

> Also, as I said that on the test system the scheduled archive and purging
> are working as expected. Archive and purging will trigger everyday and will
> keep 548 days of data on the DB and purge everything. 
Have you changed the configuration? The slurmdbd.conf lines you sent before indicated purging after 18852 hours, which would trigger a purge every hour. In any case, I expect that to be sufficient; you should never have to run `sacctmgr archive dump` outside of testing.

> After purging the DB
> has become 1.5G less. At the same time when I checked the size of the
> archive folder which contains all the generated files are not more than
> 102M, is that expected and how do we confirm that all the data archived ? I
> ask this because the difference is so huge between the size which got
> reduced from the databases(1.5G) and the archive folder size(102M). 
> Based on the last generated archive file
> (scp_resv_table_archive_2018-11-24T00:00:00_2018-12-14T23:59:59) it archived
> the data until 549 days. Is there any way available to check if the data
> archived are intact? 
You could reload the data and check the database size again. Also run sacct on a range of time that was purged.

Thanks

Comment 59 Sathishkumar 2020-06-19 07:02:36 MDT

(In reply to Broderick Gardner from comment #58)
> (In reply to Sathishkumar from comment #57)
> > I have a question from your input above 
> > 1. Do you think that it is not feasible to use sacctmgr archive dump for the
> > size of our DB ? if yes, on which case should I use sacctmgr archive dump ?
> I just mean that you should not need to use `sacctmgr archive dump` because
> it is configured to run automatically. 

Thanks for the clarification. 
> > 
> > 2. I dont understand what do you mean by backup the database after 19.05
> > conversion but before the archive/purge. (FYI, i am testing this in a test
> > system) 
> The conversion of the database from 17.11 to 19.05 takes a long time, and
> you have had to do it a few times. So you could back up the converted
> database in case it is needed again. This would just be to potentially save
> time in the future, not to protect against data loss.

Got your point
> 
> > Also, as I said that on the test system the scheduled archive and purging
> > are working as expected. Archive and purging will trigger everyday and will
> > keep 548 days of data on the DB and purge everything. 
> Have you changed the configuration? The slurmdbd.conf lines you sent before
> indicated purging after 18852 hours, which would trigger a purge every hour.
> In any case, I expect that to be sufficient; you should never have to run
> `sacctmgr archive dump` outside of testing.
Yes, i have changed to 548 days, just to see if archive is running fine if I convert it from hours to days and I dont see any issues with that at the moment. 

> 
> > After purging the DB
> > has become 1.5G less. At the same time when I checked the size of the
> > archive folder which contains all the generated files are not more than
> > 102M, is that expected and how do we confirm that all the data archived ? I
> > ask this because the difference is so huge between the size which got
> > reduced from the databases(1.5G) and the archive folder size(102M). 
> > Based on the last generated archive file
> > (scp_resv_table_archive_2018-11-24T00:00:00_2018-12-14T23:59:59) it archived
> > the data until 549 days. Is there any way available to check if the data
> > archived are intact? 
> You could reload the data and check the database size again. Also run sacct
> on a range of time that was purged.

I randomly loaded the archived file and I was able to get the desired result using sacct. But I was curious to know if there is any tools or command options available to check the integrity. 

> 
> Thanks

I will share the error message as soon as I am done with the testing. 

Thanks

Comment 60 Sathishkumar 2020-06-24 12:43:45 MDT

Hi Broderick,

I installed the patch which you have shared and did an upgrade on the new test VM with the production data

Slurm version
[root@slurmdbd1905b ~]# slurmdbd -V
slurm 19.05.7

Slurmdbd purging and archive details

PurgeStepAfter=548days
PurgeSuspendAfter=548days
PurgeTXNAfter=548days
PurgeUsageAfter=548days
#
##Archival
ArchiveDir=/mnt/backup/slurmdbd/archive
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes

I ran the archive dump but still getting the same error, any idea ?

[root@slurmdbd1905b ~]# sacctmgr archive dump
This may result in loss of accounting database records (if Purge* options enabled).
Are you sure you want to continue? (You have 30 seconds to decide)
(N/y): y

sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: Unspecified error
Problem dumping archive: Unspecified error

Also, we have a scheduled maintenance coming weekend to upgrade the slurm from 17.11 to 19.05. We are also planning to change the statesave location to different path. Is there any sequence I should follow ?

I am planning to follow the below sequence (Note: all slurm services are running on a container) Please advise if there if I am wrong.

1. Make sure there is no active jobs on the cluster.

2. Upgrade the slurmdbd first, wait for 6+ hours to complete the conversion, once the conversion is done, make sure the dbd is coming up and working as expected by querying sacct and sacctmgr, also checking service status. (please advise if there is any specific test i should perform to make sure the database if up and running fine after conversion)

3. Provision a new slurmctld00(primary) and slurmctld01(backup) with 19.05 with the new statesave location

4. Stop the slurmctld primary and backup

5. Copy over all the statesave file from old location to the new location

6. Start the slurmctld primary and backup

7. Validate if the communication between slurmdbd and slrumctld are working as expected, once it is confirmed. (please advise if there is any specific test i should perform to make sure the communication between slurmdbd and slurmctld are fine with the 19.05 version )

8. Proceed with upgrading all slurmd's

Regards
Sathish

Comment 61 Sathishkumar 2020-06-24 20:07:42 MDT

Hi Broderick, Is there any easy way to verify if I am running the slurmdbd with your patch.

Regards 
Sathish

Comment 62 Broderick Gardner 2020-06-24 20:10:53 MDT

No there's not. But the error message is different, so I would guess that you are.

sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
Problem dumping archive: Unspecified error

vs

sacctmgr: error: slurmdbd: Getting response to message type: DBD_ARCHIVE_DUMP
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: Unspecified error
 Problem dumping archive: Unspecified error

Comment 63 Broderick Gardner 2020-06-24 20:32:56 MDT

(In reply to Sathishkumar from comment #60)
> Also, we have a scheduled maintenance coming weekend to upgrade the slurm
> from 17.11 to 19.05. We are also planning to change the statesave location
> to different path. Is there any sequence I should follow ?  
> 
> I am planning to  follow the below sequence (Note: all slurm services are
> running on a container) Please advise if there if I am wrong. 
Here are notes on your procedure:
> 1. Make sure there is no active  jobs on the cluster.
One way to accomplish this is to set a maintenance reservation for the period of the upgrade so that no jobs are scheduled that could interfere.
> 
> 2. Upgrade the slurmdbd first, wait for 6+ hours to complete the conversion,
> once the conversion is done, make sure the dbd is coming up and working as
> expected by querying sacct and sacctmgr, also checking service status.
> (please advise if there is any specific test i should perform to make sure
> the database if up and running fine after conversion)
Make sure there is a backup of your database prior to running the upgraded slurmdbd. 
When running the slurmdbd to convert the database, run it in the foreground of a terminal instead of as a service. For example:
$ sudo -u <slurm user> slurmdbd -D
This prevents any interference from systemd during the conversion. Once it's done and responding to client commands (sacct or sacctmgr), feel free to stop it (SIGTERM/Ctrl-c) and restart it as a systemd service.
This process has been heavily tested, so if it completes without error, you can be reasonably sure that it is correct.

> 3. Provision a new slurmctld00(primary) and slurmctld01(backup) with 19.05
> with the new statesave location
> 
> 4. Stop the slurmctld primary and backup 
> 
> 5. Copy over all the statesave file from old location to the new location
> 
> 6. Start the slurmctld primary and backup 
> 
> 7. Validate if the communication between slurmdbd and slrumctld are working
> as expected, once it is confirmed. (please advise if there is any specific
> test i should perform to make sure the communication between slurmdbd and
> slurmctld  are fine with the 19.05 version )
If the slurmctld cannot talk to the slurmdbd, its log will have errors like:
error: slurmdbd: Sending PersistInit msg: Connection refused
Also, the ControlHost for the cluster in `sacctmgr show clusters` will not show an ip address.

> 
> 8. Proceed with upgrading all slurmd's 


Online rolling upgrades are possible, but it seems you don't want to take any chances with that. That is fine. As an FYI, this is possible by increasing the MaxJobCount to make sure that the slurmdbd agent queue on the slurmctld is large enough to hold all accounting updates while the slurmdbd is offline.

Comment 64 Sathishkumar 2020-06-24 21:01:33 MDT

(In reply to Broderick Gardner from comment #63)
> (In reply to Sathishkumar from comment #60)
> > Also, we have a scheduled maintenance coming weekend to upgrade the slurm
> > from 17.11 to 19.05. We are also planning to change the statesave location
> > to different path. Is there any sequence I should follow ?  
> > 
> > I am planning to  follow the below sequence (Note: all slurm services are
> > running on a container) Please advise if there if I am wrong. 
> Here are notes on your procedure:
> > 1. Make sure there is no active  jobs on the cluster.
> One way to accomplish this is to set a maintenance reservation for the
> period of the upgrade so that no jobs are scheduled that could interfere.
> > 
> > 2. Upgrade the slurmdbd first, wait for 6+ hours to complete the conversion,
> > once the conversion is done, make sure the dbd is coming up and working as
> > expected by querying sacct and sacctmgr, also checking service status.
> > (please advise if there is any specific test i should perform to make sure
> > the database if up and running fine after conversion)
> Make sure there is a backup of your database prior to running the upgraded
> slurmdbd. 
> When running the slurmdbd to convert the database, run it in the foreground
> of a terminal instead of as a service. For example:
> $ sudo -u <slurm user> slurmdbd -D
> This prevents any interference from systemd during the conversion. Once it's
> done and responding to client commands (sacct or sacctmgr), feel free to
> stop it (SIGTERM/Ctrl-c) and restart it as a systemd service.
> This process has been heavily tested, so if it completes without error, you can be 
> reasonably sure that it is correct.

Thanks for your input, So basically I should run slurmdbd -Dvv as root and on the other console do the upgrade by running "yum update slurm" this will basically start updating all the installed slurm packages and start the conversion. Is this right ? 


> 
> > 3. Provision a new slurmctld00(primary) and slurmctld01(backup) with 19.05
> > with the new statesave location
> > 
> > 4. Stop the slurmctld primary and backup 
> > 
> > 5. Copy over all the statesave file from old location to the new location

Any input coping the data to statesave location ? 


> > 
> > 6. Start the slurmctld primary and backup 
> > 
> > 7. Validate if the communication between slurmdbd and slrumctld are working
> > as expected, once it is confirmed. (please advise if there is any specific
> > test i should perform to make sure the communication between slurmdbd and
> > slurmctld  are fine with the 19.05 version )
> If the slurmctld cannot talk to the slurmdbd, its log will have errors like:
> error: slurmdbd: Sending PersistInit msg: Connection refused
> Also, the ControlHost for the cluster in `sacctmgr show clusters` will not
> show an ip address.

Sure, will use this step to validate the communication between slurmctld and slurmdbd are fine after the upgrade and DB conversion 


> 
> > 
> > 8. Proceed with upgrading all slurmd's 
> 
> 
> Online rolling upgrades are possible, but it seems you don't want to take
> any chances with that. That is fine. As an FYI, this is possible by
> increasing the MaxJobCount to make sure that the slurmdbd agent queue on the
> slurmctld is large enough to hold all accounting updates while the slurmdbd
> is offline.

yes, thanks for the note, Since there is change in the statesave location, we dont want to do this rolling updates. 


Also, we have few VMs and on that slurmd will still be the current version which is 17.11, do u see any issues with slurmd-17.11 communicating with 19.05 ? 

Regards
Sathish

Comment 65 Broderick Gardner 2020-06-24 21:19:49 MDT

(In reply to Sathishkumar from comment #64)
> Thanks for your input, So basically I should run slurmdbd -Dvv as root and
> on the other console do the upgrade by running "yum update slurm" this will
> basically start updating all the installed slurm packages and start the
> conversion. Is this right ? 
Well, you don't need 2 consoles. For example:
$ sudo systemctl stop slurmdbd
$ sudo yum update slurm
...
$ sudo -u slurm slurmdbd -Dvv
...

This simply runs the slurmdbd completely outside systemd. Do you typically run the slurmdbd as root? It should be run as whatever the SlurmUser is in slurmdbd.conf, which could be root. I am not sure what user the systemd service file runs slurmdbd as.
From man slurmdbd.conf:
SlurmUser
        The name of the user that the slurmdbd daemon executes as.  This user must exist on the machine ex‐
        ecuting the Slurm Database Daemon and have the same UID as the hosts on  which  slurmctld  execute.
        For  security purposes, a user other than "root" is recommended.  The default value is "root". This
        name should also be the same SlurmUser on all clusters reporting to the SlurmDBD.   NOTE:  If  this
        user  is  different  from the one set for slurmctld and is not root, it must be added to accounting
        with AdminLevel=Admin and slurmctld must be restarted.


> Any input coping the data to statesave location ? 
That is correct, just copy the statesave location while both slurmctld's are not running.
> 
> Also, we have few VMs and on that slurmd will still be the current version
> which is 17.11, do u see any issues with slurmd-17.11 communicating with
> 19.05 ? 
That should be fine. This is not quite as frequently tested, but we do support running slurmd up to 2 versions lower than the slurmctld. On that note, the slurmdbd supports talking to a slurmctld up to 2 versions lower than itself as well.

Comment 66 Sathishkumar 2020-06-24 21:58:23 MDT

(In reply to Broderick Gardner from comment #65)
> (In reply to Sathishkumar from comment #64)
> > Thanks for your input, So basically I should run slurmdbd -Dvv as root and
> > on the other console do the upgrade by running "yum update slurm" this will
> > basically start updating all the installed slurm packages and start the
> > conversion. Is this right ? 
> Well, you don't need 2 consoles. For example:
> $ sudo systemctl stop slurmdbd
> $ sudo yum update slurm

Thanks, the above step make sense.  
> ...
> $ sudo -u slurm slurmdbd -Dvv
sure, will use this to run dbd in foreground
> ...
> 
> This simply runs the slurmdbd completely outside systemd. Do you typically
> run the slurmdbd as root? It should be run as whatever the SlurmUser is in
> slurmdbd.conf, which could be root. I am not sure what user the systemd
> service file runs slurmdbd as.

yes, we run slurmdbd as root, but we have set slurm as SlurmUser, going forward will use slurm user to manage all slurm managements. 

> From man slurmdbd.conf:
> SlurmUser
>         The name of the user that the slurmdbd daemon executes as.  This
> user must exist on the machine ex‐
>         ecuting the Slurm Database Daemon and have the same UID as the hosts
> on  which  slurmctld  execute.
>         For  security purposes, a user other than "root" is recommended. 
> The default value is "root". This
>         name should also be the same SlurmUser on all clusters reporting to
> the SlurmDBD.   NOTE:  If  this
>         user  is  different  from the one set for slurmctld and is not root,
> it must be added to accounting
>         with AdminLevel=Admin and slurmctld must be restarted.
> 
> 
> > Any input coping the data to statesave location ? 
> That is correct, just copy the statesave location while both slurmctld's are
> not running.
> > 
> > Also, we have few VMs and on that slurmd will still be the current version
> > which is 17.11, do u see any issues with slurmd-17.11 communicating with
> > 19.05 ? 
> That should be fine. This is not quite as frequently tested, but we do
> support running slurmd up to 2 versions lower than the slurmctld. On that
> note, the slurmdbd supports talking to a slurmctld up to 2 versions lower
> than itself as well.

Ok, I ask this the slurm.conf has some specific changes with version 19.05, like SlurmctldHost, which cannot be used with 17.11 version, in this case i believe slurmctld will complain about "node appears to have a different slurm.conf than the slurmctld" Off-course this is obvious and can be ignored, are there any other errors we should ignore while running different version fo slurmd-17.11 vs slurmctld-19.05 ?

Regards
Sathish

Comment 67 Sathishkumar 2020-06-25 12:31:57 MDT

Hi broderick, 

I see that there is change in pam_slurm_adopt behavior, meaning with 17.11, if you add the below line with /etc/pam.d/sshd, it will prevent users sshing into nodes that they do not have a running job on, but now with even we remove the below line from /etc/pam.d/sshd, it is still preventing to ssh into nodes when there is no running job, is that right ? 

account  required  pam_slurm_adopt.so

I see the below change log with the 19.05 release notes.

NOTE: Limit pam_slurm_adopt to run only in the sshd context by default, for security reasons. A new module option 'service=<name>' can be used to allow a different PAM applications to work. The option 'service=*' can be used to restore the old behavior of always performing the adopt logic regardless of the PAM application context.

We want to restore the old behavior, like if we enable pam_slurm_adopt.so with in pam.d/sshd, it should allow sshing only if there is a running job. But if we remove pam_slurm_adopt.so, it should allow sshing irrespective of the jobs running or not. Kindly advise.  

Thanks
Sathish

Comment 68 Broderick Gardner 2020-06-25 13:21:44 MDT

(In reply to Sathishkumar from comment #67)
This change will not change what you are describing. It is a security hardening change only. If pam_slurm_adopt is removed from the PAM configuration, then it will not restrict ssh. 

If you observe a problem with this, please file a separate ticket to investigate.

Comment 69 Sathishkumar 2020-06-25 19:56:25 MDT

(In reply to Broderick Gardner from comment #68)
> (In reply to Sathishkumar from comment #67)
> This change will not change what you are describing. It is a security
> hardening change only. If pam_slurm_adopt is removed from the PAM
> configuration, then it will not restrict ssh. 
> 
> If you observe a problem with this, please file a separate ticket to
> investigate.

Sure, I will validate the behavior once again and create a ticket if it is still problem. 

I will update the status of the upgrade once we are done, probably we can close this ticket by next week. 

Thanks
Sathish

Comment 70 Sathishkumar 2020-06-29 08:07:14 MDT

Created attachment 14821 [details]
slurmdbd.log and sdiag  after 19.05 upgrade

Comment 71 Sathishkumar 2020-06-29 08:07:33 MDT

Hi Broderick, 

We updated the 19.05.7 over the weekend everything and overall the upgrade went well. We see the below errors on the logs and also the archival/Purging is not triggered yet. Attached is the logs for your review. 

 error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-18T18:00:00 - 2019-12-18T19:00:00 tres 1
 error: We have more allocated time than is possible (23624074 > 20678400) for cluster scp(5744) from 2019-12-18T19:00:00 - 2019-12-18T20:00:00 tres 1
 error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-18T19:00:00 - 2019-12-18T20:00:00 tres 1
 error: We have more allocated time than is possible (23664697 > 20678400) for cluster scp(5744) from 2019-12-18T20:00:00 - 2019-12-18T21:00:00 tres 1
 error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-18T20:00:00 - 2019-12-18T21:00:00 tres 1
 error: We have more allocated time than is possible (23743160 > 20678400) for cluster scp(5744) from 2019-12-18T21:00:00 - 2019-12-18T22:00:00 tres 1
 error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-18T21:00:00 - 2019-12-18T22:00:00 tres 1
 error: We have more allocated time than is possible (23664440 > 20678400) for cluster scp(5744) from 2019-12-18T22:00:00 - 2019-12-18T23:00:00 tres 1
 error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-18T22:00:00 - 2019-12-18T23:00:00 tres 1
 error: We have more allocated time than is possible (23535521 > 20678400) for cluster scp(5744) from 2019-12-18T23:00:00 - 2019-12-19T00:00:00 tres 1
 error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-18T23:00:00 - 2019-12-19T00:00:00 tres 1
 error: We have more allocated time than is possible (23665706 > 20678400) for cluster scp(5744) from 2019-12-19T00:00:00 - 2019-12-19T01:00:00 tres 1
 error: We have more time than is possible (20678400+1555200+0)(22233600) > 20678400 for cluster scp(5744) from 2019-12-19T00:00:00 - 2019-12-19T01:00:00 tres 1

Please do let me know if there is anything needed my end. 

Regards
Sathish

Comment 72 Sathishkumar 2020-07-01 09:45:48 MDT

Hi Broderick, 

The archive/purge took some time to initiate and it had initiated last night and all archived/purged as per the defined timeline. 

But from time to time we are seeing the errors like below in the slurmdbd logs

[2020-07-01T15:00:13.659] error: We have more time than is possible (13372857+259200+30700800)(44332857) > 30700800 for cluster scp(8528) from 2020-06-29T07:00:00 - 2020-06-29T08:00:00 tres 1
[2020-07-01T15:00:13.963] error: We have more time than is possible (14480721+259200+30700800)(45440721) > 30700800 for cluster scp(8528) from 2020-06-29T08:00:00 - 2020-06-29T09:00:00 tres 1
[2020-07-01T15:00:14.289] error: We have more time than is possible (14602653+259200+30700800)(45562653) > 30700800 for cluster scp(8528) from 2020-06-29T09:00:00 - 2020-06-29T10:00:00 tres 1
[2020-07-01T15:00:14.607] error: We have more time than is possible (14549595+259200+30700800)(45509595) > 30700800 for cluster scp(8528) from 2020-06-29T10:00:00 - 2020-06-29T11:00:00 tres 1
[2020-07-01T15:00:14.944] error: We have more time than is possible (14695663+259200+30700800)(45655663) > 30700800 for cluster scp(8528) from 2020-06-29T11:00:00 - 2020-06-29T12:00:00 tres 1
[2020-07-01T15:00:15.447] error: We have more time than is possible (17721970+259200+30700800)(48681970) > 30700800 for cluster scp(8528) from 2020-06-29T12:00:00 - 2020-06-29T13:00:00 tres 1 

Any thoughts ? 

Regards
Sathish

Comment 73 Broderick Gardner 2020-07-01 10:04:03 MDT

Do you have any reservations? Have you had any reservations during the last month?
scontrol show reservations
sacctmgr show reservations

There are a few possibly sources for this error, but the most likely is related to Bug 6839. When there are down nodes in a reservation, time is double-counted. This is not limited to 19.05, and the fix will only be in 20.11 released later this year. The reason the fix will not be in a released version of slurm for several months is because it involves tricky logic that requires significant testing.

The error only occurs when a time range including down nodes in a reservation is "rolled up" into the hourly rollup table _and_ the total allocated time goes over the total possible (meaning that even though time is double counted, the error will not show until enough nodes are allocated to push the time over the total possible time).

Please also run `sacctmgr show runawayjobs`

Thanks

Comment 74 Sathishkumar 2020-07-02 06:55:49 MDT

(In reply to Broderick Gardner from comment #73)
> Do you have any reservations? Have you had any reservations during the last
> month?
> scontrol show reservations
> sacctmgr show reservations
> 
> There are a few possibly sources for this error, but the most likely is
> related to Bug 6839. When there are down nodes in a reservation, time is
> double-counted. This is not limited to 19.05, and the fix will only be in
> 20.11 released later this year. The reason the fix will not be in a released
> version of slurm for several months is because it involves tricky logic that
> requires significant testing.
> 
> The error only occurs when a time range including down nodes in a
> reservation is "rolled up" into the hourly rollup table _and_ the total
> allocated time goes over the total possible (meaning that even though time
> is double counted, the error will not show until enough nodes are allocated
> to push the time over the total possible time).
> 
> Please also run `sacctmgr show runawayjobs`
> 
> Thanks

We do have reservations and we have been using the reservations for a while. The reported jobID 30700800 has not been submitted to the reservations, see the sacct output for the job 30700800 also other requested outputs. 

$sacct -j 30700800  --format=Node,submit,Reservation
       NodeList              Submit          Reservation
--------------- ------------------- --------------------
    seskscpn079 2020-06-02T09:39:37
    seskscpn079 2020-06-02T09:39:45
    seskscpn079 2020-06-02T09:39:45

$sacctmgr show runawayjobs
Runaway Jobs: No runaway jobs found on cluster scp

$sacctmgr show reservations
   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall
---------- --------------- ------------------------------ ------------------- ------------------- ----------
       scp             NFS                         cpu=36 2019-09-03T11:37:18 2020-07-24T11:30:08 2.8747e+07
       scp          PS-lab                         cpu=24 2020-02-29T10:29:02 2021-01-29T12:45:54 1.0710e+07
       scp          CC-FEP                        cpu=160 2020-05-21T07:48:14 2021-04-01T10:25:40 3.6158e+06
       scp       easybuild                         cpu=64 2020-05-25T10:36:21 2022-05-14T04:22:11 5.5861e+06
       scp            clab                         cpu=32 2020-06-08T07:17:52 2021-06-08T07:17:52 4.5545e+06
       scp   reinvent-prod                         cpu=40 2020-06-26T11:14:39 2021-03-31T16:00:16 9.2859e+05
       scp        maint052                         cpu=40 2020-06-26T11:14:52 2021-06-26T11:14:52 3.0879e+06
       scp    changewindow                       cpu=8528 2020-06-27T15:44:45 2021-06-27T15:44:45 2.9853e+06
       scp    icinga-check                          cpu=1 2020-06-28T19:14:50 2021-03-31T15:16:33 2.8832e+06
       scp          cryoem                         cpu=96 2020-07-01T07:00:00 2020-07-01T19:00:00 3.2056e+04
       scp          cryoem                         cpu=96 2020-07-02T07:00:00 2020-07-02T19:00:00 1.6995e+04


$scontrol show res
ReservationName=PS-lab StartTime=2020-01-30T12:45:54 EndTime=2021-01-29T12:45:54 Duration=365-00:00:00
   Nodes=seskscpn056 NodeCnt=1 CoreCnt=24 Features=broadwell&standard PartitionName=core Flags=
     NodeName=seskscpn056 CoreIDs=1-24
   TRES=cpu=24
   Users=(null) Accounts=cp Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=icinga-check StartTime=2020-03-31T15:16:33 EndTime=2021-03-31T15:16:33 Duration=365-00:00:00
   Nodes=seskscpn056 NodeCnt=1 CoreCnt=1 Features=broadwell&standard PartitionName=core Flags=IGNORE_JOBS
     NodeName=seskscpn056 CoreIDs=0
   TRES=cpu=1
   Users=scp,icinga-user Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=reinvent-prod StartTime=2020-03-31T16:00:16 EndTime=2021-03-31T16:00:16 Duration=365-00:00:00
   Nodes=seskscpg[052-053] NodeCnt=2 CoreCnt=80 Features=cascadelake&dss PartitionName=gpu Flags=SPEC_NODES
   TRES=cpu=80
   Users=(null) Accounts=cc,mai Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=CC-FEP StartTime=2020-04-01T10:25:40 EndTime=2021-04-01T10:25:40 Duration=365-00:00:00
   Nodes=seskscpg[004-005,011,050-051] NodeCnt=5 CoreCnt=160 Features=(null) PartitionName=gpu Flags=SPEC_NODES
   TRES=cpu=160
   Users=(null) Accounts=admins,cc,sc Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=cryoem StartTime=2020-07-02T07:00:00 EndTime=2020-07-02T19:00:00 Duration=12:00:00
   Nodes=seskscpg[033-034,045-046] NodeCnt=4 CoreCnt=96 Features=cascadelake&volta&highmem PartitionName=gpu Flags=FLEX,DAILY,NO_HOLD_JOBS_AFTER_END
   TRES=cpu=96
   Users=(null) Accounts=sc,admins Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=easybuild StartTime=2020-05-14T04:22:11 EndTime=2022-05-14T04:22:11 Duration=730-00:00:00
   Nodes=seskscpg[001,026-027] NodeCnt=3 CoreCnt=64 Features=(null) PartitionName=gpu Flags=FLEX,IGNORE_JOBS,SPEC_NODES,NO_HOLD_JOBS_AFTER_END
   TRES=cpu=64
   Users=(null) Accounts=ops,admins Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=clab StartTime=2020-06-08T07:17:52 EndTime=2021-06-08T07:17:52 Duration=365-00:00:00
   Nodes=seskscpn042 NodeCnt=1 CoreCnt=32 Features=broadwell,standard PartitionName=core Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=32
   Users=clab Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

Comment 75 Broderick Gardner 2020-08-04 11:43:20 MDT

From what I can tell, everything is working as expected right now. I will close this ticket.

Thanks