Ticket 2888 - Discrepancy between slurmdbd and slurm.job.log
Summary: Discrepancy between slurmdbd and slurm.job.log
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmdbd (show other tickets)
Version: 15.08.10
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-07-11 15:38 MDT by David Richardson
Modified: 2017-03-07 18:31 MST (History)
2 users (show)

See Also:
Site: University of Utah
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description David Richardson 2016-07-11 15:38:15 MDT
Hi,

We're running 15.08.10.

We've found several jobs in our records where there's a discrepancy between slurmdbd (reported via sacct) and the slurm.job.log file. Here's an example of three jobs (1262602, 1262603, and 1262604)

grep slurm.job.log:
JobId=1262602 UserId=u0138544(207845) GroupId=camp(1978) Name=ma.8002.4.slurm JobState=TIMEOUT Partition=kingspeak TimeLimit=4320 StartTime=2016-06-10T02:53:50 EndTime=2016-06-13T02:54:08 NodeList=kp027 NodeCnt=1 ProcCnt=16 WorkDir=/uufs/chpc.utah.edu/common/home/u0138544/runs/autism
JobId=1262603 UserId=u0138544(207845) GroupId=camp(1978) Name=ma.8002.9.slurm JobState=TIMEOUT Partition=kingspeak TimeLimit=4320 StartTime=2016-06-10T09:21:01 EndTime=2016-06-13T09:21:13 NodeList=kp001 NodeCnt=1 ProcCnt=16 WorkDir=/uufs/chpc.utah.edu/common/home/u0138544/runs/autism
JobId=1262604 UserId=u0138544(207845) GroupId=camp(1978) Name=ma.10001.6.slurm JobState=TIMEOUT Partition=kingspeak TimeLimit=4320 StartTime=2016-06-10T09:21:01 EndTime=2016-06-13T09:21:13 NodeList=kp002 NodeCnt=1 ProcCnt=16 WorkDir=/uufs/chpc.utah.edu/common/home/u0138544/runs/autism


sacct:
1262602|1262602|kingspeak|kingspeak|camp|camp|1978|u0138544|207845|2016-06-09T17:42:02|2016-06-09T17:42:02|2016-06-13T02:54:08|2016-06-13T02:54:08|00:00:00|1:0|TIMEOUT|1|1|1|0n|3-00:00:00|None assigned|ma.8002.4.slurm
1262603|1262603|kingspeak|kingspeak|camp|camp|1978|u0138544|207845|2016-06-09T17:42:02|2016-06-09T17:42:02|2016-06-13T09:21:13|2016-06-13T09:21:13|00:00:00|1:0|TIMEOUT|1|1|1|0n|3-00:00:00|None assigned|ma.8002.9.slurm
1262604|1262604|kingspeak|kingspeak|camp|camp|1978|u0138544|207845|2016-06-09T17:42:02|2016-06-09T17:42:02|2016-06-13T09:21:13|2016-06-13T09:21:13|00:00:00|1:0|TIMEOUT|1|1|1|0n|3-00:00:00|None assigned|ma.10001.6.slurm

(the sacct command was:
sacct --parsable2 --noheader --allocations --format jobid,jobidraw,cluster,partition,account,group,gid,user,uid,submit,eligible,start,end,elapsed,exitcode,state,nnodes,ncpus,reqcpus,reqmem,timelimit,nodelist,jobname -j $JOBID) (this is the format used by xdmod)

There are four things wrong with the sacct output
* The start time is wrong
* The duration is wrong (probably because of the start time)
* The nodelist is empty (says "none assigned" instead of the node)
* The ncpus field is wrong

We've only seen these discrepancies since moving to 15.08.10 (we were on the 14 series beforehand). Is this a known bug? Is there a recommended way to fix the data in the slurmdbd?

Thanks,
DR
Comment 2 Tim Wickberg 2016-07-11 15:48:49 MDT
Are you only seeing this with jobs in state "TIMEOUT"?

Do you get different results with 'sacct -D'?
Comment 3 David Richardson 2016-07-11 16:17:51 MDT
Tim,

-D returns the same results.

These are the only jobs we've tracked down so far, but we haven't done an exhaustive search.

Thanks,
DR
Comment 4 David Richardson 2016-12-02 13:51:09 MST
Tim,

We've done some more digging, and we have found a lot more discrepancies between the db and the logs. It seems to hit jobs in many different states.

We have wondered it part of them problem comes from times when the slurmctld is unable to reach the dbd (regardless of why; we've had it happen because of a bad reservation [once a name with a special character, once a res with nodes that no longer existed], but it's probably more tied to the *fact* of no communication than to the exact cause of no communication).

These jobs all happened while were were running 15.08.10 everywhere. As of 2016-11-30, our shared DBD and the lonepeak cluster slurmctld moved to 16.05.6


Here are some examples:

There are two jobs from in May that the db thinks are still running :

$ more /uufs/lonepeak.peaks/sys/var/slurm/log/slurm.job.log | grep 41210

JobId=41210 UserId=u6001661(1115408) GroupId=kapheim(2310) Name=GATKpreprocessing_part2_C33_6.slurm JobState=COMPLETED Partition=lonepeak TimeLimit=4320 StartTime=2016-05-12T13:20:10 EndTime=2016-05-12T16:03:29 NodeList=lp013 NodeCnt=1 ProcCnt=20 WorkDir=/uufs/chpc.utah.edu/common/home/kapheim-group1/halictid_reseq/mgen

 

$ sacct -p --format jobid,cluster,partition,account,group,user,state,submit,eligible,start,end,elapsed,exitcode,nnodes,ncpus,ncpus,nodelist,jobname --job=41210

JobID|Cluster|Partition|Account|Group|User|State|Submit|Eligible|Start|End|Elapsed|ExitCode|NNodes|NCPUS|NodeList|JobName|

41210|lonepeak|lonepeak|kapheim|kapheim|u6001661|RUNNING|2016-05-12T10:25:07|2016-05-12T10:25:07|2016-05-12T13:20:10|Unknown|138-01:34:18|0:0|1|20|lp013|GATKpreprocessing_part2_C33_6.slurm|


Two other jobs from May 12 (both start and end day):

JobId=41219 UserId=u6001661(1115408) GroupId=kapheim(2310) Name=GATKpreprocessing_part2_C49_9.slurm JobState=COMPLETED Partition=lonepeak TimeLimit=4320 StartTime=2016-05-12T16:03:34 EndTime=2016-05-12T18:35:32 NodeList=lp013 NodeCnt=1 ProcCnt=20 WorkDir=/uufs/chpc.utah.edu/common/home/kapheim-group1/halictid_reseq/mgen

 

$ sacct -p --format jobid,cluster,partition,account,group,user,submit,eligible,start,end,elapsed,exitcode,nnodes,ncpus,no,nodelist,jobname --job=41219

JobID|Cluster|Partition|Account|Group|User|Submit|Eligible|Start|End|Elapsed|ExitCode|NNodes|NCPUS|NodeList|JobName|

41219|lonepeak|lonepeak|kapheim|kapheim|u6001661|2016-05-12T10:35:51|2016-05-12T10:35:51|2016-05-12T18:35:32|2016-05-12T18:35:32|00:00:00|0:0|0|0|None assigned|GATKpreprocessing_part2_C49_9.slurm|

 

$ more /uufs/lonepeak.peaks/sys/var/slurm/log/slurm.job.log | grep 41316

JobId=41316 UserId=u0723235(794864) GroupId=cheatham(1720) Name=600psN1 JobState=CANCELLED Partition=cheatham-lp TimeLimit=4320 StartTime=2016-05-12T17:28:13 EndTime=2016-05-12T17:46:29 NodeList=lp056 NodeCnt=1 ProcCnt=8 WorkDir=/uufs/chpc.utah.edu/common/home/cheatham-group5/hamed/gac/1hc8/us/loop-separation/mix-path/neut-KCl

 

$ sacct -p --format jobid,cluster,partition,account,group,user,submit,eligible,start,end,elapsed,exitcode,nnodes,ncpus,nodelist,jobname --job=41316

JobID|Cluster|Partition|Account|Group|User|Submit|Eligible|Start|End|Elapsed|ExitCode|NNodes|NCPUS|NodeList|JobName|

41316|lonepeak|cheatham-lp|cheatham-lp|cheatham|u0723235|2016-05-12T17:27:50|2016-05-12T17:27:50|2016-05-12T17:28:13|2016-05-12T20:40:43|03:12:30|0:0|1|8|lp056|600psN1|

 

A block between 63832 – 79558 – not all the jobs but about 1600 of them These are run in a window from about 8/29 – 9/15.  Here are some examples. 

 

One example of a completed one is 63832:

JobId=63832 UserId=u0637770(711412) GroupId=noo(2270) Name=C2_10_1300 JobState=COMPLETED Partition=lonepeak TimeLimit=4320 StartTime=2016-08-29T12:27:54 EndTime=2016-08-30T19:01:21 NodeList=lp009 NodeCnt=1 ProcCnt=20 WorkDir=/uufs/chpc.utah.edu/common/home/u0637770/geant4-multi/build/Temp-build/NewFullSimulations/Heart/Fat/Position02/Circle/10keV

 

$ sacct -p --format jobid,cluster,partition,account,group,user,submit,eligible,start,end,elapsed,exitcode,nnodes,ncpus,nodelist,jobname --job=63832

JobID|Cluster|Partition|Account|Group|User|Submit|Eligible|Start|End|Elapsed|ExitCode|NNodes|NCPUS|NodeList|JobName|

63832|lonepeak|lonepeak|noo|noo|u0637770|2016-08-22T09:02:14|2016-08-22T09:02:14|2016-08-30T19:01:21|2016-08-30T19:01:21|00:00:00|0:0|1|1|None assigned|C2_10_1300|

 

 

$ more /uufs/lonepeak.peaks/sys/var/slurm/log/slurm.job.log | grep 71185

JobId=71185 UserId=u0138544(207845) GroupId=camp(1978) Name=1822.14.slurm JobState=TIMEOUT Partition=lonepeak TimeLimit=4320 StartTime=2016-09-09T00:17:18 EndTime=2016-09-12T00:17:47 NodeList=lp009 NodeCnt=1 ProcCnt=20 WorkDir=/uufs/chpc.utah.edu/common/home/u0138544/runs/bc/luminalb/1821

 

$ sacct -p --format jobid,cluster,partition,account,group,user,submit,eligible,start,end,elapsed,exitcode,nnodes,ncpus,nodelist,jobname --job=71185

JobID|Cluster|Partition|Account|Group|User|Submit|Eligible|Start|End|Elapsed|ExitCode|NNodes|NCPUS|NodeList|JobName|

71185|lonepeak|lonepeak|camp|camp|u0138544|2016-09-05T12:28:00|2016-09-05T12:28:00|2016-09-12T00:17:47|2016-09-12T00:17:47|00:00:00|1:0|1|1|None assigned|1822.14.slurm|

 

 

$ more /uufs/lonepeak.peaks/sys/var/slurm/log/slurm.job.log | grep 71194

JobId=71194 UserId=u0138544(207845) GroupId=camp(1978) Name=1822.22.slurm JobState=COMPLETED Partition=lonepeak TimeLimit=4320 StartTime=2016-09-09T13:18:56 EndTime=2016-09-12T07:20:20 NodeList=lp012 NodeCnt=1 ProcCnt=20 WorkDir=/uufs/chpc.utah.edu/common/home/u0138544/runs/bc/luminalb/1821

 

$ sacct -p --format jobid,cluster,partition,account,group,user,submit,eligible,start,end,elapsed,exitcode,nnodes,ncpus,nodelist,jobname --job=71194

JobID|Cluster|Partition|Account|Group|User|Submit|Eligible|Start|End|Elapsed|ExitCode|NNodes|NCPUS|NodeList|JobName|

71194|lonepeak|lonepeak|camp|camp|u0138544|2016-09-05T12:28:00|2016-09-05T12:28:00|2016-09-12T07:20:20|2016-09-12T07:20:20|00:00:00|0:0|1|1|None assigned|1822.22.slurm|


Thanks for your help in fixing this, guys.
Comment 5 Tim Wickberg 2016-12-02 14:13:23 MST
16.05 should make this a bit easier to resolve at least - 'sacctmgr show runawayjobs' should find such occurrences and can optionally clear them up.

I'm hoping the underlying cause has resolved itself since then - there were a few DB related issues back in 15.08 that I think may be what you've seen here.

Given that you've only just upgraded now, I assume there are no new instances?
Comment 6 David Richardson 2016-12-02 14:25:25 MST
We haven't found any new instances yet, but (like you said) it's only been two days. The cluster that's running 16 hasn't been released to users yet, so it's pushed nearly zero jobs, and it's our quietest cluster (which is why it got upgraded first).

Thanks,
DR
Comment 7 Tim Wickberg 2016-12-02 14:30:34 MST
Have you run through and had 'sacctmgr show runawayjobs' do any cleanup on those older records?
Comment 8 Brian Haymore 2016-12-02 14:48:21 MST
Tim,

I wanted to chime in here as I spoke with Danny about this at SC16.  Since each of these events seem to come around when we have had something get the communication between slurmctld and slurmdbd into a state that they wont communicate.  Then we get the messages that talk about the queue filling up and to restart slurmdbd.  After we finally clear up those issues and let the queue'ed up content sync, even giving days, weeks for things to settle out...  We are seeing bits of info missing from some jobs that David has noted.  When I spoke about this with Danny he wondered about what job content was being sent to slurmdbd at the start vs at the end and wondering if more could be sent in both cases vs a limite set that is sent now to help avoid some things.  I'm vague on this as our conversation didn't go deeper then this level at SC.  More I just wanted to make sure we were getting Danny, or whoever, into this as well if appropriate.  None of this is to derail having David complete the steps you've suggested, more just to make sure you were aware of this conversation at SC and that Danny said he had some thoughts on this.
Comment 9 Tim Wickberg 2016-12-02 14:53:06 MST
Ahh... thanks, Brian. I vaguely recall some of that discussion. Dropping Danny in here as a CC to remind him to look into that change to the accounting mechanisms if he hasn't already.
Comment 10 Danny Auble 2016-12-02 15:06:39 MST
Hey Brian,

I looked at the code and much of what we talked about is already in 16.05.  I did alter 17.02 to be better about what the slurmctld throws on the floor when the limit is reached in commit (178a929b426ad).  This throws away step messages first and then Job Start messages after that.  This should buy quite a bit more time if/when the dbd is unresponsive.  You can also set up triggers (https://slurm.schedmd.com/strigger.html) to notify you when the dbd or the database is down.  This would notify you instead of you having to notice this in the logs.

Hopefully these ideas/changes will ease your fears.  For you education the algorithm that determines the max number of messages is MaxJobs * 2 + SystemNodeCnt * 4 with a floor of 10k.
Comment 11 David Richardson 2016-12-02 15:45:15 MST
Would it be possible to queue messages to disk instead of throwing them away?

Thanks,
DR
Comment 12 Danny Auble 2016-12-02 16:03:20 MST
That would take a lot of overhead as the messages have to remain in order.  So you would have to dump all the messages and then read them all back when the connection comes back up.  They are kept in state when shut down this way, but given a busy system this could become very heavy writing over and over multiple times per job (each job has 3 potential messages as well as each step having 2).  Perhaps it wouldn't be too bad to append them instead of keeping them in memory when the connection goes away.  But then you stand to fill up your disk if you don't notice this has happened.  Granted disk is usually bigger than memory, but you still stand a risk.  Plus the before mentioned overhead.

Bug 109 was opened to determine if there was any way to prevent this.  As you can see from the number not much headway has been made on it.  Since most people set up something to let them know the database has died this usually isn't an issue.
Comment 13 Tim Wickberg 2017-03-07 18:31:29 MST
Hey David -

I'm marking this resolved/infogiven (as I should have done a while ago).

If you have any further thoughts on ways to queue / save off messages, we do still have bug 109 open as an enhancement, although as Danny'd mentioned you can see we haven't made much progress on that recently.

- Tim