Created attachment 7324 [details] number of nodes in CG state Hi! We'd like to report another case of recurrent accumulation of completing jobs. It looks very similar to what's already been reported in those bugs: * https://bugs.schedmd.com/show_bug.cgi?id=5401 * https://bugs.schedmd.com/show_bug.cgi?id=5121 * https://bugs.schedmd.com/show_bug.cgi?id=5111 * https://bugs.schedmd.com/show_bug.cgi?id=5177 * https://bugs.schedmd.com/show_bug.cgi?id=5401 I'm attaching visual representation of the effects: * CG_nodes shows the growing number of nodes stuck in CG state each time this occurs. Vertical blue lines are manual restarts of slurmctld * agent_queue shows the increasing size of the agent queue, with highs in the 800k (left axis), with no sign of slowing down before we restart the controller. This is extremely concerning and is starting to makes things really difficult for us, as we need to periodically restart the controller, with a success rate of about 50%: it often doesn't clear up completing jobs and requires many restarts in a row. We've ruled out filesystem issues, prolog/epilog issues, stuck processes on the nodes, and networking appears to be fine. When this happens, the jobs are usually completely done on the node, with no more sign of any related activity. Most of the compute nodes have just one slurmd process running, all the slurmstepd and job-related children having correctly exited. So it doesn't look like some unkillable process on the compute nodes, or communication issue between nodes and controller, but rather like a contention or a deadlock within the controller itself. We even enabled the "throttle" plugin, to limit burst of jobs submissions, but it doesn't seem to have had much effect. We would appreciate any insight or knob we could tweak to avoid those situations and limit the growing impact it has on our users. Thanks! -- Kilian
Created attachment 7325 [details] Agent queue size
Hi I will need more info. Could you provide us actual slurm.conf slurmctld.log slurmd.log from affected node output from: gdb -batch -p <slurmctld pid> -ex 't a a bt' output from: gdb -batch -p <slurmd pid of CG node> -ex 't a a bt' Dominik
Hi This command can be useful to localize which node blocs job: "scontrol completing" Dominik
Hi Dominik, Thanks for getting back to us. I will definitely provide the information you've asked for the next time this happens. Two remarks in the meantime: 1. this seem to be strongly correlated with user activity: we had long periods of fine behavior for weeks running 17.11.6, and all of a sudden, those CG episodes started a few days ago. They seem to occur more frequently during work hours, which is when the most of the submission/cancellation of jobs happens. 2. when this happens, it's not one node blocking jobs and getting stuck in Completing state, it's literally hundreds. Which surely indicates that the problem is not with individual compute nodes themselves: that would mean they all somehow fail at the same time, which is quite unlikely. And a restart of the controller also wouldn't resolve the situation in most cases. So I can send logs of a couple CG compute nodes, but that will only be a partial picture. That being said, I'm at the ready with gdb for the next time it happens. Cheers, -- Kilian
Here we go, it started again, so I'm attaching all of the requested info: * slurm.conf * full slurmctld.log * output of "gdb -batch -p $(pidof slurmctld) -ex 't a a bt'" * output of "scontrol completing" * for nodes listed in "scontrol completing" - slurmd.log - output of "gdb -batch -p $(pidof slurmd) -ex 't a a bt'" Right now, we have 300 nodes stuck in completing state (out of ~1,300) and a little over 1,000 CG jobs (out of ~15,000 in queue). Please let me know if you need any more information. Thanks! -- Kilian
Created attachment 7345 [details] Logs
Created attachment 7349 [details] Logs The previous archive was missing the gdb info from slurmctld
Hi Thanks for all the data. One more thing, could you attach sdiag output as well? Dominik
(In reply to Dominik Bartkiewicz from comment #9) > Thanks for all the data. > One more thing, could you attach sdiag output as well? Sure! It won't be completely consistent with the previous logs, since we've restarted the controller a couple time already, but it didn't resolve the situation and we still have close to ~1,000 completing jobs, so it will probably still be representative. Thanks, -- Kilian
Created attachment 7350 [details] sdiag output
Hi I see in slurmctld log that some nodes have different slurmconf than slurmctld. Have you made any configuration updates? Maybe you have added new nodes? Could you also increase debug level to debug for slurmd? I see that slurmctld made multiple attempts to send REQUEST_TERMINATE_JOB. I want check if slurmd got this messages. Dominik
(In reply to Dominik Bartkiewicz from comment #12) > I see in slurmctld log that some nodes have different slurmconf than > slurmctld. > Have you made any configuration updates? Yeah, I've been trying to adjust some of the SchedulerParameters options to see if it had any impact on the issue between slurmctld restarts. But I didn't distribute the slurm.conf file to all the compute nodes, as I don't believe slurmd cares about SchedulerParameters, does it? > Maybe you have added new nodes? We regularly add new nodes to the cluster, but for this, we make sure that all the nodes in the cluster have the same slurm.conf before restarting slurmctld and slurmds everywhere. > Could you also increase debug level to debug for slurmd? > I see that slurmctld made multiple attempts to send REQUEST_TERMINATE_JOB. > I want check if slurmd got this messages. Ok, I'll do this the next time it happens, thanks for the suggestion. Another somehow relevant point is that for two days in a row, the Slurm controller has been completely fine all night long, but started behaving mid-morning, when users started working again. Which seems to indicate some correlation between the CG state pileup and user activity. If it's of any help, I've made the evolution of the CG jobs/nodes over time, correlated with the sdiag information, available here: https://snapshot.raintank.io/dashboard/snapshot/3GeFuGAKITvAqnTrLrBt8TZUa7g8s4wr Cheers, -- Kilian
Hi I suspect this is something more tricky but I just want to be sure. Could you generate sdiag one more time? The last one showed quite high rpc rate but it was taken only for ~15 min. Dominik
Created attachment 7356 [details] sdiag output (v2)
(In reply to Dominik Bartkiewicz from comment #14) > I suspect this is something more tricky but I just want to be sure. Could > you generate sdiag one more time? > The last one showed quite high rpc rate but it was taken only for ~15 min. Sure! Here it is. Thanks, -- Kilian
Forgot to ask: is there a way to see a breakdown of RPC types per user? I see total #RPCs/user, and #RPCs/message type, but it would be helpful to see what type of RPC each user does too. Cheers, -- Kilian
Created attachment 7357 [details] slurmd debug logs It started happening again, so here's a new log/gdb bundle, with slurmd's Debug level set to "debug". Thanks, -- Kilian
Just a quick note to mention that we just deployed 17.11.8, and that it doesn't seem to have any specific effet on this: CG jobs started accumulating shortly after the update.
Hi Thank you for all the data, it has helped a lot. Currently, there is no way to display such data, In 18.08 we added some extra info about pending RPCs and agent queue. Dominik
(In reply to Dominik Bartkiewicz from comment #20) > Thank you for all the data, it has helped a lot. Great! Let me know if you need anything else. > Currently, there is no way to display such data, > In 18.08 we added some extra info about pending RPCs and agent queue. Got it, thanks. Cheers, -- Kilian
Hi! I'm raising the severity of this issue to Sev2, as the impact on our system is really high: maintaining the scheduler in a half-decent operating state requires constant attention and multiple slurmctld restarts all day long. It stabilizes for a couple hours, and then the CG jobs start piling up again and render all job submissions pretty much impossible. This is really impacting our users' ability to work on the cluster, and made our support load dramatically increase of the last couple weeks. We would really appreciate any insight you could provide in order to help quickly resolve this problem. Thanks! -- Kilian
Hi I didn't find anything suspicious in the backtraces, logs also look fine (except resending REQUEST_TERMINATE_JOB). Maybe we can tune SchedulerParameters and try to make slurmctld more responsive. Could you send me slurm.conf and one more time sdiag? Dominik
Created attachment 7380 [details] slurm.conf
Hi Dominik, (In reply to Dominik Bartkiewicz from comment #23) > I didn't find anything suspicious in the backtraces, logs also look fine > (except resending REQUEST_TERMINATE_JOB). Thanks for the update, This is quite concerning, though, given the scale of the issue and the impact it has on our system and our users. > Maybe we can tune > SchedulerParameters and try to make slurmctld more responsive. > Could you send me slurm.conf and one more time sdiag? Yes, definitely! Please see the latest attached bundle, and I'll definitely take any recommendation about SchedulerParameters options. I feel the need to mention again that those situations, where jobs in CG state accumulate, really seem to be the direct result from user activity: they usually don't happen at night or during week-ends, but seem to correlate with periods of higher interactive activity, when the most users are connected and active. They seem to be the consequence of large waves of job submission or cancellation. Please let me know any information I could provide to help diagnose and resolve this issue, we still have plenty of occasions to see it happening and collect logs. :) Cheers, -- Kilian
Created attachment 7381 [details] sdiag output (v3)
Hi Kilian, I hope this email finds you well. Jess mentioned to me that you were experiencing some struggles related to node states. Unfortunately, he was not sure of the specific issue/case number. I believe Jess was referring to this issue, however, I wanted to confirm that indeed this is the issue you were referring to when you spoke with Jess. Best regards, Jason
Hi Jason, (In reply to Jason Booth from comment #27) > I hope this email finds you well. Jess mentioned to me that you were > experiencing some struggles related to node states. Unfortunately, he was > not sure of the specific issue/case number. I believe Jess was referring to > this issue, however, I wanted to confirm that indeed this is the issue you > were referring to when you spoke with Jess. Yes, this is the very issue that my colleague Stéphane discussed with Jess at PEARC, you're absolutely right! Cheers, -- Kilian
Hi Do you know what generated that RPCs rate over the night? REQUEST_JOB_USER_INFO ~ 3 per second REQUEST_PARTITION_INFO ~ 3 per second It looks like some automatic script. Could you check "System configuration" from: https://slurm.schedmd.com/high_throughput.html I can't give you any significant recommendations. You can consider adding "defer" and slightly increase bf_yield_sleep from default 500000. ps.: It is not really important but you have repeated "batch_sched_delay=10" in SchedulerParameters. Dominik
(In reply to Dominik Bartkiewicz from comment #29) > Do you know what generated that RPCs rate over the night? > REQUEST_JOB_USER_INFO ~ 3 per second > REQUEST_PARTITION_INFO ~ 3 per second > It looks like some automatic script. Likely users, yes. I'll check, but given our user population, this has to be expected. > Could you check "System configuration" from: > https://slurm.schedmd.com/high_throughput.html Sure, I know that page well :) > I can't give you any significant recommendations. > You can consider adding "defer" and slightly increase bf_yield_sleep from > default 500000. > It is not really important but you have repeated "batch_sched_delay=10" in > SchedulerParameters. Ok, I'll take a stab at those. What else do you suggest we could do to resolve this issue, then? Thanks, -- Kilian
Hi I am still analysing code and logs and I hope I will soon find something. I asked Alex to have a look at this. Dominik
Hi Alex prepared some recommendations 1. Try to set more agresive messages aggregates : Increase WindowMsgs to 50 and WindowTime to 300. 2. You have LaunchParameters=slurmstepd_memlock_all. Perhaps memlock() has some weird interaction with ulimits Could you check if removing this has any positive effect on CG nodes? 3. the last suggestion is switching from JobAcctGatherParams=UsePss to JobAcctGatherParams=UsePss,NoOverMemoryKill and MemLimitEnforce=no and see if this helps improving things. We also notice multiple oom messages in logs. Are you sure that system has enough memory to work properly? Maybe you can consider using MemSpecLimit. Dominik
(In reply to Dominik Bartkiewicz from comment #34) > Alex prepared some recommendations Thanks! > 1. Try to set more agresive messages aggregates : > Increase WindowMsgs to 50 and WindowTime to 300. Ok, I'll do this. > 2. You have LaunchParameters=slurmstepd_memlock_all. > Perhaps memlock() has some weird interaction with ulimits > Could you check if removing this has any positive effect on CG nodes? Will try this too, yes. > 3. the last suggestion is switching from JobAcctGatherParams=UsePss to > JobAcctGatherParams=UsePss,NoOverMemoryKill and MemLimitEnforce=no and see if > this helps improving things. The idea being that the task/cgroup will take care of memory enforcement, got it. > We also notice multiple oom messages in logs. Are you sure that system has > enough memory to work properly? Maybe you can consider using > MemSpecLimit. You mean on the controller or the nodes? I don't see any sign of memory exhaustion on the Slurm controller, but OOM messages are frequent on the nodes, yes, each time a job exceeds its memory allocation, but I take this is normal, isn't it? I don't see any slurmd process being killed by the OOM killer, and if it were the case, the node would quickly marked as DOWN, right? Thanks for the suggestions, I'll give them a try. Interestingly, the situation has been quiet today, no CG episode so far, but I'm not sure if it's a function of lower user activity, or a change in our config. The last significant one was to set bf_yield_interval=1500000 and bf_yield_sleep=750000 in SchedulerParameters. We'll keep you updated in any case. Cheers, -- Kilian
Hi I'm glad to hear it is better now, but we must still control this. 500k agent queue size is hard to explain only by not optimal scheduler parameters. Do you still notice resending REQUEST_TERMINATE_JOB in slurmctld.log? Dominik
Hi Dominik, (In reply to Dominik Bartkiewicz from comment #36) > I'm glad to hear it is better now, but we must still control this. 500k > agent queue size is hard to explain only by not optimal scheduler parameters. Unfortunately, that didn't last. It happened again last night around 11pm, triggered a segfault of the primary controller in the process (I'll open a separate bug for this), and the CG jobs have been piling up all night long: # scontrol completing | wc -l 2067 > Do you still notice resending REQUEST_TERMINATE_JOB in slurmctld.log? Yes, that's the core of the issue, I guess: # grep TERMINATE_JOB /var/log/slurm/ctld/sh-sl02.log | grep "Jul 26" | wc -l 429559 430,000 occurrences that night only, since the backup controller assumed control after the primary one segfaulted. Either those signals don't reach the compute nodes (but it's unlikely since most compute nodes don't run anything anymore, and it would be surprising if that many nodes would stop receiving those messages at the same time), or the controller fails to process them correctly (which seems more likely). I'll attach another graph showing the evolution of the nodes' states in the cluster over night. Cheers, -- Kilian
Created attachment 7421 [details] Node states across the cluster
(In reply to Kilian Cavalotti from comment #37) > triggered a segfault of the primary controller in the process (I'll open a > separate bug for this) It's https://bugs.schedmd.com/show_bug.cgi?id=5486
Created attachment 7426 [details] extra debug This patch should generate extra info about agent queue in slurmctld.log when you will call sdiag. Could you try to apply this patch and send me slurmctld log and sdiag output after it happened again? I'm sorry it's taking so long but I want to ensure you that I'm working on your bug. Dominik
(In reply to Dominik Bartkiewicz from comment #40) > Created attachment 7426 [details] > extra debug > > This patch should generate extra info about agent queue in slurmctld.log > when you will call sdiag. > Could you try to apply this patch and send me slurmctld log and sdiag output > after it happened again? > I'm sorry it's taking so long but I want to ensure you that I'm working on > your bug. Awesome! Thanks a lot for the patch, I'm applying it now, and will provide the outputs at the next occurrence. Cheers, -- Kilian
Hi I have suspicions that this can be related to bug 5438 in some way. Did you change EnforcePartLimits or did you have always set ANY? Dominik
Hi Dominik, (In reply to Dominik Bartkiewicz from comment #42) > I have suspicions that this can be related to bug 5438 in some way. Ah interesting. I kind of assumed the corrupted job structures were a consequence of the "deadlock" situation with CG jobs, maybe exacerbated by our many controller restarts. But they may very well be the source of the issue, indeed. > Did you change EnforcePartLimits or did you have always set ANY? Nope, that hasn't changed recently. EnforcePartLimits has been set since our early days with Slurm. I believe it was set to "Yes" before "ANY" became a thing, but that's probably more than a year ago. Thanks, -- Kilian
Hi, It just happened again. Here's the additional info logged by the "extra debug" patch (full log attached): Jul 29 15:42:17 sh-sl01 slurmctld[65270]: BUG 5443 queue: Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_TERMINATE_JOB count: 73475 Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_LAUNCH_PROLOG count: 2061 Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_BATCH_JOB_LAUNCH count: 2042 Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: SRUN_JOB_COMPLETE count: 33 Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_KILL_TIMELIMIT count: 35 Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_SIGNAL_TASKS count: 54 Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_KILL_PREEMPTED count: 71 Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: SRUN_STEP_SIGNAL count: 1 Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: RESPONSE_RESOURCE_ALLOCATION count: 12 Jul 29 15:42:17 sh-sl01 slurmctld[65270]: type: REQUEST_HEALTH_CHECK count: 1 Jul 29 15:42:17 sh-sl01 slurmctld[65270]: BUG 5443 dump end
Created attachment 7450 [details] slurmctld.log with extra debug
Just to add that this last occurrence looks like it could be correlated with waves of multi-partiton job submissions, in case that's useful.
Created attachment 7457 [details] extra debug Hi I have prepared another patch with an extra log. This patch should catch situation when select_nodes is called for a running job. Dominik
(In reply to Dominik Bartkiewicz from comment #47) > I have prepared another patch with an extra log. > This patch should catch situation when select_nodes is called for a running > job. Thanks Dominik, I just deployed the patch and I'll send logs at the next occurrence. Cheers, -- Kilian
Created attachment 7460 [details] slurmctld logs Issue started again around 13:45 today. Here are the latest logs with the new debug patch applied. It apparently caught some things around 12:41, before the issue started.
Hi This two issues caught in the log are normal and can't be root of this problem. Could you send me slurmd log form sh-101-58? Are you sure that all nodea have set time properly? Non synchronized node can produce log like this: "Zero Bytes were transmitted or received" Dominik
Created attachment 7468 [details] sh-101-58 slurmd logs > Could you send me slurmd log form sh-101-58? Here it is. > Are you sure that all nodea have set time properly? > Non synchronized node can produce log like this: > "Zero Bytes were transmitted or received" Yes, we use NTP all across the cluster, and all the compute nodes have their clocks set to the same time. Plus, a few un-synchronized nodes wouldn't explain why thousands of jobs suddenly get stuck in CG state, would they?
Killan - Could you increase the number of MUNGE threads on the slurmctld host considerably? The default is 2, can you please adjust to 10 per our tuning guide? https://slurm.schedmd.com/high_throughput.html I don't think this is directly implicated, but it'd be good to get out of the way.
Hi Tim, (In reply to Tim Wickberg from comment #53) > Could you increase the number of MUNGE threads on the slurmctld host > considerably? > > The default is 2, can you please adjust to 10 per our tuning guide? > https://slurm.schedmd.com/high_throughput.html > > I don't think this is directly implicated, but it'd be good to get out of > the way. Thanks for the suggestion, but we have this in place already: [root@sh-sl01 ~]# ps aux | grep -i munge munge 936 0.5 0.0 751624 3568 ? Sl Jul19 99:48 /usr/sbin/munged --num-threads 10 Cheers, -- Kilian
Is there a reason you'd originally turned on message aggregation? I can see you've been running with it for quite some time, but I'd be concerned this may be exacerbating some other issues. I think Alex previously overlooked an interaction here when recommending an increased WindowTime - that might be the reason you're seeing more frequent issues around MUNGE credential timeouts if the MUNGE lifetime is now matching that setting. I'd suggest disabling this entirely unless you had a specific reason it was originally enabled. I don't think it's helping you out here, and may be causing some issues and/or masking the real root cause.
(In reply to Tim Wickberg from comment #55) > Is there a reason you'd originally turned on message aggregation? I believe it may have been originally done to help with performance and scalability, when we were reaching the 800+ node level and running our Slurm controllers out of small-ish VMs. > I can see you've been running with it for quite some time, but I'd be > concerned this may be exacerbating some other issues. I think Alex > previously overlooked an interaction here when recommending an increased > WindowTime - that might be the reason you're seeing more frequent issues > around MUNGE credential timeouts if the MUNGE lifetime is now matching that > setting. > > I'd suggest disabling this entirely unless you had a specific reason it was > originally enabled. I don't think it's helping you out here, and may be > causing some issues and/or masking the real root cause. Ok, thanks for the suggestion. I've disabled it and restarted the daemons. Cheers, -- Kilian
(In reply to Kilian Cavalotti from comment #56) > Ok, thanks for the suggestion. I've disabled it and restarted the daemons. And it got put to test quite quickly: the issue just started happening again, with message aggregation disabled. I'll keep it this way for now anyway. Cheers, -- Kilian
Hi Could you send me PrologSlurmctld or describe what is in it? Dominik
Created attachment 7479 [details] slurmctld prolog Hi Dominik, The PrologSlurmctld script is attached. We tried various variations around it, including removing the scontrol part and disabling the prolog script entirely, but it still lead to CG pile-up situations. Right now, there's a random delay introduced before calling scontrol, to avoid too many simultaneous RPCs when large job series are starting at the same time, and it's done asynchronously to let the prolog script return immediately and avoid blocking further job processing.
Hi After the internal talk, we agreed that we would like to check other epilogue and prologue. Could you send us the rest of these scripts? Maybe there is some detail which induces this. Is there any chance to comment "PrologFlags=contain" and check if this issue will still occur? I understand this can be troublesome. Dominik
Created attachment 7481 [details] prolog/epilog scripts > After the internal talk, we agreed that we would like to check other epilogue and prologue. Could you send us the rest of these scripts? Maybe there is some detail which induces this. Sure, they're attached. > Is there any chance to comment "PrologFlags=contain" and check if this issue will still occur? I understand this can be troublesome. Mmmh, I would really prefer not having to do this, as it will prevent proper containerization of tasks via pam_slurm_adopt, and users heavily rely on the ability to ssh to nodes and run things in the context of their job. I'd like to state again that the whole configuration didn't significantly change over the last year (besides adding more nodes), and that the issue suddenly started about a month ago. Thanks! -- Kilian
Sorry Kilian, ignore comment 65. I posted that on the wrong bug.
(In reply to Marshall Garey from comment #66) > Sorry Kilian, ignore comment 65. I posted that on the wrong bug. No worries. But while you're here :) is there any update or progress on that problem? Thanks, -- Kilian
Nothing specific right now - Dominik is primarily working on this one, but Tim and I have spent some time on it, too. I think the other Stanford bugs that I own are probably related to this, and I've continued to do research on those.
Hi You can try to compile slurmctld with higher value of MAX_SERVER_THREADS (eg. 512) MAX_SERVER_THREADS is defined in src/slurmctld/slurmctld.h If this stops agent queue from increasing it will be easier to find what and why generate so many REQUEST_TERMINATE_JOB rpcs. Dominik
Hi Dominik, In reply to Dominik Bartkiewicz from comment #70) > You can try to compile slurmctld with higher value of MAX_SERVER_THREADS > (eg. 512) > MAX_SERVER_THREADS is defined in src/slurmctld/slurmctld.h > If this stops agent queue from increasing it will be easier to find what and > why generate so many REQUEST_TERMINATE_JOB rpcs. Ok, I'll try that, but wouldn't it exacerbate the problem, by allowing even more threads to add work to the queue? How does that interact with max_rpc_cnt in SchedulerParameters? Thanks, -- Kilian
Hi MAX_SERVER_THREADS limits separately "Server thread count" and numbers of agents. max_rpc_cnt is compering with the count of incoming RPCs. From sdiag output and one of your plots, it seems that slurmctld never hit this limit for "Server thread count". Increasing MAX_SERVER_THREADS will allow creating more agents and I hope it will prevent hungering older RPCs in the queue. Dominik
(In reply to Dominik Bartkiewicz from comment #72) > Hi > > MAX_SERVER_THREADS limits separately "Server thread count" and numbers of > agents. > > max_rpc_cnt is compering with the count of incoming RPCs. Ok, I asked because the slurm.conf man page says: max_rpc_cnt=# If the number of active threads in the slurmctld daemon is equal to or larger than this value, defer scheduling of jobs. So it sounds pretty similar to a MAX_SERVER_THREADS value. > From sdiag output and one of your plots, it seems that slurmctld never hit > this limit for "Server thread count". > > Increasing MAX_SERVER_THREADS will allow creating more agents and I hope it > will prevent hungering older RPCs in the queue. Ok. So I ignored the comment in slurmctld.h that says "increasing this value is strongly discouraged" :) set the value to 512, recompiled and restarted the daemons. As a side note, the issue hasn't been really happening since last Tuesday (7/31). We had a few occurrences of a few dozen nodes in CG state at once, but the situation cleared itself out in about 30mn, so nothing comparable to the previous occurrences. That is not to say that we consider the issue resolved, though. Since we haven't modified the config nor the code at all after the last occurrence, we can only consider that the problem is a function of a specific job pattern and that it can still happen at any time. Cheers, -- Kilian
And we just need to talk about it to make it happen again: # sinfo -t completing -N -O nodelist -h | sort -u | wc -l 148 # scontrol completing | wc -l 289 This is *with* MAX_SERVER_THREADS=512 Do you need any dump/trace before I restart the controller?
Hi Kilian, We would like to see the slurmctld logs from today so, please attach those to the ticket. I do not believe a trace will be needed for this instance. -Jason
Created attachment 7521 [details] new slurmctld logs Sure, here they are.
Hi! Did you find anything useful in the logs? Is there any update we can report to our users towards a fix? Thanks, -- Kilian
Hi Kilian, Unfortunately, the issue is still troublesome to reproduce on our end even with the additional logging and traces. Finding the root cause has been troublesome without a clear way to duplicate this and this is why we have had you modify some code such as MAX_SERVER_THREADS. We are continuing to look into this issue but do not, as of yet, have a suggested patch/fix. Kind regards, Jason
Hi Can you try to disable temporary killing user processes in the job epilog? Now it has some race between squeue and pkill/ipcrm. Could you send my current slurmd.log of those nodes? sh-16-01, sh-15-02, sh-15-04 In which state nodes are now? Can you try to enable debug flag "Route" and when this occurs again could you send me slurmd log from a couple of completed nodes? Do you have job_script/job specification of 23171442? Perhaps this is some unrelated problem but I will try to find out why this job has overallocated memory. Dominik
Hi Dominik, Thanks for following up and for keeping looking at this. > Can you try to disable temporary killing user processes in the job epilog? > Now it has some race between squeue and pkill/ipcrm. I'm curious about that race: would that be between different jobs that could be terminating at the same time on the same nodes? The pkill/ipcrm only happens if no other job is running on the node, so unless another job starts and terminates between the squeue and pkill calls, there shouldn't be any problem, right? Now it's true that the local scratch cleanup happens between those calls, and could take some time, so I'll move it after the pkill/ipcrm parts. But for now, I disabled the pkill and ipcrm parts in the epilog script. > Could you send my current slurmd.log of those nodes? > sh-16-01, sh-15-02, sh-15-04 > In which state nodes are now? They're all in COMPLETING state now: # sinfo -p owners -Nn sh-16-01,sh-15-02,sh-15-04 NODELIST NODES PARTITION STATE sh-15-02 1 owners comp sh-15-04 1 owners comp sh-16-01 1 owners comp I'll attach the slurmd logs for those nodes. > Can you try to enable debug flag "Route" and when this occurs again could > you send me slurmd log from a couple of completed nodes? Sure, done. Didn't know this was a thing, I don't think the Route DebugFlag is mentioned in the slurm.conf man page :) Do you need increased general log level too? That's what we have right now: # scontrol show config | grep -i debug DebugFlags = Route SlurmctldDebug = info SlurmctldSyslogDebug = info SlurmdDebug = info SlurmdSyslogDebug = info > Do you have job_script/job specification of 23171442? > Perhaps this is some unrelated problem but I will try to find out why this > job has overallocated memory. Yes, it looks like this: #SBATCH -p iric #SBATCH --time=24:00:00 #SBATCH --nodes=1 #SBATCH --mem-per-cpu=6000 #SBATCH --exclusive #SBATCH --ntasks-per-core=1 I'll attach the full script too. Thanks! -- Kilian
Created attachment 7540 [details] slurmd logs for sh-16-01,sh-15-02,sh-15-04
Created attachment 7541 [details] job script for job 23171442
Another note that may be of interest: since July 31st, which is when I disabled message aggregation (https://bugs.schedmd.com/show_bug.cgi?id=5443#c55), the kernel on the Slurm controller has been logging this: [Tue Jul 31 16:25:47 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters. [Mon Aug 6 11:44:51 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters. [Mon Aug 6 12:00:44 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters. [Mon Aug 6 12:15:43 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters. [Mon Aug 6 12:45:44 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters. [Mon Aug 6 13:00:43 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters. [Mon Aug 6 13:15:50 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters. [Mon Aug 6 13:29:50 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters. [Mon Aug 6 13:44:48 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters. [Mon Aug 6 22:45:05 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters. [Tue Aug 7 11:45:54 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters. [Wed Aug 8 10:15:36 2018] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters. Do you think that may be related? Should we re-enable message aggregation? Thanks, -- Kilian
Hi Because now we know that message aggregation is not directly involved in this issue I think you can enable it. You can set SlurmdDebug to debug. It would be great if you increase SlurmctldDebug too, but at first, test if increasing SlurmctldDebug has no notable impact on performance. Dominik
(In reply to Dominik Bartkiewicz from comment #86) > Because now we know that message aggregation is not directly involved in > this issue I think you can enable it. Ok, thanks. > You can set SlurmdDebug to debug. > It would be great if you increase SlurmctldDebug too, > but at first, test if increasing SlurmctldDebug has no notable impact on > performance. Looks like the DebugFlag=Route option makes srun display debug messages like this: $ srun -p test --pty bash srun: ROUTE: split_hostlist: hl=sh-06-34 tree_width 0 Is it expected? That's confusing our users a little more. :) Cheers, -- Kilian
Hi Sorry I didn't suspect that users also can see this messages. You can disable this now, if we have some problems in this area, it will be already in log. Dominik
Hi! As a quick update, we've just updated to 17.11.9 to see if it makes any difference (and with the hope that https://github.com/SchedMD/slurm/commit/06582da8ff47132a124497f7dbd41dd5ac227c33 may be beneficial to our case). We also added a cron job that regularly checks the number of completing jobs and will restart the controllers of it gets above a threshold. But that's obviously just a temporary workaround and not something we would like to have to keep for too long. :) Cheers, -- Kilian
(In reply to Kilian Cavalotti from comment #91) > As a quick update, we've just updated to 17.11.9 to see if it makes any > difference (and with the hope that > https://github.com/SchedMD/slurm/commit/ > 06582da8ff47132a124497f7dbd41dd5ac227c33 may be beneficial to our case). Our upgrade tentative to 17.11.9 was short-lived, because of https://bugs.schedmd.com/show_bug.cgi?id=5552, so we're back on 17.11.8
Hi Could you send me logs that you have grabbed with debug flag "Route" from both slurmd and slurmctld? Currently, I expect that problem is in other parts of slurm but want to check in any way. I will attach another patch with additional logs, could you apply it and enable TraceJobs debugflag? could you also send me nodes record for one of compliting node eg.: gdb -p <slurmctld_pid> --batch -ex 'call find_node_record("<nod_name>")' -ex 'p *$1'
Created attachment 7600 [details] extra debug
Hi Dominik, (In reply to Dominik Bartkiewicz from comment #94) > Could you send me logs that you have grabbed with debug flag "Route" from > both slurmd and slurmctld? Currently, I expect that problem is in other > parts of slurm but want to check in any way. Sure. I'll attach the whole thing. It will be quite big though, so it may take me a couple tries. > I will attach another patch with additional logs, could you apply it and > enable TraceJobs debugflag? Sure, thanks for the patch. It's applied and deployed, and TraceJobs debugflag has been activated. I'm a bit worried this will make our already huge logs explode, but I'll keep an eye on them. > could you also send me nodes record for one of compliting node > eg.: > gdb -p <slurmctld_pid> --batch -ex 'call find_node_record("<nod_name>")' -ex > 'p *$1' Will do next time this happens. for sure! Thanks, -- Kilian
Created attachment 7603 [details] DebugFlag Route logs
That means you don’t have nodes or jobs in the permanent completing state? Tracejobs will be interesting only when it will start again. Dominik
(In reply to Dominik Bartkiewicz from comment #98) > That means you don’t have nodes or jobs in the permanent completing state? I'm a bit worried we've not being communicating clearly here, so let me try again: we don't have nodes or jobs permanently stuck in CG state, we have situations where many jobs and nodes accumulate in CG state, a stay that way until no resources are available anymore in the cluster (I've provided a number of graphes that show this). But a controller restart usually resolves the situation and CG states are cleared. We obviously can't afford to keep our whole 1,300-node, 3,500-user cluster down for weeks while the issue is being diagnosed. So we routinely restart the Slurm controller when the issue happens and that usually clears up the completing nodes, until the problem happens again. > Tracejobs will be interesting only when it will start again. Ok, I'll enabled it next time the issue happens, then. Cheers, -- Kilian
Hi Maybe I wasn't precise enough in the last comment. I used "permanent" for nodes that stay in this state significantly longer than we expect. For me, the first symptom of this problem is "Resending TERMINATE_JOB" that according to the log starts occur a couple minutes after restarting slurmctld. Because TERMINATE_JOB should be sent only for COMPLETING jobs, I assume that you always have some amount of completing jobs and you restart the controllers when the number of CG job gets above a threshold. Dominik
(In reply to Dominik Bartkiewicz from comment #100) > Maybe I wasn't precise enough in the last comment. > I used "permanent" for nodes that stay in this state significantly longer > than we expect. Ah, thanks for the clarification. > For me, the first symptom of this problem is "Resending TERMINATE_JOB" that > according to the log starts occur a couple minutes after restarting > slurmctld. Because TERMINATE_JOB should be sent only for COMPLETING jobs, I > assume that you always have some amount of completing jobs and you restart > the controllers when the number of CG job gets above a threshold. I see. I think that the "Resending TERMINATE_JOB" messages that are sent right after the slurmctld start actually work: they seem to be correctly processed and make the number of CG jobs/nodes decrease and go back to ~0. The "Resending TERMINATE_JOB" messages that are sent *before* the slurmctld restart are the ones that are likely problematic, as they don't seem to be processed at all and cause the accumulation of those CG jobs. So, to summarize: at times, the number of CG jobs and nodes start to steadily increase, TERMINATE_JOB messages seem to be ignored and not processed, and CG jobs keep accumulating until we restart the Slurm controller. After the restart, TERMINATE_JOB messages are sent again for all the CG jobs, they are correctly processed, and the number of completing jobs goes back to normal. Cheers, -- Kilian
Hi Do you have any news? Could you send one more timemy slurmctld log? Dominik
Hi Dominik, (In reply to Dominik Bartkiewicz from comment #102) > Do you have any news? > Could you send one more timemy slurmctld log? We were kind of waiting on you for updates. :) The issue hasn't been happening as of late, the last significant occurrence happened on Aug 15th around 01:10 but was mitigated by our slurmctld restart cron script. I'm attaching the logs, which still show repeated "Resending TERMINATE_JOB" messages. Let me know if you need anything else. Cheers, -- Kilian
Created attachment 7647 [details] slurmctld logs for the last Aug 15 occurrence
Created attachment 7676 [details] tracejobs+gdb logs The issue just happened again. Here are the requested logs: * slurmctld logs with TraceJob debug flag * scontrol completing output * for the last 8 nodes listed in that output, result of gdb 'call find_node_record("<nod_name>")' Please advise.
We've been battling with this situation for about 4 hours now, but it doesn't look like the number of CG jobs is decreasing. We've even marked the completing nodes as DOWN to clear up the CG jobs, it worked for a minute, and then the counter of completing jobs skyrocketed again. We're currently at more than 3,800 jobs in CG state and without any clear way to make that count go down. We restarted the controllers multiple times, even rebooted the physical machines, without any significant change. I'm rising the severity of this issue to Sev1, it's pretty much a site down situation right now. Would it help to give you access to the machine directly? We really need to find a solution to this problem, it has been going on for way to long. :( Thanks!
One aggravating factor we noticed is that those completing jobs tend to be re-queued when the controller restarts, although they were in CG state before, which means that they were already terminated (or in the process of being terminated). This obviously adds fuel to the fire and generates even more work for the scheduler, like an avalanche, when it tries to cleanup those CG jobs. After having restarted the controller a few times, they end up being set aside and put in the "JobHoldMaxRequeue" state, which finally allows the scheduler to recover: # squeue | grep -c JobHoldMaxRequeue 921 Why is the reason for those jobs to be requeued? Shouldn't they just be marked as COMPLETED (or FAILED) instead?
New fact that we discovered: some users are using snakemake (https://snakemake.readthedocs.io/en/stable/) to submit jobs automatically and keep a number of jobs in queue at the maximal permitted level. It seems to be quite harmful for the scheduler when it tries to cleanup completing jobs and keep seeing new job submissions to replace the terminated ones. Stopping those snakemake automated submission processes instantly brought back the number of CG jobs to regular levels. I don't think that's the root cause of the problem, but definitely an exacerbating factor. Any idea how we could mitigate those situations? We're using jobs_per_user_per_hour in SchedulerParameters already, but is there any other way?
Hi Sorry, that yesterday I didn't respond, it was after my normal working hours. I am still analyzing all logs. I have only one idea and it’s to try to disable slurmctld epilog. This should make "completing logic" more straightforward. Marshall noticed that enables SchedulerParameters=defer also can help. Dominik
(In reply to Dominik Bartkiewicz from comment #110) > Sorry, that yesterday I didn't respond, it was after my normal working hours. > > I am still analyzing all logs. > I have only one idea and it’s to try to disable slurmctld epilog. > This should make "completing logic" more straightforward. > > Marshall noticed that enables SchedulerParameters=defer also can help. Thanks for the suggestions! I've implemented both. Cheers, -- Kilian
Is the system running again? Are the number of jobs in COMPLETING state decreasing?
(In reply to Marshall Garey from comment #113) > Is the system running again? Are the number of jobs in COMPLETING state > decreasing? Yes, as I noted in https://bugs.schedmd.com/show_bug.cgi?id=5443#c108, we were able to get the system back to normal after killing those snakemake processes. We've been back at regular number of CG jobs since 8pm last night. I realize how easy it could have been to miss this in the middle of the other comments, and I forgot to get the bug back to Sev2, so I'm doing that now. Cheers, -- Kilian
Okay. I did see comment 108, but since it was still at a sev-1, I wasn't sure. Can you continue monitoring the system and let us know if the two changes (removing slurmctld epilog and adding SchedulerParameters=defer) are helping? Just a bit more information about each of those changes: defer works a bit differently than sched_min_interval (which I see you already have), so having both should help. defer in particular can help with large amounts of job submissions or job completions. Basically, it lessens some of the work slurmctld has to do (including creating threads and portions of code inside locks, including the main scheduler), hopefully giving it more time to handle everything else. Removing the slurmctld epilog should also reduce overhead and give the slurmctld more time to respond to everything else.
(In reply to Marshall Garey from comment #115) > Can you continue monitoring the system and let us know if the two changes > (removing slurmctld epilog and adding SchedulerParameters=defer) are helping? Absolutely! Although I feel like they would be most helpful for the recovery phase and avoid cascading situations, but not so much to prevent the issue from happening in the first place, right? > Just a bit more information about each of those changes: > > defer works a bit differently than sched_min_interval (which I see you > already have), so having both should help. defer in particular can help with > large amounts of job submissions or job completions. Basically, it lessens > some of the work slurmctld has to do (including creating threads and > portions of code inside locks, including the main scheduler), hopefully > giving it more time to handle everything else. Ideally, we'd like to avoid having to use defer, because our users love the reactivity and comfort of getting an interactive shell on a compute node in seconds rather than minutes. But we'll try it for now. > Removing the slurmctld epilog should also reduce overhead and give the > slurmctld more time to respond to everything else. That we can live with without any problem: our slurmctld epilog was mostly dumping a snapshot of "sacct -j jobid" into a central location, to complete the information about the job that is already collected in the prolog. But it's more a convenience thing that anything else, because we can still query sacct after the fact. Thanks! -- Kilian
Hi I am glad to hear that these options stabilize slurmctld. Node record looks suspicious, especially "last_response = 0" but I still don't know if this is just result of overloaded slurmctld or is a cause of it. Just to be sure weren't these gdb outputs generated just after slurmctld was restarted? I want to look at job records too. Could you generate this for one of "resending REQUEST_TERMINATE_JOB' completing job: gdb -p slurmctld_pid --batch -ex 'call find_job_record(<job_id>)' -ex 'p *$1'? Dominik
(In reply to Kilian Cavalotti from comment #116) > (In reply to Marshall Garey from comment #115) > > Can you continue monitoring the system and let us know if the two changes > > (removing slurmctld epilog and adding SchedulerParameters=defer) are helping? > > Absolutely! Although I feel like they would be most helpful for the recovery > phase and avoid cascading situations, but not so much to prevent the issue > from happening in the first place, right? They should help prevent it too, actually. We're doing more testing with and without the EpilogSlurmctld to see how that affects things. > Ideally, we'd like to avoid having to use defer, because our users love the > reactivity and comfort of getting an interactive shell on a compute node in > seconds rather than minutes. But we'll try it for now. On my test system (I know, not a real system), it still responds within a few seconds (2-4 seconds on average), rather than (seemingly) instantaneous.
(In reply to Dominik Bartkiewicz from comment #117) > Node record looks suspicious, especially "last_response = 0" but I still > don't know if this is just result of overloaded slurmctld or is a cause of > it. > Just to be sure weren't these gdb outputs generated just after slurmctld was > restarted? Well, depends how you defined "just" :) The gdb traces have been taken at 14:17 on Aug 22, and here are the start dates of the controller: Aug 22 14:15:11 sh-sl01 slurmctld[102853]: slurmctld version 17.11.8 started on cluster sherlock Aug 22 14:18:11 sh-sl01 slurmctld[111305]: slurmctld version 17.11.8 started on cluster sherlock So slurmctld started 2 mins before and restarted again 1 minute after. But I'm afraid it's because it crashed both times, not because I restarted it: -rw------- 1 slurm slurm 3513364480 Aug 22 14:12 /var/spool/slurm.state/core.64771 -rw------- 1 slurm slurm 902508544 Aug 22 14:18 /var/spool/slurm.state/core.160096 > I want to look at job records too. > > Could you generate this for one of "resending REQUEST_TERMINATE_JOB' > completing job: > gdb -p slurmctld_pid --batch -ex 'call find_job_record(<job_id>)' -ex 'p > *$1'? We don't have any CG job right now, but we still have the core dumps from Aug 22, when this last happened, and the logs from right before the crash with a list of jobids that were being resent REQUEST_TERMINATE_JOB. The thing is that I can't call the find_job_record() function from a core dump, it needs a live process. Is there any way to get that information from the core dump? Cheers, -- Kilian
Hi You can try gdb script like this: gdb --batch -x gdb.script <slurmctld_path> <core> gdb.script: set $i = job_list->head while ($i) set $job = (struct job_record *)($i->data) if $job->job_id == 114 p *$job end set $i = $i -> next end Dominik
(In reply to Dominik Bartkiewicz from comment #122) > Hi > > You can try gdb script like this: > gdb --batch -x gdb.script <slurmctld_path> <core> > > gdb.script: > set $i = job_list->head > while ($i) > set $job = (struct job_record *)($i->data) > if $job->job_id == 114 > p *$job > end > set $i = $i -> next > end Thank you, that works. I just realized though, that the core dumps we have have been generated by segfaults (reported in other bugs) *after* a controller restart, so I'm not sure the information in there would be too meaningful. But anyway, here are the records for a few of the jobids that were being resent TERMINATE_JOB requests: $1 = {account = 0x7fab64361280 "skachru", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x7fab64315ae0 "sh-ln06", alloc_resp_port = 37716, alloc_sid = 39284, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 14435, assoc_ptr = 0x292b990, batch_flag = 0, batch_host = 0x7fab6441b950 "sh-115-04", billable_tres = 1, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 1, cr_enabled = 1, db_index = 463645307, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x7fab64639100, direct_set_prio = 0, end_time = 1534971236, end_time_exp = 4294967294, epilog_running = false, exit_code = 4294967294, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7fab643920f0 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x7fab640cc3c0 "", gres_used = 0x0, group_id = 24086, job_id = 23944668, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x7fab6463c630, job_state = 32772, kill_on_node_fail = 1, last_sched_eval = 1534971049, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x7fab64638960}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x7fab6438b840 "bash", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7fab6463c2c0 "sh-115-04", node_addr = 0x7fab646396e0, node_bitmap = 0x7fab6463c2f0, node_bitmap_cg = 0x7fabb402f1c0, node_cnt = 1, node_cnt_wag = 1, nodes_completing = 0x7fabb0154550 "sh-115-04", origin_cluster = 0x7fab646394b0 "sherlock", other_port = 43338, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7fab640e77b0 "skachru", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x7fab04482440, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 20428, priority_array = 0x0, prio_factors = 0x7fab646392b0, profile = 0, qos_id = 3, qos_ptr = 0x26c4c60, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 333157, resp_host = 0x7fab64639480 "10.10.0.66", sched_nodes = 0x0, select_jobinfo = 0x7fab64639510, spank_job_env = 0x7fab641a5730, spank_job_env_size = 1, start_protocol_ver = 8192, start_time = 1534971049, state_desc = 0x0, state_reason = 35, state_reason_prev = 35, step_list = 0x617af20, suspend_time = 0, time_last_active = 1534972273, time_limit = 120, time_min = 0, tot_sus_time = 3, total_cpus = 1, total_nodes = 1, tres_req_cnt = 0x7fab64638990, tres_req_str = 0x7fab646393c0 "1=1,2=25600,4=1", tres_fmt_req_str = 0x7fab64639420 "cpu=1,mem=25G,node=1", tres_alloc_cnt = 0x7fab6463c550, tres_alloc_str = 0x7fabb4424b20 "1=1,2=25600,3=18446744073709551614,4=1,5=1", tres_fmt_alloc_str = 0x7fabb422f470 "cpu=1,mem=25G,node=1,billing=1", user_id = 333157, user_name = 0x7fab842dc0a0 "bbagley", wait_all_nodes = 1, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} $1 = {account = 0x7fab841ae700 "linder", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x7fab842fee10 "sh-ln03", alloc_resp_port = 0, alloc_sid = 274530, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 3988, assoc_ptr = 0x29cd8e0, batch_flag = 1, batch_host = 0x7fab94b40ee0 "sh-105-06", billable_tres = 2, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 2, cr_enabled = 1, db_index = 463650236, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x7fab842580f0, direct_set_prio = 0, end_time = 1534972175, end_time_exp = 4294967294, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7fab94b3ed70 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x7fab94b7ec40 "", gres_used = 0x0, group_id = 6959, job_id = 23945006, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x7fab94b0a1a0, job_state = 32772, kill_on_node_fail = 1, last_sched_eval = 1534972073, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x7fab84257b80}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x7fab842577e0 "s71x22.sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7fab94176050 "sh-105-06", node_addr = 0x7fab9414b7d0, node_bitmap = 0x7fab94402360, node_bitmap_cg = 0x7fab4004de10, node_cnt = 1, node_cnt_wag = 1, nodes_completing = 0x7fabb02cbcb0 "sh-105-06", origin_cluster = 0x0, other_port = 0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7fab9401ed90 "owners,cee,normal", part_ptr_list = 0x42fbe20, part_nodes_missing = false, part_ptr = 0x7fab044832e0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 7167, priority_array = 0x7fab842584c0, prio_factors = 0x7fab842582a0, profile = 0, qos_id = 3, qos_ptr = 0x26c4c60, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 241285, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x7fab84257b50, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, start_time = 1534972073, state_desc = 0x0, state_reason = 35, state_reason_prev = 35, step_list = 0x7fabac4da2c0, suspend_time = 0, time_last_active = 1534972273, time_limit = 59, time_min = 0, tot_sus_time = 8, total_cpus = 2, total_nodes = 1, tres_req_cnt = 0x7fab84257860, tres_req_str = 0x7fab84257a40 "1=2,2=8000,4=1", tres_fmt_req_str = 0x7fab84257af0 "cpu=2,mem=8000M,node=1", tres_alloc_cnt = 0x7fab94b3ed20, tres_alloc_str = 0x7fab4004ee60 "1=2,2=8000,3=18446744073709551614,4=1,5=2", tres_fmt_alloc_str = 0x7fab40810fe0 "cpu=2,mem=8000M,node=1,billing=2", user_id = 241285, user_name = 0x7faad880e9a0 "berkin", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} $1 = {account = 0x7fab847fd890 "akundaje", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x7fab84826860 "sh-105-53", alloc_resp_port = 0, alloc_sid = 50361, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 8371, assoc_ptr = 0x29581e0, batch_flag = 1, batch_host = 0x7fab8c038c50 "sh-105-67", billable_tres = 2, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 2, cr_enabled = 1, db_index = 463644395, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x7fab8405d3a0, direct_set_prio = 0, end_time = 1534971537, end_time_exp = 4294967294, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7fab8c7e44b0 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x7fab8c406920 "", gres_used = 0x0, group_id = 48931, job_id = 23944626, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x7fab8c05ce40, job_state = 32771, kill_on_node_fail = 1, last_sched_eval = 1534970961, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x7fab840d73e0}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x7fab847cc6c0 "zN-331", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7fab8c3e4250 "sh-105-67", node_addr = 0x7fab8c329a20, node_bitmap = 0x7fab8c713800, node_bitmap_cg = 0x7fab28300800, node_cnt = 1, node_cnt_wag = 1, nodes_completing = 0x7fabb01d6d70 "sh-105-67", origin_cluster = 0x0, other_port = 0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7fab8c03a4e0 "sfgf,akundaje,normal,bigmem,owners", part_ptr_list = 0x4608df0, part_nodes_missing = false, part_ptr = 0x7fab040f52e0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 17066, priority_array = 0x7fab84717470, prio_factors = 0x7fab847fd800, profile = 0, qos_id = 3, qos_ptr = 0x26c4c60, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x7fab845b3830, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, start_time = 1534970961, state_desc = 0x0, state_reason = 0, state_reason_prev = 35, step_list = 0x5aa0f90, suspend_time = 0, time_last_active = 1534972273, time_limit = 240, time_min = 0, tot_sus_time = 5, total_cpus = 2, total_nodes = 1, tres_req_cnt = 0x7fab84826810, tres_req_str = 0x7fab84386850 "1=1,2=40000,4=1", tres_fmt_req_str = 0x7fab843868b0 "cpu=1,mem=40000M,node=1", tres_alloc_cnt = 0x7fab8c286bb0, tres_alloc_str = 0x7fab28233a20 "1=2,2=40000,3=18446744073709551614,4=1,5=2", tres_fmt_alloc_str = 0x7fab284add40 "cpu=2,mem=40000M,node=1,billing=2", user_id = 312359, user_name = 0x7fab584f9030 "marinovg", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} Cheers, -- Kilian
Hi all, Just a quick update to note that disabling the slurmctld epilog and enabling defer doesn't make the problem go away: # scontrol show config | egrep "EpilogSlurmctld|SchedulerParameters" EpilogSlurmctld = (null) SchedulerParameters = default_queue_depth=5000,partition_job_depth=2000,bf_continue,bf_yield_interval=1500000,bf_yield_sleep=750000,bf_busy_nodes,bf_interval=60,bf_max_time=120,bf_window=10080,bf_job_part_count_reserve=10,bf_max_job_test=5000,bf_max_job_array_resv=10,bf_max_job_user_part=100,bf_resolution=3600,bf_min_prio_reserve=3000,kill_invalid_depend,nohold_on_prolog_fail,pack_serial_at_end,preempt_youngest_first,reduce_completing_frag,batch_sched_delay=20,sched_min_interval=100000,build_queue_timeout=5000000,jobs_per_user_per_hour=2000,max_rpc_cnt=128,defer # scontrol completing | wc -l 206 Cheers, -- Kilian
Hi After analyzing job and node state data I was expecting this. None of my suspicions, unfortunately, didn't confirm. Now I am working on eliminating some cork in rpc/agent code (but for now I didn't find anything "big" enough to explain this), maybe this is just effect of overloaded slurmctld. Dominik
An additional thing I just noticed: we're using smail from contribs/ as a MailProg, and it looks like for each mail that smail generates, it gets the cluster name with the following command: ClusterName=`/usr/bin/scontrol show config | grep ClusterName | awk '{printf("[%s]", $3)}'` Since emails are generated and sent from the controller nodes, where slurm.conf is present, maybe that potentially load-generating scontrol call could be replaced by something lighter like this? ClusterName=$(awk -F= '/ClusterName/ {print $2}' ${SLURM_CONF:-/etc/slurm/slurm.conf})
Hi Thanks for pointing this out, This definitely can create a significant additional load on slurmctld. I will include this in my test environment, maybe this is the kay to recreate this problem. Dominik
Hi I have some news: thanks for pointing out that smail uses scontrol to get clustername: We addressed this problem in: https://github.com/SchedMD/slurm/commit/9a36112969ea0519ce70830153e91eeb67a0211d https://github.com/SchedMD/slurm/commit/d07505a8750112f0c42de314a4b520660efde141 We also localized a few spots where we can increase the performance of sending RPCs, I hope this enhancement will be in 18.08.2. Dominik
Hi Dominik, (In reply to Dominik Bartkiewicz from comment #141) > I have some news: > thanks for pointing out that smail uses scontrol to get clustername: > We addressed this problem in: > https://github.com/SchedMD/slurm/commit/ > 9a36112969ea0519ce70830153e91eeb67a0211d > https://github.com/SchedMD/slurm/commit/ > d07505a8750112f0c42de314a4b520660efde141 Thank you! That first commit is about the job_submit Lua plugin though, so I assume that was https://github.com/SchedMD/slurm/commit/0324a04c007943862cf6e7ad0eae43799895b56c instead, but that's great news! > We also localized a few spots where we can increase the performance of > sending RPCs, I hope this enhancement will be in 18.08.2. Great to hear too! On our end, we've mostly managed to mitigate the issue by using `SchedulerParameters=defer`, which is not entirely satisfactory. So we will be glad to give a try to those RPCs improvements in 18.08.2 when they're available. And once we manage to solve our DB upgrade problems (#5662). Thanks again! -- Kilian
Hi This commit should increase the RPCs performance of slurmctld, https://github.com/SchedMD/slurm/commit/2b25aa4e555 Unfortunately, we decided that this change can go only to 19.05. Dominik
Hi Did you manage to upgrade to 18.08? From other customers, we get the info that 18.08 seems to be faster and more responsive. This commit makes another optimization in code that is related to rpcs/agent. https://github.com/SchedMD/slurm/commit/85e97dc2ac3f31b Dominik
Hi Dominik, (In reply to Dominik Bartkiewicz from comment #157) > Did you manage to upgrade to 18.08? Not yet, unfortunately, but we're looking forward to it! > From other customers, we get the info that 18.08 seems to be faster and more > responsive. > > This commit makes another optimization in code that is related to rpcs/agent. > https://github.com/SchedMD/slurm/commit/85e97dc2ac3f31b Nice! I assume this will be in 18.08.4? Cheers, -- Kilian
Hi Yes, it will be in 18.08.4. So I'm waiting for an update. I hope in 18.08 you will be able to remove defer from SchedulerParameters. Dominik
Hi While we are waiting for an update to 18.08, I will drop severity to 3, but if you think this should be higher, feel free to bump it up again. Dominik
(In reply to Dominik Bartkiewicz from comment #160) > Hi > > While we are waiting for an update to 18.08, I will drop severity to 3, > but if you think this should be higher, feel free to bump it up again. Sounds good to me! FYI, we're planning on migrating to 18.08 the week after Thanksgiving. We'll keep you posted. Cheers, -- Kilian
(In reply to Kilian Cavalotti from comment #161) > FYI, we're planning on migrating to 18.08 the week after Thanksgiving. We'll > keep you posted. We've just finished our migration and are now running 18.08.3. Cheers, -- Kilian
(In reply to Kilian Cavalotti from comment #162) > (In reply to Kilian Cavalotti from comment #161) > > FYI, we're planning on migrating to 18.08 the week after Thanksgiving. We'll > > keep you posted. > > We've just finished our migration and are now running 18.08.3. So far so good. Despite a few initial segfaults, we were able to remove the "defer" scheduling parameter, and the controller seems to be able to manage the load: no CG jobs accumulation to report so far. Cheers, -- Kilian
Hi That is good news. Let's keep this open for a while to be sure that on 18.08 this works as well. Dominik
Hi Let me know, when this ticket is ready to be closed. Dominik
Hi Dominik, We've been running 18.08 in production for about two weeks now, and it behaves definitely better than 17.11 in that matter, so I think we can close this now. We'll re-open if the issue arises again. Thanks! -- Kilian
(In reply to Kilian Cavalotti from comment #166) > We've been running 18.08 in production for about two weeks now, and it > behaves definitely better than 17.11 in that matter, so I think we can close > this now. We're actually still seeing this happening from time to time. Much less often than in 17.11, for sure. We were able to re-enable our PrologSlurmctld and remove defer from SchdulerParamaters and it works fine most of the time. But there still definitely are events that still make completing jobs accumulate and hang until we restart slurmctld. Cheers, -- Kilian
*** Ticket 7635 has been marked as a duplicate of this ticket. ***