Created attachment 427 [details] last few minutes of slurmctld's syslogs We're running SLURM 2.6.1 with the backfill scheduler enabled. slurmctld is terminating with a SIGABRT every few hours when _assert_bitstr_valid() fails due to a NULL part_ptr->node_bitmap. The last few minutes of slurmctld's logs are attached. A backtrace is below, and we have a core dump we can provide if it's useful. I'd rather not expose its memory image by attaching the core here, and it's larger than the attachment size limit, anyway. Let me know if you'd like it. -- Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x0000003b2a6328a5 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install slurm-2.6.1-1.el6.x86_64 (gdb) bt #0 0x0000003b2a6328a5 in raise () from /lib64/libc.so.6 #1 0x0000003b2a634085 in abort () from /lib64/libc.so.6 #2 0x0000003b2a62ba1e in __assert_fail_base () from /lib64/libc.so.6 #3 0x0000003b2a62bae0 in __assert_fail () from /lib64/libc.so.6 #4 0x00000000004a31bb in bit_and (b1=<value optimized out>, b2=<value optimized out>) at bitstring.c:546#5 0x00007f1bbe233488 in _attempt_backfill () at backfill.c:804 #6 0x00007f1bbe2344cb in backfill_agent (args=<value optimized out>) at backfill.c:498 #7 0x0000003b2aa07851 in start_thread () from /lib64/libpthread.so.0 #8 0x0000003b2a6e890d in clone () from /lib64/libc.so.6 --
Created attachment 428 [details] slurm.conf
While this raises another issue, if you remove the bf_continue from the SchedulerParameters line in slurm.conf the aborts will almost certainly go away. I have a pretty good idea what is happening and should have a patch later today.
Also, how frequently are you reconfiguring Slurm (SIGHUP to slurmctld daemon or running "scontrol reconfig")? Are you using that to delete partitions? From the log: Oct 8 18:18:47 holy-slurm01 slurmctld[21897]: Processing RPC: REQUEST_RECONFIGURE from uid=0 Oct 8 18:18:48 holy-slurm01 slurmctld[21897]: restoring original state of nodes Oct 8 18:18:48 holy-slurm01 slurmctld[21897]: cons_res: select_p_node_init Oct 8 18:18:48 holy-slurm01 slurmctld[21897]: cons_res: preparing for 24 partitions Oct 8 18:18:48 holy-slurm01 slurmctld[21897]: error: Invalid partition (panstarrs) for job 2046160 Oct 8 18:18:48 holy-slurm01 slurmctld[21897]: error: Invalid partition (panstarrs) for job 2046161
I see what is happening. If the bf_continue configuration parameter is set, then the backfill scheduler periodically sleeps to let other operations be processed and resumes. If during that interval, the slurmctld daemon is reconfigured then the partition pointers it is using will become invalid resulting in this error. Working on a fix now.
Created attachment 429 [details] Fix for bf_continue with reconfig This will fix the invalid partition pointer reported in bug 445
Created attachment 430 [details] Additional patch for bf_continue This bf_continue fixes the problem of a job being in a pending state when the backfill scheduling cycle begins, but is later cancelled and the job record purged. This fix is already in version 2.6.3, but you'll want to include this for systems with bf_continue configured. With these two patches, you should be able to resume use of bf_continue.
Please see two attached patches.
(In reply to Moe Jette from comment #3) > Also, how frequently are you reconfiguring Slurm (SIGHUP to slurmctld daemon > or running "scontrol reconfig")? Not all that often in general, but we were turning scheduler knobs yesterday and were reconfig'ing fairly often. There were more crashes after this one, from slurmctld instances that hadn't received a REQUEST_RECONFIGURE, but I don't have cores for them since abrtd seems to only keep one crash dump per daemon. I moved this crash dump aside, so we'll get a core next time. > Are you using that to delete partitions? [snip] > Oct 8 18:18:48 holy-slurm01 slurmctld[21897]: error: Invalid partition > (panstarrs) for job 2046160 We removed this partition recently (in the last few days). I'm guessing users are still trying to submit jobs to it, or maybe there were some PENDING jobs for that partition still in queue?
(In reply to Moe Jette from comment #7) > Please see two attached patches. Thanks, Moe. We'll plan to try these in the next week or two.
You might consider upgrading to version 2.6.3 and just apply that first patch related to reconfig. The v2.6.3 contains a fair number of bug fixes from v2.6.1 and also includes support for a new configuration option that Harvard requested and may be useful to you: -- Add support for new SchedulerParameters value of "bf_max_job_part", the maximum depth the backfill scheduler should go in any single partition.
Created attachment 434 [details] Additional patch The patch in attachment 430 [details] should be applied first, but only if you do not upgrade to SLurm vserion 2.6.3 first (deals with jobs being deleted and purged while backfill scheduling is sleeping with bf_continue option). Attachment 429 [details] should be applied next (deals with "scontrol reconfig" while the backfill scheduler is sleeping). This last attachment should be applied last (deals with "scontrol delete partition" being processed while the backfill scheduler is sleeping).