445 – Backfill scheduler: assertion failure on NULL bitstr_t

Ticket 445 - Backfill scheduler: assertion failure on NULL bitstr_t

Summary: Backfill scheduler: assertion failure on NULL bitstr_t

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	2.6.x
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2013-10-09 03:46 MDT by John Morrissey
Modified:	2013-10-10 04:58 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Harvard University
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
last few minutes of slurmctld's syslogs (126.01 KB, application/octet-stream) 2013-10-09 03:46 MDT, John Morrissey	Details
slurm.conf (13.22 KB, application/octet-stream) 2013-10-09 03:51 MDT, John Morrissey	Details
Fix for bf_continue with reconfig (2.62 KB, patch) 2013-10-09 05:14 MDT, Moe Jette	Details \| Diff
Additional patch for bf_continue (3.34 KB, patch) 2013-10-09 05:17 MDT, Moe Jette	Details \| Diff
Additional patch (2.42 KB, patch) 2013-10-10 04:58 MDT, Moe Jette	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description John Morrissey 2013-10-09 03:46:31 MDT

Created attachment 427 [details]
last few minutes of slurmctld's syslogs

We're running SLURM 2.6.1 with the backfill scheduler enabled. slurmctld is terminating with a SIGABRT every few hours when _assert_bitstr_valid() fails due to a NULL part_ptr->node_bitmap.

The last few minutes of slurmctld's logs are attached. A backtrace is below, and we have a core dump we can provide if it's useful. I'd rather not expose its memory image by attaching the core here, and it's larger than the attachment size limit, anyway. Let me know if you'd like it.

--
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x0000003b2a6328a5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install slurm-2.6.1-1.el6.x86_64
(gdb) bt
#0  0x0000003b2a6328a5 in raise () from /lib64/libc.so.6
#1  0x0000003b2a634085 in abort () from /lib64/libc.so.6
#2  0x0000003b2a62ba1e in __assert_fail_base () from /lib64/libc.so.6
#3  0x0000003b2a62bae0 in __assert_fail () from /lib64/libc.so.6
#4  0x00000000004a31bb in bit_and (b1=<value optimized out>, b2=<value optimized out>) at bitstring.c:546#5  0x00007f1bbe233488 in _attempt_backfill () at backfill.c:804
#6  0x00007f1bbe2344cb in backfill_agent (args=<value optimized out>) at backfill.c:498
#7  0x0000003b2aa07851 in start_thread () from /lib64/libpthread.so.0
#8  0x0000003b2a6e890d in clone () from /lib64/libc.so.6
--

Comment 1 John Morrissey 2013-10-09 03:51:47 MDT

Created attachment 428 [details]
slurm.conf

Comment 2 Moe Jette 2013-10-09 03:54:34 MDT

While this raises another issue, if you remove the bf_continue from the SchedulerParameters line in slurm.conf the aborts will almost certainly go away.

I have a pretty good idea what is happening and should have a patch later today.

Comment 3 Moe Jette 2013-10-09 04:00:28 MDT

Also, how frequently are you reconfiguring Slurm (SIGHUP to slurmctld daemon or running "scontrol reconfig")?

Are you using that to delete partitions?
From the log:
Oct  8 18:18:47 holy-slurm01 slurmctld[21897]: Processing RPC: REQUEST_RECONFIGURE from uid=0
Oct  8 18:18:48 holy-slurm01 slurmctld[21897]: restoring original state of nodes
Oct  8 18:18:48 holy-slurm01 slurmctld[21897]: cons_res: select_p_node_init
Oct  8 18:18:48 holy-slurm01 slurmctld[21897]: cons_res: preparing for 24 partitions
Oct  8 18:18:48 holy-slurm01 slurmctld[21897]: error: Invalid partition (panstarrs) for job 2046160
Oct  8 18:18:48 holy-slurm01 slurmctld[21897]: error: Invalid partition (panstarrs) for job 2046161

Comment 4 Moe Jette 2013-10-09 04:08:12 MDT

I see what is happening. If the bf_continue configuration parameter is set, then the backfill scheduler periodically sleeps to let other operations be processed and resumes. If during that interval, the slurmctld daemon is reconfigured then the partition pointers it is using will become invalid resulting in this error. Working on a fix now.

Comment 5 Moe Jette 2013-10-09 05:14:49 MDT

Created attachment 429 [details]
Fix for bf_continue with reconfig

This will fix the invalid partition pointer reported in bug 445

Comment 6 Moe Jette 2013-10-09 05:17:04 MDT

Created attachment 430 [details]
Additional patch for bf_continue

This bf_continue fixes the problem of a job being in a pending state when the backfill scheduling cycle begins, but is later cancelled and the job record purged. This fix is already in version 2.6.3, but you'll want to include this for systems with bf_continue configured. With these two patches, you should be able to resume use of bf_continue.

Comment 7 Moe Jette 2013-10-09 05:17:37 MDT

Please see two attached patches.

Comment 8 John Morrissey 2013-10-09 08:53:12 MDT

(In reply to Moe Jette from comment #3)
> Also, how frequently are you reconfiguring Slurm (SIGHUP to slurmctld daemon
> or running "scontrol reconfig")?

Not all that often in general, but we were turning scheduler knobs                        
yesterday and were reconfig'ing fairly often.                                             
                                                                                          
There were more crashes after this one, from slurmctld instances that hadn't received a REQUEST_RECONFIGURE, but I don't have cores for them since abrtd seems to only keep one crash dump per daemon. I moved this crash dump aside, so we'll get a core next time.

> Are you using that to delete partitions?
[snip]
> Oct  8 18:18:48 holy-slurm01 slurmctld[21897]: error: Invalid partition
> (panstarrs) for job 2046160

We removed this partition recently (in the last few days). I'm guessing users are still trying to submit jobs to it, or maybe there were some PENDING jobs for that partition still in queue?

Comment 9 John Morrissey 2013-10-09 08:53:46 MDT

(In reply to Moe Jette from comment #7)
> Please see two attached patches.

Thanks, Moe. We'll plan to try these in the next week or two.

Comment 10 Moe Jette 2013-10-09 10:42:20 MDT

You might consider upgrading to version 2.6.3 and just apply that first patch related to reconfig. The v2.6.3 contains a fair number of bug fixes from v2.6.1 and also includes support for a new configuration option that Harvard requested and may be useful to you:
 -- Add support for new SchedulerParameters value of "bf_max_job_part", the
    maximum depth the backfill scheduler should go in any single partition.

Comment 11 Moe Jette 2013-10-10 04:58:56 MDT

Created attachment 434 [details]
Additional patch

The patch in attachment 430 [details] should be applied first, but only if you do not upgrade to SLurm vserion 2.6.3 first (deals with jobs being deleted and purged while backfill scheduling is sleeping with bf_continue option).

Attachment 429 [details] should be applied next (deals with "scontrol reconfig" while the backfill scheduler is sleeping).

This last attachment should be applied last (deals with "scontrol delete partition" being processed while the backfill scheduler is sleeping).