1660 – slurmctld crashes after 5 sec of running

Ticket 1660 - slurmctld crashes after 5 sec of running

Summary: slurmctld crashes after 5 sec of running

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	14.11.6
Hardware:	Linux Linux

Importance:	--- 2 - High Impact
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-05-10 10:05 MDT by Gene Soudlenkov
Modified:	2018-11-09 10:45 MST (History)
CC List:	2 users (show)

See Also:	5835
Site:	University of Auckland
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.11.7
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Patch to prevent bitsize mis-match abort (1.13 KB, patch) 2015-05-10 12:00 MDT, Moe Jette	Details \| Diff
Log file before the crash (89.75 KB, text/plain) 2015-05-10 12:06 MDT, Gene Soudlenkov	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Gene Soudlenkov 2015-05-10 10:05:07 MDT

We updated our 11.4.4 to 11.4.6 and discovered that slurmd started crashing after about an hour of running. The system became unusable since the daemon could not stay up for longer than 5 seconds. We found a bug filed (bug 1557) which resembled our problem after analysing the crash dump and once we too the offending node out of circulation the daemon started running nominally. Looks like this bug was re-introduced in 11.4.6

Thanks,
Gene

Comment 1 Danny Auble 2015-05-10 11:28:36 MDT

Could you send a back trace? 

On May 10, 2015 3:05:07 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1660
>
>              Site: University of Auckland
>            Bug ID: 1660
>           Summary: slurmd crashes after 5 sec of running
>           Product: Slurm
>           Version: 14.11.6
>          Hardware: Linux
>                OS: Linux
>            Status: UNCONFIRMED
>          Severity: 2 - High Impact
>          Priority: ---
>         Component: slurmd
>          Assignee: david@schedmd.com
>          Reporter: g.soudlenkov@auckland.ac.nz
>              CC: brian@schedmd.com, da@schedmd.com, david@schedmd.com,
>                    jette@schedmd.com
>
>We updated our 11.4.4 to 11.4.6 and discovered that slurmd started
>crashing
>after about an hour of running. The system became unusable since the
>daemon
>could not stay up for longer than 5 seconds. We found a bug filed (bug
>1557)
>which resembled our problem after analysing the crash dump and once we
>too the
>offending node out of circulation the daemon started running nominally.
>Looks
>like this bug was re-introduced in 11.4.6
>
>Thanks,
>Gene
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 2 Gene Soudlenkov 2015-05-10 11:35:22 MDT

Hi, Danny

Here it is:

#0  0x00000036f3832625 in raise () from /lib64/libc.so.6
#1  0x00000036f3833e05 in abort () from /lib64/libc.so.6
#2  0x00000036f382b74e in __assert_fail_base () from /lib64/libc.so.6
#3  0x00000036f382b810 in __assert_fail () from /lib64/libc.so.6
#4  0x00000000004e5847 in bit_and (b1=0x2f9a750, b2=0x2f9acf0) at bitstring.c:624
#5  0x00000000004b3505 in _step_dealloc_lps (step_ptr=0x2f9aa70) at step_mgr.c:2034
#6  0x00000000004b9145 in post_job_step (step_ptr=0x2f9aa70) at step_mgr.c:4097
#7  0x00000000004aeb22 in _internal_step_complete (job_ptr=0x2f99ed0, step_ptr=0x2f9aa70) at step_mgr.c:226
#8  0x00000000004afae4 in job_step_complete (job_id=13770315, step_id=0, uid=0, requeue=false, job_return_code=15) at step_mgr.c:730
#9  0x0000000000493b3d in _slurm_rpc_step_complete (msg=0x7f990c0008d0) at proc_req.c:2949
#10 0x000000000048beed in slurmctld_req (msg=0x7f990c0008d0, arg=0x7f98bc04c160) at proc_req.c:458
#11 0x00000000004357b3 in _service_connection (arg=0x7f98bc04c160) at controller.c:1070
#12 0x00000036f3c079d1 in start_thread () from /lib64/libpthread.so.0
#13 0x00000036f38e88fd in clone () from /lib64/libc.so.6


Gene

Comment 3 Danny Auble 2015-05-10 11:41:53 MDT

Thanks Gene, this appears to be from the slurmctld not the slurmd is that correct? 

Could you send the log from before the abort? 

Was there a configuration change around the time it started? 

On May 10, 2015 4:35:22 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1660
>
>--- Comment #2 from Gene Soudlenkov <g.soudlenkov@auckland.ac.nz> ---
>Hi, Danny
>
>Here it is:
>
>#0  0x00000036f3832625 in raise () from /lib64/libc.so.6
>#1  0x00000036f3833e05 in abort () from /lib64/libc.so.6
>#2  0x00000036f382b74e in __assert_fail_base () from /lib64/libc.so.6
>#3  0x00000036f382b810 in __assert_fail () from /lib64/libc.so.6
>#4  0x00000000004e5847 in bit_and (b1=0x2f9a750, b2=0x2f9acf0) at
>bitstring.c:624
>#5  0x00000000004b3505 in _step_dealloc_lps (step_ptr=0x2f9aa70) at
>step_mgr.c:2034
>#6  0x00000000004b9145 in post_job_step (step_ptr=0x2f9aa70) at
>step_mgr.c:4097
>#7  0x00000000004aeb22 in _internal_step_complete (job_ptr=0x2f99ed0,
>step_ptr=0x2f9aa70) at step_mgr.c:226
>#8  0x00000000004afae4 in job_step_complete (job_id=13770315,
>step_id=0, uid=0,
>requeue=false, job_return_code=15) at step_mgr.c:730
>#9  0x0000000000493b3d in _slurm_rpc_step_complete (msg=0x7f990c0008d0)
>at
>proc_req.c:2949
>#10 0x000000000048beed in slurmctld_req (msg=0x7f990c0008d0,
>arg=0x7f98bc04c160) at proc_req.c:458
>#11 0x00000000004357b3 in _service_connection (arg=0x7f98bc04c160) at
>controller.c:1070
>#12 0x00000036f3c079d1 in start_thread () from /lib64/libpthread.so.0
>#13 0x00000036f38e88fd in clone () from /lib64/libc.so.6
>
>
>Gene
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 4 Gene Soudlenkov 2015-05-10 11:42:55 MDT

Ah, sorry, my bad - yes, I meant slurmctld

Logs: stand by

Gene

Comment 5 Moe Jette 2015-05-10 11:47:31 MDT

This is a new problem. The allocation size for the job and step don't match, or at least the bitmaps representing the allocated cores don't match for job 13770315 and it's step #0. Did the configuration change when you restarted? If so, it would be helpful to get before and after slurm.conf files.

Comment 6 Gene Soudlenkov 2015-05-10 11:49:58 MDT

Negative, configuration did not change between the starts. We try to keep the configuration files as static as possible with the majority of dynamics being handled by our filter plugin - it does partition routing, auth, etc.

Comment 7 Moe Jette 2015-05-10 12:00:39 MDT

Created attachment 1871 [details]
Patch to prevent bitsize mis-match abort

I'm not sure how these bitmaps ended up being different sizes and I would definitely like to review the logs for clues, but this patch should prevent this specific abort. I am concerned about what else might be bad though.

diff --git a/src/slurmctld/step_mgr.c b/src/slurmctld/step_mgr.c
index 16d7147..4d398f3 100644
--- a/src/slurmctld/step_mgr.c
+++ b/src/slurmctld/step_mgr.c
@@ -2030,10 +2030,19 @@ static void _step_dealloc_lps(struct step_record *step_ptr)
        xassert(job_resrcs_ptr->core_bitmap_used);
        if (step_ptr->core_bitmap_job) {
                /* Mark the job's cores as no longer in use */
-               bit_not(step_ptr->core_bitmap_job);
-               bit_and(job_resrcs_ptr->core_bitmap_used,
-                       step_ptr->core_bitmap_job);
-               /* no need for bit_not(step_ptr->core_bitmap_job); */
+               int job_core_size, step_core_size;
+               job_core_size  = bit_size(job_resrcs_ptr->core_bitmap_used);
+               step_core_size = bit_size(step_ptr->core_bitmap_job);
+               if (job_core_size != step_core_size) {
+                       error("%s: %u.%u core_bitmap size mismatch (%d != %d)",
+                             __func__, job_ptr->job_id, step_ptr->step_id,
+                             job_core_size, step_core_size);
+               } else {
+                       bit_not(step_ptr->core_bitmap_job);
+                       bit_and(job_resrcs_ptr->core_bitmap_used,
+                               step_ptr->core_bitmap_job);
+                       /* no need for bit_not(step_ptr->core_bitmap_job); */
+               }
                FREE_NULL_BITMAP(step_ptr->core_bitmap_job);
        }
 #endif

Comment 8 Gene Soudlenkov 2015-05-10 12:06:58 MDT

Created attachment 1872 [details]
Log file before the crash

Comment 9 Gene Soudlenkov 2015-05-10 12:08:45 MDT

Moe, thanks for the patch. If we apply and recompile it, can we update all daemons running without shutting down the cluster/killing the jobs?

Gene

Comment 10 Moe Jette 2015-05-10 12:18:43 MDT

irst priority is getting that patch in place and getting your system up and running. After that, I see several odd things right away in the log file to investigate:
=================================
May  8 16:39:30 slurm-001-p slurmctld[6595]: error: Registered job 13770315.0 on wrong node compute-c1-003 
error: Registered job 13770315.0 on wrong node compute-c1-003 

Perhaps you could search for "13770315" in an earlier slurmctld log file. I'm wondering where slurm expects to find the job. It would probably be worthwhile to save the slurmd log(s) from the relevant nodes too.
==============================================
The other thing of note is a bunch of messages of this sort for various nodes:
May  8 16:39:30 slurm-001-p slurmctld[6595]: error: Node compute-a1-017 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

Are you expecting different slurm.conf files on the different nodes? That is almost always a bad idea.
========================================================
Then it appears there is some problem in your topology.conf file:

May  8 16:45:44 slurm-001-p slurmctld[9983]: TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology
TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology

If it's not clear to you what the issue is, you can send me your topology.conf and slurm.conf files to advise.
==================================
This might indicate a really old version of slurm on node compute-c1-05:
error: _slurm_rpc_node_registration node=compute-c1-058: Invalid argument

Comment 11 Moe Jette 2015-05-10 12:20:38 MDT

(In reply to Gene Soudlenkov from comment #9)
> Moe, thanks for the patch. If we apply and recompile it, can we update all
> daemons running without shutting down the cluster/killing the jobs?
> 
> Gene

This patch only effects the slurmctld, you just need updated code on the head node. No jobs should be lost as a result of this patch. It's just going to log a bitmap size mismatch and if job 13770315 has more job steps, they would probably get stuck, so that specific job may need to be manually killed.

Comment 12 Gene Soudlenkov 2015-05-10 12:24:43 MDT

Job related logs

<30>May  8 17:01:33 slurm-001-p slurmctld[14903]: Time limit exhausted for JobId=13770315
Time limit exhausted for JobId=13770315
<30>May  8 16:35:48 slurm-001-p slurmctld[31383]: job_complete: JobID=13770315 State=0x8004 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 0
job_complete: JobID=13770315 State=0x8004 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 0
<11>May  8 16:35:47 compute-bigmem-001-p slurmstepd[15759]: error: *** JOB 13770315 CANCELLED AT 2015-05-08T16:35:47 *** on compute-bigmem-001
error: *** JOB 13770315 CANCELLED AT 2015-05-08T16:35:47 *** on compute-bigmem-001
<30>May  8 16:26:20 compute-c1-003-p slurmd[28869]: Warning: revoke on job 13770315 has no expiration
Warning: revoke on job 13770315 has no expiration
<30>May  8 16:08:12 compute-c1-003-p slurmd[18165]: Warning: revoke on job 13770315 has no expiration
Warning: revoke on job 13770315 has no expiration
<30>May  8 16:05:45 compute-c1-003-p slurmd[16545]: Warning: revoke on job 13770315 has no expiration
Warning: revoke on job 13770315 has no expiration
<30>May  8 15:41:51 compute-bigmem-001-p slurmd[9965]: Launching batch job 13770315 for UID 5323
Launching batch job 13770315 for UID 5323
<30>May  8 15:41:51 compute-bigmem-001-p slurmd[9965]: _run_prolog: prolog with lock for job 13770315 ran for 2 seconds
_run_prolog: prolog with lock for job 13770315 ran for 2 seconds
<30>May  8 15:41:48 slurm-001-p slurmctld[9562]: sched: Allocate JobId=13770315 NodeList=compute-bigmem-001 #CPUs=12
sched: Allocate JobId=13770315 NodeList=compute-bigmem-001 #CPUs=12
<11>May  8 15:41:46 compute-c1-003-p slurmstepd[1945]: error: *** JOB 13770315 CANCELLED AT 2015-05-08T15:41:46 *** on compute-c1-003
error: *** JOB 13770315 CANCELLED AT 2015-05-08T15:41:46 *** on compute-c1-003
<30>May  8 15:41:46 slurm-001-p slurmctld[9562]: job_complete: JobID=13770315 State=0x0 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 0
job_complete: JobID=13770315 State=0x0 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 0
<30>May  8 15:41:46 slurm-001-p slurmctld[9562]: job_complete: JobID=13770315 State=0x0 NodeCnt=1 cancelled from interactive user or node failure
job_complete: JobID=13770315 State=0x0 NodeCnt=1 cancelled from interactive user or node failure
<30>May  8 15:37:02 slurm-001-p slurmctld[9562]: _slurm_rpc_checkpoint vacate 13770315: Operation already in progress
_slurm_rpc_checkpoint vacate 13770315: Operation already in progress
<30>May  8 15:35:44 slurm-001-p slurmctld[9562]: _slurm_rpc_checkpoint requeue 13770315: Operation already in progress
_slurm_rpc_checkpoint requeue 13770315: Operation already in progress
<30>May  8 15:35:09 slurm-001-p slurmctld[9562]: _slurm_rpc_checkpoint create 13770315: Operation already in progress
_slurm_rpc_checkpoint create 13770315: Operation already in progress
<30>May  8 15:34:07 slurm-001-p slurmctld[9562]: _slurm_rpc_checkpoint create 13770315: Operation already in progress
_slurm_rpc_checkpoint create 13770315: Operation already in progress
<30>May  8 15:33:31 slurm-001-p slurmctld[9562]: _slurm_rpc_checkpoint requeue for 13770315 usec=11812
_slurm_rpc_checkpoint requeue for 13770315 usec=11812
<30>May  8 15:28:24 compute-c1-003-p slurmd[4285]: _run_prolog: prolog with lock for job 13770315 ran for 0 seconds
_run_prolog: prolog with lock for job 13770315 ran for 0 seconds
<30>May  8 15:28:24 compute-c1-003-p slurmd[4285]: Launching batch job 13770315 for UID 5323
Launching batch job 13770315 for UID 5323
<30>May  8 15:28:24 slurm-001-p slurmctld[9562]: sched: Allocate JobId=13770315 NodeList=compute-c1-003 #CPUs=12
sched: Allocate JobId=13770315 NodeList=compute-c1-003 #CPUs=12
<85>May  8 15:15:39 compute-gpu-c1-001-p sudo:  jbla572 : TTY=pts/18 ; PWD=/home/jbla572/test ; USER=root ; COMMAND=/usr/bin/scontrol update jobid=13770315 priority=60000
 jbla572 : TTY=pts/18 ; PWD=/home/jbla572/test ; USER=root ; COMMAND=/usr/bin/scontrol update jobid=13770315 priority=60000
<30>May  8 15:15:38 slurm-001-p slurmctld[9562]: _slurm_rpc_update_job complete JobId=13770315 uid=0 usec=43095
_slurm_rpc_update_job complete JobId=13770315 uid=0 usec=43095
<30>May  8 15:15:38 slurm-001-p slurmctld[9562]: sched: update_job: setting priority to 60000 for job_id 13770315
sched: update_job: setting priority to 60000 for job_id 13770315
<30>May  8 15:15:34 slurm-001-p slurmctld[9562]: _slurm_rpc_submit_batch_job JobId=13770315 usec=10996
_slurm_rpc_submit_batch_job JobId=13770315 usec=10996

Comment 13 Moe Jette 2015-05-10 12:35:36 MDT

I'll try to piece this all together on Monday morning, but it looks like the job started on one node (compute-bigmem-001-p), then got requeued and restarted  on another node (compute-c1-003-p). If you can capture the slurmd logs on those two compute nodes with respect to job 13770315, that may help. Given this additional information, I'm pretty sure that my patch does the right thing.

Comment 14 Moe Jette 2015-05-10 14:05:26 MDT

Is your system running now with the patch?

Comment 15 Gene Soudlenkov 2015-05-10 14:13:47 MDT

Patched's been installed and the service restarted - seems to be running OK but we will keep an eye on it

Thanks a lot, guys
Gene

Comment 16 Moe Jette 2015-05-11 05:29:23 MDT

This is what appears to have happened.

15:28:24  Job 13770315 started on compute-c1-003
15:33:31  Job checkpoint requeue requested
15:41:46  Job cancelled from interactive user or node failure
          It appears to have not have fully terminated, with
          job step 0 still active on compute-c1-003
15:41:51  Job 13770315 restarted on compute-bigmem-001-p
16:39:30  Job step 13770315.0 reported on compute-c1-003 from
          incomplete termination, so slurmctld tries to clean
          up its state for the job, and the number of cores
          on compute-c1-003 and compute-bigmem-001-p, so a
          bit AND operation results in an abort due to the
          difference in sizes.

I checked into the code two things:
1. When a job is requeued, make sure that all step state information has been purged. (which you do not have, but probably will not need with the patch that you do have)
2. When terminating a job step, check the bitmap sizes and report an error if they differ rather than aborting. (which is the patch you already have)

Both changes will be in version 14.11.7 when released and the commit is here:
https://github.com/SchedMD/slurm/commit/beecc7b08e93689c0f26f87d4ad7f00e43a69612

Comment 17 Moe Jette 2015-05-11 05:36:16 MDT

I am closing this bug based upon the patch cited in commit 16. Please re-open if necessary.

I also took the liberty of opening a new bug to pursue the configuration problems identified in comment #10. See http://bugs.schedmd.com/show_bug.cgi?id=1662