7757 – production: slurmctld will not start

Ticket 7757 - production: slurmctld will not start

Summary: production: slurmctld will not start

Status:	RESOLVED DUPLICATE of ticket 7641

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	19.05.2
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Nate Rini
QA Contact:

URL:

Depends on:	7641
Blocks:
	Show dependency tree / graph

Reported:	2019-09-16 11:26 MDT by Jenny Williams
Modified:	2021-03-01 12:07 MST (History)
CC List:	1 user (show)

See Also:	7641 7499 7783 10980
Site:	University of North Carolina at Chapel Hill
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
work around patch (3.59 KB, patch) 2019-09-16 11:52 MDT, Nate Rini	Details \| Diff
after touching the job script file (108.84 KB, application/gzip) 2019-09-16 12:51 MDT, Jenny Williams	Details
patch from 7499 (1.85 KB, patch) 2019-09-17 09:54 MDT, Nate Rini	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Jenny Williams 2019-09-16 11:26:49 MDT

Last messages when running slurmctld -D -vvv


lurmctld: debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0
slurmctld: error: job_resources_node_inx_to_cpu_inx: no job_resrcs or node_bitmap
slurmctld: error: job_update_tres_cnt: problem getting offset of JobId=33767454_11421(33767454)

Comment 3 Nate Rini 2019-09-16 11:44:56 MDT

Jenny,

Can you please take a tarball of your StateSaveLocation for analysis.

Thanks,
--Nate

Comment 4 Nate Rini 2019-09-16 11:47:39 MDT

Please also attach your slurm.conf if there have been any changes since your last ticket.

Comment 8 Nate Rini 2019-09-16 11:52:07 MDT

Created attachment 11596 [details]
work around patch

Jenny

Please apply this patch to your slurm install and restart slurmctld. Please attach your slurmctld logs from before and after the patch.

Thanks,
--Nate

Comment 9 Jenny Williams 2019-09-16 12:17:46 MDT

Applied patch and recompiled - reinstalled slurm-slurmctld 

slurmctld -D -vvv

slurmctld: debug2: acct_policy_job_begin: after adding JobId=33767454_11111(33767572), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(pages) is 0
slurmctld: debug2: acct_policy_job_begin: after adding JobId=33767454_11111(33767572), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(gres/gpu) is 10535309
slurmctld: debug2: _group_cache_lookup_internal: no entry found for evanbure
slurmctld: backfill: Started JobId=33767454_11111(33767572) in general on c0510
slurmctld: debug:  create_mmap_buf: Failed to mmap file `/pine/EX/root/slurm-log/slurmctld/hash.4/job.33767454/script`, Invalid argument
slurmctld: error: Could not open script file for JobId=33767454_11111(33767572)
slurmctld: fatal: _build_launch_job_msg: Can not find batch script for batch JobId=33767454_11111(33767572). Check file system serving StateSaveLocation as that directory may be missing or corrupted.

Comment 10 Jenny Williams 2019-09-16 12:19:23 MDT

Created attachment 11597 [details]
slurmctld log

entire log including first restart attempt after patched slurmctld

Comment 11 Jenny Williams 2019-09-16 12:38:22 MDT

The only zero length script file

# find ./hash* -type f -name "script" -size 0
./hash.4/job.33767454/script
[root@longleaf-sched slurmctld]# pwd
/pine/EX/root/slurm-log/slurmctld

Comment 12 Nate Rini 2019-09-16 12:45:25 MDT

Jenny

Please call this as slurmuser:
> touch /pine/EX/root/slurm-log/slurmctld/hash.4/job.33767454/script

Please restart slurmd and attach logs.

Thanks,
Nate

Comment 13 Jenny Williams 2019-09-16 12:51:36 MDT

Created attachment 11598 [details]
after touching the job script file

Note the job script file existed and exists, but is of zero size, as is the environment file

Comment 14 Nate Rini 2019-09-16 12:59:16 MDT

Jenny


Can you please put a simple that calls /bin/hostname script in there to see if that will get it to start?

Thanks
Nate

Comment 15 Jenny Williams 2019-09-16 13:06:31 MDT

had to copy an environment file in as well, as that was also empty

it is started now

Comment 16 Nate Rini 2019-09-16 13:31:03 MDT

Was there an event with the filesystem serving statesavelocation?

Can we lower the severity of this ticket?

Comment 17 Jenny Williams 2019-09-16 13:32:44 MDT

Yes there was an event with the filesystem.

Comment 18 Jenny Williams 2019-09-16 13:37:20 MDT

May i reinstall the original slurmctld and restart using that one? or leave this one in place ?

Comment 19 Nate Rini 2019-09-16 13:39:00 MDT

(In reply to Jenny Williams from comment #17)
> Yes there was an event with the filesystem.

Is it possible to get more details?

(In reply to Jenny Williams from comment #18)
> May i reinstall the original slurmctld and restart using that one? or leave
> this one in place ?

Yes, please revert at your convenience.

Comment 20 Jenny Williams 2019-09-16 13:42:49 MDT

Yes the severity can be dropped at this point


There was a storm of GPFS ejects which may or may not have been due to user job load -- seeing climbs through the day

Comment 21 Nate Rini 2019-09-16 13:46:37 MDT

(In reply to Jenny Williams from comment #20)
> Yes the severity can be dropped at this point
Lowering the severity per your reply.

> There was a storm of GPFS ejects which may or may not have been due to user
> job load -- seeing climbs through the day

Thanks for the additional details.

Comment 22 Jenny Williams 2019-09-16 16:09:59 MDT

slurmctld core dumped again independent of filesystem issues

Comment 23 Jenny Williams 2019-09-16 16:13:48 MDT

This command triggers the core dump

# scontrol update jobid=33597374 numnodes=1 nodelist=c0832
Unexpected message received for job 33597374


]# scontrol show job 33597374
JobId=33597374 JobName=slurm_cpu_job_9.sh
   UserId=mahmoudm(238134) GroupId=its_graduate_psx(203) MCS_label=N/A
   Priority=2408 Nice=0 Account=rc_styner_pi QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=3-05:21:03 TimeLimit=11-00:00:00 TimeMin=N/A
   SubmitTime=2019-09-13T12:51:31 EligibleTime=2019-09-13T12:51:31
   AccrueTime=2019-09-13T12:51:31
   StartTime=2019-09-13T12:51:41 EndTime=2019-09-24T12:51:41 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-09-13T12:51:41
   Partition=general AllocNode:Sid=longleaf-login1.its.unc.edu:32347
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c[0832-0835]
   BatchHost=c0832
   NumNodes=4 NumCPUs=8 NumTasks=8 CPUs/Task=N/A ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=32000M,node=4,billing=8
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/proj/NIRAL/users/mahmoud/Code/ADNI_Processing_Freesurfer/ADNI1_All/slurm_cpu_job_9.sh
   WorkDir=/proj/NIRAL/users/mahmoud/Code/ADNI_Processing_Freesurfer/ADNI1_All
   StdErr=/proj/NIRAL/users/mahmoud/Code/ADNI_Processing_Freesurfer/ADNI1_All/slurm-33597374.out
   StdIn=/dev/null
   StdOut=/proj/NIRAL/users/mahmoud/Code/ADNI_Processing_Freesurfer/ADNI1_All/slurm-33597374.out
   Power=

Comment 24 Jason Booth 2019-09-16 16:19:02 MDT

Hi Jenny,

Can you send us the backtrace "thread apply all bt full".

Please also cancel the job as this has corrupted entities attached to it.

-Jason & Nate

Comment 25 Jenny Williams 2019-09-16 16:21:09 MDT

Please provide the specific command line for the requested backtrace or reference specific instructions.

Comment 26 Jenny Williams 2019-09-16 16:33:35 MDT

Is it possible that this response to the scontrol update jobid request, instead of this being an issue with the job record it is due to ups being in the midst of an update from v.18.08 t 19.05.2 ? The head node is updated while very few compute nodes have yet been updated.

I tried another random job of the 47 I was going to modify - the slurmctld core dumped on that job as well.  

Is there a consequence to allowing these jobs instead to terminate naturally? Would leaving these jobs as-is potentially cause issues?  

My choices are:
	Probe each such job to determine if it is a problem job – potentially core dump of slurmctld 45 more times, especially if this is a bug triggered by the current cross-version conditions
	Outright terminate these 47 jobs without testing - I have no way of knowing if other jobs are also an issue, or if the issue were actually this users jobs.
	Leave them as-is – consequences to scheduler unknown

Comment 27 Nate Rini 2019-09-16 16:36:30 MDT

(In reply to Jenny Williams from comment #25)
> Please provide the specific command line for the requested backtrace or
> reference specific instructions.

Jenny,

Please try calling this command with gdb:

> gdb $(which slurmctld) $path_to_core
>> set pagination off
>> thread apply bt full 

You can just call this to kill the job:
> bcancel 33597374
Please tell us if this command causes slurmctld to crash too.

Thanks,
--Nate

Comment 28 Nate Rini 2019-09-16 16:39:04 MDT

(In reply to Jenny Williams from comment #26)
> Is it possible that this response to the scontrol update jobid request,
> instead of this being an issue with the job record it is due to ups being in
> the midst of an update from v.18.08 t 19.05.2 ? The head node is updated
> while very few compute nodes have yet been updated.
> 
> I tried another random job of the 47 I was going to modify - the slurmctld
> core dumped on that job as well.  
> 
> Is there a consequence to allowing these jobs instead to terminate
> naturally? Would leaving these jobs as-is potentially cause issues?  
There should not be as these jobs should either die during startup or get rejected (for accounting) when they are done. Either way, they should eventually cleanup naturally. Cancelling the jobs now should result in more accurate accounting information.

> My choices are:
> 	Probe each such job to determine if it is a problem job – potentially core
> dump of slurmctld 45 more times, especially if this is a bug triggered by
> the current cross-version conditions
> 	Outright terminate these 47 jobs without testing - I have no way of knowing
> if other jobs are also an issue, or if the issue were actually this users
> jobs.
> 	Leave them as-is – consequences to scheduler unknown
Once we get the gdb output, we will be able to provide more information.

Comment 29 Jenny Williams 2019-09-16 17:01:05 MDT

I am not getting output from the gdb command.

scancel 33597374 did not cause the scheduler to fail.


# gdb $( which slurmctld) `pwd`/core.36010
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New LWP 63475]
[New LWP 63476]
[New LWP 63477]
[New LWP 36011]
[New LWP 36019]
[New LWP 36012]
[New LWP 36010]
[New LWP 36017]
[New LWP 36013]
[New LWP 36015]
[New LWP 36021]
[New LWP 36016]
[New LWP 36018]
[New LWP 36023]
[New LWP 36020]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x00007f6a3e17f377 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-19.05.2-1.el7.x86_64
(gdb) pagination off
Undefined command: "pagination".  Try "help".
(gdb) set pagination off
(gdb) thread apply bt full
(gdb) quit

Comment 30 Nate Rini 2019-09-16 17:47:59 MDT

Please try calling this command with gdb:

> gdb $(which slurmctld) $path_to_core
>> set pagination off
>> thread apply all bt full 
>> info threads

Comment 31 Jenny Williams 2019-09-17 07:55:08 MDT

Created attachment 11608 [details]
gdb from second core file core.7464

Comment 32 Jenny Williams 2019-09-17 07:57:53 MDT

Created attachment 11609 [details]
gdb from third core file core.30080 - from second attempt to start

Comment 33 Jenny Williams 2019-09-17 08:02:23 MDT

Created attachment 11610 [details]
gdb of first core file core.57163

Comment 36 Nate Rini 2019-09-17 09:07:41 MDT

Jenny,

How is the system running? The second crash is a dup of bug #7499 which should have no ill effects.

Thanks,
--Nate

Comment 37 Jenny Williams 2019-09-17 09:11:13 MDT

I have to take your word for that - i do not have rights to see bug 7499.

It seems to be running as per usual - I am concerned that there would be jobs that could crash the system if i attempt to modify them however.

Comment 38 Nate Rini 2019-09-17 09:54:10 MDT

Created attachment 11611 [details]
patch from 7499

Jenny,

This patch should stop it crashing while modifying jobs. I have requested that bug#7499 be opened. Please give it a try.

Thanks,
--Nate

Comment 39 Jenny Williams 2019-09-20 08:12:11 MDT

I still cannot see that bug

Comment 40 Jason Booth 2019-09-20 10:06:44 MDT

Hi Jenny

> I still cannot see that bug

We apologize for the confusion here. The bug where we are actively working this issue is closed at the request of that confidential site. We will talk with them to see if they would be willing to open that bug so you can read its contents, however, this may not be possible given the nature of what is attached and commented in that ticket.

We will keep you updated regardless of the above circumstances so that you receive the pertinent information and progress in that ticket.

What Nate meant in his below comment was that he has attached a patch and would like to know if this works for you. 
> This patch should stop it crashing while modifying jobs. I have requested that bug#7499 be opened. Please give it a try.

Comment 41 Nate Rini 2019-09-20 11:11:55 MDT

(In reply to Nate Rini from comment #38)
> Created attachment 11611 [details]
> patch from 7499

Jenny,

Did you apply this patch? It is the patch from bug#7499 while we wait on permission to get the bug itself opened.

Thanks,
--Nate

Comment 42 Jenny Williams 2019-09-20 14:31:47 MDT

Yes the patch was applied and is working - Jenny

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, September 20, 2019 1:12 PM
To: Williams, Jenny Avis <jennyw@email.unc.edu>
Subject: [Bug 7757] production: slurmctld will not start

Comment # 41<https://bugs.schedmd.com/show_bug.cgi?id=7757#c41> on bug 7757<https://bugs.schedmd.com/show_bug.cgi?id=7757> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Nate Rini from comment #38<show_bug.cgi?id=7757#c38>)

> Created attachment 11611 [details]<attachment.cgi?id=11611&action=diff> [details]<attachment.cgi?id=11611&action=edit>

> patch from 7499



Jenny,



Did you apply this patch? It is the patch from bug#7499<show_bug.cgi?id=7499> while we wait on

permission to get the bug itself opened.



Thanks,

--Nate

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 43 Nate Rini 2019-09-23 18:04:12 MDT

(In reply to Jenny Williams from comment #33)
> Created attachment 11610 [details]
> gdb of first core file core.57163

Jenny,

Do you still have this core? Can you please call the following in gdb:
> t 1
> f 0
> set pretty print on
> print *step_ptr
> f 2
> print *job_ptr
> print *job_ptr->details
> print *job_ptr->job_resrcs

Thanks,
--Nate

Comment 44 Jenny Williams 2019-09-23 18:21:41 MDT

Created attachment 11668 [details]
gdb output from core.57163

Comment 52 Nate Rini 2019-10-16 14:41:48 MDT

Jenny,

Going to reduce the severity for this ticket. We have a patch in bug#7641 waiting for QA review. Once it is upstream, I will update this ticket.

Thanks,
--Nate

Comment 53 Jenny Williams 2019-12-12 08:44:43 MST

THis issue is no longer pertinent to us.

Comment 54 Nate Rini 2019-12-12 09:12:12 MST

Jenny

Closing this as a duplicate of bug#7641.

Thanks,
--Nate

*** This ticket has been marked as a duplicate of ticket 7641 ***