Last messages when running slurmctld -D -vvv lurmctld: debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0 slurmctld: error: job_resources_node_inx_to_cpu_inx: no job_resrcs or node_bitmap slurmctld: error: job_update_tres_cnt: problem getting offset of JobId=33767454_11421(33767454)
Jenny, Can you please take a tarball of your StateSaveLocation for analysis. Thanks, --Nate
Please also attach your slurm.conf if there have been any changes since your last ticket.
Created attachment 11596 [details] work around patch Jenny Please apply this patch to your slurm install and restart slurmctld. Please attach your slurmctld logs from before and after the patch. Thanks, --Nate
Applied patch and recompiled - reinstalled slurm-slurmctld slurmctld -D -vvv slurmctld: debug2: acct_policy_job_begin: after adding JobId=33767454_11111(33767572), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(pages) is 0 slurmctld: debug2: acct_policy_job_begin: after adding JobId=33767454_11111(33767572), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(gres/gpu) is 10535309 slurmctld: debug2: _group_cache_lookup_internal: no entry found for evanbure slurmctld: backfill: Started JobId=33767454_11111(33767572) in general on c0510 slurmctld: debug: create_mmap_buf: Failed to mmap file `/pine/EX/root/slurm-log/slurmctld/hash.4/job.33767454/script`, Invalid argument slurmctld: error: Could not open script file for JobId=33767454_11111(33767572) slurmctld: fatal: _build_launch_job_msg: Can not find batch script for batch JobId=33767454_11111(33767572). Check file system serving StateSaveLocation as that directory may be missing or corrupted.
Created attachment 11597 [details] slurmctld log entire log including first restart attempt after patched slurmctld
The only zero length script file # find ./hash* -type f -name "script" -size 0 ./hash.4/job.33767454/script [root@longleaf-sched slurmctld]# pwd /pine/EX/root/slurm-log/slurmctld
Jenny Please call this as slurmuser: > touch /pine/EX/root/slurm-log/slurmctld/hash.4/job.33767454/script Please restart slurmd and attach logs. Thanks, Nate
Created attachment 11598 [details] after touching the job script file Note the job script file existed and exists, but is of zero size, as is the environment file
Jenny Can you please put a simple that calls /bin/hostname script in there to see if that will get it to start? Thanks Nate
had to copy an environment file in as well, as that was also empty it is started now
Was there an event with the filesystem serving statesavelocation? Can we lower the severity of this ticket?
Yes there was an event with the filesystem.
May i reinstall the original slurmctld and restart using that one? or leave this one in place ?
(In reply to Jenny Williams from comment #17) > Yes there was an event with the filesystem. Is it possible to get more details? (In reply to Jenny Williams from comment #18) > May i reinstall the original slurmctld and restart using that one? or leave > this one in place ? Yes, please revert at your convenience.
Yes the severity can be dropped at this point There was a storm of GPFS ejects which may or may not have been due to user job load -- seeing climbs through the day
(In reply to Jenny Williams from comment #20) > Yes the severity can be dropped at this point Lowering the severity per your reply. > There was a storm of GPFS ejects which may or may not have been due to user > job load -- seeing climbs through the day Thanks for the additional details.
slurmctld core dumped again independent of filesystem issues
This command triggers the core dump # scontrol update jobid=33597374 numnodes=1 nodelist=c0832 Unexpected message received for job 33597374 ]# scontrol show job 33597374 JobId=33597374 JobName=slurm_cpu_job_9.sh UserId=mahmoudm(238134) GroupId=its_graduate_psx(203) MCS_label=N/A Priority=2408 Nice=0 Account=rc_styner_pi QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=3-05:21:03 TimeLimit=11-00:00:00 TimeMin=N/A SubmitTime=2019-09-13T12:51:31 EligibleTime=2019-09-13T12:51:31 AccrueTime=2019-09-13T12:51:31 StartTime=2019-09-13T12:51:41 EndTime=2019-09-24T12:51:41 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-09-13T12:51:41 Partition=general AllocNode:Sid=longleaf-login1.its.unc.edu:32347 ReqNodeList=(null) ExcNodeList=(null) NodeList=c[0832-0835] BatchHost=c0832 NumNodes=4 NumCPUs=8 NumTasks=8 CPUs/Task=N/A ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=32000M,node=4,billing=8 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/proj/NIRAL/users/mahmoud/Code/ADNI_Processing_Freesurfer/ADNI1_All/slurm_cpu_job_9.sh WorkDir=/proj/NIRAL/users/mahmoud/Code/ADNI_Processing_Freesurfer/ADNI1_All StdErr=/proj/NIRAL/users/mahmoud/Code/ADNI_Processing_Freesurfer/ADNI1_All/slurm-33597374.out StdIn=/dev/null StdOut=/proj/NIRAL/users/mahmoud/Code/ADNI_Processing_Freesurfer/ADNI1_All/slurm-33597374.out Power=
Hi Jenny, Can you send us the backtrace "thread apply all bt full". Please also cancel the job as this has corrupted entities attached to it. -Jason & Nate
Please provide the specific command line for the requested backtrace or reference specific instructions.
Is it possible that this response to the scontrol update jobid request, instead of this being an issue with the job record it is due to ups being in the midst of an update from v.18.08 t 19.05.2 ? The head node is updated while very few compute nodes have yet been updated. I tried another random job of the 47 I was going to modify - the slurmctld core dumped on that job as well. Is there a consequence to allowing these jobs instead to terminate naturally? Would leaving these jobs as-is potentially cause issues? My choices are: Probe each such job to determine if it is a problem job – potentially core dump of slurmctld 45 more times, especially if this is a bug triggered by the current cross-version conditions Outright terminate these 47 jobs without testing - I have no way of knowing if other jobs are also an issue, or if the issue were actually this users jobs. Leave them as-is – consequences to scheduler unknown
(In reply to Jenny Williams from comment #25) > Please provide the specific command line for the requested backtrace or > reference specific instructions. Jenny, Please try calling this command with gdb: > gdb $(which slurmctld) $path_to_core >> set pagination off >> thread apply bt full You can just call this to kill the job: > bcancel 33597374 Please tell us if this command causes slurmctld to crash too. Thanks, --Nate
(In reply to Jenny Williams from comment #26) > Is it possible that this response to the scontrol update jobid request, > instead of this being an issue with the job record it is due to ups being in > the midst of an update from v.18.08 t 19.05.2 ? The head node is updated > while very few compute nodes have yet been updated. > > I tried another random job of the 47 I was going to modify - the slurmctld > core dumped on that job as well. > > Is there a consequence to allowing these jobs instead to terminate > naturally? Would leaving these jobs as-is potentially cause issues? There should not be as these jobs should either die during startup or get rejected (for accounting) when they are done. Either way, they should eventually cleanup naturally. Cancelling the jobs now should result in more accurate accounting information. > My choices are: > Probe each such job to determine if it is a problem job – potentially core > dump of slurmctld 45 more times, especially if this is a bug triggered by > the current cross-version conditions > Outright terminate these 47 jobs without testing - I have no way of knowing > if other jobs are also an issue, or if the issue were actually this users > jobs. > Leave them as-is – consequences to scheduler unknown Once we get the gdb output, we will be able to provide more information.
I am not getting output from the gdb command. scancel 33597374 did not cause the scheduler to fail. # gdb $( which slurmctld) `pwd`/core.36010 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmctld...done. [New LWP 63475] [New LWP 63476] [New LWP 63477] [New LWP 36011] [New LWP 36019] [New LWP 36012] [New LWP 36010] [New LWP 36017] [New LWP 36013] [New LWP 36015] [New LWP 36021] [New LWP 36016] [New LWP 36018] [New LWP 36023] [New LWP 36020] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 6, Aborted. #0 0x00007f6a3e17f377 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-19.05.2-1.el7.x86_64 (gdb) pagination off Undefined command: "pagination". Try "help". (gdb) set pagination off (gdb) thread apply bt full (gdb) quit
Please try calling this command with gdb: > gdb $(which slurmctld) $path_to_core >> set pagination off >> thread apply all bt full >> info threads
Created attachment 11608 [details] gdb from second core file core.7464
Created attachment 11609 [details] gdb from third core file core.30080 - from second attempt to start
Created attachment 11610 [details] gdb of first core file core.57163
Jenny, How is the system running? The second crash is a dup of bug #7499 which should have no ill effects. Thanks, --Nate
I have to take your word for that - i do not have rights to see bug 7499. It seems to be running as per usual - I am concerned that there would be jobs that could crash the system if i attempt to modify them however.
Created attachment 11611 [details] patch from 7499 Jenny, This patch should stop it crashing while modifying jobs. I have requested that bug#7499 be opened. Please give it a try. Thanks, --Nate
I still cannot see that bug
Hi Jenny > I still cannot see that bug We apologize for the confusion here. The bug where we are actively working this issue is closed at the request of that confidential site. We will talk with them to see if they would be willing to open that bug so you can read its contents, however, this may not be possible given the nature of what is attached and commented in that ticket. We will keep you updated regardless of the above circumstances so that you receive the pertinent information and progress in that ticket. What Nate meant in his below comment was that he has attached a patch and would like to know if this works for you. > This patch should stop it crashing while modifying jobs. I have requested that bug#7499 be opened. Please give it a try.
(In reply to Nate Rini from comment #38) > Created attachment 11611 [details] > patch from 7499 Jenny, Did you apply this patch? It is the patch from bug#7499 while we wait on permission to get the bug itself opened. Thanks, --Nate
Yes the patch was applied and is working - Jenny From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, September 20, 2019 1:12 PM To: Williams, Jenny Avis <jennyw@email.unc.edu> Subject: [Bug 7757] production: slurmctld will not start Comment # 41<https://bugs.schedmd.com/show_bug.cgi?id=7757#c41> on bug 7757<https://bugs.schedmd.com/show_bug.cgi?id=7757> from Nate Rini<mailto:nate@schedmd.com> (In reply to Nate Rini from comment #38<show_bug.cgi?id=7757#c38>) > Created attachment 11611 [details]<attachment.cgi?id=11611&action=diff> [details]<attachment.cgi?id=11611&action=edit> > patch from 7499 Jenny, Did you apply this patch? It is the patch from bug#7499<show_bug.cgi?id=7499> while we wait on permission to get the bug itself opened. Thanks, --Nate ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Jenny Williams from comment #33) > Created attachment 11610 [details] > gdb of first core file core.57163 Jenny, Do you still have this core? Can you please call the following in gdb: > t 1 > f 0 > set pretty print on > print *step_ptr > f 2 > print *job_ptr > print *job_ptr->details > print *job_ptr->job_resrcs Thanks, --Nate
Created attachment 11668 [details] gdb output from core.57163
Jenny, Going to reduce the severity for this ticket. We have a patch in bug#7641 waiting for QA review. Once it is upstream, I will update this ticket. Thanks, --Nate
THis issue is no longer pertinent to us.
Jenny Closing this as a duplicate of bug#7641. Thanks, --Nate *** This ticket has been marked as a duplicate of ticket 7641 ***