slurmctld 18.08.8. We're in the process of scheduling/planning an upgrade. We had a slurmctld crash last night and restarting slurmctld will not work, crashing with signal 11. Core files are dumped. I'll also upload what I think to be the relevant backtraces. I'll attach the output of "slurmctld -D". I assume this would start OK with slurmctld -c, but it would be great not to lose the job state if there's any way to weed out what's causing the crash.
Created attachment 23518 [details] Output of slurmctl -D
Created attachment 23519 [details] Output of gdb --batch -ex "bt thread apply all bt full" on core file
Ryan, I believe this issue was fixed in 20.11.6 with the following commits: https://github.com/SchedMD/slurm/compare/82ce105e18c1...f636c4562a (See bug 10980 comment 41). I think they should still apply fairly cleanly to 18.08 though. -Michael
(In reply to Ryan Novosielski from comment #0) > We had a slurmctld crash last night and restarting slurmctld will not work, > crashing with signal 11. Could you please attach the slurmctld.log leading up to the initial crash?
Created attachment 23520 [details] slurmctld -R -D -vvv output
Created attachment 23521 [details] slurmctld -D -vvv output
Created attachment 23522 [details] slurmctld log with crash
A little more history on this: we started seeing occasional crashes a week or two ago on logrotate, which I believe only is a kill -HUP. This was discussed a bit in bug 13344. This is what the timing was for this crash too.
Ryan, I sent an email directly to you outside of this bug explaining the status of support and the version you are running. I wanted to send an update via this bug as well. Michael pointed out that this is most likely resolved in a later version. He also provided the commit this fix is found in via comment#5. It is possible that you may need another bypass patch to start your cluster. https://bugs.schedmd.com/attachment.cgi?id=9856 Once your system is back up you should start making plans to upgrade since these issues you have reported are resolved in the current release.
Yup, understood, re: support. We do what we can in line with the priorities that are given to us. Michael asked for logs; anything else in there we should know, or we should proceed with applying the patch and get things moving again?
> Michael asked for logs; anything else in there we should know, or we should proceed with applying the patch and get things moving again? Yes, please be aware that there are two patches. https://bugs.schedmd.com/attachment.cgi?id=9856 > ... the following commits: https://github.com/SchedMD/slurm/compare/82ce105e18c1...f636c4562a (See bug#10980comment#41). We believe that should apply cleanly.
Hi Ryan, There are actually three commits mentioned in comment 5. The one that will get you back up and running will be the middle commit, commit https://github.com/SchedMD/slurm/commit/73bf0a09ba. The other two commits are follow-up commits. Apply all three if you can, though.
Let me correct something: don't apply the attachment mentioned earlier (https://bugs.schedmd.com/attachment.cgi?id=9856) since that is equivalent to https://github.com/SchedMD/slurm/commit/73bf0a09ba. Just apply commit 73bf0a09ba.
OK, I'll apply that one patch with the three commits and see how it goes. Thanks for your quick help, all.
Sorry, was reading quickly, the one commit. Thanks!
(In reply to Ryan Novosielski from comment #18) > Sorry, was reading quickly, the one commit. Thanks! I just verified that commit 73bf0a09ba applies cleanly to 18.08.8. You can use the following command to do that quickly: curl https://github.com/SchedMD/slurm/commit/73bf0a09ba.patch | git am Let me know how it goes! Thanks, -Michael
We install using OpenHPC so went the .src.rpm and patching via speckle route. Just about wrapping that up now. Can you confirm, slurmctld is the only affected component?
(we also took the opportunity to apply the patch supplied to us for this one: https://github.com/SchedMD/slurm/commit/21d08466c65d – we originally resolved it by asking the user to stop doing what they were doing as an upgrade was already on the horizon)
(In reply to Ryan Novosielski from comment #20) > Can you confirm, slurmctld is the only affected component? Yes, slurmctld is the only affected component, so you can just restart that after applying the patch.
We're back in service. Thank you for the fast turnaround.
Great! I would recommend applying the other two commits in https://github.com/SchedMD/slurm/compare/82ce105e18c1...f636c4562a when you get a chance (commit 1/3 helps prevent the job resources pointer from going null in the first place; 3/3 improves the logs, but it's not that important). I'll go ahead and close this out as a duplicate, and encourage you to upgrade to a supported version as soon as you can. Thanks! -Michael *** This ticket has been marked as a duplicate of ticket 10980 ***
Hey, a followup question: I only caught far later that I have a lot of jobs that still show as running but which completed on the actual compute node. Here's an example: [root@amarel1 ~]# squeue -w hal0199 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 17631585 p_foran C508.TCG vm379 R 15:58:38 1 hal0199 17631586 p_foran C508.TCG vm379 R 15:58:38 1 hal0199 17631587 p_foran C508.TCG vm379 R 15:58:38 1 hal0199 17631577 p_foran C508.TCG vm379 R 15:58:43 1 hal0199 17631580 p_foran C508.TCG vm379 R 15:58:43 1 hal0199 17631763 p_foran C508.TCG vm379 R 15:45:39 1 hal0199 17631746 p_foran C508.TCG vm379 R 15:46:42 1 hal0199 17631730 p_foran C508.TCG vm379 R 15:48:33 1 hal0199 17631775 p_foran C508.TCG vm379 R 15:44:53 1 hal0199 17631769 p_foran C508.TCG vm379 R 15:45:25 1 hal0199 17631949 p_foran C508.TCG vm379 R 15:36:32 1 hal0199 17631943 p_foran C508.TCG vm379 R 15:36:51 1 hal0199 17631932 p_foran C508.TCG vm379 R 15:37:11 1 hal0199 17631920 p_foran C508.TCG vm379 R 15:37:26 1 hal0199 17631915 p_foran C508.TCG vm379 R 15:37:39 1 hal0199 17631916 p_foran C508.TCG vm379 R 15:37:39 1 hal0199 17631900 p_foran C508.TCG vm379 R 15:38:12 1 hal0199 17631880 p_foran C508.TCG vm379 R 15:39:02 1 hal0199 17632089 p_foran C508.TCG vm379 R 15:30:36 1 hal0199 17632081 p_foran C508.TCG vm379 R 15:30:58 1 hal0199 17632060 p_foran C508.TCG vm379 R 15:32:06 1 hal0199 17632050 p_foran C508.TCG vm379 R 15:32:44 1 hal0199 17632047 p_foran C508.TCG vm379 R 15:32:54 1 hal0199 17632035 p_foran C508.TCG vm379 R 15:33:36 1 hal0199 17632031 p_foran C508.TCG vm379 R 15:33:54 1 hal0199 17632016 p_foran C508.TCG vm379 R 15:34:28 1 hal0199 17632177 p_foran C508.TCG vm379 R 15:25:29 1 hal0199 17632171 p_foran C508.TCG vm379 R 15:26:06 1 hal0199 17632164 p_foran C508.TCG vm379 R 15:26:21 1 hal0199 17632163 p_foran C508.TCG vm379 R 15:26:34 1 hal0199 17632161 p_foran C508.TCG vm379 R 15:26:41 1 hal0199 17632158 p_foran C508.TCG vm379 R 15:27:04 1 hal0199 17632129 p_foran C508.TCG vm379 R 15:28:47 1 hal0199 17632128 p_foran C508.TCG vm379 R 15:28:50 1 hal0199 17632112 p_foran C508.TCG vm379 R 15:29:26 1 hal0199 17632105 p_foran C508.TCG vm379 R 15:29:45 1 hal0199 17632106 p_foran C508.TCG vm379 R 15:29:45 1 hal0199 17632252 p_foran C508.TCG vm379 R 15:20:27 1 hal0199 17632249 p_foran C508.TCG vm379 R 15:20:33 1 hal0199 17632236 p_foran C508.TCG vm379 R 15:21:44 1 hal0199 17632228 p_foran C508.TCG vm379 R 15:21:53 1 hal0199 17632219 p_foran C508.TCG vm379 R 15:22:42 1 hal0199 17632193 p_foran C508.TCG vm379 R 15:24:25 1 hal0199 17632191 p_foran C508.TCG vm379 R 15:24:36 1 hal0199 17632256 p_foran C508.TCG vm379 R 15:20:13 1 hal0199 17632253 p_foran C508.TCG vm379 R 15:20:20 1 hal0199 There's nothing running on that node: [root@hal0199 ~]# pgrep -U vm379 [root@hal0199 ~]# But it's keeping the node partly busy: [root@hal0199 ~]# scontrol show node hal0199 NodeName=hal0199 Arch=x86_64 CoresPerSocket=26 CPUAlloc=46 CPUTot=52 CPULoad=0.01 AvailableFeatures=hal,hdr,cascadelake ActiveFeatures=hal,hdr,cascadelake Gres=(null) NodeAddr=hal0199 NodeHostName=hal0199 Version=18.08 OS=Linux 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 RealMemory=192000 AllocMem=188416 FreeMem=181010 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=20 Owner=N/A MCS_label=N/A Partitions=main,bg,oarc,p_foran BootTime=2021-12-21T18:17:35 SlurmdStartTime=2022-02-17T18:26:25 CfgTRES=cpu=52,mem=187.50G,billing=52 AllocTRES=cpu=46,mem=184G CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Is there any way to handle these en-masse? The jobs are hard to tell from still-running jobs. I know they'll time out eventually. I also would have expected them all to fail after 5 mins, since that's what SlurmdTimeout is set to, but apparently not?
(In reply to Ryan Novosielski from comment #25) > Is there any way to handle these en-masse? The jobs are hard to tell from > still-running jobs. Have you tried restarting slurmctld, and if that doesn't work, the slurmd on that node? That may fix the issue. > I know they'll time out eventually. I also would have expected them all to > fail after 5 mins, since that's what SlurmdTimeout is set to, but apparently > not? Well, I don't think the SlurmdTimeout has anything to do with a job's timeout.
Ryan, setting the node to down should clear those jobs off. I would avoid downing the node with other valid jobs running on that node, or those jobs will be re-queued. > scontrol update nodename=hal0199 state=down reason="Crash recovery"
>> Is there any way to handle these en-masse? The jobs are hard to tell from >> still-running jobs. > Have you tried restarting slurmctld, and if that doesn't work, the slurmd on that > node? That may fix the issue. Yeah, both. Neither helps. No big deal if there's no answer; we're not supposed to be in this situation. I was just a bit surprised. We think the jobs actually completed, but we'd have to confer with the user/ask about their job logs probably to know. They look OK in the slurmd.log. Here's an exit: [2022-02-17T06:33:33.794] [17632253.batch] error: Unable to send job complete message: Unable to contact slurm controller (connect failure) [2022-02-17T06:33:33.794] [17632253.batch] done with job [2022-02-17T13:55:15.709] [17632253.0] Rank 0 sent step completion message directly to slurmctld (0.0.0.0:0) [2022-02-17T13:55:15.710] [17632253.0] done with job ...this is after the slurmctld crash at ~03:19; it did several hours of connect failures before it exited. >> I know they'll time out eventually. I also would have expected them all to >> fail after 5 mins, since that's what SlurmdTimeout is set to, but apparently >> not? > Well, I don't think the SlurmdTimeout has anything to do with a job's timeout. Oh, you know what, I was remembering that as how long you can have slurmctld down before slurmd will get upset about not being able to reach the controller and kill the job running on a node, but I guess it's really the length of time you can have slurmd down before the controller will mark the node down. Whoops.
Just FYI, you can't apply the first hunk of that patch to 18.08.8. What it patches doesn't appear anywhere in slurm_cred.c, and in fact, there doesn't appear to be anywhere in the entire source code where the thing being patched actually happens (maybe the below references in slurmd.c/req.c/x11_forwarding.c). Again, not expecting that to be fixed, but I primarily bring this up just in case the bug we hit actually isn't the same one/still exists, and it's only the fix for the symptom that applies in both cases. We don't have any "slurm_cred_create failed" error messages in our logfiles. [root@master src]# grep -r getpwuid salloc/opt.c:#include <pwd.h> /* getpwuid */ salloc/opt.c: * NOTE: This function is NOT reentrant (see getpwuid_r if needed) */ salloc/opt.c: pw_ent_ptr = getpwuid(opt.uid); sacct/print.c: if ((pw=getpwuid(job->uid))) sinfo/print.c: if ((pw=getpwuid(sinfo_data->reason_uid))) sinfo/print.c: if ((pw=getpwuid(sinfo_data->reason_uid))) sshare/sshare.c: if ((pwd = getpwuid(getuid()))) { sshare/sshare.c: if (!(pwd = getpwuid((uid_t) id))) { slurmd/slurmd/slurmd.c: /* since when you do a getpwuid you get a pointer to a slurmd/slurmd/slurmd.c: if ((pw = getpwuid(slurmd_uid))) slurmd/slurmd/slurmd.c: if ((pw = getpwuid(curr_uid))) slurmd/slurmd/req.c: if (slurm_getpwuid_r(req->uid, &pwd, pwd_buf, PW_BUF_SIZE, &pwd_ptr) slurmd/slurmd/req.c: error("%s: getpwuid_r(%u):%m", __func__, req->uid); slurmd/slurmstepd/x11_forwarding.c: if (slurm_getpwuid_r(uid, &pwd, pwd_buf, PW_BUF_SIZE, &pwd_ptr) slurmd/slurmstepd/x11_forwarding.c: error("%s: getpwuid_r(%u):%m", __func__, uid); sview/node_info.c: if ((pw=getpwuid(node_ptr->reason_uid))) sview/front_end_info.c: if ((pw=getpwuid(front_end_ptr->reason_uid))) common/uid.h:/* Retry getpwuid_r while return code is EINTR so we always get the common/uid.h: * info. Return return code of getpwuid_r. common/uid.h:extern int slurm_getpwuid_r (uid_t uid, struct passwd *pwd, char *buf, common/proc_args.c:#include <pwd.h> /* getpwuid */ common/uid.c:extern int slurm_getpwuid_r (uid_t uid, struct passwd *pwd, char *buf, common/uid.c: rc = getpwuid_r(uid, pwd, buf, bufsiz, result); common/uid.c: if (slurm_getpwuid_r(l, &pwd, buffer, PW_BUF_SIZE, &result) != 0) common/uid.c: rc = slurm_getpwuid_r(uid, &pwd, buffer, PW_BUF_SIZE, &result); common/uid.c: rc = slurm_getpwuid_r(uid, &pwd, buffer, PW_BUF_SIZE, &result); slurmctld/partition_mgr.c: * getpwuid_r and getgrgid_r calls should be cached by slurmctld/partition_mgr.c: res = getpwuid_r(run_uid, &pwd, buf, buflen, &pwd_result); strigger/strigger.c: pw = getpwuid(uid);
Actually, looking more closely at that set of commits, I think maybe you just reversed it: 1/3 looks like it's for the logging, and maybe 3/3 is the consequential patch. Print out error when getpwuid_r() fails … @hintron @gaijin03 hintron authored and gaijin03 committed on Apr 12, 2021 aed9850 Avoid segfault in controller when job loses its job resources object … @naterini @MarshallGarey @hintron @gaijin03 3 people authored and gaijin03 committed on Apr 12, 2021 73bf0a0 Never schedule the last task in a job array twice … @hintron @gaijin03 hintron and gaijin03 committed on Apr 12, 2021 f636c45 The third one was happy to apply, so I'm including that one.
...but it doesn't ultimately build. Anyway, we've only ever hit this situation a single time unpatched, so it's not a major issue for us, especially with the core dump situation remedied and our plans move off of this version as soon as possible, but I guess let me know if you want to know anything more from our system in case we're actually hitting a different bug since no credentials errors were ever logged. make[5]: Entering directory `/root/rpmbuild/BUILD/slurm-18.08.8/src/plugins/sched/backfill' /bin/sh ../../../../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm -I../../../.. -I../../../../src/common -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c -o backfill_wrapper.lo backfill_wrapper.c /bin/sh ../../../../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm -I../../../.. -I../../../../src/common -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c -o backfill.lo backfill.c libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm -I../../../.. -I../../../../src/common -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c backfill.c -fPIC -DPIC -o .libs/backfill.o libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm -I../../../.. -I../../../../src/common -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c backfill_wrapper.c -fPIC -DPIC -o .libs/backfill_wrapper.o backfill.c: In function '_attempt_backfill': backfill.c:2455:6: error: unknown type name 'job_record_t' job_record_t *tmp = job_ptr; ^ backfill.c:2455:26: warning: initialization from incompatible pointer type [enabled by default] job_record_t *tmp = job_ptr; ^ backfill.c:2458:30: warning: comparison of distinct pointer types lacks a cast [enabled by default] if (job_ptr && (job_ptr != tmp) && ^ make[5]: *** [backfill.lo] Error 1 make[5]: *** Waiting for unfinished jobs.... libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm -I../../../.. -I../../../../src/common -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c backfill_wrapper.c -o backfill_wrapper.o >/dev/null 2>&1 make[5]: Leaving directory `/root/rpmbuild/BUILD/slurm-18.08.8/src/plugins/sched/backfill' make[4]: *** [all-recursive] Error 1 make[4]: Leaving directory `/root/rpmbuild/BUILD/slurm-18.08.8/src/plugins/sched' make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory `/root/rpmbuild/BUILD/slurm-18.08.8/src/plugins' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/root/rpmbuild/BUILD/slurm-18.08.8/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/rpmbuild/BUILD/slurm-18.08.8' make: *** [all] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.dfdUuv (%build) RPM build errors: Bad exit status from /var/tmp/rpm-tmp.dfdUuv (%build)
(In reply to Ryan Novosielski from comment #31) > ...but it doesn't ultimately build. Anyway, we've only ever hit this > situation a single time unpatched, so it's not a major issue for us, > especially with the core dump situation remedied and our plans move off of > this version as soon as possible, but I guess let me know if you want to > know anything more from our system in case we're actually hitting a > different bug since no credentials errors were ever logged. Reducing severity based on this response. I strongly suggest just upgrading and avoiding this entire patch issue. This issue should be fixed in all releases since slurm-20-11-6-1. Please do upgrade to the latest patch level of 20.11 or 21.08 as they are both currently supported releases.
Ryan, before we move to close this out, I wanted to check on the current status. Also, do you still see the leftover jobs reported in comment#25?
I will be out of the office on Friday, 2/18 and Monday, 2/21. If you need immediate assistance for a Rutgers Office of Advanced Research Computing matter only (not for union-related matters), please contact the ACI team at tech@oarc.rutgers.edu (if you have not already done so) for a faster response. If your message is related to an HPAE Local 5094 matter, I will respond as soon as I'm able. Thank you for your patience!
Resolving for now.