13463 – slurmctld will not start after crash

Ticket 13463 - slurmctld will not start after crash

Summary: slurmctld will not start after crash

Status:	RESOLVED DUPLICATE of ticket 10980

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	- Unsupported Older Versions
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-02-17 08:40 MST by Ryan Novosielski
Modified:	2022-02-21 16:21 MST (History)
CC List:	1 user (show)

See Also:	10980
Site:	Rutgers
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	CentOS
Machine Name:	amarel
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Output of slurmctl -D (170.47 KB, text/plain) 2022-02-17 08:48 MST, Ryan Novosielski	Details
Output of gdb --batch -ex "bt thread apply all bt full" on core file (11.26 KB, text/plain) 2022-02-17 08:49 MST, Ryan Novosielski	Details
slurmctld -R -D -vvv output (836.34 KB, text/plain) 2022-02-17 09:15 MST, Ryan Novosielski	Details
slurmctld -D -vvv output (835.31 KB, text/plain) 2022-02-17 09:16 MST, Ryan Novosielski	Details
slurmctld log with crash (16.80 MB, text/plain) 2022-02-17 09:20 MST, Ryan Novosielski	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Ryan Novosielski 2022-02-17 08:40:56 MST

slurmctld 18.08.8. We're in the process of scheduling/planning an upgrade.

We had a slurmctld crash last night and restarting slurmctld will not work, crashing with signal 11. Core files are dumped. I'll also upload what I think to be the relevant backtraces.

I'll attach the output of "slurmctld -D". I assume this would start OK with slurmctld -c, but it would be great not to lose the job state if there's any way to weed out what's causing the crash.

Comment 1 Ryan Novosielski 2022-02-17 08:48:46 MST

Created attachment 23518 [details]
Output of slurmctl -D

Comment 2 Ryan Novosielski 2022-02-17 08:49:27 MST

Created attachment 23519 [details]
Output of gdb --batch -ex "bt thread apply all bt full" on core file

Comment 5 Michael Hinton 2022-02-17 09:07:39 MST

Ryan,

I believe this issue was fixed in 20.11.6 with the following commits: https://github.com/SchedMD/slurm/compare/82ce105e18c1...f636c4562a (See bug 10980 comment 41). I think they should still apply fairly cleanly to 18.08 though.

-Michael

Comment 6 Michael Hinton 2022-02-17 09:10:09 MST

(In reply to Ryan Novosielski from comment #0)
> We had a slurmctld crash last night and restarting slurmctld will not work,
> crashing with signal 11.
Could you please attach the slurmctld.log leading up to the initial crash?

Comment 7 Ryan Novosielski 2022-02-17 09:15:31 MST

Created attachment 23520 [details]
slurmctld -R -D -vvv output

Comment 8 Ryan Novosielski 2022-02-17 09:16:03 MST

Created attachment 23521 [details]
slurmctld -D -vvv output

Comment 10 Ryan Novosielski 2022-02-17 09:20:08 MST

Created attachment 23522 [details]
slurmctld log with crash

Comment 11 Ryan Novosielski 2022-02-17 09:26:53 MST

A little more history on this: we started seeing occasional crashes a week or two ago on logrotate, which I believe only is a kill -HUP. This was discussed a bit in bug 13344. This is what the timing was for this crash too.

Comment 12 Jason Booth 2022-02-17 10:17:49 MST

Ryan, I sent an email directly to you outside of this bug explaining the status of support and the version you are running. I wanted to send an update via this bug as well.

Michael pointed out that this is most likely resolved in a later version. He also provided the commit this fix is found in via comment#5. It is possible that you may need another bypass patch to start your cluster.

https://bugs.schedmd.com/attachment.cgi?id=9856

Once your system is back up you should start making plans to upgrade since these issues you have reported are resolved in the current release.

Comment 13 Ryan Novosielski 2022-02-17 10:28:46 MST

Yup, understood, re: support. We do what we can in line with the priorities that are given to us.

Michael asked for logs; anything else in there we should know, or we should proceed with applying the patch and get things moving again?

Comment 14 Jason Booth 2022-02-17 10:31:36 MST

> Michael asked for logs; anything else in there we should know, or we should
proceed with applying the patch and get things moving again?

Yes, please be aware that there are two patches. 

https://bugs.schedmd.com/attachment.cgi?id=9856


> ... the following commits: https://github.com/SchedMD/slurm/compare/82ce105e18c1...f636c4562a (See bug#10980comment#41).

We believe that should apply cleanly.

Comment 15 Michael Hinton 2022-02-17 10:41:45 MST

Hi Ryan,

There are actually three commits mentioned in comment 5. The one that will get you back up and running will be the middle commit, commit https://github.com/SchedMD/slurm/commit/73bf0a09ba. The other two commits are follow-up commits. Apply all three if you can, though.

Comment 16 Michael Hinton 2022-02-17 10:44:57 MST

Let me correct something: don't apply the attachment mentioned earlier (https://bugs.schedmd.com/attachment.cgi?id=9856) since that is equivalent to https://github.com/SchedMD/slurm/commit/73bf0a09ba. Just apply commit 73bf0a09ba.

Comment 17 Ryan Novosielski 2022-02-17 10:47:20 MST

OK, I'll apply that one patch with the three commits and see how it goes. Thanks for your quick help, all.

Comment 18 Ryan Novosielski 2022-02-17 10:49:54 MST

Sorry, was reading quickly, the one commit. Thanks!

Comment 19 Michael Hinton 2022-02-17 11:00:23 MST

(In reply to Ryan Novosielski from comment #18)
> Sorry, was reading quickly, the one commit. Thanks!
I just verified that commit 73bf0a09ba applies cleanly to 18.08.8. You can use the following command to do that quickly:

    curl https://github.com/SchedMD/slurm/commit/73bf0a09ba.patch | git am

Let me know how it goes!

Thanks,
-Michael

Comment 20 Ryan Novosielski 2022-02-17 11:51:18 MST

We install using OpenHPC so went the .src.rpm and patching via speckle route. Just about wrapping that up now. 

Can you confirm, slurmctld is the only affected component?

Comment 21 Ryan Novosielski 2022-02-17 11:52:33 MST

(we also took the opportunity to apply the patch supplied to us for this one: https://github.com/SchedMD/slurm/commit/21d08466c65d – we originally resolved it by asking the user to stop doing what they were doing as an upgrade was already on the horizon)

Comment 22 Michael Hinton 2022-02-17 12:06:53 MST

(In reply to Ryan Novosielski from comment #20)
> Can you confirm, slurmctld is the only affected component?
Yes, slurmctld is the only affected component, so you can just restart that after applying the patch.

Comment 23 Ryan Novosielski 2022-02-17 12:11:42 MST

We're back in service. Thank you for the fast turnaround.

Comment 24 Michael Hinton 2022-02-17 12:28:04 MST

Great! I would recommend applying the other two commits in https://github.com/SchedMD/slurm/compare/82ce105e18c1...f636c4562a when you get a chance (commit 1/3 helps prevent the job resources pointer from going null in the first place; 3/3 improves the logs, but it's not that important).

I'll go ahead and close this out as a duplicate, and encourage you to upgrade to a supported version as soon as you can.

Thanks!
-Michael

*** This ticket has been marked as a duplicate of ticket 10980 ***

Comment 25 Ryan Novosielski 2022-02-17 16:44:53 MST

Hey, a followup question: I only caught far later that I have a lot of jobs that still show as running but which completed on the actual compute node. Here's an example:

[root@amarel1 ~]# squeue -w hal0199
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          17631585   p_foran C508.TCG    vm379  R   15:58:38      1 hal0199
          17631586   p_foran C508.TCG    vm379  R   15:58:38      1 hal0199
          17631587   p_foran C508.TCG    vm379  R   15:58:38      1 hal0199
          17631577   p_foran C508.TCG    vm379  R   15:58:43      1 hal0199
          17631580   p_foran C508.TCG    vm379  R   15:58:43      1 hal0199
          17631763   p_foran C508.TCG    vm379  R   15:45:39      1 hal0199
          17631746   p_foran C508.TCG    vm379  R   15:46:42      1 hal0199
          17631730   p_foran C508.TCG    vm379  R   15:48:33      1 hal0199
          17631775   p_foran C508.TCG    vm379  R   15:44:53      1 hal0199
          17631769   p_foran C508.TCG    vm379  R   15:45:25      1 hal0199
          17631949   p_foran C508.TCG    vm379  R   15:36:32      1 hal0199
          17631943   p_foran C508.TCG    vm379  R   15:36:51      1 hal0199
          17631932   p_foran C508.TCG    vm379  R   15:37:11      1 hal0199
          17631920   p_foran C508.TCG    vm379  R   15:37:26      1 hal0199
          17631915   p_foran C508.TCG    vm379  R   15:37:39      1 hal0199
          17631916   p_foran C508.TCG    vm379  R   15:37:39      1 hal0199
          17631900   p_foran C508.TCG    vm379  R   15:38:12      1 hal0199
          17631880   p_foran C508.TCG    vm379  R   15:39:02      1 hal0199
          17632089   p_foran C508.TCG    vm379  R   15:30:36      1 hal0199
          17632081   p_foran C508.TCG    vm379  R   15:30:58      1 hal0199
          17632060   p_foran C508.TCG    vm379  R   15:32:06      1 hal0199
          17632050   p_foran C508.TCG    vm379  R   15:32:44      1 hal0199
          17632047   p_foran C508.TCG    vm379  R   15:32:54      1 hal0199
          17632035   p_foran C508.TCG    vm379  R   15:33:36      1 hal0199
          17632031   p_foran C508.TCG    vm379  R   15:33:54      1 hal0199
          17632016   p_foran C508.TCG    vm379  R   15:34:28      1 hal0199
          17632177   p_foran C508.TCG    vm379  R   15:25:29      1 hal0199
          17632171   p_foran C508.TCG    vm379  R   15:26:06      1 hal0199
          17632164   p_foran C508.TCG    vm379  R   15:26:21      1 hal0199
          17632163   p_foran C508.TCG    vm379  R   15:26:34      1 hal0199
          17632161   p_foran C508.TCG    vm379  R   15:26:41      1 hal0199
          17632158   p_foran C508.TCG    vm379  R   15:27:04      1 hal0199
          17632129   p_foran C508.TCG    vm379  R   15:28:47      1 hal0199
          17632128   p_foran C508.TCG    vm379  R   15:28:50      1 hal0199
          17632112   p_foran C508.TCG    vm379  R   15:29:26      1 hal0199
          17632105   p_foran C508.TCG    vm379  R   15:29:45      1 hal0199
          17632106   p_foran C508.TCG    vm379  R   15:29:45      1 hal0199
          17632252   p_foran C508.TCG    vm379  R   15:20:27      1 hal0199
          17632249   p_foran C508.TCG    vm379  R   15:20:33      1 hal0199
          17632236   p_foran C508.TCG    vm379  R   15:21:44      1 hal0199
          17632228   p_foran C508.TCG    vm379  R   15:21:53      1 hal0199
          17632219   p_foran C508.TCG    vm379  R   15:22:42      1 hal0199
          17632193   p_foran C508.TCG    vm379  R   15:24:25      1 hal0199
          17632191   p_foran C508.TCG    vm379  R   15:24:36      1 hal0199
          17632256   p_foran C508.TCG    vm379  R   15:20:13      1 hal0199
          17632253   p_foran C508.TCG    vm379  R   15:20:20      1 hal0199

There's nothing running on that node:

[root@hal0199 ~]# pgrep -U vm379
[root@hal0199 ~]# 

But it's keeping the node partly busy:

[root@hal0199 ~]# scontrol show node hal0199
NodeName=hal0199 Arch=x86_64 CoresPerSocket=26 
   CPUAlloc=46 CPUTot=52 CPULoad=0.01
   AvailableFeatures=hal,hdr,cascadelake
   ActiveFeatures=hal,hdr,cascadelake
   Gres=(null)
   NodeAddr=hal0199 NodeHostName=hal0199 Version=18.08
   OS=Linux 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 
   RealMemory=192000 AllocMem=188416 FreeMem=181010 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=20 Owner=N/A MCS_label=N/A
   Partitions=main,bg,oarc,p_foran 
   BootTime=2021-12-21T18:17:35 SlurmdStartTime=2022-02-17T18:26:25
   CfgTRES=cpu=52,mem=187.50G,billing=52
   AllocTRES=cpu=46,mem=184G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Is there any way to handle these en-masse? The jobs are hard to tell from still-running jobs.

I know they'll time out eventually. I also would have expected them all to fail after 5 mins, since that's what SlurmdTimeout is set to, but apparently not?

Comment 26 Michael Hinton 2022-02-17 16:53:24 MST

(In reply to Ryan Novosielski from comment #25)
> Is there any way to handle these en-masse? The jobs are hard to tell from
> still-running jobs.
Have you tried restarting slurmctld, and if that doesn't work, the slurmd on that node? That may fix the issue.

> I know they'll time out eventually. I also would have expected them all to
> fail after 5 mins, since that's what SlurmdTimeout is set to, but apparently
> not?
Well, I don't think the SlurmdTimeout has anything to do with a job's timeout.

Comment 27 Jason Booth 2022-02-17 17:48:52 MST

Ryan, setting the node to down should clear those jobs off. I would avoid downing the node with other valid jobs running on that node, or those jobs will be re-queued.


> scontrol update nodename=hal0199 state=down reason="Crash recovery"

Comment 28 Ryan Novosielski 2022-02-17 18:31:26 MST

>> Is there any way to handle these en-masse? The jobs are hard to tell from
>> still-running jobs.

> Have you tried restarting slurmctld, and if that doesn't work, the slurmd on that 
> node? That may fix the issue.

Yeah, both. Neither helps. No big deal if there's no answer; we're not supposed to be in this situation. I was just a bit surprised. We think the jobs actually completed, but we'd have to confer with the user/ask about their job logs probably to know. They look OK in the slurmd.log. Here's an exit:

[2022-02-17T06:33:33.794] [17632253.batch] error: Unable to send job complete message: Unable to contact slurm controller (connect failure)
[2022-02-17T06:33:33.794] [17632253.batch] done with job
[2022-02-17T13:55:15.709] [17632253.0] Rank 0 sent step completion message directly to slurmctld (0.0.0.0:0)
[2022-02-17T13:55:15.710] [17632253.0] done with job

...this is after the slurmctld crash at ~03:19; it did several hours of connect failures before it exited.

>> I know they'll time out eventually. I also would have expected them all to
>> fail after 5 mins, since that's what SlurmdTimeout is set to, but apparently
>> not?

> Well, I don't think the SlurmdTimeout has anything to do with a job's timeout.

Oh, you know what, I was remembering that as how long you can have slurmctld down before slurmd will get upset about not being able to reach the controller and kill the job running on a node, but I guess it's really the length of time you can have slurmd down before the controller will mark the node down. Whoops.

Comment 29 Ryan Novosielski 2022-02-17 18:43:43 MST

Just FYI, you can't apply the first hunk of that patch to 18.08.8. What it patches doesn't appear anywhere in slurm_cred.c, and in fact, there doesn't appear to be anywhere in the entire source code where the thing being patched actually happens (maybe the below references in slurmd.c/req.c/x11_forwarding.c). Again, not expecting that to be fixed, but I primarily bring this up just in case the bug we hit actually isn't the same one/still exists, and it's only the fix for the symptom that applies in both cases. We don't have any "slurm_cred_create failed" error messages in our logfiles.

[root@master src]# grep -r getpwuid
salloc/opt.c:#include <pwd.h>           /* getpwuid   */
salloc/opt.c: * NOTE: This function is NOT reentrant (see getpwuid_r if needed) */
salloc/opt.c:   pw_ent_ptr = getpwuid(opt.uid);
sacct/print.c:                                  if ((pw=getpwuid(job->uid)))
sinfo/print.c:          if ((pw=getpwuid(sinfo_data->reason_uid)))
sinfo/print.c:          if ((pw=getpwuid(sinfo_data->reason_uid)))
sshare/sshare.c:                if ((pwd = getpwuid(getuid()))) {
sshare/sshare.c:                if (!(pwd = getpwuid((uid_t) id))) {
slurmd/slurmd/slurmd.c:         /* since when you do a getpwuid you get a pointer to a
slurmd/slurmd/slurmd.c:         if ((pw = getpwuid(slurmd_uid)))
slurmd/slurmd/slurmd.c:         if ((pw = getpwuid(curr_uid)))
slurmd/slurmd/req.c:    if (slurm_getpwuid_r(req->uid, &pwd, pwd_buf, PW_BUF_SIZE, &pwd_ptr)
slurmd/slurmd/req.c:            error("%s: getpwuid_r(%u):%m", __func__, req->uid);
slurmd/slurmstepd/x11_forwarding.c:     if (slurm_getpwuid_r(uid, &pwd, pwd_buf, PW_BUF_SIZE, &pwd_ptr)
slurmd/slurmstepd/x11_forwarding.c:             error("%s: getpwuid_r(%u):%m", __func__, uid);
sview/node_info.c:                      if ((pw=getpwuid(node_ptr->reason_uid)))
sview/front_end_info.c:                 if ((pw=getpwuid(front_end_ptr->reason_uid)))
common/uid.h:/* Retry getpwuid_r while return code is EINTR so we always get the
common/uid.h: * info.  Return return code of getpwuid_r.
common/uid.h:extern int slurm_getpwuid_r (uid_t uid, struct passwd *pwd, char *buf,
common/proc_args.c:#include <pwd.h>             /* getpwuid   */
common/uid.c:extern int slurm_getpwuid_r (uid_t uid, struct passwd *pwd, char *buf,
common/uid.c:           rc = getpwuid_r(uid, pwd, buf, bufsiz, result);
common/uid.c:   if (slurm_getpwuid_r(l, &pwd, buffer, PW_BUF_SIZE, &result) != 0)
common/uid.c:   rc = slurm_getpwuid_r(uid, &pwd, buffer, PW_BUF_SIZE, &result);
common/uid.c:   rc = slurm_getpwuid_r(uid, &pwd, buffer, PW_BUF_SIZE, &result);
slurmctld/partition_mgr.c:       * getpwuid_r and getgrgid_r calls should be cached by
slurmctld/partition_mgr.c:              res = getpwuid_r(run_uid, &pwd, buf, buflen, &pwd_result);
strigger/strigger.c:    pw = getpwuid(uid);

Comment 30 Ryan Novosielski 2022-02-17 18:54:00 MST

Actually, looking more closely at that set of commits, I think maybe you just reversed it: 1/3 looks like it's for the logging, and maybe 3/3 is the consequential patch. 

Print out error when getpwuid_r() fails  …
@hintron
@gaijin03
hintron authored and gaijin03 committed on Apr 12, 2021
aed9850  

Avoid segfault in controller when job loses its job resources object  …
@naterini
@MarshallGarey
@hintron
@gaijin03
3 people authored and gaijin03 committed on Apr 12, 2021
73bf0a0  

Never schedule the last task in a job array twice  …
@hintron
@gaijin03
hintron and gaijin03 committed on Apr 12, 2021
f636c45  

The third one was happy to apply, so I'm including that one.

Comment 31 Ryan Novosielski 2022-02-17 18:57:49 MST

...but it doesn't ultimately build. Anyway, we've only ever hit this situation a single time unpatched, so it's not a major issue for us, especially with the core dump situation remedied and our plans move off of this version as soon as possible, but I guess let me know if you want to know anything more from our system in case we're actually hitting a different bug since no credentials errors were ever logged.

make[5]: Entering directory `/root/rpmbuild/BUILD/slurm-18.08.8/src/plugins/sched/backfill'
/bin/sh ../../../../libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm  -I../../../.. -I../../../../src/common   -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c -o backfill_wrapper.lo backfill_wrapper.c
/bin/sh ../../../../libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm  -I../../../.. -I../../../../src/common   -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c -o backfill.lo backfill.c
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm -I../../../.. -I../../../../src/common -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c backfill.c  -fPIC -DPIC -o .libs/backfill.o
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm -I../../../.. -I../../../../src/common -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c backfill_wrapper.c  -fPIC -DPIC -o .libs/backfill_wrapper.o
backfill.c: In function '_attempt_backfill':
backfill.c:2455:6: error: unknown type name 'job_record_t'
      job_record_t *tmp = job_ptr;
      ^
backfill.c:2455:26: warning: initialization from incompatible pointer type [enabled by default]
      job_record_t *tmp = job_ptr;
                          ^
backfill.c:2458:30: warning: comparison of distinct pointer types lacks a cast [enabled by default]
      if (job_ptr && (job_ptr != tmp) &&
                              ^
make[5]: *** [backfill.lo] Error 1
make[5]: *** Waiting for unfinished jobs....
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm -I../../../.. -I../../../../src/common -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -c backfill_wrapper.c -o backfill_wrapper.o >/dev/null 2>&1
make[5]: Leaving directory `/root/rpmbuild/BUILD/slurm-18.08.8/src/plugins/sched/backfill'
make[4]: *** [all-recursive] Error 1
make[4]: Leaving directory `/root/rpmbuild/BUILD/slurm-18.08.8/src/plugins/sched'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/root/rpmbuild/BUILD/slurm-18.08.8/src/plugins'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/root/rpmbuild/BUILD/slurm-18.08.8/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/rpmbuild/BUILD/slurm-18.08.8'
make: *** [all] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.dfdUuv (%build)


RPM build errors:
    Bad exit status from /var/tmp/rpm-tmp.dfdUuv (%build)

Comment 32 Nate Rini 2022-02-17 22:12:03 MST

(In reply to Ryan Novosielski from comment #31)
> ...but it doesn't ultimately build. Anyway, we've only ever hit this
> situation a single time unpatched, so it's not a major issue for us,
> especially with the core dump situation remedied and our plans move off of
> this version as soon as possible, but I guess let me know if you want to
> know anything more from our system in case we're actually hitting a
> different bug since no credentials errors were ever logged.

Reducing severity based on this response. I strongly suggest just upgrading and avoiding this entire patch issue. This issue should be fixed in all releases since slurm-20-11-6-1. Please do upgrade to the latest patch level of 20.11 or 21.08 as they are both currently supported releases.

Comment 33 Jason Booth 2022-02-18 09:50:45 MST

Ryan, before we move to close this out, I wanted to check on the current status. Also, do you still see the leftover jobs reported in comment#25?

Comment 34 Ryan Novosielski 2022-02-18 09:50:59 MST

I will be out of the office on Friday, 2/18 and Monday, 2/21.

If you need immediate assistance for a Rutgers Office of Advanced Research Computing matter only (not for union-related matters), please contact the ACI team at tech@oarc.rutgers.edu (if you have not already done so) for a faster response.

If your message is related to an HPAE Local 5094 matter, I will respond as soon as I'm able.

Thank you for your patience!

Comment 35 Jason Booth 2022-02-18 10:43:36 MST

Resolving for now.