Ticket 11567 - slurmctld is dumping core on a job that was stuck in completing phase.
Summary: slurmctld is dumping core on a job that was stuck in completing phase.
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.11.5
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-05-07 16:35 MDT by Geoff
Modified: 2021-08-13 14:42 MDT (History)
0 users

See Also:
Site: Johns Hopkins University Applied Physics Laboratory
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Geoff 2021-05-07 16:35:22 MDT
We ended up with a job stuck in completing phase. The scheduler had a log that getpwuid failed and requeued the job.

[2021-05-06T19:01:22.242] sched: Allocate JobId=91357_3420(91357) NodeList=un37 #CPUs=4 Partition=amds-cpu
[2021-05-06T19:01:22.242] error: _fill_cred_gids: getpwuid failed for uid=111653
[2021-05-06T19:01:22.242] error: slurm_cred_create failure for batch job 91357
[2021-05-06T19:01:22.242] error: Can not create job credential, attempting to requeue batch JobId=91357_3420(91357)
[2021-05-06T19:01:22.242] _job_complete: JobId=91357_3420(91357) WEXITSTATUS 0
[2021-05-06T19:01:22.242] _job_complete: requeue JobId=91357_3420(91357) per user/system request
[2021-05-06T19:01:22.242] _job_complete: JobId=91357_3420(91357) done

There was no evidence of the job on the system. The user said that the job looked like it ran. It was stuck in completing state and scancel would not remove it but did not return any errors.

# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
amds-cpu*    up   infinite      1   comp un37
<CUT>

#  squeue -lj 91357_3420
Fri May 07 17:07:30 2021
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
        91357_3420  amds-cpu      PSB user0001 COMPLETI       0:00   2:00:00      1 un37

# scontrol show job  91357_3420
JobId=91357 ArrayJobId=91357 ArrayTaskId=3420 JobName=PSB
   UserId=user0001(111653) GroupId=users(100) MCS_label=N/A
   Priority=1314 Nice=0 Account=defacct QOS=normal WCKey=bps15
   JobState=COMPLETING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=2 Reboot=0 ExitCode=0:0
   RunTime=18753-23:01:22 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2021-05-06T09:27:40 EligibleTime=2021-05-06T19:03:22
   AccrueTime=2021-05-06T09:27:40
   StartTime=Unknown EndTime=2021-05-06T19:01:22 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-05-06T19:01:22
   Partition=amds-cpu AllocNode:Sid=amds-usubmit01:28799
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=un37
   BatchHost=un37
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=16G,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   <CUT>

On scheduler restart slurmctld is dumping core.

We had a similar issue in the past and were given a patch to watch out for a missing job_resrcs_ptr for 20.11.04.

     https://bugs.schedmd.com/show_bug.cgi?id=10980

We are now running 20.11.05 and I have added this patch to 20.11.5 (by hand in case the line numbers did not match 20.11.4), recompiled it, and copied slurmctld into place.

  # ls -l /no_backup/shared/slurm/slurm-20.11.5/sbin/
  lrwxrwxrwx. 1 root root      15 May  7 17:50 slurmctld -> slurmctld.patch
  -rwxr-xr-x. 1 root root 3906320 Mar 18 07:48 slurmctld.orig
  -rwxr-xr-x. 1 root root 3905712 May  7 17:50 slurmctld.patch
  -rwxr-xr-x. 1 root root 1098744 Mar 18 07:48 slurmd
  -rwxr-xr-x. 1 root root  569112 Mar 18 07:48 slurmdbd
  -rwxr-xr-x. 1 root root 1350920 Mar 18 07:48 slurmstepd

The scheduler is now running.

This ticket was a "crit 1" before I realized that the slurm-20.11.5/src/slurmctld/.libs/slurmctld file did not rebuild and thought the patch did not work.

I am going to drop it to 1 and submit it anyway incase the bug has not been fully found and the getpwuid error helps in chasing it down.

If this problem is completely solved and fixed in the next version of slurm to come out this ticket can be closed. 

Here is a stack trace on the last core file.
# gdb /no_backup/shared/slurm/slurm-current/sbin/slurmctld core.41393 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /no_backup/shared/slurm/slurm-20.11.5/sbin/slurmctld.patch...done.

warning: core file may not match specified executable file.
[New LWP 41393]
[New LWP 41396]
[New LWP 41394]
[New LWP 41395]
[New LWP 41398]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/no_backup/shared/slurm/slurm-current/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  _step_dealloc_lps (step_ptr=0x17f6740) at step_mgr.c:2001
2001		xassert(job_resrcs_ptr->cpus);
Missing separate debuginfos, use: debuginfo-install munge-libs-0.5.11-3.el7.x86_64
(gdb) bt
#0  _step_dealloc_lps (step_ptr=0x17f6740) at step_mgr.c:2001
#1  post_job_step (step_ptr=step_ptr@entry=0x17f6740) at step_mgr.c:4644
#2  0x00000000004b79d3 in _internal_step_complete (
    job_ptr=job_ptr@entry=0x17f5480, step_ptr=step_ptr@entry=0x17f6740)
    at step_mgr.c:328
#3  0x00000000004b7a40 in delete_step_records (job_ptr=job_ptr@entry=0x17f5480)
    at step_mgr.c:418
#4  0x000000000046e351 in cleanup_completing (job_ptr=job_ptr@entry=0x17f5480)
    at job_scheduler.c:4958
#5  0x000000000047aaf5 in deallocate_nodes (job_ptr=job_ptr@entry=0x17f5480, 
    timeout=timeout@entry=false, suspended=suspended@entry=false, 
    preempted=preempted@entry=false) at node_scheduler.c:401
#6  0x000000000049c0c0 in _sync_nodes_to_comp_job () at read_config.c:2561
#7  read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false)
    at read_config.c:1385
#8  0x000000000042d0ae in main (argc=<optimized out>, argv=<optimized out>)
    at controller.c:667
Comment 1 Michael Hinton 2021-05-07 17:13:56 MDT
Geoff,

Yes, this is the same issue, and was fixed in 20.11.6 (bug 10980 comment 41). The patch v1 from 10980 prevents a segfault, but will not prevent that issue completely (i.e. the job will get requeued and hang in a completing state forever every time getpwuid() fails on the job like that).

The commits that fixed the issue in bug 10980 comment 41 should apply cleanly to 20.11.5, if you want to backport them. A neat trick is to simply add ".patch" to the commit URLs (which can be downloaded directly with wget or curl), and then now you have a single-commit patch you can apply with `git am`:

* https://github.com/SchedMD/slurm/commit/73bf0a09ba.patch (the 10980 v1 patch you already have)
* https://github.com/SchedMD/slurm/commit/aed98501cc.patch (prints out getpwuid() error to help you debug your system)
* https://github.com/SchedMD/slurm/commit/f636c4562a.patch (the real fix to keep your jobs from completing forever until the next ctld restart)

********************************************************************

I just verified that these patches apply cleanly to 20.11.5. Here's what I did:

$ curl https://github.com/SchedMD/slurm/commit/73bf0a09ba.patch > 1.patch
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1312  100  1312    0     0   5512      0 --:--:-- --:--:-- --:--:--  5512
$ curl https://github.com/SchedMD/slurm/commit/aed98501cc.patch > 2.patch
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   851  100   851    0     0   3017      0 --:--:-- --:--:-- --:--:--  3028
$ curl https://github.com/SchedMD/slurm/commit/f636c4562a.patch > 3.patch
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2219  100  2219    0     0   6806      0 --:--:-- --:--:-- --:--:--  6806
$ git checkout -b 11567-v1
Switched to a new branch '11567-v1'
$ git am 1.patch
Applying: Avoid segfault in controller when job loses its job resources object
$ git am 2.patch
Applying: Print out error when getpwuid_r() fails
$ git am 3.patch
Applying: Never schedule the last task in a job array twice
$ git log --oneline | head -n 4
21c4d1a269 Never schedule the last task in a job array twice
1d5f3a9957 Print out error when getpwuid_r() fails
dffc9eb5d4 Avoid segfault in controller when job loses its job resources object
97ea81ecb4 Update META for v20.11.5 release.

That should get you up to speed with the fixes. If the problem still occurs after that, then we can investigate it further.

Thanks!
-Michael

P.S. We recommend using Git to download and modify Slurm, but if not, you can use `patch -p1` in place of `git am`:

$ cd /path/to/slurm/sourcecode
$ curl https://github.com/SchedMD/slurm/commit/73bf0a09ba.patch | patch -p1
$ curl https://github.com/SchedMD/slurm/commit/aed98501cc.patch | patch -p1
$ curl https://github.com/SchedMD/slurm/commit/f636c4562a.patch | patch -p1

That should change the files in place.
Comment 2 Michael Hinton 2021-08-13 14:41:42 MDT
I'll go ahead and close this out.

Thanks!
-Michael
Comment 3 Michael Hinton 2021-08-13 14:42:52 MDT
Hey Geoff,

I went ahead and closed this ticket. Feel free to reopen it if you have further questions.

Thanks!
-Michael