We ended up with a job stuck in completing phase. The scheduler had a log that getpwuid failed and requeued the job. [2021-05-06T19:01:22.242] sched: Allocate JobId=91357_3420(91357) NodeList=un37 #CPUs=4 Partition=amds-cpu [2021-05-06T19:01:22.242] error: _fill_cred_gids: getpwuid failed for uid=111653 [2021-05-06T19:01:22.242] error: slurm_cred_create failure for batch job 91357 [2021-05-06T19:01:22.242] error: Can not create job credential, attempting to requeue batch JobId=91357_3420(91357) [2021-05-06T19:01:22.242] _job_complete: JobId=91357_3420(91357) WEXITSTATUS 0 [2021-05-06T19:01:22.242] _job_complete: requeue JobId=91357_3420(91357) per user/system request [2021-05-06T19:01:22.242] _job_complete: JobId=91357_3420(91357) done There was no evidence of the job on the system. The user said that the job looked like it ran. It was stuck in completing state and scancel would not remove it but did not return any errors. # sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST amds-cpu* up infinite 1 comp un37 <CUT> # squeue -lj 91357_3420 Fri May 07 17:07:30 2021 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 91357_3420 amds-cpu PSB user0001 COMPLETI 0:00 2:00:00 1 un37 # scontrol show job 91357_3420 JobId=91357 ArrayJobId=91357 ArrayTaskId=3420 JobName=PSB UserId=user0001(111653) GroupId=users(100) MCS_label=N/A Priority=1314 Nice=0 Account=defacct QOS=normal WCKey=bps15 JobState=COMPLETING Reason=Resources Dependency=(null) Requeue=1 Restarts=1 BatchFlag=2 Reboot=0 ExitCode=0:0 RunTime=18753-23:01:22 TimeLimit=02:00:00 TimeMin=N/A SubmitTime=2021-05-06T09:27:40 EligibleTime=2021-05-06T19:03:22 AccrueTime=2021-05-06T09:27:40 StartTime=Unknown EndTime=2021-05-06T19:01:22 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-05-06T19:01:22 Partition=amds-cpu AllocNode:Sid=amds-usubmit01:28799 ReqNodeList=(null) ExcNodeList=(null) NodeList=un37 BatchHost=un37 NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:* TRES=cpu=4,mem=16G,node=1,billing=4 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=4 MinMemoryCPU=4G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) <CUT> On scheduler restart slurmctld is dumping core. We had a similar issue in the past and were given a patch to watch out for a missing job_resrcs_ptr for 20.11.04. https://bugs.schedmd.com/show_bug.cgi?id=10980 We are now running 20.11.05 and I have added this patch to 20.11.5 (by hand in case the line numbers did not match 20.11.4), recompiled it, and copied slurmctld into place. # ls -l /no_backup/shared/slurm/slurm-20.11.5/sbin/ lrwxrwxrwx. 1 root root 15 May 7 17:50 slurmctld -> slurmctld.patch -rwxr-xr-x. 1 root root 3906320 Mar 18 07:48 slurmctld.orig -rwxr-xr-x. 1 root root 3905712 May 7 17:50 slurmctld.patch -rwxr-xr-x. 1 root root 1098744 Mar 18 07:48 slurmd -rwxr-xr-x. 1 root root 569112 Mar 18 07:48 slurmdbd -rwxr-xr-x. 1 root root 1350920 Mar 18 07:48 slurmstepd The scheduler is now running. This ticket was a "crit 1" before I realized that the slurm-20.11.5/src/slurmctld/.libs/slurmctld file did not rebuild and thought the patch did not work. I am going to drop it to 1 and submit it anyway incase the bug has not been fully found and the getpwuid error helps in chasing it down. If this problem is completely solved and fixed in the next version of slurm to come out this ticket can be closed. Here is a stack trace on the last core file. # gdb /no_backup/shared/slurm/slurm-current/sbin/slurmctld core.41393 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /no_backup/shared/slurm/slurm-20.11.5/sbin/slurmctld.patch...done. warning: core file may not match specified executable file. [New LWP 41393] [New LWP 41396] [New LWP 41394] [New LWP 41395] [New LWP 41398] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/no_backup/shared/slurm/slurm-current/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 _step_dealloc_lps (step_ptr=0x17f6740) at step_mgr.c:2001 2001 xassert(job_resrcs_ptr->cpus); Missing separate debuginfos, use: debuginfo-install munge-libs-0.5.11-3.el7.x86_64 (gdb) bt #0 _step_dealloc_lps (step_ptr=0x17f6740) at step_mgr.c:2001 #1 post_job_step (step_ptr=step_ptr@entry=0x17f6740) at step_mgr.c:4644 #2 0x00000000004b79d3 in _internal_step_complete ( job_ptr=job_ptr@entry=0x17f5480, step_ptr=step_ptr@entry=0x17f6740) at step_mgr.c:328 #3 0x00000000004b7a40 in delete_step_records (job_ptr=job_ptr@entry=0x17f5480) at step_mgr.c:418 #4 0x000000000046e351 in cleanup_completing (job_ptr=job_ptr@entry=0x17f5480) at job_scheduler.c:4958 #5 0x000000000047aaf5 in deallocate_nodes (job_ptr=job_ptr@entry=0x17f5480, timeout=timeout@entry=false, suspended=suspended@entry=false, preempted=preempted@entry=false) at node_scheduler.c:401 #6 0x000000000049c0c0 in _sync_nodes_to_comp_job () at read_config.c:2561 #7 read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1385 #8 0x000000000042d0ae in main (argc=<optimized out>, argv=<optimized out>) at controller.c:667
Geoff, Yes, this is the same issue, and was fixed in 20.11.6 (bug 10980 comment 41). The patch v1 from 10980 prevents a segfault, but will not prevent that issue completely (i.e. the job will get requeued and hang in a completing state forever every time getpwuid() fails on the job like that). The commits that fixed the issue in bug 10980 comment 41 should apply cleanly to 20.11.5, if you want to backport them. A neat trick is to simply add ".patch" to the commit URLs (which can be downloaded directly with wget or curl), and then now you have a single-commit patch you can apply with `git am`: * https://github.com/SchedMD/slurm/commit/73bf0a09ba.patch (the 10980 v1 patch you already have) * https://github.com/SchedMD/slurm/commit/aed98501cc.patch (prints out getpwuid() error to help you debug your system) * https://github.com/SchedMD/slurm/commit/f636c4562a.patch (the real fix to keep your jobs from completing forever until the next ctld restart) ******************************************************************** I just verified that these patches apply cleanly to 20.11.5. Here's what I did: $ curl https://github.com/SchedMD/slurm/commit/73bf0a09ba.patch > 1.patch % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1312 100 1312 0 0 5512 0 --:--:-- --:--:-- --:--:-- 5512 $ curl https://github.com/SchedMD/slurm/commit/aed98501cc.patch > 2.patch % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 851 100 851 0 0 3017 0 --:--:-- --:--:-- --:--:-- 3028 $ curl https://github.com/SchedMD/slurm/commit/f636c4562a.patch > 3.patch % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2219 100 2219 0 0 6806 0 --:--:-- --:--:-- --:--:-- 6806 $ git checkout -b 11567-v1 Switched to a new branch '11567-v1' $ git am 1.patch Applying: Avoid segfault in controller when job loses its job resources object $ git am 2.patch Applying: Print out error when getpwuid_r() fails $ git am 3.patch Applying: Never schedule the last task in a job array twice $ git log --oneline | head -n 4 21c4d1a269 Never schedule the last task in a job array twice 1d5f3a9957 Print out error when getpwuid_r() fails dffc9eb5d4 Avoid segfault in controller when job loses its job resources object 97ea81ecb4 Update META for v20.11.5 release. That should get you up to speed with the fixes. If the problem still occurs after that, then we can investigate it further. Thanks! -Michael P.S. We recommend using Git to download and modify Slurm, but if not, you can use `patch -p1` in place of `git am`: $ cd /path/to/slurm/sourcecode $ curl https://github.com/SchedMD/slurm/commit/73bf0a09ba.patch | patch -p1 $ curl https://github.com/SchedMD/slurm/commit/aed98501cc.patch | patch -p1 $ curl https://github.com/SchedMD/slurm/commit/f636c4562a.patch | patch -p1 That should change the files in place.
I'll go ahead and close this out. Thanks! -Michael
Hey Geoff, I went ahead and closed this ticket. Feel free to reopen it if you have further questions. Thanks! -Michael