Ticket 10980 - slurmctld dumping core
Summary: slurmctld dumping core
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.11.4
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
: 13463 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2021-03-01 11:33 MST by Geoff
Modified: 2022-02-21 16:21 MST (History)
3 users (show)

See Also:
Site: Johns Hopkins University Applied Physics Laboratory
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11.6
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf file (2.13 KB, text/plain)
2021-03-01 11:35 MST, Geoff
Details
slurmctld.log file for 2021-03-01 (102.05 KB, text/x-log)
2021-03-01 11:45 MST, Geoff
Details
gdb output for comment 9. (15.56 KB, text/x-matlab)
2021-03-01 14:51 MST, Geoff
Details
slurmctld logs back to Feb 1. (94.76 MB, text/x-log)
2021-03-01 15:19 MST, Geoff
Details
patch v1 (815 bytes, patch)
2021-03-01 15:22 MST, Michael Hinton
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Geoff 2021-03-01 11:33:32 MST
our non-airgapped slurm cluster is failing to start. It was upgraded from 20.02.5 to 20.11.4 a week+ ago. slurmctld is segfaulting on restart start.

(This is not the same cluster as the airgapped one we run that I currently have two tickets open on.)

● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: failed (Result: core-dump) since Mon 2021-03-01 13:19:11 EST; 27s ago
  Process: 23378 ExecStart=/no_backup/shared/slurm/slurm-current/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 23380 (code=dumped, signal=SEGV)

Mar 01 13:19:06 amds-usched01 systemd[1]: Starting Slurm controller daemon...
Mar 01 13:19:06 amds-usched01 systemd[1]: Can't open PID file /var/run/slurmctld.pid (yet?) after start: No such fil...ectory
Mar 01 13:19:06 amds-usched01 systemd[1]: Started Slurm controller daemon.
Mar 01 13:19:11 amds-usched01 systemd[1]: slurmctld.service: main process exited, code=dumped, status=11/SEGV
Mar 01 13:19:11 amds-usched01 systemd[1]: Unit slurmctld.service entered failed state.
Mar 01 13:19:11 amds-usched01 systemd[1]: slurmctld.service failed.




I set logging to "debug" and restarted and got...

[2021-03-01T13:19:06.464] slurmctld version 20.11.4 started on cluster unclass-slurm
[2021-03-01T13:19:06.482] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
[2021-03-01T13:19:07.167] error: Could not open node state file /no_backup/shared/slurm/slurmstate/node_state: No such file or directory
[2021-03-01T13:19:07.167] error: NOTE: Trying backup state save file. Information may be lost!
[2021-03-01T13:19:07.168] Recovered state of 60 nodes
[2021-03-01T13:19:07.168] error: Could not open job state file /no_backup/shared/slurm/slurmstate/job_state: No such file or directory
[2021-03-01T13:19:07.168] error: NOTE: Trying backup state save file. Jobs may be lost!
[2021-03-01T13:19:07.168] recovered JobId=2386071 StepId=batch
[2021-03-01T13:19:07.168] recovered JobId=2386071 StepId=0
[2021-03-01T13:19:07.169] Recovered JobId=2386071 Assoc=127
[2021-03-01T13:19:07.169] recovered JobId=2386112_2091(2386112) StepId=batch
[2021-03-01T13:19:07.171] Recovered JobId=2386112_2091(2386112) Assoc=130
[2021-03-01T13:19:07.171] recovered JobId=2439713 StepId=batch
[2021-03-01T13:19:07.171] Recovered JobId=2439713 Assoc=72
[2021-03-01T13:19:07.171] recovered JobId=2469885 StepId=batch
[2021-03-01T13:19:07.171] recovered JobId=2469885 StepId=0
[2021-03-01T13:19:07.171] Recovered JobId=2469885 Assoc=84
[2021-03-01T13:19:07.171] recovered JobId=2469886 StepId=batch
[2021-03-01T13:19:07.171] Recovered JobId=2469886 Assoc=72
[2021-03-01T13:19:07.171] recovered JobId=2469887 StepId=batch
[2021-03-01T13:19:07.171] Recovered JobId=2469887 Assoc=72
[2021-03-01T13:19:07.171] Recovered JobId=2469888 Assoc=72
[2021-03-01T13:19:07.171] Recovered information about 7 jobs
[2021-03-01T13:19:07.171] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2021-03-01T13:19:07.176] _sync_nodes_to_comp_job: JobId=2386112_2091(2386112) in completing state
[2021-03-01T13:19:07.176] select/cons_tres: job_res_rm_job: plugin still initializing
[2021-03-01T13:19:07.176] cleanup_completing: JobId=2386112_2091(2386112) completion process took 495938 seconds





gdb on the core file is giving...

# gdb /no_backup/shared/slurm/slurm-current/sbin/slurmctld ./core.5028
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /no_backup/shared/slurm/slurm-20.11.4/sbin/slurmctld...done.
[New LWP 5028]
[New LWP 5029]
[New LWP 5030]
[New LWP 5031]
[New LWP 5035]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/no_backup/shared/slurm/slurm-current/sbin/slurmctld status'.
Program terminated with signal 11, Segmentation fault.
#0  _step_dealloc_lps (step_ptr=0xdc0340) at step_mgr.c:2001
2001	step_mgr.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-323.el7_9.x86_64 munge-libs-0.5.11-3.el7.x86_64 sssd-client-1.16.5-10.el7_9.7.x86_64
(gdb) st
Ambiguous command "st": stack, start, status, step, stepi, stepping, stop, strace.
(gdb) stack
Undefined command: "stack".  Try "help".
(gdb) bt
#0  _step_dealloc_lps (step_ptr=0xdc0340) at step_mgr.c:2001
#1  post_job_step (step_ptr=step_ptr@entry=0xdc0340) at step_mgr.c:4644
#2  0x00000000004b7923 in _internal_step_complete (job_ptr=job_ptr@entry=0xdbee50, step_ptr=step_ptr@entry=0xdc0340)
    at step_mgr.c:328
#3  0x00000000004b7990 in delete_step_records (job_ptr=job_ptr@entry=0xdbee50) at step_mgr.c:418
#4  0x000000000046e311 in cleanup_completing (job_ptr=job_ptr@entry=0xdbee50) at job_scheduler.c:4958
#5  0x000000000047aaa5 in deallocate_nodes (job_ptr=job_ptr@entry=0xdbee50, timeout=timeout@entry=false, 
    suspended=suspended@entry=false, preempted=preempted@entry=false) at node_scheduler.c:401
#6  0x000000000049c010 in _sync_nodes_to_comp_job () at read_config.c:2558
#7  read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1385
#8  0x000000000042d0fe in main (argc=<optimized out>, argv=<optimized out>) at controller.c:667
Comment 1 Geoff 2021-03-01 11:35:29 MST
Created attachment 18176 [details]
slurm.conf file

slurm.conf for this cluster.
Comment 2 Geoff 2021-03-01 11:37:16 MST
The wrong "info" got changed to "debug"... doing it again.
Comment 3 Geoff 2021-03-01 11:45:38 MST
Created attachment 18177 [details]
slurmctld.log file for 2021-03-01

slurmctld.log file for  2021-03-01. The last restart was with loglevel debug3.
Comment 4 Geoff 2021-03-01 12:00:42 MST
To note: The version upgrade happened on Feb 23rd ~10am and the crash happened this morning (The first core dump is dated Mar  1 08:31 EST.)

The last few minutes of logs before the first crash are...

[2021-03-01T08:28:20.944] error: The modification time of /no_backup/shared/slurm/slurmstate/job_state moved backwards by 31 seconds
[2021-03-01T08:28:20.944] error: The clock of the file system and this computer appear to not be synchronized
[2021-03-01T08:28:30.072] error: Nodes un1 not responding
[2021-03-01T08:30:33.208] error: Nodes un1 not responding, setting DOWN
[2021-03-01T08:31:02.240] error: job_resources_node_inx_to_cpu_inx: no job_resrcs or node_bitmap
[2021-03-01T08:31:02.241] error: job_update_tres_cnt: problem getting offset of JobId=2386112_2091(2386112)
[2021-03-01T08:31:02.241] cleanup_completing: JobId=2386112_2091(2386112) completion process took 478645 seconds

These lines are included in the attached log file.
Comment 8 Geoff 2021-03-01 12:46:54 MST
restarting with debug5 doesn't seem to add anything useful as far as I can tell.


[2021-03-01T14:42:08.360] recovered JobId=2469887 StepId=batch
[2021-03-01T14:42:08.360] debug5: assoc_mgr_fill_in_assoc: looking for assoc of user=(null)(0), acct=(null), cluster=(null), partition=(null)
[2021-03-01T14:42:08.360] debug3: assoc_mgr_fill_in_assoc: found correct association of user=(null)(0), acct=(null), cluster=(null), partition=(null) to assoc=72 acct=defacct
[2021-03-01T14:42:08.360] Recovered JobId=2469887 Assoc=72
[2021-03-01T14:42:08.360] debug3: found correct qos
[2021-03-01T14:42:08.360] debug5: assoc_mgr_fill_in_assoc: looking for assoc of user=(null)(0), acct=(null), cluster=(null), partition=(null)
[2021-03-01T14:42:08.360] debug3: assoc_mgr_fill_in_assoc: found correct association of user=(null)(0), acct=(null), cluster=(null), partition=(null) to assoc=72 acct=defacct
[2021-03-01T14:42:08.360] Recovered JobId=2469888 Assoc=72
[2021-03-01T14:42:08.360] debug3: found correct qos
[2021-03-01T14:42:08.360] debug3: Set job_id_sequence to 2469889
[2021-03-01T14:42:08.360] Recovered information about 7 jobs
[2021-03-01T14:42:08.360] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2021-03-01T14:42:08.361] debug3: select/cons_tres: job_res_add_job: JobId=2386071 action:normal
[2021-03-01T14:42:08.361] debug3: select/cons_tres: job_res_add_job: adding JobId=2386071 to part amds-cpu row 0
[2021-03-01T14:42:08.361] debug3: select/cons_tres: job_res_add_job: JobId=2439713 action:normal
[2021-03-01T14:42:08.361] debug3: select/cons_tres: job_res_add_job: adding JobId=2439713 to part amds-cpu row 0
[2021-03-01T14:42:08.361] debug3: select/cons_tres: job_res_add_job: JobId=2469885 action:normal
[2021-03-01T14:42:08.361] debug3: select/cons_tres: job_res_add_job: adding JobId=2469885 to part amds-cpu row 0
[2021-03-01T14:42:08.361] debug3: select/cons_tres: job_res_add_job: JobId=2469886 action:normal
[2021-03-01T14:42:08.361] debug3: select/cons_tres: job_res_add_job: adding JobId=2469886 to part amds-cpu row 0
[2021-03-01T14:42:08.361] debug3: select/cons_tres: job_res_add_job: JobId=2469887 action:normal
[2021-03-01T14:42:08.361] debug3: select/cons_tres: job_res_add_job: adding JobId=2469887 to part amds-cpu row 0
[2021-03-01T14:42:08.362] debug3: Found batch directory for JobId=2469885
[2021-03-01T14:42:08.362] debug3: Found batch directory for JobId=2386112
[2021-03-01T14:42:08.362] debug3: Found batch directory for JobId=2469888
[2021-03-01T14:42:08.362] debug3: Found batch directory for JobId=2469886
[2021-03-01T14:42:08.363] debug3: Found batch directory for JobId=2386071
[2021-03-01T14:42:08.363] debug3: Found batch directory for JobId=2439713
[2021-03-01T14:42:08.363] debug3: Found batch directory for JobId=2469887
[2021-03-01T14:42:08.364] _sync_nodes_to_comp_job: JobId=2386112_2091(2386112) in completing state
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), qos amds-cpu grp_used_tres_run_secs(cpu) is 900
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), qos amds-cpu grp_used_tres_run_secs(mem) is 3686400
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), qos amds-cpu grp_used_tres_run_secs(node) is 900
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), qos amds-cpu grp_used_tres_run_secs(billing) is 900
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), qos amds-cpu grp_used_tres_run_secs(fs/disk) is 0
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), qos amds-cpu grp_used_tres_run_secs(vmem) is 0
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), qos amds-cpu grp_used_tres_run_secs(pages) is 0
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 130(defacct/schafjd2/(null)) grp_used_tres_run_secs(cpu) is 900
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 130(defacct/schafjd2/(null)) grp_used_tres_run_secs(mem) is 3686400
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 130(defacct/schafjd2/(null)) grp_used_tres_run_secs(node) is 900
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 130(defacct/schafjd2/(null)) grp_used_tres_run_secs(billing) is 900
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 130(defacct/schafjd2/(null)) grp_used_tres_run_secs(fs/disk) is 0
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 130(defacct/schafjd2/(null)) grp_used_tres_run_secs(vmem) is 0
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 130(defacct/schafjd2/(null)) grp_used_tres_run_secs(pages) is 0
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 5(defacct/(null)/(null)) grp_used_tres_run_secs(cpu) is 900
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 5(defacct/(null)/(null)) grp_used_tres_run_secs(mem) is 3686400
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 5(defacct/(null)/(null)) grp_used_tres_run_secs(node) is 900
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 5(defacct/(null)/(null)) grp_used_tres_run_secs(billing) is 900
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 5(defacct/(null)/(null)) grp_used_tres_run_secs(fs/disk) is 0
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 5(defacct/(null)/(null)) grp_used_tres_run_secs(vmem) is 0
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 5(defacct/(null)/(null)) grp_used_tres_run_secs(pages) is 0
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(cpu) is 900
[2021-03-01T14:42:08.364] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(mem) is 3686400
[2021-03-01T14:42:08.365] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(node) is 900
[2021-03-01T14:42:08.365] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(billing) is 900
[2021-03-01T14:42:08.365] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(fs/disk) is 0
[2021-03-01T14:42:08.365] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(vmem) is 0
[2021-03-01T14:42:08.365] debug2: acct_policy_job_begin: after adding JobId=2386112_2091(2386112), assoc 1(root/(null)/(null)) grp_used_tres_run_secs(pages) is 0
[2021-03-01T14:42:08.365] debug2: We have already ran the job_fini for JobId=2386112_2091(2386112)
[2021-03-01T14:42:08.365] select/cons_tres: job_res_rm_job: plugin still initializing
[2021-03-01T14:42:08.365] cleanup_completing: JobId=2386112_2091(2386112) completion process took 500919 seconds
Comment 9 Michael Hinton 2021-03-01 14:28:04 MST
Hi Geoff,

Could you run the following commands in GDB on that core file?

p *step_ptr
p *step_ptr->job_ptr
p *step_ptr->job_ptr->job_resrcs

f 2
p *job_ptr
p *job_ptr->job_resrcs
p *step_ptr
p *step_ptr->job_ptr
p *step_ptr->job_ptr->job_resrcs
p *step_ptr->job_ptr->step_list
p *step_ptr->job_ptr->step_list->head
p *step_ptr->job_ptr->step_list->tail
p *step_ptr->job_ptr->step_list->tail_ptr
p *((step_record_t *)step_ptr->job_ptr->step_list->head->data)
p *((step_record_t *)step_ptr->job_ptr->step_list->tail_ptr->data)

Thanks,
-Michael
Comment 10 Geoff 2021-03-01 14:51:03 MST
Created attachment 18182 [details]
gdb output for comment 9.

gdb output per comment 9
Comment 11 Michael Hinton 2021-03-01 15:08:12 MST
Could you attach all the slurmctld logs associated with job 2386112? It looks like it was supposed to finish around 6 days ago, maybe near when you upgraded. At least, the error logs saying it "completion process took X seconds" indicate that the job ended X seconds ago, and seems to put that right about when you said you upgraded. Was this job running during your upgrade?

Let me also work on a patch to get your controller up and running.

-Michael
Comment 12 Geoff 2021-03-01 15:19:57 MST
Created attachment 18183 [details]
slurmctld logs back to Feb 1.

Slurmctld logs back to Feb 1 per comment 11
Comment 13 Michael Hinton 2021-03-01 15:22:27 MST
Created attachment 18184 [details]
patch v1

Please apply patch v1 and restart your controller. It should complete the job, avoid the segfault, and get your controller up and running.

I think the nodes the job was on will still show as being used, though. So after you get things up and running, restart slurmctld again so that the nodes the job was running on "reset" properly.

Go ahead and do this first and let me know how it goes. Then go ahead and check the logs from 6 days ago for the job.

-Michael
Comment 14 Geoff 2021-03-01 15:36:19 MST
     Could you attach all the slurmctld logs associated with job 2386112?

Attached all slurmctld logs back to Feb 1.


     Was this job running during your upgrade?

I did not do the upgrade process on this cluster so I am not sure. I normally deal with our airgapped cluster. It looks like the changeover would have happened shortly after 10:10 AM EST on Feb 23. Feb 23rd should have had a maintenance reservation for the monthly patch and reboot process and the logs imply it was removed after 20.11.4 was started.

lrwxrwxrwx.  1 root  root      13 Feb 23 10:10 slurm-current -> slurm-20.11.4

[2021-02-23T10:23:25.593] slurmctld version 20.11.4 started on cluster unclass-slurm
[2021-02-23T10:23:25.693] Recovered state of reservation root_7
[2021-02-23T10:23:25.694] sched: Updated reservation=root_7 users=root nodes=un[
1-60] cores=1440 licenses=(null) tres=cpu=1440 watts=4294967294 start=2021-02-22
T23:59:00 end=2021-02-23T23:59:00 MaxStartDelay=
[2021-02-23T10:23:25.694] Recovered state of 1 reservations
<CUT>
[2021-02-23T10:27:54.654] _slurm_rpc_delete_reservation complete for root_7 usec=37
Comment 15 Geoff 2021-03-01 15:56:36 MST

      [2021-02-23T14:28:38.002] _slurm_rpc_submit_batch_job: JobId=2386112 InitPrio=4118 usec=16439

It looks like the job was submitted 4 hours after the upgrade.
Comment 16 Michael Hinton 2021-03-01 16:02:46 MST
Just to clarify, is the controller still down, making the system unusable? That would qualify as a sev 1 for us.
Comment 17 Geoff 2021-03-01 16:36:19 MST
Yes, the scheduler will not start and the cluster has been down since 8:30am.

My coworker has just finished recompiling the scheduler and it looks like it is running again. I am checking to see if the listed running jobs are actually running jobs. Is there a quick way to check for this?
Comment 18 Geoff 2021-03-01 16:43:56 MST
It looks like the jobid in question was on node "un1".
That node is showing up as idle in sinfo and I was able to do an "srun -c 24 -w un1 --pty /bin/bash" and get on the system.

Is there anything in particular that we should watch out for?

Was this a legitimate bug in 20.11.4 or did our scheduler do something goofy and this patch worked around a failure case that should not have existed?
Comment 19 Michael Hinton 2021-03-01 17:43:56 MST
(In reply to Geoff from comment #17)
> Yes, the scheduler will not start and the cluster has been down since 8:30am.
Ok. Feel free to set the severity to "1 - System not usable" for this kind of ticket in the future (assuming this is a production cluster rather than a test/dev cluster).

> My coworker has just finished recompiling the scheduler and it looks like it
> is running again.
Excellent! I'll go ahead and reduce the severity to 3 now that things are running.

> I am checking to see if the listed running jobs are
> actually running jobs. Is there a quick way to check for this?
What do you mean by "listed running jobs are actually running jobs"? Are you asking whether there are actual running processes associated with the running job vs. if it's just a shell of a job with no tasks actually running?

(In reply to Geoff from comment #18)
> It looks like the jobid in question was on node "un1".
> That node is showing up as idle in sinfo and I was able to do an "srun -c 24
> -w un1 --pty /bin/bash" and get on the system.
Ok, great.

> Is there anything in particular that we should watch out for?
Can you let us know if you ever see this again in slurmctld.log?:

    select/cons_tres: job_res_rm_job: plugin still initializing

We need to look into the code some more, but I don't think this should ever happen.

I would also investigate this error:

error: The modification time of /no_backup/shared/slurm/slurmstate/job_state moved backwards by 31 seconds
error: The clock of the file system and this computer appear to not be synchronized

Finally, the job in question that was messed up was part of an array job that started getting these errors half way through the job:

[2021-02-23T19:33:29.421] error: _fill_cred_gids: getpwuid failed for uid=108569
[2021-02-23T19:33:29.421] error: slurm_cred_create failure for batch job 2388140
[2021-02-23T19:33:29.421] error: Can not create job credential, attempting to requeue batch JobId=2386112_2027(2388140)

I would keep an eye out for this. Maybe this caused the issue somehow.

> Was this a legitimate bug in 20.11.4 or did our scheduler do something goofy
> and this patch worked around a failure case that should not have existed?
The latter. The job somehow lost its job resources object, which is something we always expect to be there. This has happened, albeit rarely, with other customers (see bug 7757, which is where I derived the patch from). Apparently, we have not yet found the root cause though or been able to reproduce it.

It is weird, however, that we have not yet incorporated the patch I gave you into the code, especially since this has happened before. I'll look into it.

Thanks!
-Michael
Comment 25 Geoff 2021-03-03 13:03:01 MST
> >  What do you mean by "listed running jobs are actually running jobs"? 

The bit about the node showing as used in comment 13 had me thinking that all nodes running jobs at the time of crash may be left in a busy state.
Comment 26 Michael Hinton 2021-03-03 13:08:54 MST
(In reply to Geoff from comment #25)
> > >  What do you mean by "listed running jobs are actually running jobs"? 
> 
> The bit about the node showing as used in comment 13 had me thinking that
> all nodes running jobs at the time of crash may be left in a busy state.
What I meant by that was that in my testing, squeue showed as empty, with no running jobs, but the node still showed as mixed or allocated. But it looks like that didn't happen with you.
Comment 32 Geoff 2021-03-24 14:55:54 MDT
Since our problem is currently fixed and we don't have a way to figure out how the job got corrupted this ticket can be closed.

I noticed that this patch did not make it into 20.11.5. Was some other fix applied that handles step_ptr->job_ptr being NULL or did this not make it in time for this version release?
Comment 33 Michael Hinton 2021-03-24 15:27:12 MDT
(In reply to Geoff from comment #32)
> Since our problem is currently fixed and we don't have a way to figure out
> how the job got corrupted this ticket can be closed.
I was actually recently able to reproduce the issue. The issue happens when a job fails to launch with these errors:

[2021-02-23T19:33:29.421] error: _fill_cred_gids: getpwuid failed for uid=108569
[2021-02-23T19:33:29.421] error: slurm_cred_create failure for batch job 2388140
[2021-02-23T19:33:29.421] error: Can not create job credential, attempting to requeue batch JobId=2386112_2027(2388140)

If the job is a job array, then the last job array task will stay stuck in a completing state forever. If the slurmctld is restarted while the job is completing like that, it will trigger the issue.

> I noticed that this patch did not make it into 20.11.5. Was some other fix
> applied that handles step_ptr->job_ptr being NULL or did this not make it in
> time for this version release?
We have not yet pushed the fix because we want to fix the issue at the source rather than just treating symptoms. I hope to be able to get a true fix in the coming weeks. In the meantime, though, v1 should protect you in case you happen to trigger it again. 

-Michael
Comment 41 Michael Hinton 2021-04-12 13:53:17 MDT
Geoff,

The fixes are now in the code and will land in 20.11.6. You can take a look at the commits here: https://github.com/SchedMD/slurm/compare/82ce105e18c1...f636c4562a.

Commit https://github.com/SchedMD/slurm/commit/f636c4562a actually fixes the underlying issue. Basically, the scheduler was erroneously attempting to schedule the last task of a job array multiple times. The last task would get scheduled, but fail to launch due to a getpwuid() failure. This launch failure would then trigger an automatic requeue of that task and make it and its candidate nodes go into COMPLETING. This process would repeat in quick succession, each time making another candidate node go into COMPLETING, until there were no more IDLE nodes left. Because there were no IDLE nodes left, the final scheduling attempt would fail, and the scheduler would free the node_bitmap and job_resrcs objects (as it does normally when it fails to schedule a pending job that hasn't run yet). This caused 1) the last task and all potential candidate nodes to stay COMPLETING forever until the ctld was restarted, and 2) the ctld to segfault on startup.

Commit https://github.com/SchedMD/slurm/commit/73bf0a09ba is just the fix that I provided to you in v1, and is a fail safe that prevents the ctld from segfaulting on start up in case the job_resrcs object ever happens to be NULL again. You should never hit that, though, so please open up a bug if you do.

Commit https://github.com/SchedMD/slurm/commit/aed98501cc simply prints out the actual error message that getpwuid() returns. That should help you debug your system a bit better.

Thanks!
-Michael
Comment 43 Michael Hinton 2022-02-17 12:28:04 MST
*** Ticket 13463 has been marked as a duplicate of this ticket. ***
Comment 44 Michael Hinton 2022-02-21 16:21:37 MST
*** Ticket 13463 has been marked as a duplicate of this ticket. ***