Ticket 18482 - Node keeps changing state to "down" - issue with job_container/tmp_fs cleanup (_delete_ns: umount2 PATH failed: Invalid argument)
Summary: Node keeps changing state to "down" - issue with job_container/tmp_fs cleanup...
Status: RESOLVED DUPLICATE of ticket 14284
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: - Unsupported Older Versions
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-12-14 12:27 MST by Matt Kasemer
Modified: 2024-01-02 08:08 MST (History)
1 user (show)

See Also:
Site: University of Alabama
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: Ubuntu
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Slurm configuration file (5.26 KB, text/plain)
2023-12-14 12:27 MST, Matt Kasemer
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Matt Kasemer 2023-12-14 12:27:28 MST
Created attachment 33723 [details]
Slurm configuration file

Running a system with one head node, 16 compute nodes (all heterogeneous, names cn1 - cn16), and an NAS with NFS-v4 shared folders, and slurm 21.08.5d. I have an issue where one node (cn4) consistently changes state to down. Jobs run on all other nodes (seemingly) without error, to completion. The slurmd daemon is still active, however. I've copied log outputs below, as well as the cgroup.conf and job_container.conf, and I have attached slurm.conf.

Here is the slurmctld log from the head node around the time a job goes down:

[2023-12-14T11:58:50.646] validate_node_specs: Node cn4 unexpectedly rebooted boot_time=1702576716 last response=1702576597
[2023-12-14T11:58:50.646] requeue job JobId=224 due to failure of node cn4
...
[2023-12-14T12:00:01.486] cleanup_completing: JobId=224 completion process took 71 seconds
[2023-12-14T12:00:01.486] Requeuing JobId=224


Here is the slurmd log from cn4:

[2023-12-14T11:03:20.938] [224.extern] debug2: setup for a launch_task
[2023-12-14T11:03:20.938] [224.extern] debug2: hwloc_topology_init
[2023-12-14T11:03:20.943] [224.extern] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-14T11:03:20.944] [224.extern] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-14T11:03:20.945] [224.extern] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
[2023-12-14T11:03:20.949] [224.extern] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-14T11:03:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:03:20.950] [224.extern] debug:  Message thread started pid = 2785
[2023-12-14T11:03:20.954] [224.extern] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-14T11:03:20.959] [224.extern] debug:  task/cgroup: init: core enforcement enabled
[2023-12-14T11:03:20.959] [224.extern] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257579M allowed:100%(enforced), swap:0%(permissive), max:100%(257579M) max+swap:100%(515158M) min:30M kmem:100%(257579M permissive) min:30M swappiness:0(unset)
[2023-12-14T11:03:20.959] [224.extern] debug:  task/cgroup: init: memory enforcement enabled
[2023-12-14T11:03:20.961] [224.extern] debug:  task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2023-12-14T11:03:20.961] [224.extern] debug:  task/cgroup: init: device enforcement enabled
[2023-12-14T11:03:20.961] [224.extern] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-14T11:03:20.961] [224.extern] cred/munge: init: Munge credential signature plugin loaded
[2023-12-14T11:03:20.961] [224.extern] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-14T11:03:20.961] [224.extern] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-14T11:03:20.965] [224.extern] debug2: _spawn_job_container: Before call to spank_init()
[2023-12-14T11:03:20.965] [224.extern] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-14T11:03:20.965] [224.extern] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-14T11:03:20.965] [224.extern] debug2: _spawn_job_container: After call to spank_init()
[2023-12-14T11:03:20.997] [224.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63'
[2023-12-14T11:03:20.997] [224.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63'
[2023-12-14T11:03:20.997] [224.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63'
[2023-12-14T11:03:20.997] [224.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63'
[2023-12-14T11:03:20.997] [224.extern] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-14T11:03:20.997] [224.extern] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-14T11:03:20.997] [224.extern] debug:  cgroup/v1: _oom_event_monitor: started.
[2023-12-14T11:03:21.015] [224.extern] debug2: adding task 3 pid 2794 on node 3 to jobacct
[2023-12-14T11:03:21.018] debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG
[2023-12-14T11:03:21.286] debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-14T11:03:21.286] debug2: Processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-14T11:03:21.286] launch task StepId=224.0 request from UID:1002 GID:1003 HOST:10.115.79.9 PORT:44110
[2023-12-14T11:03:21.286] debug:  Checking credential with 828 bytes of sig data
[2023-12-14T11:03:21.286] task/affinity: lllp_distribution: JobId=224 manual binding: none,one_thread
[2023-12-14T11:03:21.286] debug:  Waiting for job 224's prolog to complete
[2023-12-14T11:03:21.286] debug:  Finished wait for job 224's prolog to complete
[2023-12-14T11:03:21.295] debug2: debug level read from slurmd is 'debug2'.
[2023-12-14T11:03:21.295] debug2: read_slurmd_conf_lite: slurmd sent 8 TRES.
[2023-12-14T11:03:21.295] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-14T11:03:21.295] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-14T11:03:21.295] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-14T11:03:21.295] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-14T11:03:21.295] debug2: Received CPU frequency information for 64 CPUs
[2023-12-14T11:03:21.296] debug:  switch/none: init: switch NONE plugin loaded
[2023-12-14T11:03:21.296] debug:  switch Cray/Aries plugin loaded.
[2023-12-14T11:03:21.296] [224.0] debug2: setup for a launch_task
[2023-12-14T11:03:21.296] [224.0] debug2: hwloc_topology_init
[2023-12-14T11:03:21.301] [224.0] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-14T11:03:21.302] [224.0] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-14T11:03:21.303] [224.0] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
[2023-12-14T11:03:21.307] [224.0] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-14T11:03:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:03:21.308] [224.0] debug:  Message thread started pid = 2797
[2023-12-14T11:03:21.308] debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-14T11:03:21.312] [224.0] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-14T11:03:21.317] [224.0] debug:  task/cgroup: init: core enforcement enabled
[2023-12-14T11:03:21.317] [224.0] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257579M allowed:100%(enforced), swap:0%(permissive), max:100%(257579M) max+swap:100%(515158M) min:30M kmem:100%(257579M permissive) min:30M swappiness:0(unset)
[2023-12-14T11:03:21.317] [224.0] debug:  task/cgroup: init: memory enforcement enabled
[2023-12-14T11:03:21.319] [224.0] debug:  task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2023-12-14T11:03:21.319] [224.0] debug:  task/cgroup: init: device enforcement enabled
[2023-12-14T11:03:21.319] [224.0] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-14T11:03:21.319] [224.0] cred/munge: init: Munge credential signature plugin loaded
[2023-12-14T11:03:21.319] [224.0] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-14T11:03:21.319] [224.0] debug:  mpi type = none
[2023-12-14T11:03:21.320] [224.0] debug2: Before call to spank_init()
[2023-12-14T11:03:21.320] [224.0] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-14T11:03:21.320] [224.0] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-14T11:03:21.320] [224.0] debug2: After call to spank_init()
[2023-12-14T11:03:21.320] [224.0] debug:  mpi type = (null)
[2023-12-14T11:03:21.320] [224.0] debug:  mpi/none: p_mpi_hook_slurmstepd_prefork: mpi/none: slurmstepd prefork
[2023-12-14T11:03:21.320] [224.0] error: cpu_freq_cpuset_validate: cpu_bind string is null
[2023-12-14T11:03:21.347] [224.0] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63'
[2023-12-14T11:03:21.347] [224.0] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63'
[2023-12-14T11:03:21.347] [224.0] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63'
[2023-12-14T11:03:21.347] [224.0] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63'
[2023-12-14T11:03:21.348] [224.0] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-14T11:03:21.348] [224.0] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-14T11:03:21.348] [224.0] debug:  cgroup/v1: _oom_event_monitor: started.
[2023-12-14T11:03:21.349] [224.0] debug2: hwloc_topology_load
[2023-12-14T11:03:21.370] [224.0] debug2: hwloc_topology_export_xml
[2023-12-14T11:03:21.375] [224.0] debug2: Entering _setup_normal_io
[2023-12-14T11:03:21.376] [224.0] debug2: io_init_msg_write_to_fd: entering
[2023-12-14T11:03:21.376] [224.0] debug2: io_init_msg_write_to_fd: msg->nodeid = 2
[2023-12-14T11:03:21.376] [224.0] debug2: io_init_msg_write_to_fd: leaving
[2023-12-14T11:03:21.376] [224.0] debug2: Leaving  _setup_normal_io
[2023-12-14T11:03:21.376] [224.0] debug levels are stderr='error', logfile='debug2', syslog='quiet'
[2023-12-14T11:03:21.376] [224.0] debug:  IO handler started pid=2797
[2023-12-14T11:03:21.378] [224.0] starting 1 tasks
[2023-12-14T11:03:21.379] [224.0] task 2 (2803) started 2023-12-14T11:03:21
[2023-12-14T11:03:21.379] [224.0] debug:  Setting slurmstepd oom_adj to -1000
[2023-12-14T11:03:21.379] [224.0] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-14T11:03:21.411] [224.0] debug2: adding task 2 pid 2803 on node 2 to jobacct
[2023-12-14T11:03:21.413] [224.0] debug:  Sending launch resp rc=0
[2023-12-14T11:03:21.414] [224.0] debug:  mpi type = (null)
[2023-12-14T11:03:21.414] [224.0] debug:  mpi/none: p_mpi_hook_slurmstepd_task: Using mpi/none
[2023-12-14T11:03:21.414] [224.0] debug:  task/affinity: task_p_pre_launch: affinity StepId=224.0, task:2 bind:none,one_thread
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: RLIMIT_STACK  : max:inf cur:inf req:8388608
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: conf setrlimit RLIMIT_STACK succeeded
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: RLIMIT_CORE   : max:inf cur:inf req:0
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: conf setrlimit RLIMIT_CORE succeeded
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: RLIMIT_NPROC  : max:1030021 cur:1030021 req:1030020
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: RLIMIT_NOFILE : max:131072 cur:4096 req:1024
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: RLIMIT_MEMLOCK: max:inf cur:inf req:33761472512
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: conf setrlimit RLIMIT_MEMLOCK succeeded
[2023-12-14T11:03:21.414] [224.0] debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615
[2023-12-14T11:03:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:03:51.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:04:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:04:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:04:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:04:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:05:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:05:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:05:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:05:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:06:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:06:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:06:37.286] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:06:37.286] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:06:37.287] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:06:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:06:51.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:07:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:07:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:07:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:07:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:08:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:08:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:08:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:08:51.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:09:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:09:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:09:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:09:51.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:09:57.650] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:09:57.651] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:09:57.651] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:10:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:10:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:10:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:10:51.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:11:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:11:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:11:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:11:51.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:12:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:12:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:12:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:12:51.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:13:17.018] debug2: Start processing RPC: REQUEST_NODE_REGISTRATION_STATUS
[2023-12-14T11:13:17.018] debug2: Processing RPC: REQUEST_NODE_REGISTRATION_STATUS
[2023-12-14T11:13:17.021] [224.0] debug:  Handling REQUEST_STATE
[2023-12-14T11:13:17.021] debug:  _fill_registration_msg: found apparently running StepId=224.0
[2023-12-14T11:13:17.024] [224.extern] debug:  Handling REQUEST_STATE
[2023-12-14T11:13:17.024] debug:  _fill_registration_msg: found apparently running StepId=224.extern
[2023-12-14T11:13:17.040] debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
[2023-12-14T11:13:17.040] debug2: Finish processing RPC: REQUEST_NODE_REGISTRATION_STATUS
[2023-12-14T11:13:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:13:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:13:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:13:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:14:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:14:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:14:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:14:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:15:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:15:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:15:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:15:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:16:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:16:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:16:37.390] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:16:37.390] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:16:37.390] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:16:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:16:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:17:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:17:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:17:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:17:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:18:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:18:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:18:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:18:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:19:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:19:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:19:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:19:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:19:57.767] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:19:57.768] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:19:57.771] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:20:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:20:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:20:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:20:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:21:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:21:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:21:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:21:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:22:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:22:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:22:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:22:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:23:17.128] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:23:17.128] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:23:17.130] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:23:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:23:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:23:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:23:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:24:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:24:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:24:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:24:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:25:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:25:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:25:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:25:51.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:26:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:26:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:26:37.503] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:26:37.504] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:26:37.508] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:26:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:26:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:27:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:27:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:27:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:27:51.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:28:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:28:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:28:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:28:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:29:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:29:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:29:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:29:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:29:57.860] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:29:57.860] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:29:57.860] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:30:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:30:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:30:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:30:51.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:31:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:31:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:31:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:31:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:32:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:32:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:32:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:32:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:33:17.224] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:33:17.224] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:33:17.225] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:33:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:33:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:33:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:33:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:34:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:34:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:34:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:34:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:35:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:35:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:35:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:35:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:36:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:36:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:36:37.591] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:36:37.591] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:36:37.592] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:36:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:36:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:37:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:37:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:37:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:37:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:38:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:38:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:38:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:38:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:39:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:39:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:39:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:39:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:39:57.951] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:39:57.952] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:39:57.952] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:40:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:40:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:40:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:40:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:41:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:41:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:41:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:41:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:42:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:42:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:42:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:42:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:43:17.314] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:43:17.315] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:43:17.315] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:43:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:43:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:43:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:43:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:44:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:44:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:44:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:44:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:45:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:45:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:45:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:45:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:46:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:46:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:46:37.679] debug2: Start processing RPC: REQUEST_NODE_REGISTRATION_STATUS
[2023-12-14T11:46:37.679] debug2: Processing RPC: REQUEST_NODE_REGISTRATION_STATUS
[2023-12-14T11:46:37.680] [224.0] debug:  Handling REQUEST_STATE
[2023-12-14T11:46:37.680] debug:  _fill_registration_msg: found apparently running StepId=224.0
[2023-12-14T11:46:37.680] [224.extern] debug:  Handling REQUEST_STATE
[2023-12-14T11:46:37.680] debug:  _fill_registration_msg: found apparently running StepId=224.extern
[2023-12-14T11:46:37.695] debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
[2023-12-14T11:46:37.695] debug2: Finish processing RPC: REQUEST_NODE_REGISTRATION_STATUS
[2023-12-14T11:46:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:46:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:47:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:47:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:47:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:47:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:48:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:48:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:48:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:48:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:49:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:49:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:49:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:49:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:49:57.055] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:49:57.055] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:49:57.059] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:50:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:50:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:50:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:50:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:51:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:51:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:51:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:51:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:52:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:52:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:52:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:52:51.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:53:17.418] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:53:17.418] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:53:17.419] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:53:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:53:21.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:53:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:53:51.307] [224.0] debug2: profile signaling type Task
[2023-12-14T11:54:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:54:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:54:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:54:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:55:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:55:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:55:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:55:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:56:20.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:56:21.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:56:37.790] debug2: Start processing RPC: REQUEST_PING
[2023-12-14T11:56:37.791] debug2: Processing RPC: REQUEST_PING
[2023-12-14T11:56:37.791] debug2: Finish processing RPC: REQUEST_PING
[2023-12-14T11:56:50.950] [224.extern] debug2: profile signaling type Task
[2023-12-14T11:56:51.308] [224.0] debug2: profile signaling type Task
[2023-12-14T11:58:50.526] debug:  Log file re-opened
[2023-12-14T11:58:50.527] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_224.0: Connection refused
[2023-12-14T11:58:50.527] debug:  Cleaned up stray socket /var/spool/slurmd/cn4_224.0
[2023-12-14T11:58:50.527] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_224.4294967292: Connection refused
[2023-12-14T11:58:50.527] debug:  Cleaned up stray socket /var/spool/slurmd/cn4_224.4294967292
[2023-12-14T11:58:50.527] debug2: hwloc_topology_init
[2023-12-14T11:58:50.532] debug2: hwloc_topology_load
[2023-12-14T11:58:50.551] debug2: hwloc_topology_export_xml
[2023-12-14T11:58:50.553] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-14T11:58:50.553] debug2: hwloc_topology_init
[2023-12-14T11:58:50.555] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-14T11:58:50.555] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-14T11:58:50.556] topology/none: init: topology NONE plugin loaded
[2023-12-14T11:58:50.556] route/default: init: route default plugin loaded
[2023-12-14T11:58:50.556] debug2: Gathering cpu frequency information for 64 cpus
[2023-12-14T11:58:50.557] debug:  Resource spec: No specialized cores configured by default on this node
[2023-12-14T11:58:50.557] debug:  Resource spec: Reserved system memory limit not configured for this node
[2023-12-14T11:58:50.560] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-14T11:58:50.560] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-14T11:58:50.561] debug:  auth/munge: init: Munge authentication plugin loaded
[2023-12-14T11:58:50.561] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-14T11:58:50.561] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-14T11:58:50.561] cred/munge: init: Munge credential signature plugin loaded
[2023-12-14T11:58:50.561] slurmd version 21.08.5 started
[2023-12-14T11:58:50.561] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-14T11:58:50.562] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-14T11:58:50.562] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-14T11:58:50.562] debug:  job_container/tmpfs: container_p_restore: job_container.conf read successfully
[2023-12-14T11:58:50.562] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 66 not found, deleting the namespace
[2023-12-14T11:58:50.562] error: _delete_ns: umount2 /var/nvme/storage/cn4/66/.ns failed: Invalid argument
[2023-12-14T11:58:50.562] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 65 not found, deleting the namespace
[2023-12-14T11:58:50.563] error: _delete_ns: umount2 /var/nvme/storage/cn4/65/.ns failed: Invalid argument
[2023-12-14T11:58:50.563] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 58 not found, deleting the namespace
[2023-12-14T11:58:50.563] error: _delete_ns: umount2 /var/nvme/storage/cn4/58/.ns failed: Invalid argument
[2023-12-14T11:58:50.563] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 56 not found, deleting the namespace
[2023-12-14T11:58:50.563] error: _delete_ns: umount2 /var/nvme/storage/cn4/56/.ns failed: Invalid argument
[2023-12-14T11:58:50.563] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 224 not found, deleting the namespace
[2023-12-14T11:58:50.563] error: _delete_ns: umount2 /var/nvme/storage/cn4/224/.ns failed: Invalid argument
[2023-12-14T11:58:50.563] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 64 not found, deleting the namespace
[2023-12-14T11:58:50.563] error: _delete_ns: umount2 /var/nvme/storage/cn4/64/.ns failed: Invalid argument
[2023-12-14T11:58:50.563] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 74 not found, deleting the namespace
[2023-12-14T11:58:50.563] error: _delete_ns: umount2 /var/nvme/storage/cn4/74/.ns failed: Invalid argument
[2023-12-14T11:58:50.563] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 54 not found, deleting the namespace
[2023-12-14T11:58:50.563] error: _delete_ns: umount2 /var/nvme/storage/cn4/54/.ns failed: Invalid argument
[2023-12-14T11:58:50.563] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 59 not found, deleting the namespace
[2023-12-14T11:58:50.563] error: _delete_ns: umount2 /var/nvme/storage/cn4/59/.ns failed: Invalid argument
[2023-12-14T11:58:50.563] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 107 not found, deleting the namespace
[2023-12-14T11:58:50.563] error: _delete_ns: umount2 /var/nvme/storage/cn4/107/.ns failed: Invalid argument
[2023-12-14T11:58:50.563] error: Encountered an error while restoring job containers.
[2023-12-14T11:58:50.563] error: Unable to restore job_container state.
[2023-12-14T11:58:50.564] debug:  switch/none: init: switch NONE plugin loaded
[2023-12-14T11:58:50.564] debug:  switch Cray/Aries plugin loaded.
[2023-12-14T11:58:50.564] slurmd started on Thu, 14 Dec 2023 11:58:50 -0600
[2023-12-14T11:58:50.565] CPUs=64 Boards=1 Sockets=1 Cores=64 Threads=1 Memory=257580 TmpDisk=937291 Uptime=14 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2023-12-14T11:58:50.565] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-14T11:58:50.565] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-14T11:58:50.566] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-14T11:58:50.566] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-14T11:58:50.566] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
[2023-12-14T11:58:50.570] debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
[2023-12-14T11:58:50.571] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2023-12-14T11:58:50.571] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2023-12-14T11:58:50.571] debug:  _rpc_terminate_job: uid = 64030 JobId=224
[2023-12-14T11:58:50.571] debug:  credential for job 224 revoked
[2023-12-14T11:58:50.571] debug2: No steps in jobid 224 to send signal 998
[2023-12-14T11:58:50.571] debug2: No steps in jobid 224 to send signal 18
[2023-12-14T11:58:50.571] debug2: No steps in jobid 224 to send signal 15
[2023-12-14T11:58:50.571] debug2: set revoke expiration for jobid 224 to 1702576850 UTS
[2023-12-14T11:58:50.571] error: _delete_ns: umount2 /var/nvme/storage/cn4/224/.ns failed: Invalid argument
[2023-12-14T11:58:50.571] error: container_g_delete(224): Invalid argument
[2023-12-14T11:58:50.571] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB


cgroup.conf:
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
ConstrainKmemSpace=no
TaskAffinity=no
CgroupMountpoint=/sys/fs/cgroup


job_container.conf:
AutoBasePath=true
BasePath=/var/nvme/storage
Comment 1 Jason Booth 2023-12-14 13:31:52 MST
Which version of Slurm are you currently running?
Comment 2 Matt Kasemer 2023-12-14 13:32:50 MST
Default apt version on Ubuntu v22: 21.08.5.
Comment 3 Jason Booth 2023-12-14 14:41:47 MST
The node had some type of event that happened between [2023-12-14T11:56:51.308] 
and [2023-12-14T11:58:50.526]. Can you also attach the syslogs and the output 
from "$ date && uptime"?

Based on the logs it looks like the slurmd stopped suddenly and was started back 
up a few minutes later. This is why the slurmctld interpreted this as an 
unexpected reboot.

> [2023-12-14T11:56:51.308] [224.0] debug2: profile signaling type Task
> [2023-12-14T11:58:50.526] debug:  Log file re-opened
Comment 4 Matt Kasemer 2023-12-14 15:19:28 MST
Will do. Currently running memtest86+ to see if there are any RAM issues, so I don't have immediate access to logs.

As a note, the job that failed in the log below was ID=224, but many of the (seemingly culprit) errors from the log seem to be from previous jobs that failed:

[2023-12-14T11:58:50.562] error: _delete_ns: umount2 /var/nvme/storage/cn4/66/.ns failed: Invalid argument
...
[2023-12-14T11:58:50.563] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 107 not found, deleting the namespace
[2023-12-14T11:58:50.563] error: _delete_ns: umount2 /var/nvme/storage/cn4/107/.ns failed: Invalid argument
[2023-12-14T11:58:50.563] error: Encountered an error while restoring job containers.

Again, can't check until memtest is complete, but if folders for failed jobs 66 ... 107 are in the jobcontainer base path /var/nvme/storage/, then why are they still there after a job dies, and is this potentially causing the issue because the slurmd daemon is trying to restore these jobs? My understanding is naive, so please excuse me if this is entirely inconsequential.
Comment 5 Matt Kasemer 2023-12-15 14:29:49 MST
Okay, I've found two issues here:

First, memtest ran and encountered errors. After reseating sticks and CPU, no more errors.

Second, I deleted all directories leftover in /var/nvme/storage/cn4/. At this point jobs seemingly run fine, and there is no longer an issue.

Broadly, however, my question now becomes: why are there directories being leftover if a job fails?
Comment 6 Marcin Stolarek 2023-12-17 22:09:02 MST
Matt,

I don't see /var/nvme in your slurm.conf, but I see you have JobContainerType=job_container/tmpfs there. Is the directory managed by the plugin?

If you're on version later than 20.11.6 (after Bug 11093, commits a3557b4cc1a..5e4fd8ebe4ff0) slurmd should cleanup those on its startup. 

There was a number of issues fixed in regards to job_container/tmpfs, especially Bug 14284 may be of interest - fixed in 22.05.3 (the commit message is "job_container/tmpfs - cleanup container even if .ns isn't mounted.").

We also introduced additional logging in this area of code in 23.02(a few minor releases) and 23.11 to be able to better identify the reason for files/directories removal failure on job completion.

My general recommendation for the issue would be to upgrade to recent supported version, since we had many changes in the particular area of code.

Let me know if I'm correct that the directory is managed by job_container/tmpfs and your thoughts on upgrade.

cheers,
Marcin
Comment 7 Marcin Stolarek 2023-12-17 22:13:55 MST
PS. I've seen the logs and those show that /var/nvme/storage/ is used by job_container/tmpfs - I was writing the reply and looking at the case in parallel.
Comment 8 Matt Kasemer 2023-12-18 08:19:08 MST
Upgrade is (in the short term) inviable, but I will keep this in mind moving forward.

We are still getting some (similar) errors, though not for the "zombie" directories. This time, job ID = 280, and the errors are associated with it. 

From the slurmctld log:

2023-12-16T08:30:47.252] _job_complete: JobId=280 WEXITSTATUS 0
[2023-12-16T08:30:47.252] _job_complete: JobId=280 done
[2023-12-16T08:30:48.472] Node cn4 now responding
[2023-12-16T08:30:48.472] validate_node_specs: Node cn4 unexpectedly rebooted boot_time=1702737034 last response=1702736799
[2023-12-16T08:31:31.830] error: Nodes cn4 not responding

...but the node is not rebooting. From the slurmd log on the offending node around the same time:

[2023-12-16T08:30:47.761] debug:  Log file re-opened
[2023-12-16T08:30:47.761] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_280.0: Connection refused
[2023-12-16T08:30:47.761] debug:  Cleaned up stray socket /var/spool/slurmd/cn4_280.0
[2023-12-16T08:30:47.761] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_280.4294967292: Connection refused
[2023-12-16T08:30:47.761] debug:  Cleaned up stray socket /var/spool/slurmd/cn4_280.4294967292
[2023-12-16T08:30:47.761] debug2: hwloc_topology_init
[2023-12-16T08:30:47.766] debug2: hwloc_topology_load
[2023-12-16T08:30:47.784] debug2: hwloc_topology_export_xml
[2023-12-16T08:30:47.786] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-16T08:30:47.787] debug2: hwloc_topology_init
[2023-12-16T08:30:47.788] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-16T08:30:47.789] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-16T08:30:47.789] topology/none: init: topology NONE plugin loaded
[2023-12-16T08:30:47.790] route/default: init: route default plugin loaded
[2023-12-16T08:30:47.790] debug2: Gathering cpu frequency information for 64 cpus
[2023-12-16T08:30:47.791] debug:  Resource spec: No specialized cores configured by default on this node
[2023-12-16T08:30:47.791] debug:  Resource spec: Reserved system memory limit not configured for this node
[2023-12-16T08:30:47.794] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-16T08:30:47.794] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-16T08:30:47.794] debug:  auth/munge: init: Munge authentication plugin loaded
[2023-12-16T08:30:47.794] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-16T08:30:47.795] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-16T08:30:47.795] cred/munge: init: Munge credential signature plugin loaded
[2023-12-16T08:30:47.795] slurmd version 21.08.5 started
[2023-12-16T08:30:47.795] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-16T08:30:47.796] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-16T08:30:47.796] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-16T08:30:47.796] debug:  job_container/tmpfs: container_p_restore: job_container.conf read successfully
[2023-12-16T08:30:47.796] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 280 not found, deleting the namespace
[2023-12-16T08:30:47.796] error: _delete_ns: umount2 /var/nvme/storage/cn4/280/.ns failed: Invalid argument
[2023-12-16T08:30:47.796] error: Encountered an error while restoring job containers.
[2023-12-16T08:30:47.796] error: Unable to restore job_container state.
[2023-12-16T08:30:47.797] debug:  switch/none: init: switch NONE plugin loaded
[2023-12-16T08:30:47.797] debug:  switch Cray/Aries plugin loaded.
[2023-12-16T08:30:47.797] slurmd started on Sat, 16 Dec 2023 08:30:47 -0600
[2023-12-16T08:30:47.797] CPUs=64 Boards=1 Sockets=1 Cores=64 Threads=1 Memory=257580 TmpDisk=937291 Uptime=14 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2023-12-16T08:30:47.798] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-16T08:30:47.798] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-16T08:30:47.798] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-16T08:30:47.799] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-16T08:30:47.799] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
[2023-12-16T08:30:47.802] debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
[2023-12-16T08:30:48.590] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2023-12-16T08:30:48.590] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2023-12-16T08:30:48.590] debug:  _rpc_terminate_job: uid = 64030 JobId=280
[2023-12-16T08:30:48.590] debug:  credential for job 280 revoked
[2023-12-16T08:30:48.590] debug2: No steps in jobid 280 to send signal 18
[2023-12-16T08:30:48.590] debug2: No steps in jobid 280 to send signal 15
[2023-12-16T08:30:48.591] debug2: set revoke expiration for jobid 280 to 1702737168 UTS
[2023-12-16T08:30:48.591] error: _delete_ns: umount2 /var/nvme/storage/cn4/280/.ns failed: Invalid argument
[2023-12-16T08:30:48.591] error: container_g_delete(280): Invalid argument
[2023-12-16T08:30:48.591] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB
Comment 9 Marcin Stolarek 2023-12-19 01:27:22 MST
Matt,

This looks like a duplicate of Bug Bug 14284. The error you message you see is:
>[2023-12-16T08:30:48.591] error: _delete_ns: umount2 /var/nvme/storage/cn4/280/.ns failed: Invalid argument
>[2023-12-16T08:30:48.591] error: container_g_delete(280): Invalid argument

The commit 85ee0cdea94d9, changed _delete_ns, so the handling of failure in umount2 doesn't end the function, but continue to attempt the directory removal.

I'm going to mark this bug a duplicate. If you can't upgrade and the issue is vital for you the commit should be easy to backport to slurm 21.08.

Is there anything else I can help you with in the ticket?

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/85ee0cdea94d90679a77d9155e6f4809cee27bd2
Comment 10 Matt Kasemer 2023-12-19 08:23:57 MST
Marcin,

Many thanks for pointing me to that. But why is this happening on only one node? It is not pervasive on any other nodes, and consistently fails only here, often mid-job.
Comment 11 Marcin Stolarek 2024-01-01 07:51:11 MST
>Many thanks for pointing me to that. But why is this happening on only one node?

That's hard to guess from my perspective. Looking at the slurm.conf I see that this node has a different hardware configuration - lower RealMemory set in slurm.conf. I'd suggest looking into dmesg/OS logs, maybe those can help understanding difference?

cheers,
Marcin
Comment 12 Matt Kasemer 2024-01-01 14:16:04 MST
That was for debugging. Indeed, all nodes are heterogeneous.

We figured out the issue of the node rebooting: ipmitool revealed memory issues. So, the log was not incorrect, the node was indeed rebooting. However, jobs are not able to pick up after reboot. So, overall, perhaps an upgrade in the future will take care of that, but at least we now know what the issue is. Many thanks.
Comment 13 Marcin Stolarek 2024-01-01 23:58:11 MST
Thanks for confirming, I set the ticket as duplicate again.
Should you have any questions please reopen.

cheers,
Marcin

*** This ticket has been marked as a duplicate of ticket 14204 ***