Created attachment 32724 [details] slurm.conf for test cluster I am testing an upgrade from 22.05.4 to 22.05.10 to address some of the recent CVEs. My upgrade procedure was to do "yum update slurm*" on slurmctld, then slurmctld then on slurmd. On the slurmd side we use NFS root where the root filesystem is on NFS so when I update the "RW host", which is the host responsible for making updates to the root filesystem, I then do "systemctl daemon-reload ; systemctl restart slurmd" on the compute nodes that were running jobs. What I've observed is after the job completes or is cancelled the slurmstepd throws a bunch of errors and the job won't go away: Here are the logs I saw for slurmstepd: Oct 12 13:23:52 p0121 slurmstepd[57663]: error: *** JOB 2011536 ON p0121 CANCELLED AT 2023-10-12T13:23:52 *** Oct 12 13:23:52 p0121 slurmstepd[57731]: error: *** STEP 2011536.0 ON p0121 CANCELLED AT 2023-10-12T13:23:52 *** Oct 12 13:23:52 p0121 slurmstepd[57656]: plugin_load_from_file: Incompatible Slurm plugin /usr/lib64/slurm/hash_k12.so version (22.05.10) Oct 12 13:23:52 p0121 slurmstepd[57656]: error: Couldn't load specified plugin name for hash/k12: Incompatible plugin version Oct 12 13:23:52 p0121 slurmstepd[57656]: error: cannot create hash context for K12 Oct 12 13:23:52 p0121 slurmstepd[57656]: error: slurm_send_node_msg: hash_g_compute: REQUEST_STEP_COMPLETE has error Oct 12 13:23:52 p0121 slurmstepd[57656]: error: Rank 0 failed sending step completion message directly to slurmctld, retrying Oct 12 13:23:53 p0121 slurmstepd[57663]: plugin_load_from_file: Incompatible Slurm plugin /usr/lib64/slurm/hash_k12.so version (22.05.10) Oct 12 13:23:53 p0121 slurmstepd[57663]: error: Couldn't load specified plugin name for hash/k12: Incompatible plugin version Oct 12 13:23:53 p0121 slurmstepd[57663]: error: cannot create hash context for K12 Oct 12 13:23:53 p0121 slurmstepd[57663]: error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error Oct 12 13:23:53 p0121 slurmstepd[57663]: Retrying job complete RPC for StepId=2011536.batch Oct 12 13:23:55 p0121 slurmstepd[57731]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused Oct 12 13:23:55 p0121 slurmstepd[57731]: done with job Oct 12 13:24:08 p0121 slurmstepd[57663]: error: hash_g_compute: hash plugin with id:0 not exist or is not loaded Oct 12 13:24:08 p0121 slurmstepd[57663]: error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error Oct 12 13:24:08 p0121 slurmstepd[57663]: Retrying job complete RPC for StepId=2011536.batch On the scheduler side I am seeing errors like this: Oct 12 13:27:23 pitzer-slurm01-test slurmctld[25292]: error: slurm_receive_msg [10.4.3.14:52632]: Zero Bytes were transmitted or received That IP is the IP of p0121 which had the slurmstepd issues. This is what the process table looks like on p0121 using "ps auxf": root 57656 0.0 0.0 274884 3944 ? Sl 13:19 0:00 slurmstepd: [2011536.extern] root 57663 0.0 0.0 273352 4548 ? Sl 13:19 0:00 slurmstepd: [2011536.batch] root 58054 0.0 0.0 250252 18292 ? SLsl 13:20 0:00 /usr/sbin/slurmd -D -s --conf-server pitzer-slurm01-test -M The only way to get this node back into service and to clear the queue is to do this: [root@p0121 ~]# pkill -9 slurmstepd [root@p0121 ~]# systemctl restart slurmd The slurmd restart will take quite a while which is not normal. Normally slurmd restarts are very fast. After the above restart I see things like this: Oct 12 13:27:08 p0121 slurmstepd[57663]: Retrying job complete RPC for StepId=2011536.batch Oct 12 13:27:23 p0121 slurmstepd[57656]: error: *** EXTERN STEP FOR 2011536 STEPD TERMINATED ON p0121 AT 2023-10-12T13:27:22 DUE TO JOB NOT ENDING WITH SIGNALS *** Oct 12 13:27:23 p0121 slurmstepd[57663]: error: *** JOB 2011536 STEPD TERMINATED ON p0121 AT 2023-10-12T13:27:22 DUE TO JOB NOT ENDING WITH SIGNALS *** Oct 12 13:27:23 p0121 slurmstepd[57656]: error: hash_g_compute: hash plugin with id:0 not exist or is not loaded Oct 12 13:27:23 p0121 slurmstepd[57656]: error: slurm_send_node_msg: hash_g_compute: REQUEST_UPDATE_NODE has error Oct 12 13:27:23 p0121 slurmstepd[57663]: error: hash_g_compute: hash plugin with id:0 not exist or is not loaded Oct 12 13:27:23 p0121 slurmstepd[57663]: error: slurm_send_node_msg: hash_g_compute: REQUEST_UPDATE_NODE has error Oct 12 13:27:23 p0121 slurmstepd[57663]: error: hash_g_compute: hash plugin with id:0 not exist or is not loaded Oct 12 13:27:23 p0121 slurmstepd[57663]: error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error Oct 12 13:27:23 p0121 slurmstepd[57663]: Retrying job complete RPC for StepId=2011536.batch In the past we have done minor Slurm upgrades without doing rolling reboots but if all jobs will become stuck in completed state, that will be a huge problem if we do the upgrade live.
Forgot to mention the slurmctld logs like this: Oct 12 13:27:23 pitzer-slurm01-test slurmctld[25292]: error: slurm_receive_msg [10.4.3.14:52632]: Zero Bytes were transmitted or received Those will go away after the slurmd restart. So it seems slurmd may become unresponsive or otherwise have issues once a job ends that was started before the upgrade and then finishes after the upgrade.
This is a duplicate of bug#14981comment#17 and mentioned in the slurm user list here. https://groups.google.com/g/slurm-users/c/5vLhW-oZLJE?pli=1 Unfortunately, the only path forward is to drain the cluster of all jobs before upgrading or killing the leftover steps. *** This ticket has been marked as a duplicate of ticket 14981 ***