Ticket 17893 - Errors doing upgrade from 22.05.4 to 22.05.10
Summary: Errors doing upgrade from 22.05.4 to 22.05.10
Status: RESOLVED DUPLICATE of ticket 14981
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 22.05.10
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-10-12 11:32 MDT by Trey Dockendorf
Modified: 2023-10-12 13:26 MDT (History)
2 users (show)

See Also:
Site: Ohio State OSC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf for test cluster (45.65 KB, text/plain)
2023-10-12 11:32 MDT, Trey Dockendorf
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Trey Dockendorf 2023-10-12 11:32:42 MDT
Created attachment 32724 [details]
slurm.conf for test cluster

I am testing an upgrade from 22.05.4 to 22.05.10 to address some of the recent CVEs.  My upgrade procedure was to do "yum update slurm*" on slurmctld, then slurmctld then on slurmd.  On the slurmd side we use NFS root where the root filesystem is on NFS so when I update the "RW host", which is the host responsible for making updates to the root filesystem, I then do "systemctl daemon-reload ; systemctl restart slurmd" on the compute nodes that were running jobs.  What I've observed is after the job completes or is cancelled the slurmstepd throws a bunch of errors and the job won't go away:

Here are the logs I saw for slurmstepd:

Oct 12 13:23:52 p0121 slurmstepd[57663]: error: *** JOB 2011536 ON p0121 CANCELLED AT 2023-10-12T13:23:52 ***
Oct 12 13:23:52 p0121 slurmstepd[57731]: error: *** STEP 2011536.0 ON p0121 CANCELLED AT 2023-10-12T13:23:52 ***
Oct 12 13:23:52 p0121 slurmstepd[57656]: plugin_load_from_file: Incompatible Slurm plugin /usr/lib64/slurm/hash_k12.so version (22.05.10)
Oct 12 13:23:52 p0121 slurmstepd[57656]: error: Couldn't load specified plugin name for hash/k12: Incompatible plugin version
Oct 12 13:23:52 p0121 slurmstepd[57656]: error: cannot create hash context for K12
Oct 12 13:23:52 p0121 slurmstepd[57656]: error: slurm_send_node_msg: hash_g_compute: REQUEST_STEP_COMPLETE has error
Oct 12 13:23:52 p0121 slurmstepd[57656]: error: Rank 0 failed sending step completion message directly to slurmctld, retrying
Oct 12 13:23:53 p0121 slurmstepd[57663]: plugin_load_from_file: Incompatible Slurm plugin /usr/lib64/slurm/hash_k12.so version (22.05.10)
Oct 12 13:23:53 p0121 slurmstepd[57663]: error: Couldn't load specified plugin name for hash/k12: Incompatible plugin version
Oct 12 13:23:53 p0121 slurmstepd[57663]: error: cannot create hash context for K12
Oct 12 13:23:53 p0121 slurmstepd[57663]: error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error
Oct 12 13:23:53 p0121 slurmstepd[57663]: Retrying job complete RPC for StepId=2011536.batch
Oct 12 13:23:55 p0121 slurmstepd[57731]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
Oct 12 13:23:55 p0121 slurmstepd[57731]: done with job
Oct 12 13:24:08 p0121 slurmstepd[57663]: error: hash_g_compute: hash plugin with id:0 not exist or is not loaded
Oct 12 13:24:08 p0121 slurmstepd[57663]: error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error
Oct 12 13:24:08 p0121 slurmstepd[57663]: Retrying job complete RPC for StepId=2011536.batch

On the scheduler side I am seeing errors like this:

Oct 12 13:27:23 pitzer-slurm01-test slurmctld[25292]: error: slurm_receive_msg [10.4.3.14:52632]: Zero Bytes were transmitted or received

That IP is the IP of p0121 which had the slurmstepd issues.

This is what the process table looks like on p0121 using "ps auxf":

root      57656  0.0  0.0 274884  3944 ?        Sl   13:19   0:00 slurmstepd: [2011536.extern]
root      57663  0.0  0.0 273352  4548 ?        Sl   13:19   0:00 slurmstepd: [2011536.batch]
root      58054  0.0  0.0 250252 18292 ?        SLsl 13:20   0:00 /usr/sbin/slurmd -D -s --conf-server pitzer-slurm01-test -M

The only way to get this node back into service and to clear the queue is to do this:

[root@p0121 ~]# pkill -9 slurmstepd
[root@p0121 ~]# systemctl restart slurmd

The slurmd restart will take quite a while which is not normal.  Normally slurmd restarts are very fast.  After the above restart I see things like this:

Oct 12 13:27:08 p0121 slurmstepd[57663]: Retrying job complete RPC for StepId=2011536.batch
Oct 12 13:27:23 p0121 slurmstepd[57656]: error: *** EXTERN STEP FOR 2011536 STEPD TERMINATED ON p0121 AT 2023-10-12T13:27:22 DUE TO JOB NOT ENDING WITH SIGNALS ***
Oct 12 13:27:23 p0121 slurmstepd[57663]: error: *** JOB 2011536 STEPD TERMINATED ON p0121 AT 2023-10-12T13:27:22 DUE TO JOB NOT ENDING WITH SIGNALS ***
Oct 12 13:27:23 p0121 slurmstepd[57656]: error: hash_g_compute: hash plugin with id:0 not exist or is not loaded
Oct 12 13:27:23 p0121 slurmstepd[57656]: error: slurm_send_node_msg: hash_g_compute: REQUEST_UPDATE_NODE has error
Oct 12 13:27:23 p0121 slurmstepd[57663]: error: hash_g_compute: hash plugin with id:0 not exist or is not loaded
Oct 12 13:27:23 p0121 slurmstepd[57663]: error: slurm_send_node_msg: hash_g_compute: REQUEST_UPDATE_NODE has error
Oct 12 13:27:23 p0121 slurmstepd[57663]: error: hash_g_compute: hash plugin with id:0 not exist or is not loaded
Oct 12 13:27:23 p0121 slurmstepd[57663]: error: slurm_send_node_msg: hash_g_compute: REQUEST_COMPLETE_BATCH_SCRIPT has error
Oct 12 13:27:23 p0121 slurmstepd[57663]: Retrying job complete RPC for StepId=2011536.batch

In the past we have done minor Slurm upgrades without doing rolling reboots but if all jobs will become stuck in completed state, that will be a huge problem if we do the upgrade live.
Comment 1 Trey Dockendorf 2023-10-12 11:34:58 MDT
Forgot to mention the slurmctld logs like this:

Oct 12 13:27:23 pitzer-slurm01-test slurmctld[25292]: error: slurm_receive_msg [10.4.3.14:52632]: Zero Bytes were transmitted or received


Those will go away after the slurmd restart.  So it seems slurmd may become unresponsive or otherwise have issues once a job ends that was started before the upgrade and then finishes after the upgrade.
Comment 3 Jason Booth 2023-10-12 13:26:16 MDT
This is a duplicate of bug#14981comment#17 and mentioned in the slurm user list here.

https://groups.google.com/g/slurm-users/c/5vLhW-oZLJE?pli=1

Unfortunately, the only path forward is to drain the cluster of all jobs before upgrading or killing the leftover steps.

*** This ticket has been marked as a duplicate of ticket 14981 ***