Bug 15470

Summary:	Assistance with migration from cons_res to cons_tres
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	Configuration	Assignee:	Ben Roberts <ben>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	21.08.8
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:	22.05.7, 23.02.0pre1	Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---
Attachments:	slurm.conf gres.conf slurmctld.log

Description Ole.H.Nielsen@fysik.dtu.dk 2022-11-21 07:05:07 MST

We have been running for years with SelectType=select/cons_res in slurm.conf, but we should migrate this to cons_tres.  I'm not aware of any instructions for this migration on a running cluster.  

Since we're running Configless mode, special care probably needs to be taken for restarting slurmctld as well as all the slurmds.

Can you kindly give us some instructions for the migration?  At SC22 I was talking to Carlos about this, so he may have some ideas.

Best regards,
Ole

Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2022-11-21 07:05:32 MST

Created attachment 27850 [details]
slurm.conf

Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2022-11-21 07:05:47 MST

Created attachment 27851 [details]
gres.conf

Comment 3 Ben Roberts 2022-11-21 11:58:19 MST

Hi Ole,

When making a change from cons_res to cons_tres there isn't much you need to do.  These two select type plugins are very similar.  The difference being that the cons_tres plugin adds much more functionality related to GPUs.  If you're moving from cons_res to cons_tres there shouldn't be any effect on the running jobs.  If you were changing from cons_tres to cons_res and you had jobs on the system that used the new syntax that is available for GPUs, then you would run into problems.  For your reference, this is described in the documentation here:
https://slurm.schedmd.com/slurm.conf.html#OPT_SelectType_1

This page shows the new options that are available when using the cons_tres plugin:
https://slurm.schedmd.com/cons_res.html#using_cons_tres

I don't foresee a problem with changing this plugin type.  Feel free to let me know if you have any additional questions or concerns though.

Thanks,
Ben

Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2022-11-21 13:11:24 MST

Hi Ben,

(In reply to Ben Roberts from comment #3)
> When making a change from cons_res to cons_tres there isn't much you need to
> do.  These two select type plugins are very similar.  The difference being
> that the cons_tres plugin adds much more functionality related to GPUs.  If
> you're moving from cons_res to cons_tres there shouldn't be any effect on
> the running jobs.  If you were changing from cons_tres to cons_res and you
> had jobs on the system that used the new syntax that is available for GPUs,
> then you would run into problems.  For your reference, this is described in
> the documentation here:
> https://slurm.schedmd.com/slurm.conf.html#OPT_SelectType_1
> 
> This page shows the new options that are available when using the cons_tres
> plugin:
> https://slurm.schedmd.com/cons_res.html#using_cons_tres
> 
> I don't foresee a problem with changing this plugin type.  Feel free to let
> me know if you have any additional questions or concerns though.

Thanks a lot for the reassuring explanation regarding functionality.

What I'm uncertain about is whether 1) slurmctld must be restarted, and 2) all slurmds must be restarted?  Is this fully captured by this section in the above mentioned manual page:

> A restart of slurmctld is required for changes to this parameter to take effect. When changed, all job information (running and pending) will be lost, since the job state save format used by each plugin is different. The only exception to this is when changing from cons_res to cons_tres or from cons_tres to cons_res. 

So in my case no restarts are needed, even when running in Configless mode, right?

Thanks,
Ole

Comment 5 Ben Roberts 2022-11-21 14:34:01 MST

I'm sorry I glossed over that part of your question.  You would need to restart slurmctld, but not slurmd on the nodes.  This is a scheduler level parameter, so restarting slurmctld will cause it to pick up the change to cons_tres.  You don't need to restart slurmd on the nodes since they are going to run the jobs as they are passed to them by the scheduler.  

Thanks,
Ben

Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2022-11-22 00:35:05 MST

I changed to cons_tres and restarted slurmctld and did an "scontrol reconfig".

Unfortunately, a number of nodes had problems, for example node a015 which says in slurmd.log:

[2022-11-22T08:21:56.326] error: select_g_select_jobinfo_unpack: select plugin cons_tres not found
[2022-11-22T08:21:56.326] error: select_g_select_jobinfo_unpack: unpack error
[2022-11-22T08:21:56.326] error: Malformed RPC of type REQUEST_BATCH_JOB_LAUNCH(4005) received
[2022-11-22T08:21:56.326] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2022-11-22T08:21:56.336] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2022-11-22T08:21:56.344] error: select_g_select_jobinfo_unpack: select plugin cons_tres not found
[2022-11-22T08:21:56.344] error: select_g_select_jobinfo_unpack: unpack error
[2022-11-22T08:21:56.344] error: Malformed RPC of type REQUEST_TERMINATE_JOB(6011) received
[2022-11-22T08:21:56.344] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2022-11-22T08:21:56.354] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2022-11-22T08:21:56.376] error: select_g_select_jobinfo_unpack: select plugin cons_res not found
[2022-11-22T08:21:56.376] error: select_g_select_jobinfo_unpack: unpack error
[2022-11-22T08:21:56.376] error: Malformed RPC of type REQUEST_LAUNCH_TASKS(6001) received
[2022-11-22T08:21:56.376] fatal: slurmstepd: we didn't unpack the request correctly
[2022-11-22T08:21:56.377] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process
[2022-11-22T08:21:56.379] Could not launch job 5669416 and not able to requeue it, cancelling job

Furthermore, all jobs seemed to be crashing and new jobs would not start!

Therefore I reverted to cons_res and restarted slurmctld.  Now the system is behaving sanely again.  Please note that we're running Slurm 21.08.

I'll attach the slurmctld.log file for your analysis.

I'd appreciate any feedback about what went wrong here, and how we can upgrade cons_res to cons_tres without crashing jobs.

Thanks,
Ole

Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2022-11-22 00:45:15 MST

Created attachment 27864 [details]
slurmctld.log

Comment 8 Ben Roberts 2022-11-22 08:37:45 MST

Hi Ole,

That is strange that the logs say that the cons_tres plugin is not found. Could I have you send the full slurmd logs from node a015?

Thanks,
Ben

Comment 9 Ben Roberts 2022-11-22 09:53:31 MST

Hi Ole,

I looked into this further and I see the problem.  I told you that you could just restart slurmctld, but you do need to restart slurmd on the nodes as well.  I saw the note in the docs that just mentions that the controller needs to be restarted and I thought I remembered making that change previously without restarting the slurmd's, but I was wrong.  I can reproduce the error messages you're seeing by restarting just the controller with jobs on the system.  When I make the same change and restart slumd on the nodes as well it works fine.  My apologies for giving you bad instructions and the lost jobs it caused.

Thanks,
Ben

Comment 11 Ole.H.Nielsen@fysik.dtu.dk 2022-11-23 02:01:58 MST

Hi Ben,

(In reply to Ben Roberts from comment #9)
> I looked into this further and I see the problem.  I told you that you could
> just restart slurmctld, but you do need to restart slurmd on the nodes as
> well.  I saw the note in the docs that just mentions that the controller
> needs to be restarted and I thought I remembered making that change
> previously without restarting the slurmd's, but I was wrong.  I can
> reproduce the error messages you're seeing by restarting just the controller
> with jobs on the system.  When I make the same change and restart slumd on
> the nodes as well it works fine.  My apologies for giving you bad
> instructions and the lost jobs it caused.

Thanks for the update.  It would be great if correct and complete instructions could be added to the slurm.conf manual https://slurm.schedmd.com/slurm.conf.html#OPT_SelectType_1 (if a better place cannot be found).

My guess is that instructions would be:

1. Edit SelectType=select/cons_res into SelectType=select/cons_tres in slurm.conf
2. If not Configless then distribute slurm.conf to all nodes.
3. Restart all services immediately:
   systemctl restart slurmctld; clush -ba systemctl restart slurmd

I expect that one *must not* do "scontrol reconfig" in this process since the slurmds get restarted anyway, right?

I would like to await your quality assurance response before attempting this procedure.

Thanks,
Ole

Comment 12 Ben Roberts 2022-11-23 09:30:29 MST

Hi Ole,

After my last message I did put together a documentation patch to make it clear that you must restart slurmd along with the controller.  That is awaiting review to be added to the docs.

Your steps are right.  You're also correct that you don't need to do a 'scontrol reconfig' after restarting slurmctld because it's redundant.  The reconfigure is a way to have slurmctld pick up certain changes to the slurm.conf without restarting.  In this case the restart is required so there is no need for the reconfigure.

Here is the proof of concept for the procedure.  I'm running 21.08.8-2 with the cons_res plugin and with configless enabled.  I primarily use a multiple slurmd configuration, so I added an external node (kitt) to verify the configless portion of it.

$ slurmctld -V
slurm 21.08.8-2

$ scontrol show config | egrep -i 'configless|cons_'
SelectType              = select/cons_res
SlurmctldParameters     = enable_configless

$ sinfo -pdebug
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite     10   idle kitt,node[01-09]



I submit 10 jobs that request enough processors to occupy 2 nodes each.  5 jobs start immediately and 5 are queued.

$ sbatch -pdebug -n48 --wrap='srun sleep 60'               
Submitted batch job 67118386
...
$ sbatch -pdebug -n48 --wrap='srun sleep 60'
Submitted batch job 67118395

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          67118395     debug     wrap      ben PD       0:00      2 (None)
          67118394     debug     wrap      ben PD       0:00      2 (Priority)
          67118393     debug     wrap      ben PD       0:00      2 (Priority)
          67118392     debug     wrap      ben PD       0:00      2 (Priority)
          67118391     debug     wrap      ben PD       0:00      2 (Resources)
          67118387     debug     wrap      ben  R       0:01      2 node[01-02]
          67118388     debug     wrap      ben  R       0:01      2 node[03-04]
          67118389     debug     wrap      ben  R       0:01      2 node[05-06]
          67118390     debug     wrap      ben  R       0:01      2 kitt,node09
          67118386     debug     wrap      ben  R       0:04      2 node[07-08]



I edit the slurm.conf to enable the cons_tres plugin.  I restart the services on the controller with the steps shown below.  I also restarted slurmd on the external node (not shown).

$ vim slurm.conf

$ scontrol shutdown; sacctmgr shutdown; device-delete.sh 3

$ device-create.sh 3; slurmdbd; sudo ~/slurm/src/21-08/knight/etc/nodes.sh; sleep 1; slurmctld -i



You can see the change is picked up correctly.

$ scontrol show config | egrep -i 'configless|cons_'
SelectType              = select/cons_tres
SlurmctldParameters     = enable_configless



I allow the existing jobs to run to completion.  The next 5 jobs start on the nodes as expected.

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          67118392     debug     wrap      ben PD       0:00      2 (Priority)
          67118391     debug     wrap      ben PD       0:00      2 (Resources)
          67118395     debug     wrap      ben PD       0:00      2 (Priority)
          67118394     debug     wrap      ben PD       0:00      2 (Priority)
          67118393     debug     wrap      ben PD       0:00      2 (Priority)
          67118387     debug     wrap      ben  R       0:57      2 node[01-02]
          67118388     debug     wrap      ben  R       0:57      2 node[03-04]
          67118389     debug     wrap      ben  R       0:57      2 node[05-06]
          67118390     debug     wrap      ben  R       0:57      2 kitt,node09
          67118386     debug     wrap      ben  R       1:00      2 node[07-08]

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          67118391     debug     wrap      ben  R       0:10      2 node[07-08]
          67118392     debug     wrap      ben  R       0:10      2 node[01,09]
          67118393     debug     wrap      ben  R       0:10      2 node[02-03]
          67118394     debug     wrap      ben  R       0:10      2 node[04-05]
          67118395     debug     wrap      ben  R       0:10      2 kitt,node06



As a verification, the job that ran on the external node completed with a 0 exit code.

$ sacct -X -j 67118390
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
67118390           wrap      debug       sub1         48  COMPLETED      0:0 


Let me know if anything is unclear.

Thanks,
Ben

Comment 13 Ole.H.Nielsen@fysik.dtu.dk 2022-11-24 01:15:50 MST

I have configured cons_tres now as described in Comment 11 and everything seems to be fine!  I suppose this case can be closed now.

Thanks for your support,
Ole

Comment 14 Ben Roberts 2022-11-28 09:04:59 MST

I'm glad to hear that you were able to make this change successfully.  My apologies again for the failed initial attempt.  Since I have a documentation patch waiting for review I'll leave this ticket open until that is finished.  I'll let you know once it's done.

Thanks,
Ben

Comment 16 Ben Roberts 2022-11-30 12:13:09 MST

Hi Ole,

The update to the documentation has been checked in with the following commit:
https://github.com/SchedMD/slurm/commit/664628d7bf062f375baf42bfea280af302f857ee

I'll go ahead and close this ticket but don't hesitate to let us know if you need anything in the future.

Thanks,
Ben

Comment 17 Ole.H.Nielsen@fysik.dtu.dk 2022-12-01 03:41:26 MST

Hi Ben,

(In reply to Ben Roberts from comment #16)
> The update to the documentation has been checked in with the following
> commit:
> https://github.com/SchedMD/slurm/commit/
> 664628d7bf062f375baf42bfea280af302f857ee
> 
> I'll go ahead and close this ticket but don't hesitate to let us know if you
> need anything in the future.

Thanks for the update.  I'm worried about this documentation, however:

When changed, all job information (running and pending) will be
lost, since the job state save format used by each plugin is different.

My reading of this statement is that losing all job information will cause running jobs to crash!  That would be a complete showstopper!  But that's not what I experienced after configuring cons_tres: the jobs continued running without interruption and I didn't see any issues.

Could we ask for some clarification of the meaning of job state save format information?

Thanks,
Ole

Comment 18 Ben Roberts 2022-12-01 08:24:23 MST

We do try to make it clear that this isn't the case when changing from cons_res to cons_tres.  Further down in the paragraph it has this:
> The only exception to this is when changing from cons_res to
> cons_tres or from cons_tres to cons_res. However, if a job contains
> cons_tres-specific features and then SelectType is changed to
> cons_res, the job will be canceled, since there is no way for
> cons_res to satisfy requirements specific to cons_tres. 

In other words, changes that involve the 'cray_aries' or 'linear' plugins would cause running and pending job information to be lost.  Let me know if that still sounds unclear.

Thanks,
Ben

Comment 19 Ole.H.Nielsen@fysik.dtu.dk 2022-12-01 08:42:28 MST

Hi Ben,

(In reply to Ben Roberts from comment #18)
> We do try to make it clear that this isn't the case when changing from
> cons_res to cons_tres.  Further down in the paragraph it has this:
> > The only exception to this is when changing from cons_res to
> > cons_tres or from cons_tres to cons_res. However, if a job contains
> > cons_tres-specific features and then SelectType is changed to
> > cons_res, the job will be canceled, since there is no way for
> > cons_res to satisfy requirements specific to cons_tres. 
> 
> In other words, changes that involve the 'cray_aries' or 'linear' plugins
> would cause running and pending job information to be lost.  Let me know if
> that still sounds unclear.

Thanks, one must obviously read the entire context (I read only the patch).  I'm fine with the docs now.

/Ole