15857 – Regression from slurm-22.05.2 to slurm-22.05.7 when using "--gpus=N" option

Bug 15857 - Regression from slurm-22.05.2 to slurm-22.05.7 when using "--gpus=N" option

Summary: Regression from slurm-22.05.2 to slurm-22.05.7 when using "--gpus=N" option

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Regression (show other bugs)
Version:	22.05.7
Hardware:	Linux Linux

Importance:	--- 2 - High Impact
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Duplicates (1):	15925 (view as bug list)
Depends on:
Blocks:

Reported:	2023-01-24 07:46 MST by Rigoberto Corujo
Modified:	2023-11-20 10:47 MST (History)
CC List:	3 users (show)

See Also:	18217
Site:	HPE AI
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	22.05.9
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Rigoberto Corujo 2023-01-24 07:46:05 MST

I have a small 2 compute node GPU cluster, where each node as 2 GPUs.

$ sinfo -o "%20N  %10c  %10m  %25f  %30G "
NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES                          
o186i[126-127]        128         64000       (null)                     gpu:nvidia_a40:2(S:0-1) 

In my batch script, I request 4 GPUs and let Slurm decide how many nodes to automatically allocate.  I also tell it I want 1 task per node.


$ cat rig_batch.sh
#!/usr/bin/env bash

#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1-9
#SBATCH --gpus=4
#SBATCH --error=/home/corujor/slurm-error.log
#SBATCH --output=/home/corujor/slurm-output.log

bash -c 'echo $(hostname):SLURM_JOBID=${SLURM_JOBID}:SLURM_PROCID=${SLURM_PROCID}:CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}'


I submit my batch script on slurm-22.05.2.

$ sbatch rig_batch.sh
Submitted batch job 7


I get the expected results.  That is, since each compute node has 2 GPUs and I requested 4 GPUs, Slurm allocated 2 nodes, and 1 task per node.


$ cat slurm-output.log
o186i126:SLURM_JOBID=7:SLURM_PROCID=0:CUDA_VISIBLE_DEVICES=0,1
o186i127:SLURM_JOBID=7:SLURM_PROCID=1:CUDA_VISIBLE_DEVICES=0,1


However, when I try to submit the same batch script on slurm-22.05.7, it fails.

$ sbatch rig_batch.sh
sbatch: error: Batch job submission failed: Requested node configuration is not available


Here is my configuration.


$ scontrol show config
Configuration data as of 2023-01-12T21:38:55
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = localhost
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreFlags    = (null)
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes            = (null)
AuthAltParameters       = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BcastExclude            = /lib,/usr/lib,/lib64,/usr/lib64
BcastParameters         = (null)
BOOT_TIME               = 2023-01-12T17:17:11
BurstBufferType         = (null)
CliFilterPlugins        = (null)
ClusterName             = grenoble_test
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = OnDemand,Performance,UserSpace
CredType                = cred/munge
DebugFlags              = Gres
DefMemPerNode           = UNLIMITED
DependencyParameters    = (null)
DisableRootJobs         = Yes
EioTimeout              = 60
EnforcePartLimits       = ANY
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/none
JobAcctGatherParams     = (null)
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = /apps/slurm/etc/.slurm.key
JobCredentialPublicCertificate = /apps/slurm/etc/slurm.cert
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = use_interactive_step
LaunchType              = launch/slurm
Licenses                = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxDBDMsgs              = 20008
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxNodeCount            = 2
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = pmix
MpiParams               = (null)
NEXT_JOB_ID             = 274
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /apps/slurm-22-05-7-1/lib/slurm
PlugStackConfig         = (null)
PowerParameters         = (null)
PowerPlugin             =
PreemptMode             = OFF
PreemptType             = preempt/none
PreemptExemptTime       = 00:00:00
PrEpParameters          = (null)
PrEpPlugins             = prep/script
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityType            = priority/basic
PrivateData             = none
ProctrackType           = proctrack/linuxproc
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = route/default
SchedulerParameters     = (null)
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
ScronParameters         = (null)
SelectType              = select/cons_tres
SelectTypeParameters    = CR_CPU
SlurmUser               = slurm(1182)
SlurmctldAddr           = (null)
SlurmctldDebug          = debug
SlurmctldHost[0]        = o186i208
SlurmctldLogFile        = /var/log/slurmctld.log
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = (null)
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 120 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = (null)
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/slurm/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /apps/slurm-22-05-7-1/etc/slurm.conf
SLURM_VERSION           = 22.05.7
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /var/spool/slurmctld
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = INFINITE
SuspendTimeout          = 30 sec
SwitchParameters        = (null)
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/affinity
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = No
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)
 

MPI Plugins Configuration:

PMIxCliTmpDirBase       = (null)
PMIxCollFence           = (null)
PMIxDebug               = 0
PMIxDirectConn          = yes
PMIxDirectConnEarly     = no
PMIxDirectConnUCX       = no
PMIxDirectSameArch      = no
PMIxEnv                 = (null)
PMIxFenceBarrier        = no
PMIxNetDevicesUCX       = (null)
PMIxTimeout             = 300
PMIxTlsUCX              = (null)

Slurmctld(primary) at o186i208 is UP



The only difference when I run this with slurm-22.05.2, is that I have to make this change or Slurm will complain.  Other than that, the same configuration is used for both slurm-22.05.2 and slurm.05.7.  In both cases, I am running on the same cluster using the same compute nodes, just pointing to different versions of Slurm.

#MpiDefault=pmix
MpiDefault=none

Comment 3 Dominik Bartkiewicz 2023-01-25 07:06:57 MST

Hi

I can recreate this bug, and I know its source.
I will let you know when we will have a patch.

Dominik

Comment 7 Rigoberto Corujo 2023-02-02 13:02:21 MST

Can you please let me know in which version of Slurm this bug was introduced?  We have a customer that wants to install a version of Slurm newer than 22.05.2 because of a different bug, and they want to know if they can safely install it without hitting this issue?

Thank you,

Rigoberto

Comment 8 Dominik Bartkiewicz 2023-02-03 04:47:22 MST

Hi

This regression was added in 22.05.5.

Dominik

Comment 17 Chad Vizino 2023-02-08 16:46:49 MST

*** Bug 15925 has been marked as a duplicate of this bug. ***

Comment 18 Dominik Bartkiewicz 2023-02-09 04:05:50 MST

Hi

Those commits fix this and a few related issues.
https://github.com/SchedMD/slurm/compare/71119adcf9c...a0a8563f04

The commits are in the 22.05 branch and will be included in the next 22.05 release. Let me know if you have any additional questions otherwise, I will close this bug.

Dominik

Comment 19 Rigoberto Corujo 2023-02-09 08:11:25 MST

Thank you.  Just so we can let our customers know, which specific 22.05 release is this going into?

Comment 20 Dominik Bartkiewicz 2023-02-10 03:47:30 MST

Hi

This fix will be included in 22.05.9 and above.
We have no exact release date yet but I expect this should happen in begin of March.

Dominik

Comment 21 Dominik Bartkiewicz 2023-02-17 07:25:07 MST

Hi

I'll go ahead and close this ticket out. Feel free to reopen if needed.

Dominik