I have a small 2 compute node GPU cluster, where each node as 2 GPUs. $ sinfo -o "%20N %10c %10m %25f %30G " NODELIST CPUS MEMORY AVAIL_FEATURES GRES o186i[126-127] 128 64000 (null) gpu:nvidia_a40:2(S:0-1) In my batch script, I request 4 GPUs and let Slurm decide how many nodes to automatically allocate. I also tell it I want 1 task per node. $ cat rig_batch.sh #!/usr/bin/env bash #SBATCH --ntasks-per-node=1 #SBATCH --nodes=1-9 #SBATCH --gpus=4 #SBATCH --error=/home/corujor/slurm-error.log #SBATCH --output=/home/corujor/slurm-output.log bash -c 'echo $(hostname):SLURM_JOBID=${SLURM_JOBID}:SLURM_PROCID=${SLURM_PROCID}:CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}' I submit my batch script on slurm-22.05.2. $ sbatch rig_batch.sh Submitted batch job 7 I get the expected results. That is, since each compute node has 2 GPUs and I requested 4 GPUs, Slurm allocated 2 nodes, and 1 task per node. $ cat slurm-output.log o186i126:SLURM_JOBID=7:SLURM_PROCID=0:CUDA_VISIBLE_DEVICES=0,1 o186i127:SLURM_JOBID=7:SLURM_PROCID=1:CUDA_VISIBLE_DEVICES=0,1 However, when I try to submit the same batch script on slurm-22.05.7, it fails. $ sbatch rig_batch.sh sbatch: error: Batch job submission failed: Requested node configuration is not available Here is my configuration. $ scontrol show config Configuration data as of 2023-01-12T21:38:55 AccountingStorageBackupHost = (null) AccountingStorageEnforce = none AccountingStorageHost = localhost AccountingStorageExternalHost = (null) AccountingStorageParameters = (null) AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreFlags = (null) AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = No AuthAltTypes = (null) AuthAltParameters = (null) AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 10 sec BcastExclude = /lib,/usr/lib,/lib64,/usr/lib64 BcastParameters = (null) BOOT_TIME = 2023-01-12T17:17:11 BurstBufferType = (null) CliFilterPlugins = (null) ClusterName = grenoble_test CommunicationParameters = (null) CompleteWait = 0 sec CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = OnDemand,Performance,UserSpace CredType = cred/munge DebugFlags = Gres DefMemPerNode = UNLIMITED DependencyParameters = (null) DisableRootJobs = Yes EioTimeout = 60 EnforcePartLimits = ANY Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu GpuFreqDef = high,memory=high GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 0 sec InteractiveStepOptions = --interactive --preserve-env --pty $SHELL JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/none JobAcctGatherParams = (null) JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = /apps/slurm/etc/.slurm.key JobCredentialPublicCertificate = /apps/slurm/etc/slurm.cert JobDefaults = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = (null) KillOnBadExit = 0 KillWait = 30 sec LaunchParameters = use_interactive_step LaunchType = launch/slurm Licenses = (null) LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 1001 MaxDBDMsgs = 20008 MaxJobCount = 10000 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxNodeCount = 2 MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MessageTimeout = 10 sec MinJobAge = 300 sec MpiDefault = pmix MpiParams = (null) NEXT_JOB_ID = 274 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /apps/slurm-22-05-7-1/lib/slurm PlugStackConfig = (null) PowerParameters = (null) PowerPlugin = PreemptMode = OFF PreemptType = preempt/none PreemptExemptTime = 00:00:00 PrEpParameters = (null) PrEpPlugins = prep/script PriorityParameters = (null) PrioritySiteFactorParameters = (null) PrioritySiteFactorPlugin = (null) PriorityType = priority/basic PrivateData = none ProctrackType = proctrack/linuxproc Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = (null) PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = (null) ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeFailProgram = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 RoutePlugin = route/default SchedulerParameters = (null) SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill ScronParameters = (null) SelectType = select/cons_tres SelectTypeParameters = CR_CPU SlurmUser = slurm(1182) SlurmctldAddr = (null) SlurmctldDebug = debug SlurmctldHost[0] = o186i208 SlurmctldLogFile = /var/log/slurmctld.log SlurmctldPort = 6817 SlurmctldSyslogDebug = (null) SlurmctldPrimaryOffProg = (null) SlurmctldPrimaryOnProg = (null) SlurmctldTimeout = 120 sec SlurmctldParameters = (null) SlurmdDebug = info SlurmdLogFile = /var/log/slurmd.log SlurmdParameters = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdSyslogDebug = (null) SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/slurm/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /apps/slurm-22-05-7-1/etc/slurm.conf SLURM_VERSION = 22.05.7 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /var/spool/slurmctld SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = INFINITE SuspendTimeout = 30 sec SwitchParameters = (null) SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/affinity TaskPluginParam = (null type) TaskProlog = (null) TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 50 UsePam = No UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec X11Parameters = (null) MPI Plugins Configuration: PMIxCliTmpDirBase = (null) PMIxCollFence = (null) PMIxDebug = 0 PMIxDirectConn = yes PMIxDirectConnEarly = no PMIxDirectConnUCX = no PMIxDirectSameArch = no PMIxEnv = (null) PMIxFenceBarrier = no PMIxNetDevicesUCX = (null) PMIxTimeout = 300 PMIxTlsUCX = (null) Slurmctld(primary) at o186i208 is UP The only difference when I run this with slurm-22.05.2, is that I have to make this change or Slurm will complain. Other than that, the same configuration is used for both slurm-22.05.2 and slurm.05.7. In both cases, I am running on the same cluster using the same compute nodes, just pointing to different versions of Slurm. #MpiDefault=pmix MpiDefault=none
Hi I can recreate this bug, and I know its source. I will let you know when we will have a patch. Dominik
Can you please let me know in which version of Slurm this bug was introduced? We have a customer that wants to install a version of Slurm newer than 22.05.2 because of a different bug, and they want to know if they can safely install it without hitting this issue? Thank you, Rigoberto
Hi This regression was added in 22.05.5. Dominik
*** Bug 15925 has been marked as a duplicate of this bug. ***
Hi Those commits fix this and a few related issues. https://github.com/SchedMD/slurm/compare/71119adcf9c...a0a8563f04 The commits are in the 22.05 branch and will be included in the next 22.05 release. Let me know if you have any additional questions otherwise, I will close this bug. Dominik
Thank you. Just so we can let our customers know, which specific 22.05 release is this going into?
Hi This fix will be included in 22.05.9 and above. We have no exact release date yet but I expect this should happen in begin of March. Dominik
Hi I'll go ahead and close this ticket out. Feel free to reopen if needed. Dominik