Ticket 7311 - SLURM_NTASKS isn't consistent across nodes in a job
Summary: SLURM_NTASKS isn't consistent across nodes in a job
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 20.02.6
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Nate Rini
QA Contact:
URL:
: 6166 6167 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2019-06-27 12:54 MDT by Kaylea Nelson
Modified: 2021-03-10 15:11 MST (History)
4 users (show)

See Also:
Site: Yale
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
test submission script (114 bytes, text/x-sh)
2019-07-02 12:03 MDT, Kaylea Nelson
Details
example out file (336 bytes, text/plain)
2019-07-02 12:04 MDT, Kaylea Nelson
Details
grace 20.02.6 conf (7.97 KB, application/zip)
2021-02-18 11:58 MST, Kaylea Nelson
Details
grace 20.02.6 conf (7.97 KB, application/zip)
2021-02-18 12:00 MST, Kaylea Nelson
Details
grace cgroup.conf (144 bytes, text/plain)
2021-02-18 12:00 MST, Kaylea Nelson
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Kaylea Nelson 2019-06-27 12:54:21 MDT
We have noticed an odd behavior where the SLURM_NTASKS environment variable isn't the same on all nodes in a job. Only the master node on the job appears to report the correct value (SLURM_NTASKS = --ntasks). Are we misunderstanding how this environment variable is supposed to be used?

$ srun --pty -n 10 -N 3 bash
$ module load MPI/OpenMPI/2.1.1-gcc
$ mpirun bash -c 'hostname; echo $SLURM_NTASKS'
c35n07.grace.hpc.yale.internal
10
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c36n11.grace.hpc.yale.internal
2
Comment 1 Nate Rini 2019-06-27 13:17:55 MDT
Kaylea,

Can you please provide the 'scontrol show job $JOBID' output for the job in your last message?

Thanks,
--Nate
Comment 2 Kaylea Nelson 2019-07-01 08:11:05 MDT
Similar job:

$ mpirun bash -c 'hostname; echo $SLURM_NTASKS'
c36n03.grace.hpc.yale.internal
10
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n11.grace.hpc.yale.internal
2

$ scontrol show job 34698950
JobId=34698950 JobName=bash
   UserId=kln26(11135) GroupId=hpcprog(10042) MCS_label=N/A
   Priority=33420 Nice=0 Account=hpcprog QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:46 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-07-01T10:09:30 EligibleTime=2019-07-01T10:09:30
   AccrueTime=Unknown
   StartTime=2019-07-01T10:09:30 EndTime=2019-07-01T11:09:30 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-07-01T10:09:30
   Partition=admintest AllocNode:Sid=grace2:29049
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c36n[03,10-11]
   BatchHost=c36n03
   NumNodes=3 NumCPUs=10 NumTasks=10 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=10,mem=50G,node=3
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/gpfs/loomis/home.grace/fas/hpcprog/kln26
   Power=
Comment 3 Nate Rini 2019-07-01 11:05:42 MDT
Kaylea,

I'm working on recreating the issue. I currently suspect it is related to how mpirun is handling the environment.

--Nate
Comment 5 Nate Rini 2019-07-01 18:41:30 MDT
I have confirmed the same issue by matching your software levels. Looking into the cause now.

scaleout1: 10
scaleout1: 10
scaleout2: 2
scaleout3: 2
scaleout2: 2
scaleout2: 2
scaleout3: 2
scaleout3: 2
scaleout1: 10
scaleout1: 10

--Nate
Comment 13 Nate Rini 2019-07-02 10:17:01 MDT
(In reply to Kaylea Nelson from comment #0)
> We have noticed an odd behavior where the SLURM_NTASKS environment variable
> isn't the same on all nodes in a job. Only the master node on the job
> appears to report the correct value (SLURM_NTASKS = --ntasks). Are we
> misunderstanding how this environment variable is supposed to be used?

That is the correct understanding as documented. Took a little while to replicate this, but I managed to replicate it with openmpi 211 and openmpi 401 on Slurm 19.05 too.
 
The problem lies in how your calling the job:
> $ srun --pty -n 10 -N 3 bash
> $ mpirun bash -c 'hostname; echo $SLURM_NTASKS'

This causes a valid job step to be created with ptys linked to task zero. When mpirun is called, it (correctly) detects that it is already inside of a job and then only calls srun for the other nodes in the job, but not the node in task zero.

Here is an example of the srun child command mpirun calls:
> srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=2 --nodelist=scaleout2,scaleout3 --ntasks=2 orted -mca orte_debug_daemons "1" -mca ess "slurm" -mca ess_base_jobid "2152267776" -mca ess_base_vpid "1" -mca ess_base_num_procs "3" -mca orte_node_regex "scaleout[1:1-3]@0(3)" -mca orte_hnp_uri "2152267776.0;tcp://10.11.5.1:49981" --mca plm_base_verbose "5" -mca oob_base_verbose "10" -mca rml_base_verbose "10"

Slurm then correctly sets the ntasks from this child srun command, which is different than the parent srun command that the job was started under. This is admittedly confusing and would probably be better to take up with the openmpi developers as Slurm is working as documented.

I would suggest to instead use `salloc` instead of `srun --pty` as it will result in the expected job pattern:
> $ salloc -n 9 -N 3 --ntasks-per-node=3 bash 
> $ mpirun bash -c 'echo "$(hostname): $SLURM_NTASKS"'
> scaleout1: 3
> scaleout1: 3
> scaleout1: 3
> scaleout2: 3
> scaleout2: 3
> scaleout2: 3
> scaleout3: 3
> scaleout3: 3
> scaleout3: 3

Is there a specific reason you are using `srun --pty` instead of `salloc`?

Thanks,
--Nate
Comment 14 Kaylea Nelson 2019-07-02 12:03:34 MDT
Hi Nate,

In this example since SLURM_NTASKS is supposed to equal --ntasks, wouldn't one expect SLURM_NTASKS = 9 (not 3, the number of nodes)?

We actually first encountered this behavior in batch jobs (I've attached a simple batch script example and log), where we also see the incorrect value for SLURM_NTASKS. I included the srun --pty example since it was a straightforward way to reproduce the issue.

Thanks,
Kaylea
Comment 15 Kaylea Nelson 2019-07-02 12:03:49 MDT
Created attachment 10781 [details]
test submission script
Comment 16 Kaylea Nelson 2019-07-02 12:04:03 MDT
Created attachment 10782 [details]
example out file
Comment 22 Nate Rini 2019-07-05 10:59:04 MDT
(In reply to Kaylea Nelson from comment #14)
> We actually first encountered this behavior in batch jobs (I've attached a
> simple batch script example and log), where we also see the incorrect value
> for SLURM_NTASKS. I included the srun --pty example since it was a
> straightforward way to reproduce the issue.

We have confirmed your issue. We are currently looking at how best to proceed and what value SLURM_NTASKS should have in steps. As of right now, it does not agree with the documented value.
Comment 23 Nate Rini 2019-07-16 19:53:37 MDT
*** Ticket 6167 has been marked as a duplicate of this ticket. ***
Comment 31 Nate Rini 2019-08-13 15:01:25 MDT
(In reply to Nate Rini from comment #22)
> (In reply to Kaylea Nelson from comment #14)
> > We actually first encountered this behavior in batch jobs (I've attached a
> > simple batch script example and log), where we also see the incorrect value
> > for SLURM_NTASKS. I included the srun --pty example since it was a
> > straightforward way to reproduce the issue.
> 
> We have confirmed your issue. We are currently looking at how best to
> proceed and what value SLURM_NTASKS should have in steps. As of right now,
> it does not agree with the documented value.

Kaylea,

The environmental variable SLURM_NTASKS being set is the same as "srun/salloc/sbatch -n". Since each of these calls also sets SLURM_NTASKS, the next call to srun is inheriting that count (unless explicitly overridden or provided as a command line argument). I suggest looking at SLURM_STEP_NUM_TASKS instead to find number of tasks assigned to a given step.

Do you have any more questions?

Thanks,
--Nate
Comment 33 Nate Rini 2019-08-28 18:12:29 MDT
Kaylea,

We are going to close this ticket as there hasn't been a response in over a week. 

If you have any issues or questions, please respond to this ticket and it will be reopened.

Thanks,
--Nate
Comment 34 Nate Rini 2019-08-28 18:20:30 MDT
*** Ticket 6166 has been marked as a duplicate of this ticket. ***
Comment 35 Kaylea Nelson 2021-02-18 09:51:23 MST
Hi Nate,

Sorry to reopen this ancient ticket but I just found it in my email back log and this issue still persists in version 20.02.6.

I tried the suggestion of using SLURM_STEP_NUM_TASKS, but that still reports the incorrect number of task on non-main nodes when using "mpirun" instead of "srun".

[kln26@grace1 ~]$ srun --pty -n 10 -N 3 bash
[kln26@c09n01 ~]$ mpirun bash -c 'hostname; echo $SLURM_STEP_NUM_TASKS'

c09n01.grace.hpc.yale.internal
10
c09n01.grace.hpc.yale.internal
10
c09n01.grace.hpc.yale.internal
10
c09n01.grace.hpc.yale.internal
10
c09n03.grace.hpc.yale.internal
2
c09n03.grace.hpc.yale.internal
2
c09n03.grace.hpc.yale.internal
2
c09n04.grace.hpc.yale.internal
2
c09n04.grace.hpc.yale.internal
2
c09n04.grace.hpc.yale.internal
2
Comment 36 Nate Rini 2021-02-18 10:28:50 MST
Kaylea,

Please attach your current slurm.conf & friends along and please note if your system has any patches applied.

Thanks,
--Nate
Comment 37 Kaylea Nelson 2021-02-18 11:58:43 MST
Created attachment 17995 [details]
grace 20.02.6 conf
Comment 38 Kaylea Nelson 2021-02-18 12:00:01 MST
Created attachment 17996 [details]
grace 20.02.6 conf
Comment 39 Kaylea Nelson 2021-02-18 12:00:31 MST
Created attachment 17997 [details]
grace cgroup.conf
Comment 40 Kaylea Nelson 2021-02-18 12:01:33 MST
I've attached the conf for our Grace cluster (I can reproduce this issue on all four of our clusters). Let me know if you need anything else.

We have this patch applied on Grace: https://bugs.schedmd.com/show_bug.cgi?id=10824#c30
Comment 42 Nate Rini 2021-02-22 15:33:30 MST
My attempt to recreate failed, can you please attach your slurmctld log at the time of job submission along with the following from your test job:
> scontrol show job $JOBID
Comment 43 Nate Rini 2021-03-10 15:11:09 MST
Kaylea,

We have not gotten a response in over a week to comment #42. I'm going to time this ticket out. Please respond with the logs when convenient and we can continue debugging.

Thanks,
--Nate