7311 – SLURM_NTASKS isn't consistent across nodes in a job

Ticket 7311 - SLURM_NTASKS isn't consistent across nodes in a job

Summary: SLURM_NTASKS isn't consistent across nodes in a job

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	20.02.6
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Nate Rini
QA Contact:

URL:

Duplicates (2):	6166 6167 (view as ticket list)
Depends on:
Blocks:

Reported:	2019-06-27 12:54 MDT by Kaylea Nelson
Modified:	2021-03-10 15:11 MST (History)
CC List:	4 users (show)

See Also:	6993 7191
Site:	Yale
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
test submission script (114 bytes, text/x-sh) 2019-07-02 12:03 MDT, Kaylea Nelson	Details
example out file (336 bytes, text/plain) 2019-07-02 12:04 MDT, Kaylea Nelson	Details
grace 20.02.6 conf (7.97 KB, application/zip) 2021-02-18 11:58 MST, Kaylea Nelson	Details
grace 20.02.6 conf (7.97 KB, application/zip) 2021-02-18 12:00 MST, Kaylea Nelson	Details
grace cgroup.conf (144 bytes, text/plain) 2021-02-18 12:00 MST, Kaylea Nelson	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kaylea Nelson 2019-06-27 12:54:21 MDT

We have noticed an odd behavior where the SLURM_NTASKS environment variable isn't the same on all nodes in a job. Only the master node on the job appears to report the correct value (SLURM_NTASKS = --ntasks). Are we misunderstanding how this environment variable is supposed to be used?

$ srun --pty -n 10 -N 3 bash
$ module load MPI/OpenMPI/2.1.1-gcc
$ mpirun bash -c 'hostname; echo $SLURM_NTASKS'
c35n07.grace.hpc.yale.internal
10
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c35n08.grace.hpc.yale.internal
2
c36n11.grace.hpc.yale.internal
2

Comment 1 Nate Rini 2019-06-27 13:17:55 MDT

Kaylea,

Can you please provide the 'scontrol show job $JOBID' output for the job in your last message?

Thanks,
--Nate

Comment 2 Kaylea Nelson 2019-07-01 08:11:05 MDT

Similar job:

$ mpirun bash -c 'hostname; echo $SLURM_NTASKS'
c36n03.grace.hpc.yale.internal
10
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n10.grace.hpc.yale.internal
2
c36n11.grace.hpc.yale.internal
2

$ scontrol show job 34698950
JobId=34698950 JobName=bash
   UserId=kln26(11135) GroupId=hpcprog(10042) MCS_label=N/A
   Priority=33420 Nice=0 Account=hpcprog QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:46 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-07-01T10:09:30 EligibleTime=2019-07-01T10:09:30
   AccrueTime=Unknown
   StartTime=2019-07-01T10:09:30 EndTime=2019-07-01T11:09:30 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-07-01T10:09:30
   Partition=admintest AllocNode:Sid=grace2:29049
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c36n[03,10-11]
   BatchHost=c36n03
   NumNodes=3 NumCPUs=10 NumTasks=10 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=10,mem=50G,node=3
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/gpfs/loomis/home.grace/fas/hpcprog/kln26
   Power=

Comment 3 Nate Rini 2019-07-01 11:05:42 MDT

Kaylea,

I'm working on recreating the issue. I currently suspect it is related to how mpirun is handling the environment.

--Nate

Comment 5 Nate Rini 2019-07-01 18:41:30 MDT

I have confirmed the same issue by matching your software levels. Looking into the cause now.

scaleout1: 10
scaleout1: 10
scaleout2: 2
scaleout3: 2
scaleout2: 2
scaleout2: 2
scaleout3: 2
scaleout3: 2
scaleout1: 10
scaleout1: 10

--Nate

Comment 13 Nate Rini 2019-07-02 10:17:01 MDT

(In reply to Kaylea Nelson from comment #0)
> We have noticed an odd behavior where the SLURM_NTASKS environment variable
> isn't the same on all nodes in a job. Only the master node on the job
> appears to report the correct value (SLURM_NTASKS = --ntasks). Are we
> misunderstanding how this environment variable is supposed to be used?

That is the correct understanding as documented. Took a little while to replicate this, but I managed to replicate it with openmpi 211 and openmpi 401 on Slurm 19.05 too.
 
The problem lies in how your calling the job:
> $ srun --pty -n 10 -N 3 bash
> $ mpirun bash -c 'hostname; echo $SLURM_NTASKS'

This causes a valid job step to be created with ptys linked to task zero. When mpirun is called, it (correctly) detects that it is already inside of a job and then only calls srun for the other nodes in the job, but not the node in task zero.

Here is an example of the srun child command mpirun calls:
> srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=2 --nodelist=scaleout2,scaleout3 --ntasks=2 orted -mca orte_debug_daemons "1" -mca ess "slurm" -mca ess_base_jobid "2152267776" -mca ess_base_vpid "1" -mca ess_base_num_procs "3" -mca orte_node_regex "scaleout[1:1-3]@0(3)" -mca orte_hnp_uri "2152267776.0;tcp://10.11.5.1:49981" --mca plm_base_verbose "5" -mca oob_base_verbose "10" -mca rml_base_verbose "10"

Slurm then correctly sets the ntasks from this child srun command, which is different than the parent srun command that the job was started under. This is admittedly confusing and would probably be better to take up with the openmpi developers as Slurm is working as documented.

I would suggest to instead use `salloc` instead of `srun --pty` as it will result in the expected job pattern:
> $ salloc -n 9 -N 3 --ntasks-per-node=3 bash 
> $ mpirun bash -c 'echo "$(hostname): $SLURM_NTASKS"'
> scaleout1: 3
> scaleout1: 3
> scaleout1: 3
> scaleout2: 3
> scaleout2: 3
> scaleout2: 3
> scaleout3: 3
> scaleout3: 3
> scaleout3: 3

Is there a specific reason you are using `srun --pty` instead of `salloc`?

Thanks,
--Nate

Comment 14 Kaylea Nelson 2019-07-02 12:03:34 MDT

Hi Nate,

In this example since SLURM_NTASKS is supposed to equal --ntasks, wouldn't one expect SLURM_NTASKS = 9 (not 3, the number of nodes)?

We actually first encountered this behavior in batch jobs (I've attached a simple batch script example and log), where we also see the incorrect value for SLURM_NTASKS. I included the srun --pty example since it was a straightforward way to reproduce the issue.

Thanks,
Kaylea

Comment 15 Kaylea Nelson 2019-07-02 12:03:49 MDT

Created attachment 10781 [details]
test submission script

Comment 16 Kaylea Nelson 2019-07-02 12:04:03 MDT

Created attachment 10782 [details]
example out file

Comment 22 Nate Rini 2019-07-05 10:59:04 MDT

(In reply to Kaylea Nelson from comment #14)
> We actually first encountered this behavior in batch jobs (I've attached a
> simple batch script example and log), where we also see the incorrect value
> for SLURM_NTASKS. I included the srun --pty example since it was a
> straightforward way to reproduce the issue.

We have confirmed your issue. We are currently looking at how best to proceed and what value SLURM_NTASKS should have in steps. As of right now, it does not agree with the documented value.

Comment 23 Nate Rini 2019-07-16 19:53:37 MDT

*** Ticket 6167 has been marked as a duplicate of this ticket. ***

Comment 31 Nate Rini 2019-08-13 15:01:25 MDT

(In reply to Nate Rini from comment #22)
> (In reply to Kaylea Nelson from comment #14)
> > We actually first encountered this behavior in batch jobs (I've attached a
> > simple batch script example and log), where we also see the incorrect value
> > for SLURM_NTASKS. I included the srun --pty example since it was a
> > straightforward way to reproduce the issue.
> 
> We have confirmed your issue. We are currently looking at how best to
> proceed and what value SLURM_NTASKS should have in steps. As of right now,
> it does not agree with the documented value.

Kaylea,

The environmental variable SLURM_NTASKS being set is the same as "srun/salloc/sbatch -n". Since each of these calls also sets SLURM_NTASKS, the next call to srun is inheriting that count (unless explicitly overridden or provided as a command line argument). I suggest looking at SLURM_STEP_NUM_TASKS instead to find number of tasks assigned to a given step.

Do you have any more questions?

Thanks,
--Nate

Comment 33 Nate Rini 2019-08-28 18:12:29 MDT

Kaylea,

We are going to close this ticket as there hasn't been a response in over a week. 

If you have any issues or questions, please respond to this ticket and it will be reopened.

Thanks,
--Nate

Comment 34 Nate Rini 2019-08-28 18:20:30 MDT

*** Ticket 6166 has been marked as a duplicate of this ticket. ***

Comment 35 Kaylea Nelson 2021-02-18 09:51:23 MST

Hi Nate,

Sorry to reopen this ancient ticket but I just found it in my email back log and this issue still persists in version 20.02.6.

I tried the suggestion of using SLURM_STEP_NUM_TASKS, but that still reports the incorrect number of task on non-main nodes when using "mpirun" instead of "srun".

[kln26@grace1 ~]$ srun --pty -n 10 -N 3 bash
[kln26@c09n01 ~]$ mpirun bash -c 'hostname; echo $SLURM_STEP_NUM_TASKS'

c09n01.grace.hpc.yale.internal
10
c09n01.grace.hpc.yale.internal
10
c09n01.grace.hpc.yale.internal
10
c09n01.grace.hpc.yale.internal
10
c09n03.grace.hpc.yale.internal
2
c09n03.grace.hpc.yale.internal
2
c09n03.grace.hpc.yale.internal
2
c09n04.grace.hpc.yale.internal
2
c09n04.grace.hpc.yale.internal
2
c09n04.grace.hpc.yale.internal
2

Comment 36 Nate Rini 2021-02-18 10:28:50 MST

Kaylea,

Please attach your current slurm.conf & friends along and please note if your system has any patches applied.

Thanks,
--Nate

Comment 37 Kaylea Nelson 2021-02-18 11:58:43 MST

Created attachment 17995 [details]
grace 20.02.6 conf

Comment 38 Kaylea Nelson 2021-02-18 12:00:01 MST

Created attachment 17996 [details]
grace 20.02.6 conf

Comment 39 Kaylea Nelson 2021-02-18 12:00:31 MST

Created attachment 17997 [details]
grace cgroup.conf

Comment 40 Kaylea Nelson 2021-02-18 12:01:33 MST

I've attached the conf for our Grace cluster (I can reproduce this issue on all four of our clusters). Let me know if you need anything else.

We have this patch applied on Grace: https://bugs.schedmd.com/show_bug.cgi?id=10824#c30

Comment 42 Nate Rini 2021-02-22 15:33:30 MST

My attempt to recreate failed, can you please attach your slurmctld log at the time of job submission along with the following from your test job:
> scontrol show job $JOBID

Comment 43 Nate Rini 2021-03-10 15:11:09 MST

Kaylea,

We have not gotten a response in over a week to comment #42. I'm going to time this ticket out. Please respond with the logs when convenient and we can continue debugging.

Thanks,
--Nate