We have noticed an odd behavior where the SLURM_NTASKS environment variable isn't the same on all nodes in a job. Only the master node on the job appears to report the correct value (SLURM_NTASKS = --ntasks). Are we misunderstanding how this environment variable is supposed to be used? $ srun --pty -n 10 -N 3 bash $ module load MPI/OpenMPI/2.1.1-gcc $ mpirun bash -c 'hostname; echo $SLURM_NTASKS' c35n07.grace.hpc.yale.internal 10 c35n08.grace.hpc.yale.internal 2 c35n08.grace.hpc.yale.internal 2 c35n08.grace.hpc.yale.internal 2 c35n08.grace.hpc.yale.internal 2 c35n08.grace.hpc.yale.internal 2 c35n08.grace.hpc.yale.internal 2 c35n08.grace.hpc.yale.internal 2 c35n08.grace.hpc.yale.internal 2 c36n11.grace.hpc.yale.internal 2
Kaylea, Can you please provide the 'scontrol show job $JOBID' output for the job in your last message? Thanks, --Nate
Similar job: $ mpirun bash -c 'hostname; echo $SLURM_NTASKS' c36n03.grace.hpc.yale.internal 10 c36n10.grace.hpc.yale.internal 2 c36n10.grace.hpc.yale.internal 2 c36n10.grace.hpc.yale.internal 2 c36n10.grace.hpc.yale.internal 2 c36n10.grace.hpc.yale.internal 2 c36n10.grace.hpc.yale.internal 2 c36n10.grace.hpc.yale.internal 2 c36n10.grace.hpc.yale.internal 2 c36n11.grace.hpc.yale.internal 2 $ scontrol show job 34698950 JobId=34698950 JobName=bash UserId=kln26(11135) GroupId=hpcprog(10042) MCS_label=N/A Priority=33420 Nice=0 Account=hpcprog QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:46 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2019-07-01T10:09:30 EligibleTime=2019-07-01T10:09:30 AccrueTime=Unknown StartTime=2019-07-01T10:09:30 EndTime=2019-07-01T11:09:30 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-07-01T10:09:30 Partition=admintest AllocNode:Sid=grace2:29049 ReqNodeList=(null) ExcNodeList=(null) NodeList=c36n[03,10-11] BatchHost=c36n03 NumNodes=3 NumCPUs=10 NumTasks=10 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=10,mem=50G,node=3 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/gpfs/loomis/home.grace/fas/hpcprog/kln26 Power=
Kaylea, I'm working on recreating the issue. I currently suspect it is related to how mpirun is handling the environment. --Nate
I have confirmed the same issue by matching your software levels. Looking into the cause now. scaleout1: 10 scaleout1: 10 scaleout2: 2 scaleout3: 2 scaleout2: 2 scaleout2: 2 scaleout3: 2 scaleout3: 2 scaleout1: 10 scaleout1: 10 --Nate
(In reply to Kaylea Nelson from comment #0) > We have noticed an odd behavior where the SLURM_NTASKS environment variable > isn't the same on all nodes in a job. Only the master node on the job > appears to report the correct value (SLURM_NTASKS = --ntasks). Are we > misunderstanding how this environment variable is supposed to be used? That is the correct understanding as documented. Took a little while to replicate this, but I managed to replicate it with openmpi 211 and openmpi 401 on Slurm 19.05 too. The problem lies in how your calling the job: > $ srun --pty -n 10 -N 3 bash > $ mpirun bash -c 'hostname; echo $SLURM_NTASKS' This causes a valid job step to be created with ptys linked to task zero. When mpirun is called, it (correctly) detects that it is already inside of a job and then only calls srun for the other nodes in the job, but not the node in task zero. Here is an example of the srun child command mpirun calls: > srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=2 --nodelist=scaleout2,scaleout3 --ntasks=2 orted -mca orte_debug_daemons "1" -mca ess "slurm" -mca ess_base_jobid "2152267776" -mca ess_base_vpid "1" -mca ess_base_num_procs "3" -mca orte_node_regex "scaleout[1:1-3]@0(3)" -mca orte_hnp_uri "2152267776.0;tcp://10.11.5.1:49981" --mca plm_base_verbose "5" -mca oob_base_verbose "10" -mca rml_base_verbose "10" Slurm then correctly sets the ntasks from this child srun command, which is different than the parent srun command that the job was started under. This is admittedly confusing and would probably be better to take up with the openmpi developers as Slurm is working as documented. I would suggest to instead use `salloc` instead of `srun --pty` as it will result in the expected job pattern: > $ salloc -n 9 -N 3 --ntasks-per-node=3 bash > $ mpirun bash -c 'echo "$(hostname): $SLURM_NTASKS"' > scaleout1: 3 > scaleout1: 3 > scaleout1: 3 > scaleout2: 3 > scaleout2: 3 > scaleout2: 3 > scaleout3: 3 > scaleout3: 3 > scaleout3: 3 Is there a specific reason you are using `srun --pty` instead of `salloc`? Thanks, --Nate
Hi Nate, In this example since SLURM_NTASKS is supposed to equal --ntasks, wouldn't one expect SLURM_NTASKS = 9 (not 3, the number of nodes)? We actually first encountered this behavior in batch jobs (I've attached a simple batch script example and log), where we also see the incorrect value for SLURM_NTASKS. I included the srun --pty example since it was a straightforward way to reproduce the issue. Thanks, Kaylea
Created attachment 10781 [details] test submission script
Created attachment 10782 [details] example out file
(In reply to Kaylea Nelson from comment #14) > We actually first encountered this behavior in batch jobs (I've attached a > simple batch script example and log), where we also see the incorrect value > for SLURM_NTASKS. I included the srun --pty example since it was a > straightforward way to reproduce the issue. We have confirmed your issue. We are currently looking at how best to proceed and what value SLURM_NTASKS should have in steps. As of right now, it does not agree with the documented value.
*** Ticket 6167 has been marked as a duplicate of this ticket. ***
(In reply to Nate Rini from comment #22) > (In reply to Kaylea Nelson from comment #14) > > We actually first encountered this behavior in batch jobs (I've attached a > > simple batch script example and log), where we also see the incorrect value > > for SLURM_NTASKS. I included the srun --pty example since it was a > > straightforward way to reproduce the issue. > > We have confirmed your issue. We are currently looking at how best to > proceed and what value SLURM_NTASKS should have in steps. As of right now, > it does not agree with the documented value. Kaylea, The environmental variable SLURM_NTASKS being set is the same as "srun/salloc/sbatch -n". Since each of these calls also sets SLURM_NTASKS, the next call to srun is inheriting that count (unless explicitly overridden or provided as a command line argument). I suggest looking at SLURM_STEP_NUM_TASKS instead to find number of tasks assigned to a given step. Do you have any more questions? Thanks, --Nate
Kaylea, We are going to close this ticket as there hasn't been a response in over a week. If you have any issues or questions, please respond to this ticket and it will be reopened. Thanks, --Nate
*** Ticket 6166 has been marked as a duplicate of this ticket. ***
Hi Nate, Sorry to reopen this ancient ticket but I just found it in my email back log and this issue still persists in version 20.02.6. I tried the suggestion of using SLURM_STEP_NUM_TASKS, but that still reports the incorrect number of task on non-main nodes when using "mpirun" instead of "srun". [kln26@grace1 ~]$ srun --pty -n 10 -N 3 bash [kln26@c09n01 ~]$ mpirun bash -c 'hostname; echo $SLURM_STEP_NUM_TASKS' c09n01.grace.hpc.yale.internal 10 c09n01.grace.hpc.yale.internal 10 c09n01.grace.hpc.yale.internal 10 c09n01.grace.hpc.yale.internal 10 c09n03.grace.hpc.yale.internal 2 c09n03.grace.hpc.yale.internal 2 c09n03.grace.hpc.yale.internal 2 c09n04.grace.hpc.yale.internal 2 c09n04.grace.hpc.yale.internal 2 c09n04.grace.hpc.yale.internal 2
Kaylea, Please attach your current slurm.conf & friends along and please note if your system has any patches applied. Thanks, --Nate
Created attachment 17995 [details] grace 20.02.6 conf
Created attachment 17996 [details] grace 20.02.6 conf
Created attachment 17997 [details] grace cgroup.conf
I've attached the conf for our Grace cluster (I can reproduce this issue on all four of our clusters). Let me know if you need anything else. We have this patch applied on Grace: https://bugs.schedmd.com/show_bug.cgi?id=10824#c30
My attempt to recreate failed, can you please attach your slurmctld log at the time of job submission along with the following from your test job: > scontrol show job $JOBID
Kaylea, We have not gotten a response in over a week to comment #42. I'm going to time this ticket out. Please respond with the logs when convenient and we can continue debugging. Thanks, --Nate