Ticket 5051 - Wrong warning on srun options processing: can't honor --ntasks-per-node
Summary: Wrong warning on srun options processing: can't honor --ntasks-per-node
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 17.11.5
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-04-10 10:57 MDT by hpc-cs-hd
Modified: 2019-09-05 08:24 MDT (History)
0 users

See Also:
Site: Cineca
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
submission scripts and job outputs (9.58 KB, application/gzip)
2018-04-19 09:35 MDT, hpc-cs-hd
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description hpc-cs-hd 2018-04-10 10:57:12 MDT
Dear Support,

we noticed an incorrect behavior when processing the --ntasks-per-node option 
every time a number of tasks per node smaller than the number of requested nodes is requested, and (Intel) mpirun is used to run the code (with srun everything's ok).
For instance:

----------------------------------
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=3

mpirun  ~/hello_mpi_impi
----------------------------------

results in:

ssrun: Warning: can't honor --ntasks-per-node set to 3 which doesn't match the requested tasks 4 with the number of requested nodes 4. Ignoring --ntasks-per-node.

However the tasks are then correctly distributed 3 per node with the correct pinning. The same result is obtained adding the --ntasks option:

#SBATCH --ntasks=12

It seems that in src/srun/libsrun/opt.c, in the conditions which trigger the comment, the value of opt.ntasks is not correctly evaluated (4 instead of 12). But maybe I misunderstood the meaning of opt.ntasks?

I found some old posts in slurm-dev archive referring to previous slurm releases (16.05.5, 17.02.0), where it's also said that this did not happen in 15.08.12. I also found the bugs 4268, 3079, 1115, but though referring to the same Warning they appear (to me) due to a different problem. 

Can you please advise on the matter?

thank you very much,

Isabella
HPC User Support @ CINECA
Comment 1 Felip Moll 2018-04-12 05:13:58 MDT
Hi Isabella,

I've seen this posts you mention in slurm-dev lists.
i.e. https://www.mail-archive.com/slurm-dev@schedmd.com/msg09063.html

Nevertheless I am not entirely sure it is a slurm problem but a mpirun problem.

The variables set by Slurm in the sbatch seems correct:
SLURM_NODELIST=knl4,moll[1-3]
SLURM_NTASKS_PER_NODE=3
SLURM_NNODES=4
SLURM_NTASKS=12
SLURM_TASKS_PER_NODE=3(x4)
SLURM_JOB_NUM_NODES=4

And for a srun started in the sbatch allocation:
SLURM_NODELIST=knl4,moll[1-3]
SLURM_NTASKS_PER_NODE=3
SLURM_NNODES=4
SLURM_NTASKS=12
SLURM_TASKS_PER_NODE=3(x4)
SLURM_STEP_TASKS_PER_NODE=3(x4)
SLURM_STEP_NUM_TASKS=12

I am wondering if Intel mpirun, when executes srun, picks the SLURM_STEP_TASKS_PER_NODE=X value and omits the (x4).

I will investigate this a little further, but I have to configure an Intel MPI environment first.

I currently don't know how to do that but, do you know if there is any debug option for mpirun to see what is going on?

We could do a test... in the batch script, before launching mpirun, just modify and export SLURM_STEP_TASKS_PER_NODE=12 and SLURM_TASKS_PER_NODE=12. Then run mpirun and see what happens. This will tell us if it is an mpirun issue with these variables.

The commit which introduces srun check seems correct to me (bug 2520):
72ed146cd2a6facb76e854919fb887faf3fc0c25

And that can be the reason that why it hasn't seen before 16.05, this commit just uncovered the problem.


Anyway, we usually recommend to use srun over mpirun, and Intel MPI also seems to agree with this idea. Using srun vs mpirun you will get support for process tracking, accounting, task affinity, suspend/resume and other features.

See more here: https://slurm.schedmd.com/mpi_guide.html#intel_mpi

Is there any reason why you are not using srun?
I wait for your input in the commented test.
Comment 2 Felip Moll 2018-04-19 03:00:03 MDT
(In reply to Felip Moll from comment #1)
> We could do a test... in the batch script, before launching mpirun, just
> modify and export SLURM_STEP_TASKS_PER_NODE=12 and SLURM_TASKS_PER_NODE=12.
> Then run mpirun and see what happens. This will tell us if it is an mpirun
> issue with these variables.
>...
> I wait for your input in the commented test.

Hi Isabella,

Did you have a chance to do the commented test?

Thanks
Comment 3 hpc-cs-hd 2018-04-19 09:35:35 MDT
Created attachment 6655 [details]
submission scripts and job outputs
Comment 4 hpc-cs-hd 2018-04-19 09:37:16 MDT
Dear Felip,

apologies for my late reply. I did the tests you suggested and some more digging myself. 

------------------------------------------------------------------
* I am wondering if Intel mpirun, when executes srun, picks the 
SLURM_STEP_TASKS_PER_NODE=X value and omits the (x4).
------------------------------------------------------------------

I submitted a job with (I report only the relevant lines, please find the full script in attachment, submit1, and the outputs for job 775777):

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=3

mpirun  env | sort | uniq > env.mpirun_$SLURM_JOBID
srun  env | sort | uniq > env.srun_$SLURM_JOBID

Please find in attachment the two full outputs, while I report here what seems wrong with the environment seen by mpirun when compared to srun:

mpirun:
------

SLURM_NPROCS=4
SLURM_STEP_NUM_TASKS=4
SLURM_STEP_TASKS_PER_NODE=1(x4)
SLURM_TASKS_PER_NODE=1(x4)

srun:
-----

SLURM_NPROCS=12
SLURM_STEP_NUM_TASKS=12
SLURM_STEP_TASKS_PER_NODE=3(x4)
SLURM_TASKS_PER_NODE=3(x4)

Hence, apparently mpirun, when executes srun, (incorrectly) picks one task per node and keeps the correct (x4). 
It anyway picks the correct SLURM_NTASKS=12 and SLURM_NTASKS_PER_NODE=3 and in fact it launches 12 tasks, 3 per node (even though there appear only 4 SLURM_PROCID instead of 12 in the environment)

* I currently don't know how to do that but, do you know if there is any 
* debug option for mpirun to see what is going on?

As you see, I used the "mpirun env" to understand what's going on when requesting #SBATCH --nodes=4 and #SBATCH --ntasks-per-node=3. 
You can also set the variable I_MPI_DEBUG like in:

mpirun -genv I_MPI_DEBUG=5 exe

to have information on process pinning (in such way I verified that the total 12 processes were pinned to 3 different cores on 4 nodes).

* We could do a test... in the batch script, before launching mpirun, 
* just modify and export SLURM_STEP_TASKS_PER_NODE=12 
* and SLURM_TASKS_PER_NODE=12. Then run mpirun and see what 
* happens. This will tell us if it is an mpirun issue with 
* these variables.

done (in attachment: submit2, and outputs of job 775906): I keep obtaining the warning message "can't honor --ntasks-per-node" and in the mpirun env:

SLURM_NPROCS=4
SLURM_NTASKS=12
SLURM_NTASKS_PER_NODE=3
SLURM_STEP_NUM_TASKS=4
SLURM_STEP_TASKS_PER_NODE=12

but the SLURM_TASKS_PER_NODE value set in the script is ignored:

SLURM_TASKS_PER_NODE=1(x4)

I also tried to export variables SLURM_NPROCS=12 and SLURM_STEP_NUM_TASKS=12 (submit3 and outputs of job 776163) but mpirun still reports for env:

SLURM_NPROCS=4
SLURM_TASKS_PER_NODE=1(x4)

On the master compute node I see:

ibaccare 33556  0.0  0.0 114148  1508 ?        S    16:50   0:00 /bin/bash /var/spool/slurmd/job775399/slurm_script
ibaccare 33571  0.0  0.0 113132  1480 ?        S    16:50   0:00  \_ /bin/sh /cineca/prod/opt/compilers/intel/pe-xe-2018/binary/impi/2018.2.199/bin64/mpirun env
ibaccare 33577  0.2  0.0  18096  2348 ?        S    16:50   0:00  |   \_ mpiexec.hydra env
ibaccare 33578  0.2  0.0 448504  6776 ?        Sl   16:50   0:00  |       \_ /usr/bin/srun --nodelist r066c12s02,r066c12s04,r067c07s01,r067c08s04 -N 4 -n 4 --input none /cineca/prod/opt/compilers/intel/pe-xe-2018/binary/impi/2018.2.199/bin64/pmi_proxy --control-port r066c12s02.marconi.cineca.it:9603 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 664226096 --usize -2 --proxy-id -1
ibaccare 33579  0.0  0.0  44052   672 ?        S    16:50   0:00  |           \_ /usr/bin/srun --nodelist r066c12s02,r066c12s04,r067c07s01,r067c08s04 -N 4 -n 4 --input none /cineca/prod/opt/compilers/intel/pe-xe-2018/binary/impi/2018.2.199/bin64/pmi_proxy --control-port r066c12s02.marconi.cineca.it:9603 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 664226096 --usize -2 --proxy-id -1

Hence, mpirun is launching srun with "-n 4" instead of "-n 12" (but 12 processes are executed, that's puzzling). 

mpirun appears to be the responsible indeed (and srun correctly produces the warning we obtain). I guess I need to contact the Intel support, but any idea/suggestions from your part will be appreciated.

As for using mpirun with respect to srun: we told our users that srun is the recommended option, but many of them keep using mpirun and complain for the warning message (and fear an incorrect behavior concerning the final number of tasks). 



Thank you very much for your assistance,

Isabella
Comment 5 Felip Moll 2018-04-23 10:13:25 MDT
I installed the Intel compilers and did the same tests than you with the same result. Your diagnostic
is correct, mpirun launches 4 tasks in 4 nodes despite what you asked in the sbatch script, and this
is incorrect.

For instance, the error is just easily emulable running an 'salloc --nodes=4 --ntasks-per-node=2' and then
launching an 'srun -N4 -n4 hostname' inside the allocation.

I recommend you to just open a bug to Intel and explain them the situation. The -n4 in srun should be translated
to the first value of SLURM_TASKS_PER_NODE or to SLURM_NTASKS_PER_NODE values... 
currently, it seems to be induced from the nodelist instead:

in a same salloc session:
]$ salloc --nodes=3 --ntasks-per-node=2 
...

]$ mpirun hostname
--> it executes "srun --nodelist gamba1,gamba2,gamba3 -N 3 -n 3 ..."

]$ export SLURM_NODELIST=gamba[1-2]
]$ mpirun hostname
--> it executes "srun --nodelist gamba1,gamba2 -N 2 -n 2 ..."

]$ export SLURM_NODELIST=gamba[1-4]
]$ mpirun hostname
 --> it executes "srun --nodelist gamba1,gamba2,gamba3,gamba4 -N 4 -n 4 ..."

regardless of the other variables.


I also confirmed with the team and they see this as an Intel issue, so I will close this bug and mark as infogiven, but please, feel free to reopen if you think there's something more we missed out.

Best regards,
Felip M
Comment 6 Felip Moll 2019-09-05 08:24:55 MDT
Hi,

I just wanted to comment in this old closed bug. We've found a workaround for this issue in bug 7097. It doesn't fix the issue and it is still a problem with mpiexec/mpirun which doesn't set the --ntasks-per-node in the spawned srun when it is executing the hydra proxies in every node.

The trick is to force mpiexec to add a parameter to the bootstrap srun, specifically this fixes the issue:

#!/bin/bash
#SBATCH -J myjob
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8 

module load intel

export I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--ntasks-per-node=1"
mpiexec hostname


This export will make mpiexec to still read the SLURM_* environment variables but pass the --ntasks-per-node=1 to srun.
Hydra launcher will launch correctly user application since we haven't removed the real SLURM_NTASKS_PER_NODE.

Bug 7097