7097 – Warning: can't honor --ntasks-per-node set to

Ticket 7097 - Warning: can't honor --ntasks-per-node set to

Summary: Warning: can't honor --ntasks-per-node set to

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	18.08.4
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-05-23 06:02 MDT by aghasemi@lenovo.com
Modified:	2019-09-05 08:26 MDT (History)
CC List:	3 users (show)

See Also:	5897 5051
Site:	LRZ
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
ENV variables (106.38 KB, text/plain) 2019-05-28 03:35 MDT, aghasemi@lenovo.com	Details
honor.sh (248 bytes, text/plain) 2019-09-04 05:13 MDT, Victor Gamayunov	Details
ps -Hef (58.58 KB, text/plain) 2019-09-04 05:26 MDT, Victor Gamayunov	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description aghasemi@lenovo.com 2019-05-23 06:02:13 MDT

HI,

The following script is used to submit a job. at the beginning of the user log file the following message is printed.

srun: Warning: can't honor --ntasks-per-node set to 48 which doesn't match the requested tasks 81 with the number of requested nodes 81. Ignoring --ntasks-per-node.

As I thought, did not understand the reason for this warning. I will appreciate it if you can explain for me the root cause of the warning. Also already i went trough some the past BUG reporting this issue but I could not get any conclusion out of them. 

The submission script is as follows

#!/bin/bash
#SBATCH --mail-user=christoph.federrath@anu.edu.au
#SBATCH --mail-type=all
#SBATCH --account=pr32lo
#SBATCH --get-user-env
#SBATCH --export=NONE
#SBATCH --no-requeue
#SBATCH -o ./job_%j.out
#SBATCH -D ./
#SBATCH -J MHD_10008
#SBATCH --partition=large
#SBATCH --nodes=1944
#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1
#SBATCH --time=12:00:00

source /etc/profile.d/modules.sh
module load slurm_setup
module load intel
module unload mpi.intel
module load mpi.intel/2019
module load lrztools fftw/3.3.8-intel-impi hdf5/1.10.2-intel-impi szip
module list
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HDF5_BASE/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SZIP_BASE/lib

mpiexec -n 93312 ./flash4 &> shell.out12


kind regards,

Abouzar Ghasemi
HPC Specialist
Lenovo Deutschland GmbH	
+4915237708213
aghasemi@lenovo.com

Comment 3 Felip Moll 2019-05-24 07:28:33 MDT

Hi Abouzar,

I'm trying to reproduce the issue.

In the meantime, can you run a script like this and show me the output?:

#!/bin/bash
#SBATCH --get-user-env
#SBATCH --export=NONE
#SBATCH --partition=large
#SBATCH --nodes=1944
#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1

/bin/env


I am foreseeing some environment variable set which makes the sruns in the allocation to do a bad calculation.
Is the failing script really using mpiexec, or srun?

Comment 4 Felip Moll 2019-05-24 10:20:42 MDT

I would like you to test running with srun instead of mpiexec too.

#!/bin/bash
#SBATCH --get-user-env
#SBATCH --export=NONE
#SBATCH --no-requeue
#SBATCH --partition=large
#SBATCH --nodes=1944
#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1

srun -n 93312 /bin/true


Show me the output please.

I think mpiexec may be misinterpreting Slurm's environment variables. I encourage you to read through bug 5051 which seems a duplicate of your issue, then if you can do the tests done there we will be certain that this is the same issue.

Thanks

Comment 5 aghasemi@lenovo.com 2019-05-28 03:34:32 MDT

Hi Flip,
Thanks for your reply.

I have done a couple of test using srun instead of mpiexec. I haven't observed any "can't honor" warning.
Please see also the attached file for the ENV variables.

Regards,
Abouzar

Comment 6 aghasemi@lenovo.com 2019-05-28 03:35:24 MDT

Created attachment 10394 [details]
ENV variables

Comment 7 Felip Moll 2019-05-28 09:43:50 MDT

Hi, I see the environment but for one job of 12 nodes instead, I don't know if with this setup you have reproduced the problem or not. That's the reason I asked exactly:

#!/bin/bash
#SBATCH --get-user-env
#SBATCH --export=NONE
#SBATCH --partition=large
#SBATCH --nodes=1944
#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1

/bin/env

---------

(Also I see an output of a program, there's no need to run a program, I just needed the env output).

In any case, let's do the following:

a) You need to find a combination that is failing if you cannot use the original combination:

#SBATCH --nodes=1944
#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1
mpiexec -n 93312 ...

b) Then try this script:

-----------8<---------------
#SBATCH --nodes=1944
#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1

mpiexec -n 93312 env | sort | uniq > env.mpiexec_$SLURM_JOBID
srun env | sort | uniq > env.srun_$SLURM_JOBID

------------8<---------------

and upload here the env.mpirun_* and env.srun_*

c) Then run:
-----------8<---------------
#SBATCH --nodes=1944
#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1
mpiexec -n 93312 sleep 1000
------------8<---------------

Go into the master node and issue a 'pstree' to see exactly what mpiexec is running. We should see something like:

xxxx 33556  0.0  0.0 114148  1508 ?        S    16:50   0:00 /bin/bash /var/spool/slurmd/job775399/slurm_script
xxxx 33571  0.0  0.0 113132  1480 ?        S    16:50   0:00  \_ /bin/sh /..../impi/2018..../bin64/mpirun env
xxxx 33577  0.2  0.0  18096  2348 ?        S    16:50   0:00  |   \_ mpiexec.hydra env
xxxx 33578  0.2  0.0 448504  6776 ?        Sl   16:50   0:00  |       \_ /usr/bin/srun --nodelist xxxx -N 81 -n 81 ........
xxxx 33579  0.0  0.0  44052   672 ?        S    16:50   0:00  |           \_ /usr/bin/srun --nodelist xxxx -N 81 -n 81 .....

We should see the incorrect srun line, something like:

srun -N 81 -n 81 ...

If that's the case, we confirm there's an issue with Intel and a bug must be opened to them.

Comment 8 Felip Moll 2019-06-05 03:55:07 MDT

Hi, do you have any feedback in what regards to my last comment?

I'm pretty sure this is an Intel issue. If you can confirm it would be great to open a bug to them.

Comment 9 Felip Moll 2019-06-13 04:54:46 MDT

Hi,

I haven't seen a response to this issue and probably the cause is what I explained in my previous comments.

I am heading to close this issue now.

Don't hesitate to mark this bug as OPEN again if you still have questions or open a new one.

Regards,
Felip M

Comment 10 aghasemi@lenovo.com 2019-06-13 05:56:19 MDT

Hi Flip,

Thanks for your message and I hope you accept our apology. 
Actually the cluster is passing now some acceptance test and we can not run the requested jobs to collect required information. According to our scheduled we can provide for you the retested information until the end of this month. By the way, inlet is also informed about the issue. They opened an internal ticket for this issue.

Comment 11 Victor Gamayunov 2019-09-04 04:31:24 MDT

Hi Felip, I would like to reopen this issue on customer request. This message should be somehow suppressed as it is confusing for users.

My understanding is that Intel MPI always spawns (using srun) one proxy per node regardless of --ntasks-per-node, and srun would see the conflicting SLURM_NTASKS_PER_NODE value and print the warning. mpiexec cannot override SLURM_NTASKS_PER_NODE because that will change how MPI tasks are spawned.

Comment 12 Felip Moll 2019-09-04 04:55:22 MDT

(In reply to Victor Gamayunov from comment #11)
> Hi Felip, I would like to reopen this issue on customer request. This
> message should be somehow suppressed as it is confusing for users.
> 
> My understanding is that Intel MPI always spawns (using srun) one proxy per
> node regardless of --ntasks-per-node, and srun would see the conflicting
> SLURM_NTASKS_PER_NODE value and print the warning. mpiexec cannot override
> SLURM_NTASKS_PER_NODE because that will change how MPI tasks are spawned.

mpiexec is launching srun after a translation of what it reads in the environment from within the allocation granted by slurm.

The issue is probably in mpiexec and how it translates to srun, this is the reason I asked for more details in comment 7, I wanted to confirm this hypothesis. 

This seems the same case than bug 5051.

Also, is this the same issue opened by your colleagues in bug 7681? If so, please coordinate and let's close one of the two issues.

In any case to confirm we need you to reproduce the issue and get the info requested in comment 7.

Comment 13 Victor Gamayunov 2019-09-04 05:10:26 MDT

(In reply to Felip Moll from comment #12)
> Also, is this the same issue opened by your colleagues in bug 7681? If so,
> please coordinate and let's close one of the two issues.

bug 7681 is different, about affinity settings, the common thing is Intel MPI with SLURM bootstrap.
 
> In any case to confirm we need you to reproduce the issue and get the info
> requested in comment 7.

I can reproduce this message on much smaller number of nodes, and indeed there is "srun -N 32 -n 32":

root     3425362       1  0 13:07 ?        00:00:00   slurmstepd: [184804.batch]
di52jap  3425368 3425362  0 13:07 ?        00:00:00     /bin/bash /var/spool/slurm/job184804/slurm_script
di52jap  3425394 3425368  0 13:07 ?        00:00:00       mpiexec sleep 300
di52jap  3425395 3425394  0 13:07 ?        00:00:00         /usr/bin/srun -N 32 -n 32 --nodelist i01r01c04s04,i01r01c04s05,i01r01c04s06,i01r01c04s07,i01r01c04s08,i01r01c04s09,i01r01c04s10,i01r01c04s11,i01r0
di52jap  3425396 3425395  0 13:07 ?        00:00:00           /usr/bin/srun -N 32 -n 32 --nodelist i01r01c04s04,i01r01c04s05,i01r01c04s06,i01r01c04s07,i01r01c04s08,i01r01c04s09,i01r01c04s10,i01r01c04s11,i01
root     3425441       1  0 13:07 ?        00:00:00   slurmstepd: [184804.0]
di52jap  3425449 3425441  0 13:07 ?        00:00:00     /lrz/sys/intel/studio2019_u4/impi/2019.4.243/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9

Comment 14 Victor Gamayunov 2019-09-04 05:13:30 MDT

Created attachment 11456 [details]
honor.sh

Comment 15 Victor Gamayunov 2019-09-04 05:26:05 MDT

Created attachment 11458 [details]
ps -Hef

Full output of ps -Hef

Comment 16 Felip Moll 2019-09-04 06:56:55 MDT

> I can reproduce this message on much smaller number of nodes, and indeed
> there is "srun -N 32 -n 32":
> 
> root     3425362       1  0 13:07 ?        00:00:00   slurmstepd:
> [184804.batch]
> di52jap  3425368 3425362  0 13:07 ?        00:00:00     /bin/bash
> /var/spool/slurm/job184804/slurm_script
> di52jap  3425394 3425368  0 13:07 ?        00:00:00       mpiexec sleep 300
> di52jap  3425395 3425394  0 13:07 ?        00:00:00         /usr/bin/srun -N
> 32 -n 32 --nodelist
> i01r01c04s04,i01r01c04s05,i01r01c04s06,i01r01c04s07,i01r01c04s08,
> i01r01c04s09,i01r01c04s10,i01r01c04s11,i01r0

So, do you agree that even if you send an sbatch asking for 8 tasks per node and 32 nodes, Intel's mpiexec is **incorrectly** translating this into srun -N 32 -n 32?

#!/bin/bash
#SBATCH -o %x.%j.%N.out
#SBATCH -e %x.%j.%N.err
#SBATCH -D .
#SBATCH -J honor
#SBATCH --nodes=32
#SBATCH --tasks-per-node=8

Comment 17 Victor Gamayunov 2019-09-04 07:02:21 MDT

> So, do you agree that even if you send an sbatch asking for 8 tasks per node
> and 32 nodes, Intel's mpiexec is **incorrectly** translating this into srun
> -N 32 -n 32?

Yes, AFAIK the process is following:
1. mpiexec on master node spawns 1x hydra_bstrap_proxy proxy per node (therefore equal -N and -n)
2. hydra_bstrap_proxy spawns --ntasks-per-node user processes on each node

Comment 18 Felip Moll 2019-09-04 07:28:07 MDT

(In reply to Victor Gamayunov from comment #17)
> > So, do you agree that even if you send an sbatch asking for 8 tasks per node
> > and 32 nodes, Intel's mpiexec is **incorrectly** translating this into srun
> > -N 32 -n 32?
> 
> Yes, AFAIK the process is following:
> 1. mpiexec on master node spawns 1x hydra_bstrap_proxy proxy per node
> (therefore equal -N and -n)
> 2. hydra_bstrap_proxy spawns --ntasks-per-node user processes on each node

Correct, that's my same understanding.

Given that, I don't know what Slurm can do to fix it, since it seems an Intel's issue during the translation.

Do you have any different thoughts about this?

Comment 19 Rozanov, Anatoliy 2019-09-04 07:30:43 MDT

Felip, is there any option to disable this warning message? We can add it to our srun call

Comment 20 Felip Moll 2019-09-04 10:15:54 MDT

(In reply to Rozanov, Anatoliy from comment #19)
> Felip, is there any option to disable this warning message? We can add it to
> our srun call

There's not any option to hide this error when submitting a job with incorrect parameters in srun, and I really think this must not be hidden. In the past there were similar bugs and this code was already tuned to not throw the error in incorrect situations, see for example https://github.com/SchedMD/slurm/commit/daacf5afee9 (bug 3079)

This is a problem in Intel commands that should be addressed. Some more information can be found in bug 5051 comment 4.

Nevertheless, if you still want to change the code and disable the error, you must understand that SchedMD cannot support that officially.
The code that raises the error is found in libsrun/opt.c

		if ((opt.ntasks_per_node != NO_VAL) && opt.min_nodes &&
		    (opt.ntasks_per_node != (opt.ntasks / opt.min_nodes))) {
			if (opt.ntasks > opt.ntasks_per_node)
				info("Warning: can't honor --ntasks-per-node "
				     "set to %u which doesn't match the "
				     "requested tasks %u with the number of "
				     "requested nodes %u. Ignoring "
				     "--ntasks-per-node.", opt.ntasks_per_node,
				     opt.ntasks, opt.min_nodes);
			opt.ntasks_per_node = NO_VAL;
		}

You could change the info message to a debug, and it would be shown only when running srun -vvv. Note that ignoring the error would unset ntasks_per_node as it is doing currently when you see it and you won't directly notice that.

Comment 21 Rozanov, Anatoliy 2019-09-05 01:40:23 MDT

> I really think this must not be hidden.

Why must it not be hidden by some option? We just want to run one process per node.

Comment 22 Victor Gamayunov 2019-09-05 03:49:38 MDT

Let me clarify the situation a bit. The fact that Hydra launches one task per node is intended and not a bug. The rationale behind this feature is to reduce the startup time of an application on large number of nodes with large PPN (ntasks-per-node). For example, it can start 300,000 ranks in under a minute.

This works with SSH bootstrap, the warning only appears when bootstrap is srun. From srun point of view (as it is now) the arguments are incorrect.

Can I suggest then adding a switch to srun which is not just hiding the warning, but maybe we can call it "hydra" or "proxy" mode. Which may be required anyway for bug 7681 to sort the pinning.

What are your thoughts, does it make sense?

Comment 23 Felip Moll 2019-09-05 04:51:51 MDT

Hi Victor, I was responding just right now and we collided.

(In reply to Victor Gamayunov from comment #22)
> Let me clarify the situation a bit. The fact that Hydra launches one task
> per node is intended and not a bug. The rationale behind this feature is to
> reduce the startup time of an application on large number of nodes with
> large PPN (ntasks-per-node). For example, it can start 300,000 ranks in
> under a minute.
> 

I perfectly understand this.

> This works with SSH bootstrap, the warning only appears when bootstrap is
> srun. From srun point of view (as it is now) the arguments are incorrect.
> 

I will explain below.


> Can I suggest then adding a switch to srun which is not just hiding the
> warning, but maybe we can call it "hydra" or "proxy" mode. Which may be
> required anyway for bug 7681 to sort the pinning.
> 
> What are your thoughts, does it make sense?

We cannot remove the warning since it is real. If you request --ntasks-per-node=8 , and then run an srun with -n 32 -N 32 and the environment variable SLURM_NTASKS_PER_NODE is set to 8, 32/32=1 doesn't match what's in the env. var SLURM_NTASKS_PER_NODE=8, so we must ignore the env var and therefore we throw the warning.

In a correct execution mpiexec/mpirun should really call srun with srun --ntasks-per-node=1 to override what you defined in sbatch and not rely only in -n and -N. Or otherwise it should unset SLURM_NTASKS_PER_NODE. You can try to unset SLURM_NTASKS_PER_NODE in your script but then I doubt that hydra_bstrap_proxy launches the correct number of tasks.

#!/bin/bash
#SBATCH -o %x.%j.%N.out
#SBATCH -e %x.%j.%N.err
#SBATCH -D .
#SBATCH -J honor
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8              <---------- There was a mistake in your honor.sh, this is --ntasks-per-node not --tasks-per-node

module unload devEnv/Intel
module load devEnv/Intel/2019
module load slurm_setup

unset SLURM_NTASKS_PER_NODE
mpiexec sleep 300


To me, that's an inconsistency to request for 8 tasks per node, but then pretend to launch 1 task per node without explicitly telling we're doing that on purpose (specifying --ntasks-per-node=1 again).
I understand that requesting --ntasks-per-node=8 in the batch script seems to be only a way to pass hydra_bstrap_proxy what you really want, we will really know after unsetting the env var.

That's the reason we cannot remove the error, it is because the error is legit and mpiexec is the one which should request --ntasks-per-node=1 to overwrite what you requested in first place.


The options you have is:
1. To fix mpiexec/mpirun, which I think is the correct option.
2. Try to unset the var
3. Modify the code to ignore the warning or add a new option, since it is not our error we will not just add a flag to ignore a warning, I hope you understand.


What do you think?

Comment 24 Rozanov, Anatoliy 2019-09-05 05:38:22 MDT

Thank you. I will try to set --ntasks-per-node=1. I thought that -n 1 is enough.

Comment 25 Victor Gamayunov 2019-09-05 06:01:22 MDT

I tested --ntasks-per-node=1 by using I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS variable. It worked for the attached case and the message is gone.

Are there any possible side effects, for example will this change SLURM_* environment variables for the proxy/application?

Comment 26 Felip Moll 2019-09-05 06:30:07 MDT

(In reply to Victor Gamayunov from comment #25)
> I tested --ntasks-per-node=1 by using I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS
> variable. It worked for the attached case and the message is gone.
> 
> Are there any possible side effects, for example will this change SLURM_*
> environment variables for the proxy/application?

I am not sure how hydra treats this variable but as far as you see one single pmi proxy and expected nr of processes for user app, it should be working.

Can you show me exactly how you did that?

Comment 27 Rozanov, Anatoliy 2019-09-05 06:32:44 MDT

mpiexec reads env vars before running srun. So, it should not affect hydra process manager.

Comment 28 Victor Gamayunov 2019-09-05 06:35:04 MDT

(In reply to Felip Moll from comment #26)

> Can you show me exactly how you did that?

export I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--ntasks-per-node=1"
export I_MPI_HYDRA_DEBUG=1
mpiexec IMB-MPI1 sendrecv

the hydra debug variable generates lot of output, but the main thing is this:

[mpiexec@i01r01c03s11.sng.lrz.de] Launch arguments: /usr/bin/srun -N 32 -n 32 --ntasks-per-node=1 --nodelist i01r01c03s11,i01r01c03s12,i01r01c04s01,i01r01c04s02,i01r01c04s05,i01r01c04s07,i01r01c04s09,i01r01c04s10,i01r01c04s11,i01r01c04s12,i01r01c05s03,i01r01c05s04,i01r01c05s06,i01r01c05s07,i01r01c05s08,i01r01c05s09,i01r01c05s10,i01r01c05s11,i01r01c05s12,i01r01c06s01,i01r01c06s02,i01r01c06s03,i01r01c06s04,i01r01c06s05,i01r01c06s06,i01r01c06s07,i01r01c06s08,i01r01c06s09,i01r01c06s10,i01r01c06s11,i01r01c06s12,i01r05c04s07 --input none /lrz/sys/intel/studio2019_u4/impi/2019.4.243/intel64/bin//hydra_bstrap_proxy --upstream-host 172.16.192.35 --upstream-port 41562 --pgid 0 --launcher slurm --launcher-number 1 --base-path /lrz/sys/intel/studio2019_u4/impi/2019.4.243/intel64/bin/ --tree-width 128 --tree-level 1 --iface ib0 --time-left -1 --collective-launch 1 --debug /lrz/sys/intel/studio2019_u4/impi/2019.4.243/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9

Comment 29 Felip Moll 2019-09-05 08:14:25 MDT

> export I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--ntasks-per-node=1"
> export I_MPI_HYDRA_DEBUG=1
> mpiexec IMB-MPI1 sendrecv
> 
> the hydra debug variable generates lot of output, but the main thing is this:
> 
> [mpiexec@i01r01c03s11.sng.lrz.de] Launch arguments: /usr/bin/srun -N 32 -n
> 32 --ntasks-per-node=1 --nodelist...

That's interesting, it seems to be a good option to remove the error as long as the proxy launches later the expected number of tasks, which I guess it will do since this only affect the bootstrap and is not related to reading slurm parameters.

I will take this note for next situations.

Given that, do you thing this issue is resolved?

Comment 30 Victor Gamayunov 2019-09-05 08:21:24 MDT

> That's interesting, it seems to be a good option to remove the error as long
> as the proxy launches later the expected number of tasks, which I guess it
> will do since this only affect the bootstrap and is not related to reading
> slurm parameters.

My concern was exactly that - if we introduce the srun switch or change environment, will hydra launch correct number of tasks. But initial testing shows that it works correctly.
 
> I will take this note for next situations.
> 
> Given that, do you thing this issue is resolved?

Yes

Many thanks Felip

Comment 31 Felip Moll 2019-09-05 08:26:46 MDT

> My concern was exactly that - if we introduce the srun switch or change
> environment, will hydra launch correct number of tasks. But initial testing
> shows that it works correctly.
>  
> > I will take this note for next situations.
> > 
> > Given that, do you thing this issue is resolved?
> 
> Yes
> 
> Many thanks Felip

Good. I still think it is something Intel must improve on its side. Having this workaround helps. Maybe, since you are from intel, you could push them to do this changes and would be a win-win for all of us :)

Glad everything is clear now.

I am marking the bug as infogiven.