HI, The following script is used to submit a job. at the beginning of the user log file the following message is printed. srun: Warning: can't honor --ntasks-per-node set to 48 which doesn't match the requested tasks 81 with the number of requested nodes 81. Ignoring --ntasks-per-node. As I thought, did not understand the reason for this warning. I will appreciate it if you can explain for me the root cause of the warning. Also already i went trough some the past BUG reporting this issue but I could not get any conclusion out of them. The submission script is as follows #!/bin/bash #SBATCH --mail-user=christoph.federrath@anu.edu.au #SBATCH --mail-type=all #SBATCH --account=pr32lo #SBATCH --get-user-env #SBATCH --export=NONE #SBATCH --no-requeue #SBATCH -o ./job_%j.out #SBATCH -D ./ #SBATCH -J MHD_10008 #SBATCH --partition=large #SBATCH --nodes=1944 #SBATCH --ntasks-per-node=48 #SBATCH --cpus-per-task=1 #SBATCH --time=12:00:00 source /etc/profile.d/modules.sh module load slurm_setup module load intel module unload mpi.intel module load mpi.intel/2019 module load lrztools fftw/3.3.8-intel-impi hdf5/1.10.2-intel-impi szip module list export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HDF5_BASE/lib export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SZIP_BASE/lib mpiexec -n 93312 ./flash4 &> shell.out12 kind regards, Abouzar Ghasemi HPC Specialist Lenovo Deutschland GmbH +4915237708213 aghasemi@lenovo.com
Hi Abouzar, I'm trying to reproduce the issue. In the meantime, can you run a script like this and show me the output?: #!/bin/bash #SBATCH --get-user-env #SBATCH --export=NONE #SBATCH --partition=large #SBATCH --nodes=1944 #SBATCH --ntasks-per-node=48 #SBATCH --cpus-per-task=1 /bin/env I am foreseeing some environment variable set which makes the sruns in the allocation to do a bad calculation. Is the failing script really using mpiexec, or srun?
I would like you to test running with srun instead of mpiexec too. #!/bin/bash #SBATCH --get-user-env #SBATCH --export=NONE #SBATCH --no-requeue #SBATCH --partition=large #SBATCH --nodes=1944 #SBATCH --ntasks-per-node=48 #SBATCH --cpus-per-task=1 srun -n 93312 /bin/true Show me the output please. I think mpiexec may be misinterpreting Slurm's environment variables. I encourage you to read through bug 5051 which seems a duplicate of your issue, then if you can do the tests done there we will be certain that this is the same issue. Thanks
Hi Flip, Thanks for your reply. I have done a couple of test using srun instead of mpiexec. I haven't observed any "can't honor" warning. Please see also the attached file for the ENV variables. Regards, Abouzar
Created attachment 10394 [details] ENV variables
Hi, I see the environment but for one job of 12 nodes instead, I don't know if with this setup you have reproduced the problem or not. That's the reason I asked exactly: #!/bin/bash #SBATCH --get-user-env #SBATCH --export=NONE #SBATCH --partition=large #SBATCH --nodes=1944 #SBATCH --ntasks-per-node=48 #SBATCH --cpus-per-task=1 /bin/env --------- (Also I see an output of a program, there's no need to run a program, I just needed the env output). In any case, let's do the following: a) You need to find a combination that is failing if you cannot use the original combination: #SBATCH --nodes=1944 #SBATCH --ntasks-per-node=48 #SBATCH --cpus-per-task=1 mpiexec -n 93312 ... b) Then try this script: -----------8<--------------- #SBATCH --nodes=1944 #SBATCH --ntasks-per-node=48 #SBATCH --cpus-per-task=1 mpiexec -n 93312 env | sort | uniq > env.mpiexec_$SLURM_JOBID srun env | sort | uniq > env.srun_$SLURM_JOBID ------------8<--------------- and upload here the env.mpirun_* and env.srun_* c) Then run: -----------8<--------------- #SBATCH --nodes=1944 #SBATCH --ntasks-per-node=48 #SBATCH --cpus-per-task=1 mpiexec -n 93312 sleep 1000 ------------8<--------------- Go into the master node and issue a 'pstree' to see exactly what mpiexec is running. We should see something like: xxxx 33556 0.0 0.0 114148 1508 ? S 16:50 0:00 /bin/bash /var/spool/slurmd/job775399/slurm_script xxxx 33571 0.0 0.0 113132 1480 ? S 16:50 0:00 \_ /bin/sh /..../impi/2018..../bin64/mpirun env xxxx 33577 0.2 0.0 18096 2348 ? S 16:50 0:00 | \_ mpiexec.hydra env xxxx 33578 0.2 0.0 448504 6776 ? Sl 16:50 0:00 | \_ /usr/bin/srun --nodelist xxxx -N 81 -n 81 ........ xxxx 33579 0.0 0.0 44052 672 ? S 16:50 0:00 | \_ /usr/bin/srun --nodelist xxxx -N 81 -n 81 ..... We should see the incorrect srun line, something like: srun -N 81 -n 81 ... If that's the case, we confirm there's an issue with Intel and a bug must be opened to them.
Hi, do you have any feedback in what regards to my last comment? I'm pretty sure this is an Intel issue. If you can confirm it would be great to open a bug to them.
Hi, I haven't seen a response to this issue and probably the cause is what I explained in my previous comments. I am heading to close this issue now. Don't hesitate to mark this bug as OPEN again if you still have questions or open a new one. Regards, Felip M
Hi Flip, Thanks for your message and I hope you accept our apology. Actually the cluster is passing now some acceptance test and we can not run the requested jobs to collect required information. According to our scheduled we can provide for you the retested information until the end of this month. By the way, inlet is also informed about the issue. They opened an internal ticket for this issue.
Hi Felip, I would like to reopen this issue on customer request. This message should be somehow suppressed as it is confusing for users. My understanding is that Intel MPI always spawns (using srun) one proxy per node regardless of --ntasks-per-node, and srun would see the conflicting SLURM_NTASKS_PER_NODE value and print the warning. mpiexec cannot override SLURM_NTASKS_PER_NODE because that will change how MPI tasks are spawned.
(In reply to Victor Gamayunov from comment #11) > Hi Felip, I would like to reopen this issue on customer request. This > message should be somehow suppressed as it is confusing for users. > > My understanding is that Intel MPI always spawns (using srun) one proxy per > node regardless of --ntasks-per-node, and srun would see the conflicting > SLURM_NTASKS_PER_NODE value and print the warning. mpiexec cannot override > SLURM_NTASKS_PER_NODE because that will change how MPI tasks are spawned. mpiexec is launching srun after a translation of what it reads in the environment from within the allocation granted by slurm. The issue is probably in mpiexec and how it translates to srun, this is the reason I asked for more details in comment 7, I wanted to confirm this hypothesis. This seems the same case than bug 5051. Also, is this the same issue opened by your colleagues in bug 7681? If so, please coordinate and let's close one of the two issues. In any case to confirm we need you to reproduce the issue and get the info requested in comment 7.
(In reply to Felip Moll from comment #12) > Also, is this the same issue opened by your colleagues in bug 7681? If so, > please coordinate and let's close one of the two issues. bug 7681 is different, about affinity settings, the common thing is Intel MPI with SLURM bootstrap. > In any case to confirm we need you to reproduce the issue and get the info > requested in comment 7. I can reproduce this message on much smaller number of nodes, and indeed there is "srun -N 32 -n 32": root 3425362 1 0 13:07 ? 00:00:00 slurmstepd: [184804.batch] di52jap 3425368 3425362 0 13:07 ? 00:00:00 /bin/bash /var/spool/slurm/job184804/slurm_script di52jap 3425394 3425368 0 13:07 ? 00:00:00 mpiexec sleep 300 di52jap 3425395 3425394 0 13:07 ? 00:00:00 /usr/bin/srun -N 32 -n 32 --nodelist i01r01c04s04,i01r01c04s05,i01r01c04s06,i01r01c04s07,i01r01c04s08,i01r01c04s09,i01r01c04s10,i01r01c04s11,i01r0 di52jap 3425396 3425395 0 13:07 ? 00:00:00 /usr/bin/srun -N 32 -n 32 --nodelist i01r01c04s04,i01r01c04s05,i01r01c04s06,i01r01c04s07,i01r01c04s08,i01r01c04s09,i01r01c04s10,i01r01c04s11,i01 root 3425441 1 0 13:07 ? 00:00:00 slurmstepd: [184804.0] di52jap 3425449 3425441 0 13:07 ? 00:00:00 /lrz/sys/intel/studio2019_u4/impi/2019.4.243/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
Created attachment 11456 [details] honor.sh
Created attachment 11458 [details] ps -Hef Full output of ps -Hef
> I can reproduce this message on much smaller number of nodes, and indeed > there is "srun -N 32 -n 32": > > root 3425362 1 0 13:07 ? 00:00:00 slurmstepd: > [184804.batch] > di52jap 3425368 3425362 0 13:07 ? 00:00:00 /bin/bash > /var/spool/slurm/job184804/slurm_script > di52jap 3425394 3425368 0 13:07 ? 00:00:00 mpiexec sleep 300 > di52jap 3425395 3425394 0 13:07 ? 00:00:00 /usr/bin/srun -N > 32 -n 32 --nodelist > i01r01c04s04,i01r01c04s05,i01r01c04s06,i01r01c04s07,i01r01c04s08, > i01r01c04s09,i01r01c04s10,i01r01c04s11,i01r0 So, do you agree that even if you send an sbatch asking for 8 tasks per node and 32 nodes, Intel's mpiexec is **incorrectly** translating this into srun -N 32 -n 32? #!/bin/bash #SBATCH -o %x.%j.%N.out #SBATCH -e %x.%j.%N.err #SBATCH -D . #SBATCH -J honor #SBATCH --nodes=32 #SBATCH --tasks-per-node=8
> So, do you agree that even if you send an sbatch asking for 8 tasks per node > and 32 nodes, Intel's mpiexec is **incorrectly** translating this into srun > -N 32 -n 32? Yes, AFAIK the process is following: 1. mpiexec on master node spawns 1x hydra_bstrap_proxy proxy per node (therefore equal -N and -n) 2. hydra_bstrap_proxy spawns --ntasks-per-node user processes on each node
(In reply to Victor Gamayunov from comment #17) > > So, do you agree that even if you send an sbatch asking for 8 tasks per node > > and 32 nodes, Intel's mpiexec is **incorrectly** translating this into srun > > -N 32 -n 32? > > Yes, AFAIK the process is following: > 1. mpiexec on master node spawns 1x hydra_bstrap_proxy proxy per node > (therefore equal -N and -n) > 2. hydra_bstrap_proxy spawns --ntasks-per-node user processes on each node Correct, that's my same understanding. Given that, I don't know what Slurm can do to fix it, since it seems an Intel's issue during the translation. Do you have any different thoughts about this?
Felip, is there any option to disable this warning message? We can add it to our srun call
(In reply to Rozanov, Anatoliy from comment #19) > Felip, is there any option to disable this warning message? We can add it to > our srun call There's not any option to hide this error when submitting a job with incorrect parameters in srun, and I really think this must not be hidden. In the past there were similar bugs and this code was already tuned to not throw the error in incorrect situations, see for example https://github.com/SchedMD/slurm/commit/daacf5afee9 (bug 3079) This is a problem in Intel commands that should be addressed. Some more information can be found in bug 5051 comment 4. Nevertheless, if you still want to change the code and disable the error, you must understand that SchedMD cannot support that officially. The code that raises the error is found in libsrun/opt.c if ((opt.ntasks_per_node != NO_VAL) && opt.min_nodes && (opt.ntasks_per_node != (opt.ntasks / opt.min_nodes))) { if (opt.ntasks > opt.ntasks_per_node) info("Warning: can't honor --ntasks-per-node " "set to %u which doesn't match the " "requested tasks %u with the number of " "requested nodes %u. Ignoring " "--ntasks-per-node.", opt.ntasks_per_node, opt.ntasks, opt.min_nodes); opt.ntasks_per_node = NO_VAL; } You could change the info message to a debug, and it would be shown only when running srun -vvv. Note that ignoring the error would unset ntasks_per_node as it is doing currently when you see it and you won't directly notice that.
> I really think this must not be hidden. Why must it not be hidden by some option? We just want to run one process per node.
Let me clarify the situation a bit. The fact that Hydra launches one task per node is intended and not a bug. The rationale behind this feature is to reduce the startup time of an application on large number of nodes with large PPN (ntasks-per-node). For example, it can start 300,000 ranks in under a minute. This works with SSH bootstrap, the warning only appears when bootstrap is srun. From srun point of view (as it is now) the arguments are incorrect. Can I suggest then adding a switch to srun which is not just hiding the warning, but maybe we can call it "hydra" or "proxy" mode. Which may be required anyway for bug 7681 to sort the pinning. What are your thoughts, does it make sense?
Hi Victor, I was responding just right now and we collided. (In reply to Victor Gamayunov from comment #22) > Let me clarify the situation a bit. The fact that Hydra launches one task > per node is intended and not a bug. The rationale behind this feature is to > reduce the startup time of an application on large number of nodes with > large PPN (ntasks-per-node). For example, it can start 300,000 ranks in > under a minute. > I perfectly understand this. > This works with SSH bootstrap, the warning only appears when bootstrap is > srun. From srun point of view (as it is now) the arguments are incorrect. > I will explain below. > Can I suggest then adding a switch to srun which is not just hiding the > warning, but maybe we can call it "hydra" or "proxy" mode. Which may be > required anyway for bug 7681 to sort the pinning. > > What are your thoughts, does it make sense? We cannot remove the warning since it is real. If you request --ntasks-per-node=8 , and then run an srun with -n 32 -N 32 and the environment variable SLURM_NTASKS_PER_NODE is set to 8, 32/32=1 doesn't match what's in the env. var SLURM_NTASKS_PER_NODE=8, so we must ignore the env var and therefore we throw the warning. In a correct execution mpiexec/mpirun should really call srun with srun --ntasks-per-node=1 to override what you defined in sbatch and not rely only in -n and -N. Or otherwise it should unset SLURM_NTASKS_PER_NODE. You can try to unset SLURM_NTASKS_PER_NODE in your script but then I doubt that hydra_bstrap_proxy launches the correct number of tasks. #!/bin/bash #SBATCH -o %x.%j.%N.out #SBATCH -e %x.%j.%N.err #SBATCH -D . #SBATCH -J honor #SBATCH --nodes=32 #SBATCH --ntasks-per-node=8 <---------- There was a mistake in your honor.sh, this is --ntasks-per-node not --tasks-per-node module unload devEnv/Intel module load devEnv/Intel/2019 module load slurm_setup unset SLURM_NTASKS_PER_NODE mpiexec sleep 300 To me, that's an inconsistency to request for 8 tasks per node, but then pretend to launch 1 task per node without explicitly telling we're doing that on purpose (specifying --ntasks-per-node=1 again). I understand that requesting --ntasks-per-node=8 in the batch script seems to be only a way to pass hydra_bstrap_proxy what you really want, we will really know after unsetting the env var. That's the reason we cannot remove the error, it is because the error is legit and mpiexec is the one which should request --ntasks-per-node=1 to overwrite what you requested in first place. The options you have is: 1. To fix mpiexec/mpirun, which I think is the correct option. 2. Try to unset the var 3. Modify the code to ignore the warning or add a new option, since it is not our error we will not just add a flag to ignore a warning, I hope you understand. What do you think?
Thank you. I will try to set --ntasks-per-node=1. I thought that -n 1 is enough.
I tested --ntasks-per-node=1 by using I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS variable. It worked for the attached case and the message is gone. Are there any possible side effects, for example will this change SLURM_* environment variables for the proxy/application?
(In reply to Victor Gamayunov from comment #25) > I tested --ntasks-per-node=1 by using I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS > variable. It worked for the attached case and the message is gone. > > Are there any possible side effects, for example will this change SLURM_* > environment variables for the proxy/application? I am not sure how hydra treats this variable but as far as you see one single pmi proxy and expected nr of processes for user app, it should be working. Can you show me exactly how you did that?
mpiexec reads env vars before running srun. So, it should not affect hydra process manager.
(In reply to Felip Moll from comment #26) > Can you show me exactly how you did that? export I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--ntasks-per-node=1" export I_MPI_HYDRA_DEBUG=1 mpiexec IMB-MPI1 sendrecv the hydra debug variable generates lot of output, but the main thing is this: [mpiexec@i01r01c03s11.sng.lrz.de] Launch arguments: /usr/bin/srun -N 32 -n 32 --ntasks-per-node=1 --nodelist i01r01c03s11,i01r01c03s12,i01r01c04s01,i01r01c04s02,i01r01c04s05,i01r01c04s07,i01r01c04s09,i01r01c04s10,i01r01c04s11,i01r01c04s12,i01r01c05s03,i01r01c05s04,i01r01c05s06,i01r01c05s07,i01r01c05s08,i01r01c05s09,i01r01c05s10,i01r01c05s11,i01r01c05s12,i01r01c06s01,i01r01c06s02,i01r01c06s03,i01r01c06s04,i01r01c06s05,i01r01c06s06,i01r01c06s07,i01r01c06s08,i01r01c06s09,i01r01c06s10,i01r01c06s11,i01r01c06s12,i01r05c04s07 --input none /lrz/sys/intel/studio2019_u4/impi/2019.4.243/intel64/bin//hydra_bstrap_proxy --upstream-host 172.16.192.35 --upstream-port 41562 --pgid 0 --launcher slurm --launcher-number 1 --base-path /lrz/sys/intel/studio2019_u4/impi/2019.4.243/intel64/bin/ --tree-width 128 --tree-level 1 --iface ib0 --time-left -1 --collective-launch 1 --debug /lrz/sys/intel/studio2019_u4/impi/2019.4.243/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
> export I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--ntasks-per-node=1" > export I_MPI_HYDRA_DEBUG=1 > mpiexec IMB-MPI1 sendrecv > > the hydra debug variable generates lot of output, but the main thing is this: > > [mpiexec@i01r01c03s11.sng.lrz.de] Launch arguments: /usr/bin/srun -N 32 -n > 32 --ntasks-per-node=1 --nodelist... That's interesting, it seems to be a good option to remove the error as long as the proxy launches later the expected number of tasks, which I guess it will do since this only affect the bootstrap and is not related to reading slurm parameters. I will take this note for next situations. Given that, do you thing this issue is resolved?
> That's interesting, it seems to be a good option to remove the error as long > as the proxy launches later the expected number of tasks, which I guess it > will do since this only affect the bootstrap and is not related to reading > slurm parameters. My concern was exactly that - if we introduce the srun switch or change environment, will hydra launch correct number of tasks. But initial testing shows that it works correctly. > I will take this note for next situations. > > Given that, do you thing this issue is resolved? Yes Many thanks Felip
> My concern was exactly that - if we introduce the srun switch or change > environment, will hydra launch correct number of tasks. But initial testing > shows that it works correctly. > > > I will take this note for next situations. > > > > Given that, do you thing this issue is resolved? > > Yes > > Many thanks Felip Good. I still think it is something Intel must improve on its side. Having this workaround helps. Maybe, since you are from intel, you could push them to do this changes and would be a win-win for all of us :) Glad everything is clear now. I am marking the bug as infogiven.