Hi, We are seeing this error with the following batch script: srun: Warning: can't honor --ntasks-per-node set to 252 which doesn't match the requested tasks 252 with the number of requested nodes 21. Ignoring --ntasks-per-node. #SBATCH --nodes=21 #SBATCH --ntasks-per-node=12 #SBATCH --constraint=CPU-E5645 #SBATCH --mem=48000 The nodes with that constraint do have 12 cores. Searching around I think this is the same issue as here: https://bugs.schedmd.com/show_bug.cgi?format=multiple&id=3032 and https://groups.google.com/forum/#!msg/slurm-devel/zeuBOXcPJUM/qZ5wCPBYCAAJ I saw the following note in the 16.05.4 release notes, but it looks to be a slightly different problem, so I wanted to check if that would fix this issue before we updated: #### -- Correct documented configurations where --ntasks-per-core and --ntasks-per-socket are supported. #### Thanks for any insight. Martins
Martin, could you please show the whole batch script including the possible srun requests inside it. Since it's an srun error it will be easier to reproduce. Anyhow, we are able to reproduce something similar with a more simple request: $ salloc --ntasks-per-node=8 -n 8 salloc: Granted job allocation 20004 srun: Warning: can't honor --ntasks-per-node set to 8 which doesn't match the requested tasks 1 with the number of requested nodes 1. Ignoring --ntasks-per-node. $ scontrol show config | grep Salloc SallocDefaultCommand = srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --gres=craynetwork:0 --mpi=none $SHELL So there's definitely an issue going on there.
Created attachment 3495 [details] batch script
OK, attached. This is a much simplified script from the original report but still shows the same problem. Just running mpi hello world.
I see in your script you have: #SBATCH --nodes=2 #SBATCH --ntasks-per-node=12 and then NPROCS=`expr $SLURM_NTASKS_PER_NODE \* $SLURM_NNODES` srun --ntasks-per-node=$NPROCS ./helloworld why are you overriding --ntasks-per-node from 12 to 12*2 ? Maybe I'm wrong but it sounds a bit strange to me.
Created attachment 3499 [details] script2
Created attachment 3500 [details] output2
Created attachment 3501 [details] error2
OK, sorry. In attempting to simplify the script, there was an error. I uploaded a new script and the corresponding output and error. The easiest way to reproduce the error is to have one part of the job script that has an srun invocation that uses fewer cores than the overall job script. We think we are having this same error with other combinations of nodes/cores, but this was the easiest to simplify. Let me know if it is actually an error in the job script and we can work with the user to fix it. Thanks Martins
(In reply to Martins Innus from comment #17) > OK, sorry. In attempting to simplify the script, there was an error. > > I uploaded a new script and the corresponding output and error. > > The easiest way to reproduce the error is to have one part of the job script > that has an srun invocation that uses fewer cores than the overall job > script. Yes we also managed to locally reproduce this way. > We think we are having this same error with other combinations of > nodes/cores, but this was the easiest to simplify. > > Let me know if it is actually an error in the job script and we can work > with the user to fix it. > We've ready a patch for this, once it is pushed we'll come back to you.
Martins, following commit silences the warning you see if the number of tasks is less than the number of tasks per node given/inherited: https://github.com/SchedMD/slurm/commit/daacf5afee9 Anyhow, as the documentation states, ntasks-per-node is "Meant to be used with the --ntasks option". Slurm has to figure out how many tasks can run in an allocation based on what the allocation requests. This is done off whatever is given Slurm. Slurm always wants to fill in an allocation so ntasks is ALWAYS inherited from the environment when in one. So any time you are in an allocation you will ALWAYS default to whatever the allocation has for tasks. If you expect a certain number of tasks you should ask for it. The options you are specified are only telling Slurm how to lay out tasks, not the number. Slurm will default to fill the allocated resources unless told otherwise.
Alejandro, OK. So if we have a multistep job where we have multiple sruns that require different ntasks-per-node we need to use --ntasks for the srun? Like this: #SBATCH --nodes=2 #SBATCH --ntasks-per-node=12 #SBATCH --ntasks=24 # This inherits srun ./foo.exe # This needs all new params srun --nodes=2 --ntasks=12 --ntasks-per-node=6 ./bar.exe # End sbatch Thanks for the clarification. Martins
(In reply to Martins Innus from comment #32) > #SBATCH --nodes=2 > #SBATCH --ntasks-per-node=12 > #SBATCH --ntasks=24 > > # This inherits > srun ./foo.exe Slurm will default to fill the allocated resources unless told otherwise, so this first srun will try to fill 24 tasks across the 2 nodes. > > # This needs all new params > srun --nodes=2 --ntasks=12 --ntasks-per-node=6 ./bar.exe If you want to consume less than the allocated for the job you've to explicitly tell Slurm like you do in this second srun, exactly. > # End sbatch > > > Thanks for the clarification. > > Martins No problem. If you don't have any more questions let me know if we can close the bug. Thanks.
OK thanks! I will ask the researcher to resubmit his job and confirm it works.
Hi Martins, any progress with this? Thanks.
Marking as resolved/timedout. Please reopen if any issue is encountered with the customer feedback.