When submitting a job with: #SBATCH -n 720 #SBATCH --exclusive #SBATCH --ntasks-per-node=24 This warning is printed: srun: Warning: can't honor --ntasks-per-node set to 24 which doesn't match the requested tasks 30 with the number of requested nodes 30. Ignoring --ntasks-per-node. This warning looks erroneous to me. First of all, it is not true, the job ends up using 24 threads per node, which is the correct core count. When checking the source code, slurm/src/srun/libsrun/opt.c , I find: if (opt.ntasks > opt.ntasks_per_node) info("Warning: can't honor --ntasks-per-node " "set to %u which doesn't match the " "requested tasks %u with the number of " "requested nodes %u. Ignoring " "--ntasks-per-node.", opt.ntasks_per_node, opt.ntasks, opt.min_nodes); opt.ntasks_per_node = NO_VAL; As the output shows, opt.ntasks is 30, while the opt.ntasks_per_node is 24. How the opt.ntasks variable is set, is not this clear, but it is apparently set to the number of nodes: if (((opt.distribution & SLURM_DIST_STATE_BASE) == SLURM_DIST_ARBITRARY) && !opt.ntasks_set) { opt.ntasks = hostlist_count(hl); opt.ntasks_set = true; } This then makes this if essentially: “if more nodes than threads per node”. This is of course stupid and does not correspond to the warning text. This confirms, that the warning has no bearing on the number of tasks actually started per node. So this bug is nothing but an erroneous printing of a warning, it does not affect functionality.
Hi Thomas, Are those the only arguments being passed to sbatch? Do you have an environmental variable `SLURM_HOSTFILE` defined? Could you supply a "-v" next time this is run, so we can see what the input parameters are? From https://slurm.schedmd.com/sbatch.html under `--ntasks-per-node`: “If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option.” Also, under `-N, --nodes`: “If -N is not specified, the default behavior is to allocate enough nodes to satisfy the requirements of the -n and -c options” Is there a reason you are specifying --ntasks-per-node=24 in the first place? It seems unnecessary, because each node has a max of 24 nodes anyways. > As the output shows, opt.ntasks is 30, while the opt.ntasks_per_node is 24. > How the opt.ntasks variable is set, is not this clear, but it is apparently > set to the number of nodes: > > if (((opt.distribution & SLURM_DIST_STATE_BASE) == > SLURM_DIST_ARBITRARY) && !opt.ntasks_set) { > opt.ntasks = hostlist_count(hl); > opt.ntasks_set = true; > } opt.ntasks can't be set here, or else you would have never seen the warning message, which was in the corresponding else statement. Plus, this is only if you specified `-m arbitrary`, which seems unlikely. I think it's taking a different code path. These are just my first thoughts; I’ll keep looking into it. -Michael
Hello Michael, There is no hostfile or N specified. The other options are name, time and partition, none of which should have any bearing on this issue. Even if the ntasks-per-node option was superfluous in this scenario, the warning is still wrong. On the one hand, 30*30 is not 720 and the number of tasks eventually started per node is still 24. I'm not sure where the opt.ntasks gets set, but it does end up with the number of nodes. Thanks for looking into this. Thomas
I can't see how opt.ntasks can be set to 30 if its already set to 720 via `-n 720`. If it isn't already set, then it can only be set if SBATCH_DISTRIBUTION=arbitrary or `-m arbitrary` is specified and either SLURM_HOSTFILE or `-w` is specified with a valid nodelist. Very strange. Could you please show the whole set of commands/ the entire sbatch file being run that produces this error? Something isn't adding up. Thanks.
Obviously, I can see that the warning is wrong in this instance, but I want to dig deeper to find the underlying cause and to reproduce the issue. Simply hiding the warning message because it doesn't make sense doesn't really help anybody.
We have a similar use case exhibiting this behavior. Specifically, when the sbatch script specifies: #SBATCH --tasks-per-node=36 #SBATCH -N 4 and the individual sruns read: srun --ntasks-per-node=1 --overcommit --cpu_bind=none ... previously, we would get one srun per node. Presently, we see 36 sruns per node. Our present work-around is to change all of these scripts to: srun --ntasks=<# hosts#> --overcommit --distribution=cyclic --cpu_bind=none ... This change was introduced between slurm versions 17.02 and 17.11.
Hey Thomas, sorry for taking so long on this. What you describe seems like a bug, but I haven't been able to reproduce it. Could you give me some more information about your system, like the slurm.conf and the output of `scontrol show nodes <relevant_node>`? Does S Senator's comment help you? Thanks, Michael
Feel free to reopen if you want to continue looking into this. Thanks! Michael