Ticket 18217 - Behavior doesn't match man page for srun --ntasks and --ntasks-per-node
Summary: Behavior doesn't match man page for srun --ntasks and --ntasks-per-node
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Regression (show other tickets)
Version: 23.02.5
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Ricard Zarco Badia
QA Contact:
URL:
: 18251 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2023-11-16 10:27 MST by David Gloe
Modified: 2024-04-19 10:06 MDT (History)
7 users (show)

See Also:
Site: CRAY
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: Cray Internal
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 24.05
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf file (5.86 KB, text/plain)
2023-11-16 10:27 MST, David Gloe
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description David Gloe 2023-11-16 10:27:46 MST
Created attachment 33350 [details]
slurm.conf file

According to the srun man page:

--ntasks-per-node=<ntasks>
    Request that ntasks be invoked on each node. If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node.

However, it seems to me like --ntasks-per-node takes precedence, since the following srun runs 8 tasks instead of 5.

[dgloe@pea2k ~]$ srun -l --ntasks=5 --ntasks-per-node=4 hostname
1: n022
0: n022
2: n022
3: n022
4: n023
6: n023
5: n023
7: n023
Comment 10 Tyler Connel 2023-11-21 15:00:01 MST
*** Ticket 18251 has been marked as a duplicate of this ticket. ***
Comment 11 Ricard Zarco Badia 2023-11-22 00:24:16 MST
Hi David,

Thanks for reporting this issue, we have been taking a look at it and we have more or less identified the source. We are currently discussing how to address this, I will write you back as soon as we reach a conclusion.

Best regards, Ricard.
Comment 12 Trey Dockendorf 2023-11-30 12:50:38 MST
Would it be possible to get this fixed in 23.02.7?  We had opened 18251 which was duplicate of this and will be attempting to upgrade to 23.02.x on our December 19th downtime so having even a patch available before then would be good but having this as part of 23.02.7 would be even better.
Comment 15 Ricard Zarco Badia 2023-12-05 02:15:34 MST
Hello Trey,

Since we've had 23.11 released last month, 23.02 is now only eligible for security patches and segfault fixes. Right now we are reviewing a proposed solution for this bug, but since it is a change in behavior, it will most likely get patched for 23.11.1.

Best regards, Ricard.
Comment 16 Trey Dockendorf 2023-12-05 05:58:43 MST
Would it be possible to get a patch for 23.02 that just doesn't make it into a maintenance release?  This would allow OSC to change the behavior locally without changing the behavior in the 23.02 release for all customers. Without a 23.02 patch our options are either live with the behavior or upgrade to 23.11 but we are not comfortable going to 23.11 yet and we have limited time windows when we can do the major upgrades, ie either December 19th 2023 or May 2024.
Comment 19 Ricard Zarco Badia 2023-12-06 09:08:47 MST
Hello Trey,

We can send a local patch for 23.02 if needed, but please keep in mind that it would be a best effort patch that hasn't run our regular QA regressions, which also won't be maintained. Our initial solution for your case seems to produce a regression for another feature, so I'm trying to have a "stable" patch before December 19th.

Best regards, Ricard.
Comment 21 Trey Dockendorf 2023-12-06 11:06:29 MST
After some internal discussion OSC has decided we will just delay our upgrade and do an upgrade to the 23.11 release that has a fix for this bug.  The hope is that the upgrade will be done in early 2024 so a patch for 23.11 would be useful so we can validate the fix.
Comment 26 Trey Dockendorf 2024-01-09 12:45:53 MST
Any update on this getting fixed in 23.11 release?
Comment 27 Ricard Zarco Badia 2024-01-10 00:56:00 MST
Hello Trey,

We are still iterating proposed fixes for this bug. Right now we have a solution that already fixes your initial issue, but clashes with the node range (-N <min_nodes>-<max_nodes>) feature on specific circumstances. This is taking a bit of time because the allocation logic has a lot of possible parameter combinations, which need to be tested properly.

When are you planning to do the upgrade?

Best regards, Ricard.
Comment 28 Trey Dockendorf 2024-01-10 14:06:51 MST
Our upgrade timeline is dependent on this getting resolved so we haven't schedule the upgrade.  We will be doing a rolling reboot to handle the upgrade so we don't have to wait for our May 2024 downtime.
Comment 33 Ricard Zarco Badia 2024-04-19 10:06:52 MDT
Hello Trey,

Sorry for the delay, this took some review iterations but the final solution for this issue has been reviewed and pushed to master, it will be included in the coming 24.05 release. The related commits are the following:

* bd70f4df7a NEWS for the ntasks precedence fix
* 7b00991bf9 srun - Removed "can't honor" warning on job allocations
* 57c1685f16 salloc - Disable calculation of ntasks if set by the user in the cli
* 9cfff7276e srun - Disable recalculation of ntasks if set by the user in the cli
* ef04c57c9d Add ntasks_opt_set parameter for explicitly set ntasks

Thanks for notifying it!

Best regards, Ricard.