Ticket 18251 - Change in behavior with 23.02 and Slurm not assuming >1 nodes for ntasks and ntasks-per-node
Summary: Change in behavior with 23.02 and Slurm not assuming >1 nodes for ntasks and ...
Status: RESOLVED DUPLICATE of ticket 18217
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 23.02.6
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tyler Connel
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-11-21 08:38 MST by Trey Dockendorf
Modified: 2023-11-21 15:00 MST (History)
2 users (show)

See Also:
Site: Ohio State OSC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (17.44 KB, text/plain)
2023-11-21 08:38 MST, Trey Dockendorf
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Trey Dockendorf 2023-11-21 08:38:03 MST
Created attachment 33404 [details]
slurm.conf

We are testing the 23.02.6 after running 22.05.x for a while and have noticed that doing "--ntasks=4 --ntasks-per-node=2" no longer requests a 2 node job which is causing issues on partitions where we have MinNodes=2.

Example:


$ salloc --ntasks=4 --ntasks-per-node=2 -A PZS0708 -p parallel srun --pty /bin/bash
salloc: error: Job submit/allocate failed: Node count specification invalid

The debug log

Nov 21 10:33:18 owens-slurm01-test slurmctld[82880]: debug2: _part_access_check: Job requested for nodes (1) smaller than partition parallel(2) min nodes

The partition:

PartitionName=parallel DefaultTime=01:00:00 DefMemPerCPU=4315 DenyAccounts=<OMIT LONG LIST> MaxCPUsPerNode=28 MaxMemPerCPU=4315 MaxNodes=81 MaxTime=4-00:00:00 MinNodes=2 Nodes=cpu OverSubscribe=EXCLUSIVE PriorityJobFactor=2000 State=UP

Is this change in behavior expected? I wasn't sure if we've run into a bug or if this is just a change in behavior.  I did a quick look through Release Notes and nothing jumped out at me.
Comment 1 Tyler Connel 2023-11-21 15:00:01 MST
Hello Trey,

I suspect the phenomenon you're experiencing is a duplicate of the linked ticket (18217).

The issue is that when --ntasks-per-node is provided it will recalculate and supersede the value passed by --ntasks. In this ticket, the issue was also found on 23.02.

I'll resolve this as a duplicate for now, but please do reach out if the fix provided through 18217 does not resolve your issue as well.

Best,
Tyler Connel

*** This ticket has been marked as a duplicate of ticket 18217 ***