18251 – Change in behavior with 23.02 and Slurm not assuming >1 nodes for ntasks and ntasks-per-node

Ticket 18251 - Change in behavior with 23.02 and Slurm not assuming >1 nodes for ntasks and ntasks-per-node

Summary: Change in behavior with 23.02 and Slurm not assuming >1 nodes for ntasks and ...

Status:	RESOLVED DUPLICATE of ticket 18217

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	23.02.6
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Tyler Connel
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-11-21 08:38 MST by Trey Dockendorf
Modified:	2023-11-21 15:00 MST (History)
CC List:	2 users (show)

See Also:	18217
Site:	Ohio State OSC
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (17.44 KB, text/plain) 2023-11-21 08:38 MST, Trey Dockendorf	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Trey Dockendorf 2023-11-21 08:38:03 MST

Created attachment 33404 [details]
slurm.conf

We are testing the 23.02.6 after running 22.05.x for a while and have noticed that doing "--ntasks=4 --ntasks-per-node=2" no longer requests a 2 node job which is causing issues on partitions where we have MinNodes=2.

Example:


$ salloc --ntasks=4 --ntasks-per-node=2 -A PZS0708 -p parallel srun --pty /bin/bash
salloc: error: Job submit/allocate failed: Node count specification invalid

The debug log

Nov 21 10:33:18 owens-slurm01-test slurmctld[82880]: debug2: _part_access_check: Job requested for nodes (1) smaller than partition parallel(2) min nodes

The partition:

PartitionName=parallel DefaultTime=01:00:00 DefMemPerCPU=4315 DenyAccounts=<OMIT LONG LIST> MaxCPUsPerNode=28 MaxMemPerCPU=4315 MaxNodes=81 MaxTime=4-00:00:00 MinNodes=2 Nodes=cpu OverSubscribe=EXCLUSIVE PriorityJobFactor=2000 State=UP

Is this change in behavior expected? I wasn't sure if we've run into a bug or if this is just a change in behavior.  I did a quick look through Release Notes and nothing jumped out at me.

Comment 1 Tyler Connel 2023-11-21 15:00:01 MST

Hello Trey,

I suspect the phenomenon you're experiencing is a duplicate of the linked ticket (18217).

The issue is that when --ntasks-per-node is provided it will recalculate and supersede the value passed by --ntasks. In this ticket, the issue was also found on 23.02.

I'll resolve this as a duplicate for now, but please do reach out if the fix provided through 18217 does not resolve your issue as well.

Best,
Tyler Connel

*** This ticket has been marked as a duplicate of ticket 18217 ***