14015 – cores_per_socket being ignored (Add MaxCPUsPerNode to partition configuration)

Ticket 14015 - cores_per_socket being ignored (Add MaxCPUsPerNode to partition configuration)

Summary: cores_per_socket being ignored (Add MaxCPUsPerNode to partition configuration)

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	21.08.7
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Duplicates (1):	4717 (view as ticket list)
Depends on:
Blocks:

Reported:	2022-05-05 22:41 MDT by Robin Humble
Modified:	2022-06-17 22:59 MDT (History)
CC List:	2 users (show)

See Also:	4717
Site:	Swinburne
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	23.02pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Robin Humble 2022-05-05 22:41:31 MDT

Hi,

the below used to work in 20.02 and it's broken in 21.08.

in job_submit.lua we set job_desc.ntasks_per_node and job_desc.cores_per_socket for all multi-node jobs.

cores_per_socket seems to be ignored in 21.08. instead the cores are being packed as many as possible onto one socket. ie.

lspcu on our cpus is
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35

so for a 32 core job we want jobs to use say, cores 0-31, leaving 32,34 free on one socket and 33,35 free on the other.

instead, 21.08 is doing this
# cat /sys/fs/cgroup/cpuset/slurm/uid_10853/job_27083761/cpuset.cpus 
0-28,30,32,34

so it's filling up one socket entirely.

'scontrol show job' shows eg.
   ...
   NumNodes=15 NumCPUs=480 NumTasks=480 CPUs/Task=1 ReqB:S:C:T=0:0:16:*
   TRES=cpu=480,mem=1125G,node=15,billing=480,gres/tmp=1572864000
   Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryNode=75G MinTmpDiskNode=100M
   ...

so there is a 16 in 'Req' being set, but then slurmctld or slurmd is ignoring it?

please Note: we leave these 2 cores/socket free for gpu jobs to use. so by filling a socket the gpu attached to that socket is now either inaccessible or slow. hence the significant impact this has on us.

we changed from cons_res to cons_tres when we went to slurm 21.08. maybe that broke it? can we go back?
or should we be setting something instead of cores_per_socket?

any help would be much appreciated. thanks.

cheers,
robin

Comment 1 Robin Humble 2022-05-06 00:39:09 MDT

Hi,

it looks like it's not just job_submit.lua, but all jobs can ignore cores_per_socket.

the below job doesn't get any of its node/core/memory params modified by job_submit.lua (we leave gpu jobs unmodified except for gres:tmp), but it still gets the wrong core allocation on nodes.

 $ srun -N2 -n8 --ntasks-per-node=4 --cores-per-socket=2 --time=0:05:00 --tmp=1g --gres=gpu:2 --pty -u /bin/bash -l
srun: job 27109946 queued and waiting for resources
srun: job 27109946 has been allocated resources

[john40 ]$ scontrol show job 27109946
JobId=27109946 JobName=bash
...
   NodeList=john[40,65]
   BatchHost=john40
   NumNodes=2 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:2:*
   TRES=cpu=8,mem=800M,node=2,billing=8,gres/gpu=4,gres/tmp=2147483648
   Socks/Node=* NtasksPerN:B:S:C=4:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryCPU=100M MinTmpDiskNode=1G
..

[john40 ]$  cat /sys/fs/cgroup/cpuset/slurm/uid_1040/job_*/cpuset.cpus
29,31,33,35

all of those are odd numbers, so are all on 1 socket.
vs. the job asked for 2 cores per socket.

cheers,
robin

Comment 4 Marcin Stolarek 2022-05-09 05:20:12 MDT

Robin,

The way --cores-per-socket works (and it didn't change) is that it eliminates nodes that have fewer cores per socket from the list of nodes considered for the job allocation[1,2]. Looking at the behavior you noticed I can't confirm it's a bug/regression.

Maybe the way you can achieve the desired behavior (at least partially) is to configure two partitions and use MaxCPUsPerNode[3]:
>Maximum number of CPUs on any node available to all jobs from this partition. 
>This can be especially useful to schedule GPUs. For example a node can be 
>associated with two Slurm partitions (e.g. "cpu" and "gpu") and the 
>partition/queue "cpu" could be limited to only a subset of the node's CPUs, 
>ensuring that one or more CPUs would be available to jobs in the "gpu" >partition/queue. 

>we changed from cons_res to cons_tres when we went to slurm 21.08. maybe that broke it? can we go back?
Change of SelectPlugin is a serious reconfiguration and requires caution, please take a look at Bug 8954 comment 2, where it's fully explained.

cheers,
Marcin

[1]https://slurm.schedmd.com/sbatch.html#OPT_cores-per-socket
[2]https://slurm.schedmd.com/sbatch.html#OPT_extra-node-info
[3]https://slurm.schedmd.com/slurm.conf.html#OPT_MaxCPUsPerNode

Comment 5 Robin Humble 2022-05-09 08:46:58 MDT

Hi Marcin,

thanks. we already set MaxCPUsPerNode.

please see https://bugs.schedmd.com/show_bug.cgi?id=4717 (which we've had open for wow, looks like it's 4 years now...) to see why that is insufficient, and for a few suggestions on how to fix it.

now that you have cons_tres, perhaps that enhancement can be looked at again?
ie. MaxCPUsPerSocket or ReservedCores in gres.conf... or something like that.


thanks also for pointing out that cores-per-socket is probably not the right thing to do according to the man page. however it's worked since slurm 17/18 'til 20.02, so clearly man page != code. this would not be the first time :) lol.

I've been experimenting with ntasks-per-socket in job_submit.lua as well as cores-per-socket. setting both of them gets us closer to the previous desired behaviour.

contrary to the man page, setting cores-per-socket definitely does change the scheduling. it would be interesting to know why.

eg. 100 jobs run with -n4 -c4 ->

ideally we'd like that job geometry to be forcefully packed onto one numa node using 16 of the 18 cores there, (often) leaving 2 cores free for gpu jobs, which is the goal.

if job_submit.lua sets

job_desc.ntasks_per_socket=job_desc.num_tasks                        job_desc.cores_per_socket=job_desc.num_tasks*job_desc.cpus_per_task

this results in

   NumNodes=1 NumCPUs=16 NumTasks=4 CPUs/Task=4 ReqB:S:C:T=0:0:16:*
   TRES=cpu=16,mem=1600M,energy=485,node=1,billing=16,gres/tmp=104857600
   Socks/Node=* NtasksPerN:B:S:C=4:0:4:* CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=100M MinTmpDiskNode=100M

and scheduling then seems fine - the 16 core job is always on either socket0 or socket1. never across both sockets.

however if I run another 100 identical jobs, but only set

job_desc.ntasks_per_socket=job_desc.num_tasks

and nor cores_per_socket, that results in

   NumNodes=1 NumCPUs=16 NumTasks=4 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=1600M,energy=421,node=1,billing=16,gres/tmp=104857600
   Socks/Node=* NtasksPerN:B:S:C=4:0:4:* CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=100M MinTmpDiskNode=100M

and around half the jobs run with cgroups like
     1,3,5,7,9,11,13,15,17,19,21,23,25,27,32,34
ie. a mix of odd (socket1) and even (socket0)
which is wrong.

so cores-per-socket has a measurable effect.

if you don't know why then that's fine. we would like this behaviour to remain though, as we need a mechanism to pack and align jobs with numa nodes efficiently.

please leave this bug open for a few weeks whilst I check more jobs work with the settings I've found. thanks.

re: tres: we moved to it 'cos I thought cons_tres was "the future" and we would eventually have no choice. is that not the case? will cons_res remain?

cheers,
robin

Comment 6 Marcin Stolarek 2022-05-10 05:47:14 MDT

Robin,

I see I missed that looking at the code yesterday. I believe that the difference you noticed is caused by commit 6de8c831ae9, from Bug 4995 (Swinburne). This commit was to fix the issue that doesn't exist in cons_tres before of _allocate_sc differences.

I see that in this scenario it was not only a fix for Bug 4995, but also a "feature" itself. I've tested that and can confirm that it's cons_res vs cons_tres difference. This additional meaning of --cores-per-socket is only active when MaxCPUsPerNode is set on the partition.

>now that you have cons_tres, perhaps that enhancement can be looked at again?
I think it's a very good question. I'll discuss the approach we can take on that with our senior developers and will let you know.

>as we need a mechanism to pack and align jobs with numa nodes efficiently.[...]
Could you please share your slurm.conf and run test commands like:

srun -l  [YOUR OPTIONS]

Comment 7 Marcin Stolarek 2022-05-10 05:54:13 MDT

[sorry I pressed enter by accident continuing here]

srun -l  [YOUR OPTIONS] /bin/bash -c 'taskset -pc $$ && scontrol show job -d $SLURM_JOB_ID | grep Nodes='

This will show us abstract (slurmctld) mapping of CPUs in addition to phisical. The default task distribution is cyclic which means that tasks collect CPUs from the subsequent sockets as long as possible, so you should be able to achieve block allocation by adding `-m block:block` to options. (Full control over task distribution is obviously possible only with --exclusive node allocation).
You don't have to modify that in submit plugin, setting CR_CORE_DEFAULT_DIST_BLOCK[1] in SelectTypeParameters should be enough?

cheers,
Marcin

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_CR_CORE_DEFAULT_DIST_BLOCK

Comment 10 Marcin Stolarek 2022-05-11 06:38:03 MDT

Robin,

 We decided that we'll add MaxCPUsPerSocket partition configuration parameter. This requires changes in slurm protocol, which can only happen on major releases. Since we're very close to 22.05 the changes are targeted to Slurm 23.02.

cheers,
Marcin

Comment 11 Robin Humble 2022-05-11 23:28:47 MDT

Hi Marcin,

cool that you folks will look at MaxCPUsPerSocket. thanks!

actually, I thought I'd posted a comment to https://bugs.schedmd.com/show_bug.cgi?id=4717 the other day about this, but I forgot to click submit. I've done that now. that's some other ideas around MaxCPUsPerSocket etc.

cheers,
robin

Comment 12 Marcin Stolarek 2022-05-12 03:18:11 MDT

Robin,

Since the improvement can't be addressed in 22.05. I think that the best workaround for you today is to switch back to cons_res and use the same logic as you've been using before. Once we have MaxCPUsPerSocket (23.02) you'll be able to switch to cons_tres and remove the job_submit logic in favor of the option.

What do you think?

cheers,
Marcin

Comment 13 Robin Humble 2022-05-12 03:58:30 MDT

Hi Marcin,

with a combo of MaxCPUsPerNode, and setting both ntasks-per-socket and cores-per-socket in job_submit.lua, our jobs seem to be doing well. they're packing onto 16 of the 18 cores on each socket nicely, even though we're still using cons_tres. it's not perfect, but it's as good as it was in 20.02.

most importantly there are now 2 cores/socket free for gpu jobs in our common case of large parallel many-node 32 cores-per-node jobs. in case you're interested, the snippet of lua that works for these multi-node jobs (that are a multiple of 32 cores, plus some other checks and constraints) is

   job_desc.ntasks_per_socket=16/job_desc.cpus_per_task
   job_desc.cores_per_socket=16

so it kinda makes sense I think.
both of these things seem to be needed to keep 2 cores/socket free on these 36 core machines.

but yes, once we have MaxCPUsPerSocket or a PerNuma vector or whatever you decide to do in the other ticket, then that will be a great improvement.

at the moment a lot of small jobs, or weird sized jobs like 29 or 31 cores (that cannot be aligned to anything) can block access to a gpu, but we've just been living with that.

anyway, now we're back to as good as 20.02, so I'm happy.
please feel free to close this ticket.

thanks for all your help and good explanations.
very much appreciated :)

cheers,
robin

Comment 14 Marcin Stolarek 2022-05-19 05:39:19 MDT

Robin,

I'm lowering the ticket severity to 4 for now, since the solution is going to be targeted to Slurm 23.02.

cheers,
Marcin

Comment 23 Marcin Stolarek 2022-06-14 03:19:25 MDT

Robin,

We've added a MaxCPUsPerSocket configuration parameter to partition options in our master branch, commits: 6e887117e7..a7df2eb2f5. It will be released as part of the next major release - Slurm 23.02.

I'm marking this bug report as fixed now, should you have any questions please don't hesitate to reopen.

cheers,
Marcin

Comment 24 Marcin Stolarek 2022-06-14 03:21:04 MDT

*** Ticket 4717 has been marked as a duplicate of this ticket. ***

Comment 25 Robin Humble 2022-06-17 22:59:53 MDT

thanks Marcin and Danny! I look forward to using this in 23.02