Ticket 4717

Summary:	Advice needed: keeping cores free on sockets for GPUs when using both MaxCPUsPerNode for partitions and Cores= in gres.conf
Product:	Slurm	Reporter:	Christopher Samuel <chris>
Component:	Scheduling	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	altenkort, bart, cinek, robin.humble+slurm, rsmith, tanderson
Version:	17.11.2
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=14015
Site:	Swinburne	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:		Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---

Description Christopher Samuel 2018-01-31 22:50:36 MST

Hi there,

As part of the bring up of Slurm on this new system I've been asked to achieve two things:

1) Have NUMA locality of cores to GPUs (which can be done via Cores= in gres.conf)
2) Keep some cores free on nodes for GPU jobs (which can be done with overlapping partitions and MaxCPUsPerNode to limit the non-GPU partition).

The documentation for gres.conf in 17.11.2 says:

# If specified, then only the identified cores can be allocated with
# each generic resource; an attempt to use other cores will not be honored.

The complication that I wonder about is when you combine the two of these at the same time. For instance if a non-GPU job grabs all the cores on the first socket, then you are stuck with a free GPU but no cores available to use it.

Is there a way around that, or would it be possible to have a MaxCPUsPerSocket setting for a partition instead that would ensure that we could keep 1 or 2 cores per socket to access their local GPU please?

Observation: whilst reading the gres.conf manual page I noticed that it talks about Cores= but all the examples are with CPUs= - I thought that the Cores= was a typo but checking the source code it appears that CPUs= is now deprecated so it might be handy to note that and change the examples in the man page to show Cores= instead to be consistent.

Thanks!
Chris

Comment 2 Dominik Bartkiewicz 2018-02-09 10:17:06 MST

Hi

Sorry for late respond.
Currently we have couple of gres related bugs and I didn't want to give you obsolete answer. 

For now I don't see any available options to do this work properly.
I will try to implement MaxCPUsPerSocket, but this won't be available until 18.08.

Dominik

Comment 3 Christopher Samuel 2018-02-10 04:03:55 MST

On 10/02/18 04:17, bugs@schedmd.com wrote:

> Hi

Hi Dominik,

> Sorry for late respond.
> Currently we have couple of gres related bugs and I didn't want to give you
> obsolete answer.

Not a problem, thanks for getting back to me.

> For now I don't see any available options to do this work properly.
> I will try to implement MaxCPUsPerSocket, but this won't be available until
> 18.08.

Understood, best of luck with that work!

All the best,
Chris

Comment 4 Christopher Samuel 2018-02-11 16:29:24 MST

On 10/02/18 04:17, bugs@schedmd.com wrote:

> For now I don't see any available options to do this work properly.
> I will try to implement MaxCPUsPerSocket, but this won't be available until
> 18.08.

The only alternative I can think of is perhaps a ReserveNumCores=
option in gres.conf that means that number of cores can only be used
by jobs that request the particular gres.

It might be cleaner to keep that in the gres.conf as then you don't
need to muck around with multiple partitions & submit filters as you
must to currently achieve this.

So consider this for a 16 core node with 2 P100 GPUs:

Name=gpu Type=p100 File=/dev/nvidia0 Cores=0-7 ReserveNumCores=2
Name=gpu Type=p100 File=/dev/nvidia1 Cores=8-15 ReserveNumCores=2

GPU and non-GPU jobs could use the first 6 cores on either socket,
but the last 2 would be kept free for jobs that request the gres.

I was thinking that you could have ReserveCores= but then it's
ambiguous whether the cores you reserve have to intersect with
the Cores= directive or not.

Plus if a site didn't care about locality they could drop the
Cores= directive and just have:

Name=gpu Type=p100 File=/dev/nvidia0 ReserveNumCores=2
Name=gpu Type=p100 File=/dev/nvidia1 ReserveNumCores=2

...and then the code would just keep 2 cores free for each GPU
but not care about topology.

I think this might be a better way of doing it as it satisfies
what we want without having to grow multiple partitions and thus
is transparent to the user.

How does that sound?

All the best,
Chris

Comment 5 Dominik Bartkiewicz 2018-02-12 04:12:24 MST

Hi

I will discuss this idea with rest of the team.
MaxCPUsPerSocket looks straightforward but it requires changes in API/protocol. This can be done only in 18.08.

Dominik

Comment 6 Christopher Samuel 2018-02-12 04:27:56 MST

On Monday, 12 February 2018 10:12:24 PM AEDT bugs@schedmd.com wrote:

> I will discuss this idea with rest of the team.
> MaxCPUsPerSocket looks straightforward but it requires changes in
> API/protocol. This can be done only in 18.08.

Thanks Dominik, appreciate you checking on this and on the fact that this will 
need to wait for the next major release.

Comment 13 Dominik Bartkiewicz 2018-02-19 09:01:41 MST

Hi

After talk with development team we made decision that
we won’t add new features to cons_res.

We can change severity of this bug to sev-5 bug and we will keep this issue in mind for future design/development.

Dominik

Comment 14 Christopher Samuel 2018-02-20 00:12:24 MST

On Tuesday, 20 February 2018 3:01:41 AM AEDT bugs@schedmd.com wrote:

> Hi

Hi Dominik

> After talk with development team we made decision that
> we won’t add new features to cons_res.
>
> We can change severity of this bug to sev-5 bug and we will keep this issue
> in mind for future design/development.

Oh that's disappointing.  Thanks for letting me know.

Comment 15 Moe Jette 2018-02-20 07:17:41 MST

Chris,

Here is a bit more background information.

Our current plan it to develop a completely new select/cons_res plugin in 2018. This new version will provide a great deal more support for managing GPUs. We plan to release a beta-version of the plugin in November at SC18 (compatible with version 18.08) and include it in version 19.05 (in parallel with the current select/cons_res plugin). We can include the MaxCPUsPerSocket functionality in that release.

Comment 17 Christopher Samuel 2018-02-20 17:31:42 MST

On 21/02/18 01:17, bugs@schedmd.com wrote:

> Chris,

Hi Moe!

> Here is a bit more background information.

Thanks for that, that sounds a lot more promising!

I really appreciate that.

All the best,
Chris

Comment 18 Christopher Samuel 2018-03-13 21:46:07 MDT

One related question based on what appears to be conflicting information in the manual pages.

The manual page for gres.conf says:


# If specified, then only the identified cores can be allocated with
# each generic resource; an attempt to use other cores will not be honored.

but the manual page for sbatch says:

#  --gres-flags=enforce-binding
#         If set, the only CPUs available to the job will be those bound
#         to the selected GRES (i.e. the CPUs identified in the gres.conf
#         file will be strictly enforced rather than advisory).

So are the Cores= spec in gres.conf enforced by default or not?

The reason I raise it in this bug is that if it is just advisory by default then we can specify both for now and hopefully get the best of both worlds (until the machine gets fragmented by jobs).

All the best,
Chris

Comment 19 Moe Jette 2018-03-14 09:09:08 MDT

(In reply to Christopher Samuel from comment #18)
> One related question based on what appears to be conflicting information in
> the manual pages.
> 
> The manual page for gres.conf says:
> 
> 
> # If specified, then only the identified cores can be allocated with
> # each generic resource; an attempt to use other cores will not be honored.
> 
> but the manual page for sbatch says:
> 
> #  --gres-flags=enforce-binding
> #         If set, the only CPUs available to the job will be those bound
> #         to the selected GRES (i.e. the CPUs identified in the gres.conf
> #         file will be strictly enforced rather than advisory).
> 
> So are the Cores= spec in gres.conf enforced by default or not?
> 
> The reason I raise it in this bug is that if it is just advisory by default
> then we can specify both for now and hopefully get the best of both worlds
> (until the machine gets fragmented by jobs).

Thanks for reporting this inconsistency.

The information in gres.conf is advisory unless the job specifies --gres-flags=enforce-binding which requires matching cores/GPUs. I've just updated the documentation:
https://github.com/SchedMD/slurm/commit/f85659560193b918e439356982acf4806930a9bc

Comment 20 Christopher Samuel 2018-03-14 16:54:31 MDT

Hi Moe,

On 15/03/18 02:09, bugs@schedmd.com wrote:

> Thanks for reporting this inconsistency.
> 
> The information in gres.conf is advisory unless the job specifies 
> --gres-flags=enforce-binding which requires matching cores/GPUs. I've
> just updated the documentation:

That's great, thanks so much for the clarification!

That helps us a lot, as I was reading it as preventing access from other
cores which was the root cause of this bug report (wanting to specify
that and wanting to have some reserved cores on a partition and
thinking you could end up with reserved cores that couldn't be used
due to the gres.conf config).

All the best,
Chris

Comment 29 Moe Jette 2019-04-05 09:59:57 MDT

(In reply to Moe Jette from comment #15)
> Chris,
> 
> Here is a bit more background information.
> 
> Our current plan it to develop a completely new select/cons_res plugin in
> 2018. This new version will provide a great deal more support for managing
> GPUs. We plan to release a beta-version of the plugin in November at SC18
> (compatible with version 18.08) and include it in version 19.05 (in parallel
> with the current select/cons_res plugin). We can include the
> MaxCPUsPerSocket functionality in that release.

I wanted to let you know the MaxCPUsPerSocket has not been implemented yet and will not be available in version 19.05. Sorry...

Comment 30 Luis 2020-05-28 06:47:35 MDT

Hello,
unfortunately it seems that in 20.02 this is also not yet implemented. Do you have any update on when this feature could become available? It is highly requested by our users.
Thanks & best regards
Luis

Comment 31 Robin Humble 2022-05-11 23:25:11 MDT

Hi Moe and Dominik,

as an update, we've been going ok with with MaxCPUsPerNode plus some hackery in job_submit.lua.

but now that we have cons_tres, is it possible to revisit the possibility of implementing MaxCPUsPerSocket for a partition or a QOS?

or some other solution that would let us keep a few optimal cores free for gpu jobs. we're not tied to MaxCPUsPerSocket - it's just an idea. 

indeed with nodes that have 2 sockets and 1 gpu, then even MaxCPUsPerSocket wouldn't do the right thing. it almost needs to be a vector eg.

MaxCPUsPerSocket=16,18

might keep 2 cores free on socket0 (where the gpu is attached), and 0 free on socket1 (no gpu)

or perhaps even more generally (as cpus seems to be increasing in numa complexity) it could be a numa node vector

MaxCPUsPerNumaNode=6,8,8,8,8,8,8,8

or something like that.

what do you think?

cheers,
robin