Bug 11247 - Allow --cpu-bind:map_cpu and mask_cpu for partial node allocation
Summary: Allow --cpu-bind:map_cpu and mask_cpu for partial node allocation
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other bugs)
Version: 20.11.5
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact: Brian Christiansen
URL:
Depends on:
Blocks: 11227
  Show dependency treegraph
 
Reported: 2021-03-29 19:39 MDT by Kilian Cavalotti
Modified: 2021-11-12 09:27 MST (History)
3 users (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kilian Cavalotti 2021-03-29 19:39:15 MDT
Hi!

I've allocated 2 CPUs on a node, their CPU ids are 0 and 16:

$ salloc -p test -N 1 -n 2
salloc: Pending job allocation 21271191
salloc: job 21271191 queued and waiting for resources
salloc: job 21271191 has been allocated resources
salloc: Granted job allocation 21271191
salloc: Waiting for resource configuration
salloc: Nodes sh03-01n71 are ready for job

$ srun bash -c 'printf "CPU id: %s (pid: %s)\n" $(ps -h -o psr,pid $$)'
CPU id: 16 (pid: 29452)
CPU id: 0 (pid: 29451)

I'm trying to control which CPU is allocated to which task with the --cpu-bind=map_cpu:<cpuid,...> option, but the results are pretty unexpected:

* Requesting CPU 0 for the 1st task and CPU 16 for the 2nd tasks produces the opposite binding:
$ srun  -l -n 2 --cpu-bind=map_cpu:0,16 bash -c 'printf "CPU id: %s (pid: %s)\n" $(ps -h -o psr,pid $$)'
1: CPU id: 0 (pid: 31422)
0: CPU id: 16 (pid: 31421)

* Requesting CPU id 16 for both tasks works:
$ srun  -l -n 2 --cpu-bind=map_cpu:16,16 bash -c 'printf "CPU id: %s (pid: %s)\n" $(ps -h -o psr,pid $$)'
0: CPU id: 16 (pid: 31487)
1: CPU id: 16 (pid: 31488)

* Requesting CPU 16 for the 1st task and CPU 0 for the 2nd tasks produces the opposite binding:
$ srun  -l -n 2 --cpu-bind=map_cpu:16,0 bash -c 'printf "CPU id: %s (pid: %s)\n" $(ps -h -o psr,pid $$)'
0: CPU id: 0 (pid: 31553)
1: CPU id: 16 (pid: 31554)

* Requesting CPU id 0 for both tasks... well no:
$ srun  -l -n 2 --cpu-bind=map_cpu:0,0 bash -c 'printf "CPU id: %s (pid: %s)\n" $(ps -h -o psr,pid $$)'
0: CPU id: 16 (pid: 31639)
1: CPU id: 0 (pid: 31640)


We're using TaskPlugin=task/cgroup,task/affinity

What are we doing wrong?

Thanks!
--
Kilian
Comment 2 Marcin Stolarek 2021-04-02 06:08:03 MDT
Kilian,

Could you please check slurmstepd logs as suggested in  Bug 11227?

My guess is that because the allocation is not for the whole node, than --cpu-bind=map_cpu gets ignored. When you print psr from of the process you'll always get one number (since one process can run only on one CPU at a time), but the random behavior probably comes from no binding. I'd expect that simply repeating(few times in a row)  any of the srun commands you executed may give a different stepid<->psr result.

I just want to make sure if it's a separate case, for potential improvements (despite the area naming) Bug 11227 is the place.

Let me know your thoughts.

cheers,
Marcin
Comment 3 Kilian Cavalotti 2021-04-02 12:08:15 MDT
Hi Marcin, 

(In reply to Marcin Stolarek from comment #2)
> My guess is that because the allocation is not for the whole node, than
> --cpu-bind=map_cpu gets ignored. When you print psr from of the process
> you'll always get one number (since one process can run only on one CPU at a
> time), but the random behavior probably comes from no binding. I'd expect
> that simply repeating(few times in a row)  any of the srun commands you
> executed may give a different stepid<->psr result.

Ah you're right, indeed: the results are pretty random when repeating the same command, indeed:


$ for i in {1..5}; do srun  -l -n 2 --cpu-bind=map_cpu:0,16 bash -c 'printf "CPU id: %s (pid: %s)\n" $(ps -h -o psr,pid $$)'; echo '----'; done
0: CPU id: 16 (pid: 32134)
1: CPU id: 0 (pid: 32135)
----
0: CPU id: 0 (pid: 32196)
1: CPU id: 16 (pid: 32197)
----
0: CPU id: 16 (pid: 32259)
1: CPU id: 16 (pid: 32260)
----
0: CPU id: 0 (pid: 32321)
1: CPU id: 0 (pid: 32322)
----
1: CPU id: 0 (pid: 32384)
0: CPU id: 0 (pid: 32383)
----

And slurmd indeed logs this for each srun

slurmd[17115]: task/affinity: lllp_distribution: entire node must be allocated, disabling affinity

> I just want to make sure if it's a separate case, for potential improvements
> (despite the area naming) Bug 11227 is the place.

So that seems to be the same as bug#11227, indeed.

But beyond the fact that the "disabling affinity" should really be surfaced to the user rather than logged by slurmd, why is full-node allocation necessary for the --cpu-bind option to work? Affinity masks could be generated the same way within a cgroup, whether the node is fully allocated or not, right? 

The CPU ids are known to Slurm:
$ scontrol -dd show job $SLURM_JOBID | grep CPU_IDs
     Nodes=sh03-01n71 CPU_IDs=0,16 Mem=8000 GRES=
so what would prevent it from generating the requested CPU binding for the step?

Cheers,
--
Kilian
Comment 4 Marcin Stolarek 2021-04-05 03:15:32 MDT
> why is full-node allocation necessary for the --cpu-bind option to work? Affinity masks could be generated the same way within a cgroup, whether the node is fully allocated or not, right? 

To be precise, it's only a requirement for map_cpu and mask_cpu (and map/mask/rank _ldom) use in --cpu-bind. Other options like sockets, cores, threads don't require whole node allocation.

>The CPU ids are known to Slurm:
>$ scontrol -dd show job $SLURM_JOBID | grep CPU_IDs
>     Nodes=sh03-01n71 CPU_IDs=0,16 Mem=8000 GRES=
Yes and no. The numbers displayed by `scontrol show job -d` are CPU IDs enumerated in Slurmctld abstraction form - with the assumption that subsequent IDs are placed on the same sockets, which is the case for majority of architectures, but it's not always the case. Off the top of my head, Opteron 6200 series had slightly different PUs numeration[1] in this case Slurm "abstract CPU IDs" will translate to so-called(in Slurm code) "machine representation" like:
Slurm=0,1,2,3 -> Machine=0,4,8,12.


>Affinity masks could be generated the same way within a cgroup, whether the node is fully allocated or not, right? 

I'm not sure I understood. There is a affinity mechanism build into task/cgroup plugin, which is generally not recommended since task/affinity plugin supports a wilder range of binding options (that's why we recommend task/affinity,task/cgroup stack for TaskPlugin). If you're talking about the way sched_setaffinity get's called - I think you're right CPU IDs remain unchanged, co CPU #5 remains #5 even if it's the only one in the cpuset.

The rational argument that comes to my mind is a guess that if one wants to get a performance boost of such detailed control over task placement than he probably wants to have the whole node, either way, to be in charge of every detail, don't you agree?

The requirement came into the code in 4a48ae7a85e (Slurm 2.0, more than 10 years ago). Personally, I think that allowing map/mask binding to CPUs when not all are allocated to users will mostly result in non-desired behavior.


Does that make sense to you?

cheers,
Marcin


https://www.open-mpi.org/projects/hwloc/lstopo/images/4Opteron6200.v1.11.png
Comment 5 Kilian Cavalotti 2021-04-06 10:48:34 MDT
Hi Marcin, 

(In reply to Marcin Stolarek from comment #4)
> To be precise, it's only a requirement for map_cpu and mask_cpu (and
> map/mask/rank _ldom) use in --cpu-bind. Other options like sockets, cores,
> threads don't require whole node allocation.

Yep, I noticed that. Those options offer a little less control to the user, though.

> >The CPU ids are known to Slurm:
> >$ scontrol -dd show job $SLURM_JOBID | grep CPU_IDs
> >     Nodes=sh03-01n71 CPU_IDs=0,16 Mem=8000 GRES=
> Yes and no. The numbers displayed by `scontrol show job -d` are CPU IDs
> enumerated in Slurmctld abstraction form - with the assumption that
> subsequent IDs are placed on the same sockets, which is the case for
> majority of architectures, but it's not always the case. Off the top of my
> head, Opteron 6200 series had slightly different PUs numeration[1] in this
> case Slurm "abstract CPU IDs" will translate to so-called(in Slurm code)
> "machine representation" like:
> Slurm=0,1,2,3 -> Machine=0,4,8,12.

I'm not sure it entirely matters, since in the end, the "real", OS-determined CPU ids are the ones that are used to set up cgroups, right?

For instance, we have nodes where the CPU enumeration is round-robin across sockets:

[root@sh02-01n01 ~]# lscpu | grep -E 'NUMA|Model name' | sort
Model name:            Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19
NUMA node(s):          2

In such a node, the Slurm numbering scheme doesn't match the system indexes.
Yet, if I allocate the whole node, the CPU ids I request with --cpu-bind:map_cpu are the ones that are used by the system as well:

$ salloc --exclusive -p test -w sh02-01n59

$ srun  -l -n 2 --cpu-bind=map_cpu:0,11 bash -c 'printf "PID: %s | CPU id: %2s | sched_getaffinity: %2s | cgroup cpuset: %s\n" $(ps -h -o pid,psr $$) $(taskset -cp $$ | awk "{print \$NF}") $(cat /sys/fs/cgroup/cpuset/$(cat /proc/$$/cpuset)/cpuset.cpus)'
0: PID: 132547 | CPU id:  0 | sched_getaffinity:  0 | cgroup cpuset: 0-19
1: PID: 132548 | CPU id: 11 | sched_getaffinity: 11 | cgroup cpuset: 0-19


If I only allocate 2 CPUs on that same node, the CPU IDs reported by scontrol are indeed different from the system CPU ids:

$ salloc -n 2 -p test -w sh02-01n59

$ scontrol -dd show job $SLURM_JOB_ID | grep CPU_IDs
     Nodes=sh02-01n59 CPU_IDs=0,10 Mem=8000 GRES=

$ srun  -l -n 2 --cpu-bind=map_cpu:0,10 bash -c 'printf "PID: %s | CPU id: %2s | sched_getaffinity: %2s | cgroup cpuset: %s\n" $(ps -h -o pid,psr $$) $(taskset -cp $$ | awk "{print \$NF}") $(cat /sys/fs/cgroup/cpuset/$(cat /proc/$$/cpuset)/cpuset.cpus)'
0: PID: 133952 | CPU id:  0 | sched_getaffinity: 0,1 | cgroup cpuset: 0-1
1: PID: 133953 | CPU id:  1 | sched_getaffinity: 0,1 | cgroup cpuset: 0-1

CPU 10 for Slurm is CPU 1 for the system.

This double-enumeration scheme is pretty confusing, and I'm wondering why Slurm uses an internal numbering system that is separate from the system one, which should be the reference after all, shouldn't it?

It seems like there's an extra abstraction layer that requires translating CPU ids in both ways, for no visible benefit? Is it to keep compatibility with platforms that don't expose CPU ids the same way Linux does? Are there still real use cases for this? Wouldn't it be much simpler if the only CPU numbering scheme were the system one?


> The rational argument that comes to my mind is a guess that if one wants to
> get a performance boost of such detailed control over task placement than he
> probably wants to have the whole node, either way, to be in charge of every
> detail, don't you agree?

But then, if users can't request precisely the resources they want from the resource scheduler, then what's the point of having a scheduler in the first place? :)

Requesting a full node when you only need to run tasks on 2 CPU cores but need to make sure they're correctly placed wrt. the node topology and the associated devices (GPUs, NICs...) seems like a waste of resources, doesn't it? Especially these days when nodes are becoming larger and larger with 100+ core nodes being mainstream.

> The requirement came into the code in 4a48ae7a85e (Slurm 2.0, more than 10
> years ago). Personally, I think that allowing map/mask binding to CPUs when
> not all are allocated to users will mostly result in non-desired behavior.

I understand that "binding" options may not be the best for this, but they still  seem the closest way I found to request specific resources on a node. 

But I guess the more general question here is along those lines: given a particular node topology, how can a user request specific CPU (and GPU, cf. #11226) IDs, to make sure that her tasks will run on the proper resources, and get optimal performance, without wasting resources (by requesting full nodes)?


Thanks!
--
Kilian
Comment 8 Marcin Stolarek 2021-04-13 03:00:32 MDT
>This double-enumeration scheme is pretty confusing, and I'm wondering why Slurm uses an internal numbering system that is separate from the system one, which should be the reference after all, shouldn't it?

The main reason for that is SelectPlugin (a computational core of slurmctld) performance. If every time we're checking if a specific core should/can be selected for the job we have to check on which socket it is we'll end up with much more memory lookups. Today we just assume in a number of places that a core range from x to x+CoresPerSocket is on the same socket.
It's so much built-in that I don't expect anyone willing to rewrite it.

>But then, if users can't request precisely the resources they want from the resource scheduler, then what's the point of having a scheduler in the first place? :)
Slurm, like nearly everything, has its limitations. There is a number of options already available allowing job description, but I'm sure there always be yet another way to describing user needs.
We're always open to suggestions on how to improve, however, we want to make sure the direction is something that may be interesting for a wider group of Slurm users and is intact with the approach of Slurm in general.

>[...]placed wrt. the node topology and the associated devices (GPUs, NICs...) seems like a waste of resources, doesn't it? 
I agree and we have a number of binding options to support users needs like:
--gres-flags=enforce-binding
--mem-bind
--cpu-bind

>But I guess the more general question here is along those lines: given a particular node topology, how can a user request specific CPU (and GPU, cf. #11226) IDs, to make sure that her tasks will run on the proper resources, and get optimal performance, without wasting resources (by requesting full nodes)?
Maybe we should take a step back. Could you please explain the use case where you see the need of specific CPU allocation? I mean there must be a feature/property of a the GPU/CPU that makes it unique hence interesting for the user. It can't just be the fact that it got a specific ID from OS.
For instance, user may be interested in getting only the cores on the socket where the GPUs PCIe is served, but I don't expect any user to be willing to use GPU ID 3 and socket 2 just because those are her favorite numbers. Keep in mind that in general those IDs assignments may be changed by reconfiguration or drivers/OS upgrade.

cheers,
Marcin
Comment 9 Kilian Cavalotti 2021-04-13 10:18:56 MDT
(In reply to Marcin Stolarek from comment #8)
> The main reason for that is SelectPlugin (a computational core of slurmctld)
> performance. If every time we're checking if a specific core should/can be
> selected for the job we have to check on which socket it is we'll end up
> with much more memory lookups. 

But couldn't the real id map be determined once and for all when slurmd starts? Once the CPU topology map is in memory, I don't know if it's more costly to update an in-memory table or to compute an arbitrary value. The list of in-use and available CPU ids has still to be maintained somewhere, right? I'm not sure to understand how using real CPU ids would make a difference in terms of performance, compared to using made-up CPU ids.

> Today we just assume in a number of places
> that a core range from x to x+CoresPerSocket is on the same socket.

There was a time when this definitely was a reasonable assumption, but simple socket-core topologies are mostly behind us, I'm afraid. And going forward, nodes' topologies will certainly not become simpler: they will very likely add more layers, more hierarchy levels, and there will undoubtedly be a point where this CPU.id = socket.id + core.id + thread.id formula won't work very well anymore. We're already there today with CPUs that can provide multiple NUMA nodes per socket, and where this scheme can be changed with BIOS settings.

So why not just take the actual topology information that is provided by the system itself and use it to schedule those resources, rather than forcing a assumptive numbering scheme that may not hold up in the future?

I believe this was the whole idea behind the NVML GRES plugin, and maybe we're at a point where the same concept should be applied to CPU topologies as well?

> It's so much built-in that I don't expect anyone willing to rewrite it.

That I understand. :)

This is certainly a daunting task, but I'd say that the obvious upsides are clarity for the users, and more importantly future-proofing and relevance with future topologies.


> >[...]placed wrt. the node topology and the associated devices (GPUs, NICs...) seems like a waste of resources, doesn't it? 
> I agree and we have a number of binding options to support users needs like:
> --gres-flags=enforce-binding
> --mem-bind
> --cpu-bind


> >But I guess the more general question here is along those lines: given a particular node topology, how can a user request specific CPU (and GPU, cf. #11226) IDs, to make sure that her tasks will run on the proper resources, and get optimal performance, without wasting resources (by requesting full nodes)?
> Maybe we should take a step back. Could you please explain the use case
> where you see the need of specific CPU allocation? I mean there must be a
> feature/property of a the GPU/CPU that makes it unique hence interesting for
> the user. It can't just be the fact that it got a specific ID from OS.
> For instance, user may be interested in getting only the cores on the socket
> where the GPUs PCIe is served, but I don't expect any user to be willing to
> use GPU ID 3 and socket 2 just because those are her favorite numbers. Keep
> in mind that in general those IDs assignments may be changed by
> reconfiguration or drivers/OS upgrade.

Yes, you're right. Taking a step back, the underlying need is application performance, as described in https://bugs.schedmd.com/show_bug.cgi?id=11226#c22 Those two bugs/requests are actually motivated by the same need.

Ideally, the scheduler should be able to provide optimal task placement to ensure the best possible application performance. Sometimes, "optimal" may be loosely defined, or subjective, or too difficult to determine automatically. And in those cases, having a mechanism to let the user specify manual placement/binding options (in a way that is relevant to the system ids) would be extremely useful.

Does that make sense?

Thanks!
--
Kilian
Comment 13 Marcin Stolarek 2021-05-11 10:01:59 MDT
Kilian,

 I just wanted to let you know that we had an internal discussion on that. We'll try to remove the requiremnet for --exclusive for the discussed binding types.

 The little uncertainty in the statement above comes from the fact that the requiremnet was in the code for very long time, so it may be that we'll find an internal reason making it more difficult than we think today.

cheers,
Marcin
Comment 14 Kilian Cavalotti 2021-05-11 10:28:10 MDT
Hi Marcin, 

(In reply to Marcin Stolarek from comment #13)
>  I just wanted to let you know that we had an internal discussion on that.
> We'll try to remove the requiremnet for --exclusive for the discussed
> binding types.
> 
>  The little uncertainty in the statement above comes from the fact that the
> requiremnet was in the code for very long time, so it may be that we'll find
> an internal reason making it more difficult than we think today.

Thanks for the update!

Cheers,
--
Kilian
Comment 25 Marcin Stolarek 2021-07-14 01:57:47 MDT
Kilian,

The requested behavior will be available in Slurm 21.08, we implemented it together with code cleanup in commits 8f5c832d47..5eec1ff0e6.

cheers,
Marcin
Comment 26 Kilian Cavalotti 2021-07-14 01:57:56 MDT
Hi,

I am currently out of office, returning on July 29th. 

If you need to
reach Stanford Research Computing, please email srcc-support@stanford.edu

Cheers,