Bug 11226 - --gpu-bind=verbose,... doesn't display GPU binding information
Summary: --gpu-bind=verbose,... doesn't display GPU binding information
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other bugs)
Version: 20.11.5
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-03-25 18:44 MDT by Kilian Cavalotti
Modified: 2021-10-27 15:23 MDT (History)
0 users

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kilian Cavalotti 2021-03-25 18:44:30 MDT
Hi SchedMD!

It doesn't look like specifying "verbose" in `--gpu-bind displays any information about GPU binding:

$ srun -N 1 -n 1 -G 1 --gpu-bind=closest nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
$ srun -N 1 -n 1 -G 1 --gpu-bind=verbose,closest nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)

The man page says:
 --gpu-bind=[verbose,]<type>
      Bind tasks to specific GPUs.  By default every spawned task can access every GPU allocated to the job.  If "verbose," is specified before <type>, then print out GPU binding information.

What am I missing?

Thanks!
--
Kilian
Comment 1 Michael Hinton 2021-03-26 10:13:31 MDT
Hey Kilian,

Uh oh. I am able to reproduce this, but it's not just verbose that's missing... Looks like the entire --gpu-bind option is being ignored. I'll look into it and get back to you. I know --gpu-bind worked when 20.11 was released, so something we did in a minor release must have screwed this up.

Thanks!
-Michael
Comment 2 Kilian Cavalotti 2021-03-26 10:45:24 MDT
Hi Michael, 

(In reply to Michael Hinton from comment #1)
> Uh oh. I am able to reproduce this, but it's not just verbose that's
> missing... Looks like the entire --gpu-bind option is being ignored. I'll
> look into it and get back to you. I know --gpu-bind worked when 20.11 was
> released, so something we did in a minor release must have screwed this up.

Oh good, thanks for confirming! I was actually in the middle of writing another bug report about how I couldn't get --gpu-bind to work, so I guess I won't submit it now. :)

Thanks!
--
Kilian
Comment 3 Michael Hinton 2021-03-26 11:32:11 MDT
(In reply to Michael Hinton from comment #1)
> I know --gpu-bind worked when 20.11 was released,
Correction: It was at least working in pre 20.11 when I added the verbose option :) But it looks like it may have been broken soon after that before .0 was even released...
Comment 4 Michael Hinton 2021-03-26 11:33:31 MDT
Also, it looks like this is only broken on single-task jobs. Could you confirm that an -n >= 2 shows verbose binding?
Comment 5 Michael Hinton 2021-03-26 11:39:42 MDT
That's weird, it looks like --gpu-bind has always been cleared out when there is only one task... That initially seems weird, but it may make sense, because why would you need to specify GPU task binding if there is only one task? So I think this may not be an issue (though we might want to document this, or we might want to enable it anyways for debugging)
Comment 6 Kilian Cavalotti 2021-03-26 11:48:45 MDT
Yes, I can confirm that verbose works for jobs with more than one task:

$ srun -l -N 1 -n 2 --gpus-per-task=1 --gpu-bind=verbose,map_gpu:0,4 bash -c 'printf "%s | CPU: %s (pid: %s)\n" $(hostname) $(ps -h -o psr,pid $$); nvidia-smi -L'
0: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
0: sh03-14n03.int | CPU: 0 (pid: 76579)
0: GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
0: GPU 1: A100-SXM4-40GB (UUID: GPU-401facc6-3a48-b3f4-cd3c-a132b723e456)

1: slurmstepd: error: Bind request 4 (0x10) does not specify any devices within the allocation. Binding to the first device in the allocation instead.
1: gpu-bind: usable_gres=0x10; bit_alloc=0x3; local_inx=0; global_list=0; local_list=0
1: sh03-14n03.int | CPU: 32 (pid: 76580)
1: GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
1: GPU 1: A100-SXM4-40GB (UUID: GPU-401facc6-3a48-b3f4-cd3c-a132b723e456)

(lines manually reordered for clarity)

I don't understand the slurmstepd error (bind request does not specify any devices within the allocation), though. Those srun commands are started within a salloc job, which has all the 8 GPUs allocated on the nodes.


I guess my question becomes (and that was the point of my other non-submitted bug report): how do I specify the GPU id I want to run on? It works well for CPUs, but --gpu-bind=map_gpu always give the same GPU.

Exemple with --map-cpu:

$ for i in {0..7}; do srun -N 1 -n 1 --cpu-bind=map_cpu:$i bash -c 'printf "CPU: %s (pid: %s)\n" $(ps -h -o psr,pid $$)'; done
CPU: 0 (pid: 81939)
CPU: 1 (pid: 138250)
CPU: 2 (pid: 81973)
CPU: 3 (pid: 138277)
CPU: 4 (pid: 82005)
CPU: 5 (pid: 82033)
CPU: 6 (pid: 82059)
CPU: 7 (pid: 138300)

Each job step runs on the request CPU.

For --gpu-bind:

$ for i in {0..7}; do srun -N 1 -n 1 -G 1 --gpu-bind=map_gpu:$i nvidia-smi -L; done
GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)

I'd expect each step to give me a different GPU.

What am I missing here?

Thanks,
--
Kilian
Comment 7 Michael Hinton 2021-03-26 14:21:45 MDT
(In reply to Kilian Cavalotti from comment #6)
> What am I missing here?
I think what you are missing is the fact that steps (i.e. sruns) are only given the # of GPUs they request, not the total number of GPUs in the job allocation. And this is because the steps set step-specific GPU restrictions via cgroups in addition to the job-specific cgroup restrictions of the allocation. So

$ srun -l -N 1 -n 2 --gpus-per-task=1 --gpu-bind=verbose,map_gpu:0,4 bash -c 'printf "%s | CPU: %s (pid: %s)\n" $(hostname) $(ps -h -o psr,pid $$); nvidia-smi -L'
0: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
0: sh03-14n03.int | CPU: 0 (pid: 76579)
0: GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
0: GPU 1: A100-SXM4-40GB (UUID: GPU-401facc6-3a48-b3f4-cd3c-a132b723e456)

1: slurmstepd: error: Bind request 4 (0x10) does not specify any devices within the allocation. Binding to the first device in the allocation instead.
1: gpu-bind: usable_gres=0x10; bit_alloc=0x3; local_inx=0; global_list=0; local_list=0
1: sh03-14n03.int | CPU: 32 (pid: 76580)
1: GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
1: GPU 1: A100-SXM4-40GB (UUID: GPU-401facc6-3a48-b3f4-cd3c-a132b723e456)

won't work for task 1 because the step was only given 2 GPUs, and so a 'cgroup-local' bit_alloc contains only 2 GPUs and looks like bit_alloc=0x3 (instead of a 'global' bit_alloc that would look like bit_alloc=0xff if indeed the step had access to all 8 GPUs; try running nvidia-smi in the step to verify). So your index of 4 for task 1 does not index into bit_alloc=0x3 given to the step, and thus it defaults to the first GPU in the step allocation.

So I think you should instead do:

    --gpu-bind=verbose,map_gpu:0,1

and I think that will make task 0 bind to the first GPU given to the job, and task 1 bind to the second GPU given to the job, regardless of what those global GPU IDs actually are. In other words, I don't believe map_gpu or mask_gpu specify global GPU IDs, but rather local GPU indexes into the local bit_alloc array for the step. I'm speaking off memory; I'll try to play around with it myself to confirm. But see if that helps.

-Michael
Comment 8 Michael Hinton 2021-03-26 14:28:27 MDT
(In reply to Michael Hinton from comment #7)
> ...try running nvidia-smi in the step to verify)
Well, you already did in the example (oops) - you can see that it only prints out 2 GPUs, so the step is properly constraining the GPUs
Comment 9 Michael Hinton 2021-03-26 14:33:29 MDT
I guess it's weird because --cpu-bind=map_cpu:$i seems to take IDs of the GPUs relative to the job allocation, rather than the step, which is different from --gpu-bind. I haven't played around with CPU binding much, just GPU binding. Maybe Slurm does not constrain CPUs per step via cgroups like it does with GPUs. I haven't really looked at the CPU binding code much.
Comment 10 Michael Hinton 2021-03-26 14:35:47 MDT
(In reply to Michael Hinton from comment #9)
> Maybe Slurm does not constrain CPUs per step via cgroups
> like it does with GPUs. I haven't really looked at the CPU binding code much.
I bet we could look at the cgroup structures generated for some steps to prove this. I bet the step cpu cgroup stuff matches the job's CPU cgroup stuff, even when a step requests only 1 of many CPUs.
Comment 11 Michael Hinton 2021-03-26 14:51:17 MDT
So here's a quick example that verifies what I'm talking about:

$ salloc -n4 --gres=gpu:4
salloc: Granted job allocation 14491
[salloc job=14491]$ srun -l -n2 --gpus=2 --gpu-bind=map_gpu:0,1 sleep 3000 &
[1] 3023912
[salloc job=14492]$ srun -l -n2 --gpus=2 --gpu-bind=map_gpu:0,1 printenv CUDA_VISIBLE_DEVICES
0: 0
1: 1
[salloc job=14492]$ srun -l -n2 --gpus=2 --gpu-bind=map_gpu:2,3 printenv CUDA_VISIBLE_DEVICES
0: slurmstepd-test1: error: Bind request 2 (0x4) does not specify any devices within the allocation. Binding to the first device in the allocation instead.
1: slurmstepd-test1: error: Bind request 3 (0x8) does not specify any devices within the allocation. Binding to the first device in the allocation instead.
0: 0
1: 0

I run a long-running step on the first two GPUs. For the last two GPUs, my first srun succeeds, while my last srun fails. Note also that CUDA_VISIBLE_DEVICES is also 'step-local' and indexes into the GPUs allowed for the step via cgroups.
Comment 12 Michael Hinton 2021-03-26 15:14:09 MDT
(In reply to Michael Hinton from comment #11)
> [salloc job=14491]$ ...
Sorry, that should be:
> [salloc job=14492]$ ...
Comment 17 Kilian Cavalotti 2021-03-26 17:24:19 MDT
Yeah, you're right, GPU indexes being relative to the allocation, 0,4 only makes sense if 5 GPUs are allocated to the step, indeed. map_gpu:0,1 works better:

$ srun -l -N 1 -n 2 --gpus-per-task=1 --gpu-bind=verbose,map_gpu:0,1 bash -c 'printf "%s | CPU: %s (pid: %s)\n" $(hostname) $(ps -h -o psr,pid $$); nvidia-smi -L'
0: gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
1: gpu-bind: usable_gres=0x2; bit_alloc=0x3; local_inx=2; global_list=1; local_list=1
1: sh03-14n03.int | CPU: 32 (pid: 70962)
0: sh03-14n03.int | CPU: 0 (pid: 70961)
1: GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
1: GPU 1: A100-SXM4-40GB (UUID: GPU-401facc6-3a48-b3f4-cd3c-a132b723e456)
0: GPU 0: A100-SXM4-40GB (UUID: GPU-f514e4b7-d3df-3343-3dfd-f4b5dbc01113)
0: GPU 1: A100-SXM4-40GB (UUID: GPU-401facc6-3a48-b3f4-cd3c-a132b723e456)


Now, as you mentioned, the behavior is different from --cpu-bind.

So, given that I'm in a salloc with all the 8 GPUs of the nodes allocated to the job, how can I run a srun step requesting a specific GPU?

Thanks!
--
Kilian
Comment 18 Michael Hinton 2021-03-29 18:34:21 MDT
(In reply to Kilian Cavalotti from comment #17)
> So, given that I'm in a salloc with all the 8 GPUs of the nodes allocated to
> the job, how can I run a srun step requesting a specific GPU?
Unfortunately, I'm not sure that this is possible. I'll consult with my colleagues to see if there is a way, however. What is the use case, besides just poking around and testing?
Comment 19 Michael Hinton 2021-03-29 18:41:19 MDT
(In reply to Kilian Cavalotti from comment #17)
> So, given that I'm in a salloc with all the 8 GPUs of the nodes allocated to
> the job, how can I run a srun step requesting a specific GPU?
If you want to use GPU X, the only way I can think of is to just run a dummy srun in the background (`srun --gres=gpu:X`) to take up the first X GPUs (GPUs 0 through X-1), then another `srun --gres=gpu:1` to get GPU X.
Comment 20 Kilian Cavalotti 2021-03-29 19:29:51 MDT
Hi Michael,

The use case is to control resource allocation on nodes with complex topologies, to ensure proper process pinning, and optimal performance when accessing GPUs and NICs. Right now, the NVML AutoDetect plugin doesn't work very well on some NUMA architectures, so we're trying to find  workarounds to target particular GPUs.

Naturally, running empty jobs to artificially occupy GPU0 and GPU1 when somebody wants to run on GPU2 is neither very efficient in terms of resource utilization, nor probably very consistent, as there's no guarantee that another job won't come in between to occupy a GPU before the next srun can be started.

But in any case, there seem to be an inconsistency between the way the --cpu-bind and --gpu-bind options work, despite their very similar name and options. So I feel that at least this discrepancy should be documented. And right now, I'm not sure how to use --gpu-bind: if there's no control about what specific GPUs are allocated to a job (other than the first that happens to be available), then I'm not sure there's much interest in "controlling" binding within the allocation either (if we don't know what we're binding to).

From my 10,000 ft perspective, the GPU ids seem to be available to Slurm, the same way the CPU ids are. So I'm not sure to understand why the 2 options behave differently.

Does that make sense?


Thanks!
--
Kilian
Comment 21 Michael Hinton 2021-03-30 11:09:30 MDT
(In reply to Kilian Cavalotti from comment #20)
> The use case is to control resource allocation on nodes with complex
> topologies, to ensure proper process pinning, and optimal performance when
> accessing GPUs and NICs. Right now, the NVML AutoDetect plugin doesn't work
> very well on some NUMA architectures, so we're trying to find  workarounds
> to target particular GPUs.
Could you elaborate on this use case? I am a bit confused about what you are trying to work around. If the GPUs on a node are all homogeneous, then why do you need to select individual ones? If they are all heterogeneous, then maybe each GPU should be given a unique type so they can be selected individually.
Comment 22 Kilian Cavalotti 2021-03-30 11:44:34 MDT
(In reply to Michael Hinton from comment #21)
> (In reply to Kilian Cavalotti from comment #20)
> > The use case is to control resource allocation on nodes with complex
> > topologies, to ensure proper process pinning, and optimal performance when
> > accessing GPUs and NICs. Right now, the NVML AutoDetect plugin doesn't work
> > very well on some NUMA architectures, so we're trying to find  workarounds
> > to target particular GPUs.
> Could you elaborate on this use case? I am a bit confused about what you are
> trying to work around. If the GPUs on a node are all homogeneous, then why
> do you need to select individual ones? If they are all heterogeneous, then
> maybe each GPU should be given a unique type so they can be selected
> individually.

Sure: it's not really about GPUs themselves, it's about nodes' topology.
Consider the following one:

-- 8< ------------------------------------------------------------------
[root@sh03-14n15 ~]# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     SYS     32-63           1
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     SYS     32-63           1
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     0-31            0
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     0-31            0
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     PXB     96-127          3
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     PXB     96-127          3
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     64-95           2
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     64-95           2
mlx5_0  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS
mlx5_1  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
-- 8< ------------------------------------------------------------------

Although the 8 GPUs on that node are identical, their CPU affinity varies, and each pair of GPU has a differently privileged set of CPUs they can work with. More importantly, on multi-rail nodes like this one, each pair of GPU has a strong affinity with a specific IB interface.

So here, for multi-node GPU-to-GPU communication, you'll want to use GPU[0-1] and CPU[32-63] together with mlx5_0, for instance. Using say mlx5_1 with GPU0 will result in disastrous performance, as the data will need to take unnecessary trips through an SMP interconnect to go from the GPU to the IB interface, instead of going straight from the IB HCA to a GPU that's directly connected to it.

And the performance hit could easily reach 80%.

Here's a concrete example running osu_bw, the OSU MPI-CUDA bandwidth test (http://mvapich.cse.ohio-state.edu/benchmarks) between two machines with the topology mentioned above. All the sruns below are done within a 2-node exclusive allocation ((full nodes):

- without particular consideration for pinning, we get 5GB/s:

$ srun -N 2 --ntasks-per-node=1 --gpus-per-node=1 get_local_rank osu_bw -m 4194304:8388608 D D
# OSU MPI-CUDA Bandwidth Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
4194304              4984.95
8388608              4987.90

- with explicit pinning, making sure to use GPU 0 and mlx5_0, we get close to line-speed: over 24GB/s (on a IB HDR link):

$ UCX_NET_DEVICES=mlx5_0:1 srun -N 2 --ntasks-per-node=1 --gpus-per-node=8 bash -c 'CUDA_VISIBLE_DEVICES=0 get_local_rank osu_bw -m 4194304:8388608 D D'
# OSU MPI-CUDA Bandwidth Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
4194304             24373.33
8388608             24412.20

(CPU pinning doesn't really matter here, as it's a pure GPU-to-GPU RDMA transfer, so there's no data transfer to the CPUs)

The actual GPU pinning above is done by requesting all the GPUs on each node, and explicitly choosing the right one with CUDA_VISIBLE_DEVICES. Which is what --gpu-bind should do, I believe.

If we request GPU2 (which is what happens by default, in the first example), or if we use the 2nd IB interface and GPU0, we get the same huge performance hit:

$ UCX_NET_DEVICES=mlx5_0:1 srun -N 2 --ntasks-per-node=1 --gpus-per-node=8 bash -c 'CUDA_VISIBLE_DEVICES=2 get_local_rank osu_bw -m 4194304:8388608 D D'# OSU MPI-CUDA Bandwidth Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
4194304              4989.13
8388608              4990.53

$ UCX_NET_DEVICES=mlx5_1:1 srun -N 2 --ntasks-per-node=1 --gpus-per-node=8 bash -c 'CUDA_VISIBLE_DEVICES=0 get_local_rank osu_bw -m 4194304:8388608 D D'
# OSU MPI-CUDA Bandwidth Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
4194304              5407.02
8388608              5408.89


If hope this can help make the case that the ability to request and select particular GPUs matters. A lot. :)

Cheers,
--
Kilian
Comment 23 Michael Hinton 2021-03-30 11:57:55 MDT
(In reply to Kilian Cavalotti from comment #22)
> If hope this can help make the case that the ability to request and select
> particular GPUs matters. A lot. :)
It seems to me like you want to use the `--gres-flags=enforce-binding` option. That turns the "Cores=..." option in gres.conf from a suggestion to a requirement. Is this something you have looked into before? I haven't tested using this within an existing allocation, but I would still expect it to automatically pair your step to the appropriate GPUs within your allocation.
Comment 24 Kilian Cavalotti 2021-03-30 12:11:30 MDT
(In reply to Michael Hinton from comment #23)
> (In reply to Kilian Cavalotti from comment #22)
> > If hope this can help make the case that the ability to request and select
> > particular GPUs matters. A lot. :)
> It seems to me like you want to use the `--gres-flags=enforce-binding`
> option. 

I have, yes :)

https://bugs.schedmd.com/show_bug.cgi?id=1725
https://bugs.schedmd.com/show_bug.cgi?id=5189


> That turns the "Cores=..." option in gres.conf from a suggestion to
> a requirement. Is this something you have looked into before? I haven't
> tested using this within an existing allocation, but I would still expect it
> to automatically pair your step to the appropriate GPUs within your
> allocation.

I'm not sure it works within an allocation either, but in either case, as I noted in my previous comment, the CPU-to-GPU binding it not the most significant here, since CPUs are not involved in inter-node GPU-to-GPU RDMA communications. 

What matter is the GPU to NIC (IB interface in that case) binding, and as far as I know, there's no mechanism to enforce this kind of automatic binding in Slurm right now.

Hence the request for --gpu-bind to work like --cpu-bind and let users specify the particular GPU ids they need their tasks to be using.

Does that make sense?

Cheers,
--
Kilian
Comment 25 Michael Hinton 2021-03-30 13:06:21 MDT
(In reply to Kilian Cavalotti from comment #24)
> I'm not sure it works within an allocation either, but in either case, as I
> noted in my previous comment, the CPU-to-GPU binding it not the most
> significant here, since CPUs are not involved in inter-node GPU-to-GPU RDMA
> communications. 
> 
> What matter is the GPU to NIC (IB interface in that case) binding, and as
> far as I know, there's no mechanism to enforce this kind of automatic
> binding in Slurm right now.
> 
> Hence the request for --gpu-bind to work like --cpu-bind and let users
> specify the particular GPU ids they need their tasks to be using.
> 
> Does that make sense?
Ah, ok, so because Slurm can't schedule the IB interface, you need to match it up manually. And enforce-binding doesn't help here because it's not aware of IB affinity and you can't change UCX_NET_DEVICES after the fact if your srun lands on a GPU with the wrong IB.

I wonder if we could add a slurm.conf option to disable cgroup device constraints on the step level, which I think should allow --gpu-bind to act in relation to the whole job allocation rather than to what was given to the step. Disabling cgroup device constraints for the step shouldn't reduce security because it's already constrained at the job level.

I think a long-term goal would be to add in support for co-scheduling things like IB.  

At any rate, I think these are both enhancements. I'll go ahead and look into the first one, since that should be easy to do if it's possible. The second one would be quite extensive, I think. Do you think you could open up an enhancement request for these in separate tickets?

As for the original issue of this ticket, I put together a doc patch saying that --gpu-bind is ignored for a single task as well as a patch to emit a warning whenever --gpu-bind is used with only one task.
Comment 26 Michael Hinton 2021-03-30 13:13:54 MDT
In the meantime, as a workaround, what if you defined 3 GPU types: a100_sys, a100_mlx5_0, and a100_mlx5_1, then set your gres.conf to something like this:

Name=gpu Type=a100_mlx5_0 File=/dev/nvidia[0-1] Cores=32-63
Name=gpu Type=a100_sys File=/dev/nvidia[2-3] Cores=0-31
Name=gpu Type=a100_mlx5_1 File=/dev/nvidia[4-5] Cores=96-127
Name=gpu Type=a100_sys File=/dev/nvidia[6-7] Cores=64-95

And added this to your slurm.conf node definition:

Gres=gpu:a100_sys:4,gpu:a100_mlx5_0:2,gpu:a100_mlx5_1:2

Then you could request which types of GPUs you want based on IB. Do you think something like this might work for you?
Comment 27 Kilian Cavalotti 2021-03-30 13:47:56 MDT
(In reply to Michael Hinton from comment #26)
> In the meantime, as a workaround, what if you defined 3 GPU types: a100_sys,
> a100_mlx5_0, and a100_mlx5_1, then set your gres.conf to something like this:
> 
> Name=gpu Type=a100_mlx5_0 File=/dev/nvidia[0-1] Cores=32-63
> Name=gpu Type=a100_sys File=/dev/nvidia[2-3] Cores=0-31
> Name=gpu Type=a100_mlx5_1 File=/dev/nvidia[4-5] Cores=96-127
> Name=gpu Type=a100_sys File=/dev/nvidia[6-7] Cores=64-95
> 
> And added this to your slurm.conf node definition:
> 
> Gres=gpu:a100_sys:4,gpu:a100_mlx5_0:2,gpu:a100_mlx5_1:2
> 
> Then you could request which types of GPUs you want based on IB. Do you
> think something like this might work for you?

Yes, that workaround sounds like it could work, but it's not really scalable on large machines with a lot of different node topologies. And it goes in the reverse direction the NVML AutoDetect plugin has taken, so it would be a step back.

Disabling cgroup enforcement doesn't sound appealing at all either, because it would require full-node allocations for even single-GPU jobs, and thus lead to under-utilization of resources.

Finding a way to schedule and allocate NICs such as IB interfaces seems interesting, but probably much longer-term.

What we have right now is the NVML AutoDetect plugin, and Slurm knows about GPU ids. So what we really would like to see is the --gpu-bind option to behave like the --cpu-bind option. I think all the information is there already to make it work, and as it is today, it's not working as the documentation indicates.

So it's not an enhancement request, really, it's more of a broken functionality fix, IMHO. Workarounds are nice, but they don't really make up for existing features that are not working as intended or documented. :)

Thanks!
--
Kilian
Comment 29 Michael Hinton 2021-06-09 13:49:56 MDT
Kilian,

(In reply to Kilian Cavalotti from comment #27)
> So it's not an enhancement request, really, it's more of a broken
> functionality fix, IMHO. Workarounds are nice, but they don't really make up
> for existing features that are not working as intended or documented. :)
Could I ask that you open up a separate ticket to address this, and to summarize specifically what you would like to be changed/fixed? (maybe just repost/summarize comment 20 and comment 22 and demonstrate what the proposed change might look like in practice with an example).

I agree that it would be nice for --gpu-bind and --cpu-bind to be analogous to each other, but we also need to think about backwards compatibility, or at least be very careful to document the change and possibly have an option to use the original behavior.

(In reply to Michael Hinton from comment #25)
> As for the original issue of this ticket, I put together a doc patch saying
> that --gpu-bind is ignored for a single task as well as a patch to emit a
> warning whenever --gpu-bind is used with only one task.
This has landed in https://github.com/SchedMD/slurm/commit/3146217a5890bbfc7657f4c7d716c6fcbce17059.

I'll go ahead and close the original ticket out.

Thanks!
-Michael
Comment 31 Kilian Cavalotti 2021-06-11 18:00:28 MDT
(In reply to Michael Hinton from comment #29)
> Kilian,
> 
> (In reply to Kilian Cavalotti from comment #27)
> > So it's not an enhancement request, really, it's more of a broken
> > functionality fix, IMHO. Workarounds are nice, but they don't really make up
> > for existing features that are not working as intended or documented. :)
> Could I ask that you open up a separate ticket to address this, and to
> summarize specifically what you would like to be changed/fixed? (maybe just
> repost/summarize comment 20 and comment 22 and demonstrate what the proposed
> change might look like in practice with an example).
> 
> I agree that it would be nice for --gpu-bind and --cpu-bind to be analogous
> to each other, but we also need to think about backwards compatibility, or
> at least be very careful to document the change and possibly have an option
> to use the original behavior.

Thanks for the suggestion! I created #11819.

> (In reply to Michael Hinton from comment #25)
> > As for the original issue of this ticket, I put together a doc patch saying
> > that --gpu-bind is ignored for a single task as well as a patch to emit a
> > warning whenever --gpu-bind is used with only one task.
> This has landed in
> https://github.com/SchedMD/slurm/commit/
> 3146217a5890bbfc7657f4c7d716c6fcbce17059.
> 
> I'll go ahead and close the original ticket out.

Thank you!

Cheers,
--
Kilian
Comment 34 Michael Hinton 2021-10-27 15:19:07 MDT
Marking this as closed.

Thanks!
-Michael