Ticket 1718

Summary: Support for GPU and NIC affinity
Product: Slurm Reporter: Matthieu Ospici <matthieu.ospici>
Component: SchedulingAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: brian, da, yiannis.georgiou
Version: 15.08.x   
Hardware: Linux   
OS: Linux   
Site: Atos/Eviden Sites Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 15.08.1-pre6 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Patch that adds GPU/NIC affinity
local binding patch

Description Matthieu Ospici 2015-06-02 01:54:49 MDT
Created attachment 1938 [details]
Patch that adds GPU/NIC affinity

Hi,

Please find enclosed a patch that adds to Slurm a support for GPU and NIC affinity.

This is a preliminary implementation. We are still working on it, but we have released our patch in order to have your feedback about our implementation.

* Rationale

To use GPU direct RDMA in an efficient way, choosing the right affinity is important. To have the best performance with GPU direct RDMA, the GPU and the NIC must be on the same PCI express controller.

Even without GPU direct RDMA, bound a GPU to its closest CPU can increase performance of the data transfer in some cases/hardware platforms.


* Implementation

In our proposition, to choose the right GPU/NIC affinity we firstly find in which socket a task runs. Once done we allow the task to use only the GPU and the NIC connected to this socket by setting two environment variables. 
Therefore, in a hardware platform in which each socket has a PCI express controller, we ensure two proprieties:
	CPU and GPU are on the same pciexpress controller
	GPU and NIC are on the same pciexpress controller


* Usage

Two new parameters have been implemented:

--accel-bind=2 sets both GPU and NIC card affinity (useful for GPU direct RDMA)
--accel-bind=1 sets only the GPU affinity


If you launch:

srun --gres=gpu:2 --ntasks-per-socket=2 --accel-bind=2 -n 4 ./MPI_app

The two instances of MPI_app running on the first socket will use the GPU 0 and the NIC 0 and the two other instances of MPI_app will use GPU 1 and NIC 1

It is important to bind the tasks to a particular socket. If a task can use cores from 2 or more sockets, our implementation can't find a GPU/NIC affinity.


* Limitations

- We assume we have exactly one NIC and one GPU per socket : the selection algorithm supports only this configuration. We are considering enhancing that.

- You need to use the TaskPlugin=affinity (the current version does not support the cgroup task plugin)

- The code works with OpenMPI >= 1.8 (the code sets the OMPI_MCA_btl_openib_if_include to select the good IB interface), Infiniband cards (named mlx4_X) and NVIDIA GPU.

Thanks,
Comment 1 Moe Jette 2015-06-02 04:01:33 MDT
This seems to assume that NIC are not Slurm GRES. In that case, there would be no need to add these new fields to the job_descriptor or batch_job_launch_msg data structures. However, this may not be the best option. A more flexible option might be to define both GPUs and NIC as GRES, then allocate and bind to the closest resources. For example:
# in slurm.conf
GresType=gpu,nic
NodeName=tux[1-32] gres=gpu:4,nic:no_consume:2 ...

# in gres.conf
Name=gpu File=/dev/nvidia0 cpus=0-7
Name=gpu File=/dev/nvidia1 cpus=8-15
Name=gpu File=/dev/nvidia2 cpus=16-23
Name=gpu File=/dev/nvidia3 cpus=24-31
Name=nic File=/dev/mxl4_0 cpus=0-15
Name=nic File=/dev/mxl4_1 cpus=16-31

With something like the above, jobs will automatically be bound to specific GPUs and NIC with an appropriate execute line ("sbatch gres=gpu:2,nic:2 ..."). Then it is simply a matter of binding individual tasks to specific GPUs and NICs that align.

Run "configure" with the "--enable-developer" and it will report some errors in the code that should be fixed.

I see blocks of spaces used in many places rather than tabs. Please use tabs.
Also limit line length to 80 characters. See "Linux Kernel Coding Style" document here:
http://slurm.schedmd.com/coding_style.pdf
Comment 2 Moe Jette 2015-06-02 10:58:29 MDT
I have studied the patch some more. Since more than one job can run on each node, the resource selection all needs to happen in slurmctld. The new logic to select GPUs and NICs for a job step is in slurmstepd, which duplicates logic already in src/common/gres.c used by slurmctld. This patch can not work if you want to support more than one job per compute node or use the task/cgroup plugin, both of which are essential capabilites.

Let me work on adding this functionality and perhaps you can work on testing it.
Comment 3 Moe Jette 2015-06-04 05:39:22 MDT
Created attachment 1947 [details]
local binding patch

Commit:
https://github.com/SchedMD/slurm/commit/1ae5ad6d7672bba25009eb322a7ce8fe8f869bbd

Examples of use:

$ cat /etc/slurm/slurm.conf
Name=gpu File=/dev/nvidia0 CPUs=0
Name=gpu File=/dev/nvidia1 CPUs=1
Name=gpu File=/dev/nvidia2 CPUs=2
Name=gpu File=/dev/nvidia3 CPUs=3
Name=nic File=/dev/mlx4_0 CPUs=0-1
Name=nic File=/dev/mlx4_1 CPUs=2-3

$ srun --gres=gpu:4,nic:2 -n4 --cpu_bind=v --accel-bind=gn -l tmp1
0: cpu_bind=MASK - jette, task  0  0 [14127]: mask 0x1 set
1: cpu_bind=MASK - jette, task  1  1 [14128]: mask 0x2 set
2: cpu_bind=MASK - jette, task  2  2 [14129]: mask 0x4 set
3: cpu_bind=MASK - jette, task  3  3 [14130]: mask 0x8 set
0: CUDA_VISIBLE_DEVICES=0
0: OMPI_MCA_btl_openib_if_include=mlx4_0
1: CUDA_VISIBLE_DEVICES=1
1: OMPI_MCA_btl_openib_if_include=mlx4_0
2: CUDA_VISIBLE_DEVICES=2
2: OMPI_MCA_btl_openib_if_include=mlx4_1
3: CUDA_VISIBLE_DEVICES=3
3: OMPI_MCA_btl_openib_if_include=mlx4_1

(Without binding to local devices)
$ srun --gres=gpu:4,nic:2 -n4 --cpu_bind=v -l tmp1
0: cpu_bind=MASK - jette, task  0  0 [14367]: mask 0x1 set
1: cpu_bind=MASK - jette, task  1  1 [14368]: mask 0x2 set
2: cpu_bind=MASK - jette, task  2  2 [14369]: mask 0x4 set
3: cpu_bind=MASK - jette, task  3  3 [14370]: mask 0x8 set
0: CUDA_VISIBLE_DEVICES=0,1,2,3
0: OMPI_MCA_btl_openib_if_include=mlx4_0,mlx4_1
1: CUDA_VISIBLE_DEVICES=0,1,2,3
1: OMPI_MCA_btl_openib_if_include=mlx4_0,mlx4_1
2: CUDA_VISIBLE_DEVICES=0,1,2,3
2: OMPI_MCA_btl_openib_if_include=mlx4_0,mlx4_1
3: CUDA_VISIBLE_DEVICES=0,1,2,3
3: OMPI_MCA_btl_openib_if_include=mlx4_0,mlx4_1
Comment 4 Moe Jette 2015-06-04 05:43:38 MDT
The code is done. Let me know if you discover any problems.