Created attachment 1938 [details] Patch that adds GPU/NIC affinity Hi, Please find enclosed a patch that adds to Slurm a support for GPU and NIC affinity. This is a preliminary implementation. We are still working on it, but we have released our patch in order to have your feedback about our implementation. * Rationale To use GPU direct RDMA in an efficient way, choosing the right affinity is important. To have the best performance with GPU direct RDMA, the GPU and the NIC must be on the same PCI express controller. Even without GPU direct RDMA, bound a GPU to its closest CPU can increase performance of the data transfer in some cases/hardware platforms. * Implementation In our proposition, to choose the right GPU/NIC affinity we firstly find in which socket a task runs. Once done we allow the task to use only the GPU and the NIC connected to this socket by setting two environment variables. Therefore, in a hardware platform in which each socket has a PCI express controller, we ensure two proprieties: CPU and GPU are on the same pciexpress controller GPU and NIC are on the same pciexpress controller * Usage Two new parameters have been implemented: --accel-bind=2 sets both GPU and NIC card affinity (useful for GPU direct RDMA) --accel-bind=1 sets only the GPU affinity If you launch: srun --gres=gpu:2 --ntasks-per-socket=2 --accel-bind=2 -n 4 ./MPI_app The two instances of MPI_app running on the first socket will use the GPU 0 and the NIC 0 and the two other instances of MPI_app will use GPU 1 and NIC 1 It is important to bind the tasks to a particular socket. If a task can use cores from 2 or more sockets, our implementation can't find a GPU/NIC affinity. * Limitations - We assume we have exactly one NIC and one GPU per socket : the selection algorithm supports only this configuration. We are considering enhancing that. - You need to use the TaskPlugin=affinity (the current version does not support the cgroup task plugin) - The code works with OpenMPI >= 1.8 (the code sets the OMPI_MCA_btl_openib_if_include to select the good IB interface), Infiniband cards (named mlx4_X) and NVIDIA GPU. Thanks,
This seems to assume that NIC are not Slurm GRES. In that case, there would be no need to add these new fields to the job_descriptor or batch_job_launch_msg data structures. However, this may not be the best option. A more flexible option might be to define both GPUs and NIC as GRES, then allocate and bind to the closest resources. For example: # in slurm.conf GresType=gpu,nic NodeName=tux[1-32] gres=gpu:4,nic:no_consume:2 ... # in gres.conf Name=gpu File=/dev/nvidia0 cpus=0-7 Name=gpu File=/dev/nvidia1 cpus=8-15 Name=gpu File=/dev/nvidia2 cpus=16-23 Name=gpu File=/dev/nvidia3 cpus=24-31 Name=nic File=/dev/mxl4_0 cpus=0-15 Name=nic File=/dev/mxl4_1 cpus=16-31 With something like the above, jobs will automatically be bound to specific GPUs and NIC with an appropriate execute line ("sbatch gres=gpu:2,nic:2 ..."). Then it is simply a matter of binding individual tasks to specific GPUs and NICs that align. Run "configure" with the "--enable-developer" and it will report some errors in the code that should be fixed. I see blocks of spaces used in many places rather than tabs. Please use tabs. Also limit line length to 80 characters. See "Linux Kernel Coding Style" document here: http://slurm.schedmd.com/coding_style.pdf
I have studied the patch some more. Since more than one job can run on each node, the resource selection all needs to happen in slurmctld. The new logic to select GPUs and NICs for a job step is in slurmstepd, which duplicates logic already in src/common/gres.c used by slurmctld. This patch can not work if you want to support more than one job per compute node or use the task/cgroup plugin, both of which are essential capabilites. Let me work on adding this functionality and perhaps you can work on testing it.
Created attachment 1947 [details] local binding patch Commit: https://github.com/SchedMD/slurm/commit/1ae5ad6d7672bba25009eb322a7ce8fe8f869bbd Examples of use: $ cat /etc/slurm/slurm.conf Name=gpu File=/dev/nvidia0 CPUs=0 Name=gpu File=/dev/nvidia1 CPUs=1 Name=gpu File=/dev/nvidia2 CPUs=2 Name=gpu File=/dev/nvidia3 CPUs=3 Name=nic File=/dev/mlx4_0 CPUs=0-1 Name=nic File=/dev/mlx4_1 CPUs=2-3 $ srun --gres=gpu:4,nic:2 -n4 --cpu_bind=v --accel-bind=gn -l tmp1 0: cpu_bind=MASK - jette, task 0 0 [14127]: mask 0x1 set 1: cpu_bind=MASK - jette, task 1 1 [14128]: mask 0x2 set 2: cpu_bind=MASK - jette, task 2 2 [14129]: mask 0x4 set 3: cpu_bind=MASK - jette, task 3 3 [14130]: mask 0x8 set 0: CUDA_VISIBLE_DEVICES=0 0: OMPI_MCA_btl_openib_if_include=mlx4_0 1: CUDA_VISIBLE_DEVICES=1 1: OMPI_MCA_btl_openib_if_include=mlx4_0 2: CUDA_VISIBLE_DEVICES=2 2: OMPI_MCA_btl_openib_if_include=mlx4_1 3: CUDA_VISIBLE_DEVICES=3 3: OMPI_MCA_btl_openib_if_include=mlx4_1 (Without binding to local devices) $ srun --gres=gpu:4,nic:2 -n4 --cpu_bind=v -l tmp1 0: cpu_bind=MASK - jette, task 0 0 [14367]: mask 0x1 set 1: cpu_bind=MASK - jette, task 1 1 [14368]: mask 0x2 set 2: cpu_bind=MASK - jette, task 2 2 [14369]: mask 0x4 set 3: cpu_bind=MASK - jette, task 3 3 [14370]: mask 0x8 set 0: CUDA_VISIBLE_DEVICES=0,1,2,3 0: OMPI_MCA_btl_openib_if_include=mlx4_0,mlx4_1 1: CUDA_VISIBLE_DEVICES=0,1,2,3 1: OMPI_MCA_btl_openib_if_include=mlx4_0,mlx4_1 2: CUDA_VISIBLE_DEVICES=0,1,2,3 2: OMPI_MCA_btl_openib_if_include=mlx4_0,mlx4_1 3: CUDA_VISIBLE_DEVICES=0,1,2,3 3: OMPI_MCA_btl_openib_if_include=mlx4_0,mlx4_1
The code is done. Let me know if you discover any problems.