On our Selene cluster we tried to enable the Multi-Instance GPU (aka MIG, see https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) on our A100 GPUs. To allow users to test multiple MIG configurations from within one Slurm job, we wanted to allow users to create/destroy MIG instances from a job step. We realized that we can't create a MIG instance from within a Slurm job step since the GPU is in use by another process, for instance from an interactive job step: $ nvidia-smi mig --id 1 --create-gpu-instance 3g.40gb,3g.40gb --default-compute-instance Unable to create a GPU instance on GPU 1 using profile 3g.40gb: In use by another client Slurmstepd is the process that causes the GPUs to be "in use" while the job step is active: $ lsof /dev/nvidia0 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME slurmstep 28695 root mem CHR 195,0 750 /dev/nvidia0 slurmstep 28695 root 21u CHR 195,0 0t0 750 /dev/nvidia0 slurmstep 28695 root 29u CHR 195,0 0t0 750 /dev/nvidia0 slurmstep 28695 root 30u CHR 195,0 0t0 750 /dev/nvidia0 Through /proc/$(pidof slurmstepd)/maps, we saw that slurmstepd is using libnvidia-opencl.so and libcuda.so. Since OpenCL is involved, we suspected that the GPU handles were created by hwloc. Looking at the Slurm code, we discovered that hwloc is being used in the PMIx code path: https://github.com/SchedMD/slurm/blob/5921b6abe8e4f813d107dce9400c206e4ebf370d/src/plugins/mpi/pmix/pmixp_client.c#L318 And we then confirmed that slurmstepd acquires handles on /dev/nvidia* files only when using --mpi=pmix. We are opening this issue to request that the PMIx code in Slurm filters out the "OS devices" using the hwloc API, e.g. by adding a line like the following in _set_topology: hwloc_topology_set_type_filter(topology, HWLOC_OBJ_OS_DEVICE, HWLOC_TYPE_FILTER_KEEP_NONE); As you can see in the hwloc code, it will prevent hwloc from querying the OpenCL devices: https://github.com/open-mpi/hwloc/blob/6bfd272e1f4e29f9702a8915a0396845a269fc8a/hwloc/topology-opencl.c#L58-L60 I believe the topology information will still be sufficient for PMIx with the change above, I modified the Slurm code manually and verified that slurmstepd will not acquire GPU handles anymore.
By the way, another side effect of having slurmstepd keeping an handle on GPUs is that a few MBs are allocated on each GPU by default: $ srun --mpi=pmix nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits 3, 81252 Like with /dev/nvidia* files, without PMIx no memory is used on the GPU: $ srun --mpi=none nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits 0, 81252 This is less important than the issue above, but this also tripped some of our prolog scripts that expect 0 MB to be allocated at the end of a job (combined with the fact that the slurmstepd process wasn't fully reaped at this point).
Danny, I'm in the loop and we are looking for the solution now.
Yes, I discussed with Artem and we have already established that the solution I suggested initially is unlikely to work: hwloc_topology_set_type_filter(topology, HWLOC_OBJ_OS_DEVICE, HWLOC_TYPE_FILTER_KEEP_NONE); This will also remove important device features that are found after PCIe enumeration, for instance by querying sysfs folders /sys/class/net and /sys/class/infiniband. Hence, we wouldn't be able to get the name of the InfiniBand HCAs (like mlx5_0) in the topology. So we need another approach. We'll update this bug once we have other suggestions.
Hi Artem/Felix, Have you arrived to any other conclusion about this issue? Do you have any feedback? Thank you
Hello Felip, Our plan right now is to disable OpenCL in hwloc under Slurmd; by setting the environment variable "HWLOC_COMPONENTS=-opencl" in the systemd unit file of slurmd. See https://www.open-mpi.org/projects/hwloc/doc/v2.3.0/a00357.php This has been tested locally but not deployed to our cluster yet. We discussed multiple options for long-term fixes: caching the topology file on the filesystem, calling hwloc in a subprocess; but these would require significant code changes. Lowering the importance to 4 since we have a fairly good workaround. Let me know if you want me to close this bug for now.
Hi Felix, For the moment we decided to add an entry in our FAQ in the website. It will show up in the launch of the next release 20.11.3. commit 767b9189e339d33edaad16833a8b7c9d599e5ca7 Author: Felip Moll <felip.moll@schedmd.com> AuthorDate: Tue Jan 5 14:44:47 2021 +0100 Commit: Ben Roberts <ben@schedmd.com> CommitDate: Wed Jan 6 11:11:40 2021 -0600 Docs - Add FAQ about Multi-Instance GPU and PMIx Bug 10356 If we find this starts to bother a lot of sites or has worse implications, we can reopen the bug. Thanks for your investigation.