Bug 10356 - slurmstepd creates an OpenCL handle on all GPUs through PMIx / hwloc
Summary: slurmstepd creates an OpenCL handle on all GPUs through PMIx / hwloc
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: PMIx (show other bugs)
Version: 20.02.6
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Felip Moll
QA Contact: Ben Roberts
URL:
Depends on:
Blocks:
 
Reported: 2020-12-03 16:22 MST by Felix Abecassis
Modified: 2021-03-15 17:29 MDT (History)
3 users (show)

See Also:
Site: NVIDIA (PSLA)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Felix Abecassis 2020-12-03 16:22:34 MST
On our Selene cluster we tried to enable the Multi-Instance GPU (aka MIG, see https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) on our A100 GPUs. To allow users to test multiple MIG configurations from within one Slurm job, we wanted to allow users to create/destroy MIG instances from a job step.

We realized that we can't create a MIG instance from within a Slurm job step since the GPU is in use by another process, for instance from an interactive job step:
$ nvidia-smi mig --id 1 --create-gpu-instance 3g.40gb,3g.40gb --default-compute-instance
Unable to create a GPU instance on GPU  1 using profile 3g.40gb: In use by another client

Slurmstepd is the process that causes the GPUs to be "in use" while the job step is active:
$ lsof /dev/nvidia0
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
slurmstep 28695 root  mem    CHR  195,0           750 /dev/nvidia0
slurmstep 28695 root   21u   CHR  195,0      0t0  750 /dev/nvidia0
slurmstep 28695 root   29u   CHR  195,0      0t0  750 /dev/nvidia0
slurmstep 28695 root   30u   CHR  195,0      0t0  750 /dev/nvidia0

Through /proc/$(pidof slurmstepd)/maps, we saw that slurmstepd is using libnvidia-opencl.so and libcuda.so. Since OpenCL is involved, we suspected that the GPU handles were created by hwloc. Looking at the Slurm code, we discovered that hwloc is being used in the PMIx code path:
https://github.com/SchedMD/slurm/blob/5921b6abe8e4f813d107dce9400c206e4ebf370d/src/plugins/mpi/pmix/pmixp_client.c#L318

And we then confirmed that slurmstepd acquires handles on /dev/nvidia* files only when using --mpi=pmix.

We are opening this issue to request that the PMIx code in Slurm filters out the  "OS devices" using the hwloc API, e.g. by adding a line like the following in _set_topology:
hwloc_topology_set_type_filter(topology, HWLOC_OBJ_OS_DEVICE, HWLOC_TYPE_FILTER_KEEP_NONE);  

As you can see in the hwloc code, it will prevent hwloc from querying the OpenCL devices:
https://github.com/open-mpi/hwloc/blob/6bfd272e1f4e29f9702a8915a0396845a269fc8a/hwloc/topology-opencl.c#L58-L60 

I believe the topology information will still be sufficient for PMIx with the change above, I modified the Slurm code manually and verified that slurmstepd will not acquire GPU handles anymore.
Comment 1 Felix Abecassis 2020-12-03 16:27:35 MST
By the way, another side effect of having slurmstepd keeping an handle on GPUs is that a few MBs are allocated on each GPU by default:
$ srun --mpi=pmix nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits
3, 81252

Like with /dev/nvidia* files, without PMIx no memory is used on the GPU:
$ srun --mpi=none nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits
0, 81252

This is less important than the issue above, but this also tripped some of our prolog scripts that expect 0 MB to be allocated at the end of a job (combined with the fact that the slurmstepd process wasn't fully reaped at this point).
Comment 2 Artem Polyakov 2020-12-04 16:44:10 MST
Danny,
I'm in the loop and we are looking for the solution now.
Comment 3 Felix Abecassis 2020-12-04 18:37:16 MST
Yes, I discussed with Artem and we have already established that the solution I suggested initially is unlikely to work:
hwloc_topology_set_type_filter(topology, HWLOC_OBJ_OS_DEVICE, HWLOC_TYPE_FILTER_KEEP_NONE);

This will also remove important device features that are found after PCIe enumeration, for instance by querying sysfs folders /sys/class/net and /sys/class/infiniband. 

Hence, we wouldn't be able to get the name of the InfiniBand HCAs (like mlx5_0) in the topology. So we need another approach. We'll update this bug once we have other suggestions.
Comment 4 Felip Moll 2020-12-28 08:07:12 MST
Hi Artem/Felix,

Have you arrived to any other conclusion about this issue?
Do you have any feedback?

Thank you
Comment 5 Felix Abecassis 2021-01-04 09:30:35 MST
Hello Felip,

Our plan right now is to disable OpenCL in hwloc under Slurmd; by setting the environment variable "HWLOC_COMPONENTS=-opencl" in the systemd unit file of slurmd. See https://www.open-mpi.org/projects/hwloc/doc/v2.3.0/a00357.php
This has been tested locally but not deployed to our cluster yet.

We discussed multiple options for long-term fixes: caching the topology file on the filesystem, calling hwloc in a subprocess; but these would require significant code changes.

Lowering the importance to 4 since we have a fairly good workaround. Let me know if you want me to close this bug for now.
Comment 13 Felip Moll 2021-01-06 10:55:28 MST
Hi Felix,

For the moment we decided to add an entry in our FAQ in the website. It will show up in the launch of the next release 20.11.3.

commit 767b9189e339d33edaad16833a8b7c9d599e5ca7
Author:     Felip Moll <felip.moll@schedmd.com>
AuthorDate: Tue Jan 5 14:44:47 2021 +0100
Commit:     Ben Roberts <ben@schedmd.com>
CommitDate: Wed Jan 6 11:11:40 2021 -0600

    Docs - Add FAQ about Multi-Instance GPU and PMIx
    
    Bug 10356

If we find this starts to bother a lot of sites or has worse implications, we can reopen the bug.

Thanks for your investigation.