Summary: | slurmstepd creates an OpenCL handle on all GPUs through PMIx / hwloc | ||
---|---|---|---|
Product: | Slurm | Reporter: | Felix Abecassis <fabecassis> |
Component: | PMIx | Assignee: | Felip Moll <felip.moll> |
Status: | RESOLVED INFOGIVEN | QA Contact: | Ben Roberts <ben> |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | artpol84, jbernauer, lyeager |
Version: | 20.02.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=11091 | ||
Site: | NVIDIA (PSLA) | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Felix Abecassis
2020-12-03 16:22:34 MST
By the way, another side effect of having slurmstepd keeping an handle on GPUs is that a few MBs are allocated on each GPU by default: $ srun --mpi=pmix nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits 3, 81252 Like with /dev/nvidia* files, without PMIx no memory is used on the GPU: $ srun --mpi=none nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits 0, 81252 This is less important than the issue above, but this also tripped some of our prolog scripts that expect 0 MB to be allocated at the end of a job (combined with the fact that the slurmstepd process wasn't fully reaped at this point). Danny, I'm in the loop and we are looking for the solution now. Yes, I discussed with Artem and we have already established that the solution I suggested initially is unlikely to work: hwloc_topology_set_type_filter(topology, HWLOC_OBJ_OS_DEVICE, HWLOC_TYPE_FILTER_KEEP_NONE); This will also remove important device features that are found after PCIe enumeration, for instance by querying sysfs folders /sys/class/net and /sys/class/infiniband. Hence, we wouldn't be able to get the name of the InfiniBand HCAs (like mlx5_0) in the topology. So we need another approach. We'll update this bug once we have other suggestions. Hi Artem/Felix, Have you arrived to any other conclusion about this issue? Do you have any feedback? Thank you Hello Felip, Our plan right now is to disable OpenCL in hwloc under Slurmd; by setting the environment variable "HWLOC_COMPONENTS=-opencl" in the systemd unit file of slurmd. See https://www.open-mpi.org/projects/hwloc/doc/v2.3.0/a00357.php This has been tested locally but not deployed to our cluster yet. We discussed multiple options for long-term fixes: caching the topology file on the filesystem, calling hwloc in a subprocess; but these would require significant code changes. Lowering the importance to 4 since we have a fairly good workaround. Let me know if you want me to close this bug for now. Hi Felix, For the moment we decided to add an entry in our FAQ in the website. It will show up in the launch of the next release 20.11.3. commit 767b9189e339d33edaad16833a8b7c9d599e5ca7 Author: Felip Moll <felip.moll@schedmd.com> AuthorDate: Tue Jan 5 14:44:47 2021 +0100 Commit: Ben Roberts <ben@schedmd.com> CommitDate: Wed Jan 6 11:11:40 2021 -0600 Docs - Add FAQ about Multi-Instance GPU and PMIx Bug 10356 If we find this starts to bother a lot of sites or has worse implications, we can reopen the bug. Thanks for your investigation. |