Ticket 7919 - slurm RPM dependency on libnvidia-ml.so
Summary: slurm RPM dependency on libnvidia-ml.so
Status: RESOLVED DUPLICATE of ticket 9525
Alias: None
Product: Slurm
Classification: Unclassified
Component: Build System and Packaging (show other tickets)
Version: 19.05.3
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-10-11 12:29 MDT by Kilian Cavalotti
Modified: 2021-05-21 02:26 MDT (History)
7 users (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name: Sherlock
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
SPEC patch (423 bytes, patch)
2019-10-11 18:59 MDT, Kilian Cavalotti
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2019-10-11 12:29:30 MDT
Hi!

When activating NVML support in Slurm (which the configure script enables by default if it finds nvml.h and libnvidia-ml.so on the building host), it looks like the generate "slurm" RPM depends on "libnvidia-ml.so.1()(64bit)":

# rpm -qRp ./slurm-19.05.3-2.el7.x86_64.rpm  | grep nvidia
libnvidia-ml.so.1()(64bit)

Is that expected? That would force installing the NVIDIA driver on all the nodes of an heterogeneous cluster, even nodes without GPUs, and that seems a bit overkill.

I thought from the documentation that NVML support was kind of opportunistic and that the NVML libs would only be used if available on the nodes:
"If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management Library (NVML) is installed on the node and was enabled during Slurm configuration, configuration details will automatically be filled in for any system-detected NVIDIA GPU. "

It seems like in practice, if NVML was enabled during Slurm configuration, it *has* to be installed on the node.

Did I miss something here? Or is is just a matter of fixing the spec file to remove the requirement on libnvidia-ml.so?

Thanks!
-- 
Kilian
Comment 3 Kilian Cavalotti 2019-10-11 18:59:25 MDT
Created attachment 11934 [details]
SPEC patch

Here's a patch to the SPEC file to not make the slurm RPMs depend on libnvidia-ml.so, even if it's been enabled at configure time.

Cheers,
-- 
Kilian
Comment 5 timeu 2020-07-08 02:56:22 MDT
Hi we have basically the same issue. 
We want to use configless slurm with 20.02.2 and would like to have a single gres.conf with AutoDetect=nvml for all GPU nodes of our heterogeneous HPC cluster. 
Any chance this patch can be upstreamed ? 

We have an active support contract with SchedMD
Comment 6 Lyn 2021-01-13 12:00:32 MST
We at NASA/NCCS are also experiencing this bug in 20.02.6 in multiple environments with GPUs. We've leveraged the workaround (egrep -v). We don't understand why find-requires would error out even when libnvidia-ml is installed in a system default location like /usr/lib64. The resolution we would like to see is a slurm spec file that is aware of libnvidia-ml.so.1, but which also does not fail to install if it doesn't find it--or else maybe it is smarter about how to find it.
Comment 7 timeu 2021-02-09 14:54:17 MST
Seems to be fixed in 20.06.1: https://github.com/SchedMD/slurm/commit/1be5492c274e170451ed18763e7eeea826f57cb7
Comment 8 Tim Wickberg 2021-02-11 12:32:31 MST
This is fixed in the spec file shipped alongside Slurm starting with 20.02.6 / 20.11.0.

- Tim

*** This ticket has been marked as a duplicate of ticket 9525 ***