Hi! When activating NVML support in Slurm (which the configure script enables by default if it finds nvml.h and libnvidia-ml.so on the building host), it looks like the generate "slurm" RPM depends on "libnvidia-ml.so.1()(64bit)": # rpm -qRp ./slurm-19.05.3-2.el7.x86_64.rpm | grep nvidia libnvidia-ml.so.1()(64bit) Is that expected? That would force installing the NVIDIA driver on all the nodes of an heterogeneous cluster, even nodes without GPUs, and that seems a bit overkill. I thought from the documentation that NVML support was kind of opportunistic and that the NVML libs would only be used if available on the nodes: "If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management Library (NVML) is installed on the node and was enabled during Slurm configuration, configuration details will automatically be filled in for any system-detected NVIDIA GPU. " It seems like in practice, if NVML was enabled during Slurm configuration, it *has* to be installed on the node. Did I miss something here? Or is is just a matter of fixing the spec file to remove the requirement on libnvidia-ml.so? Thanks! -- Kilian
Created attachment 11934 [details] SPEC patch Here's a patch to the SPEC file to not make the slurm RPMs depend on libnvidia-ml.so, even if it's been enabled at configure time. Cheers, -- Kilian
Hi we have basically the same issue. We want to use configless slurm with 20.02.2 and would like to have a single gres.conf with AutoDetect=nvml for all GPU nodes of our heterogeneous HPC cluster. Any chance this patch can be upstreamed ? We have an active support contract with SchedMD
We at NASA/NCCS are also experiencing this bug in 20.02.6 in multiple environments with GPUs. We've leveraged the workaround (egrep -v). We don't understand why find-requires would error out even when libnvidia-ml is installed in a system default location like /usr/lib64. The resolution we would like to see is a slurm spec file that is aware of libnvidia-ml.so.1, but which also does not fail to install if it doesn't find it--or else maybe it is smarter about how to find it.
Seems to be fixed in 20.06.1: https://github.com/SchedMD/slurm/commit/1be5492c274e170451ed18763e7eeea826f57cb7
This is fixed in the spec file shipped alongside Slurm starting with 20.02.6 / 20.11.0. - Tim *** This ticket has been marked as a duplicate of ticket 9525 ***