9525 – Build and install 20.02.X problem libnvidia-ml missing

Bug 9525 - Build and install 20.02.X problem libnvidia-ml missing

Summary: Build and install 20.02.X problem libnvidia-ml missing

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other bugs)
Version:	20.02.3
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Tim McMullan
QA Contact:

URL:

Duplicates (1):	7919 (view as bug list)
Depends on:
Blocks:

Reported:	2020-08-06 07:09 MDT by Marco Induni
Modified:	2021-02-11 12:32 MST (History)
CC List:	1 user (show)

See Also:
Site:	CSCS - Swiss National Supercomputing Centre
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.02.6 20.11.0pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
bug9525 workaround (432 bytes, patch) 2020-08-07 08:32 MDT, Tim McMullan	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marco Induni 2020-08-06 07:09:49 MDT

Dear Support,
I'm tryng to build and install Slurm 20.02.04, but I'm facing a dependency problem on the generated rpms.
Here the steps I did:
1) Build command

VER=20.02.4 ; rpmbuild -tb --define "_prefix /opt/slurm/$VER" --define "_sysconfdir /etc/slurm"  --define "_slurm_sysconfdir /etc/slurm" slurm-$VER.tar.bz2


2) the rpms are correctly generated, but during the install 

rpm -Uvh slurm-20.02.4-1.el7.x86_64.rpm
error: Failed dependencies:
	libnvidia-ml.so.1()(64bit) is needed by slurm-20.02.4-1.el7.x86_64


Note the libnvidia-ml.so.1 are present see ldconfig report

 ldconfig -p | grep libnvidia-ml
	libnvidia-ml.so.1 (libc6,x86-64) => /lib64/libnvidia-ml.so.1
	libnvidia-ml.so.1 (libc6) => /lib/libnvidia-ml.so.1
	libnvidia-ml.so (libc6,x86-64) => /lib64/libnvidia-ml.so
	libnvidia-ml.so (libc6) => /lib/libnvidia-ml.so


So the build was configured to use it 
...
configure:21362: checking for nvml.h
configure:21362: result: yes
configure:21370: checking for nvmlInit in -lnvidia-ml
configure:21395: gcc -o conftest -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -Wl,-z,lazy  -m64 -mtune=generic -std=gnu99 -pthread -I/usr/local/cuda/include -I/usr/cuda/include  -Wl,-z,relro -Wl,-z,lazy  conftest.c -lnvidia-ml   -lresolv  >&5
configure:21395: $? = 0
configure:21404: result: yes
...

but during the install fails, with the message as report above. 

These nvdia files are not part of any installed rpms (were installed with the original NVIDIA-installer), so could be this the problem ? 

Thank you
Marco Induni

Comment 1 Tim McMullan 2020-08-07 08:32:58 MDT

Created attachment 15350 [details]
bug9525 workaround

Hi Marco,

That is indeed the issue.  rpm checks the list of installed rpms for what they provide, and if it doesn't find libnvidia-ml.so, it fails.  Since you are installing manually, there is no rpm to look at.

I've attached a workaround patch to the spec file will exclude libnvida-ml from the requirements, but we generally anticipate that if you are installing slurm with rpms, cuda would be as well.

I don't think we would want to do this in general though since it could cause some issues for sites that do expect cuda to be installed with the RPMs.

Let me know if this works for you!

Thanks!
--Tim

Comment 2 Marco Induni 2020-08-10 08:06:51 MDT

> Hi Marco,

Hi Tim,
 
> That is indeed the issue.  rpm checks the list of installed rpms for what
> they provide, and if it doesn't find libnvidia-ml.so, it fails. 

but if I understood correctly, the build doesn't check the rpms and it looks for libraries directly, (see log ... configure:21370: checking for nvmlInit in -lnvidia-ml ...configure:21404: result: yes) and this will enable the support for the nvidia-ml. 
So the NVIDIA support is build, but then it fails the install because the rpms reqested are not found. I think this is a little odd, for one process is looking some files and for the install other ones.

I can anyway cop with the SPEC workaround (thank you for the attachment), but maybe this is something worthy to be mentioned on the Documentation.


Thank you
Marco

Comment 4 Tim McMullan 2020-08-12 10:19:04 MDT

Hey Marco,

I agree that it is a little odd, and I think we are actually doing the same operation to allow manual installation of pmix.  I formalized the workaround into a patch and have it up for review.  If it is decided to not use that, I will make sure the documentation is clear!

Thanks!
--Tim

Comment 8 Tim McMullan 2020-10-09 14:19:48 MDT

Hey Marco,

We chatted about this internally and have pushed the this into the spec file.  It should start showing up in 20.02.6/20.11!

I'm going to close this out for now, but let me know if you have any questions!

Thanks!
--Tim

Comment 9 Tim Wickberg 2021-02-11 12:32:31 MST

*** Bug 7919 has been marked as a duplicate of this bug. ***