Bug 10288

Summary: Slurm fails to build with rpmbuild for PMIx and UCX
Product: Slurm Reporter: Misha Ahmadian <misha.ahmadian>
Component: Build System and PackagingAssignee: Tim McMullan <mcmullan>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: felip.moll, mcmullan
Version: 20.11.0   
Hardware: Linux   
OS: Linux   
Site: TTU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: CentOS
Machine Name: CLE Version:
Version Fixed: 20.11.1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: bug10288 workaround
Slurm rpmbuild Log file (after the spec.patch was applied)
Slurm rpmbuild exit status log
mlx ucx fix

Description Misha Ahmadian 2020-11-24 19:59:01 MST
Hello,

I'm trying to build RPMs for Slurm 20.11.0 by rpmbuild and with PMIx and UCX support, however, slurm.spec keeps looking for PMIx rpms rather than looking into the path I provide along with the rpmbuild macros:

I installed the PMIx on an NFS location which is shared between all the nodes. I also have a specific version of UCX on that nfs location which comes from the "hpcx" package and is optimized for the Mellanox drivers. When I try to build the Slurm with following macro definitions it dumps an error and stops:

# rpmbuild --define '_with_pmix --with-pmix=/opt/apps/nfs/custom/pmix/install/3.2' --define '_with_ucx --with-ucx=/opt/apps/nfs/custom/ucx' -ta slurm-20.11.0.tar.bz2 --with slurmrestd

warning: Macro expanded in comment on line 22: %_prefix path		install path for commands, libraries, etc.

warning: Macro expanded in comment on line 26: %_with_slurmrestd 1	build slurmrestd

warning: Macro expanded in comment on line 32: %_with_mysql 1		require mysql/mariadb support

warning: Macro expanded in comment on line 36: %_with_ucx path		require ucx support

warning: Macro expanded in comment on line 37: %_with_pmix path	require pmix support

warning: Macro expanded in comment on line 170: %define _unpackaged_files_terminate_build      0

error: Failed build dependencies:
	pmix is needed by slurm-20.11.0-1.el8.x86_64

The reason that there is no error for UCX is that we also have an older version of the UCX which is already installed by RPM and is located under /usr and rpmbuild takes that instead of the provided path. (Both the PMIx and UCX directories under the NFS location contain the required libs and header files)

I also looked into the slurm.spec file and looks like for both PMIx and UCX it keeps searching for the RPM packages and ignore the path that is defined as a macro.

slurm.spec:
...
%if %{with pmix}
BuildRequires: pmix
%global pmix_version %(rpm -q pmix --qf "%{VERSION}")
%endif

%if %{with ucx}
BuildRequires: ucx-devel
%global ucx_version %(rpm -q ucx-devel --qf "%{VERSION}")
%endif

I'm wondering if there is something that I'm missing during the installation, or any workaround that can help get this fixed.

I've been following this document to install Slurm with PMIx:
https://slurm.schedmd.com/mpi_guide.html

Best Regards,
Misha
Comment 1 Tim McMullan 2020-11-25 10:13:58 MST
Created attachment 16827 [details]
bug10288 workaround

Hi Misha,

I've attached a patch to the slurm.spec file that should allow this to work.  Would you give this a try and let me know if it works for you?

Thanks!
--Tim
Comment 2 Misha Ahmadian 2020-11-25 13:57:25 MST
Created attachment 16840 [details]
Slurm rpmbuild Log file (after the spec.patch was applied)

Hi Tim,

Thanks for your response. Looks like it worked partially but gives ma different error. Now it complains about the UCX location which is very strange. Please find the attached build log file and the build exit status file for more details.

I used the following command to build slurm after applied the patch:

rpmbuild --define '_with_pmix --with-pmix=/opt/apps/nfs/custom/ext-libs/pmix/install/3.2' --define '_with_ucx --with-ucx=/opt/apps/nfs/custom/ext-libs/ucx' -ta slurm-20.11.0-patched.tar.bz2 --with slurmrestd --with mysql --with ofed 2>&1 | tee build.log

As you can see in the build.log file, the configure process complains about missing ucx under the /opt/apps/nfs/custom/ext-libs/ucx:

...
checking for pmix installation... /opt/apps/nfs/custom/ext-libs/pmix/install/3.2
checking for freeipmi installation... /usr
checking for rrdtool installation... /usr
checking for ucx installation... /opt/apps/nfs/custom/ext-libs/ucx
checking for ucp_cleanup in -lucp... no
configure: error: unable to locate ucx installation
error: Bad exit status from /var/tmp/rpm-tmp.TpUh8e (%build)

However, I am able to locate ucp_cleanup function on the libupc.so and ucp.h files:

# grep -ir ucp_cleanup /opt/apps/nfs/custom/ext-libs/ucx/lib/
Binary file /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.so.0.0.0 matches
Binary file /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.a matches

# grep -ir ucp_cleanup /opt/apps/nfs/custom/ext-libs/ucx/include/
/opt/apps/nfs/custom/ext-libs/ucx/include/ucp/api/ucp.h:void ucp_cleanup(ucp_context_h context_p);

There is also a proper symlink to libucp.so.0.0.0:

# ls -l ucp_cleanup /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.*
ls: cannot access 'ucp_cleanup': No such file or directory
-rw-r--r-- 1 root root 10157598 Oct 13 10:10 /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.a
-rw-r--r-- 1 root root      933 Oct 13 10:10 /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.la
lrwxrwxrwx 1 root root       15 Oct 13 10:10 /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.so -> libucp.so.0.0.0
lrwxrwxrwx 1 root root       15 Oct 13 10:10 /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.so.0 -> libucp.so.0.0.0
-rwxr-xr-x 1 root root  3852456 Oct 13 10:10 /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.so.0.0.0

I'm not sure what prevents Slurm rpmbuild (or the spec file) from finding the ucx under the given directory. Perhaps it ignores the subdirectories (bin, lib, include,..) Any idea on this?

Best,
Misha
Comment 3 Misha Ahmadian 2020-11-25 13:58:23 MST
Created attachment 16841 [details]
Slurm rpmbuild exit status log
Comment 4 Tim McMullan 2020-11-25 14:47:22 MST
Thanks for the logs Misha!

It looks like the patch did what it was supposed to do at least!

configure just tries to link a program, and I did a quick check and it at least appears that it is expecting it to be $ucx_path/lib to find the library.

I did a quick check against my ucx install on rhel8 and it seemed to find the library ok.  What version of ucx are you trying to build against?

Can you try doing just a "configure --with-ucx=/opt/apps/nfs/custom/ext-libs/ucx" without rpmbuild?  This might help narrow down if rpmbuild is getting in the way or if something else might be going on.

Thanks!
--Tim
Comment 5 Misha Ahmadian 2020-11-25 16:14:04 MST
Created attachment 16843 [details]
HPCx-UCX-1.9

Hi Tim,

The "configure --with-ucx=/opt/apps/nfs/custom/ext-libs/ucx" gives the same error. Here is the thing, We have two versions of ucx on our system:

1) The UCX v1.9 which is coming from Mellanox HPCx package and we simply copied that to a shared location. It's already working with our OpenMPI with no issue and has a better performance compared to non-mellanox version. This version is located under the "/opt/apps/nfs/custom/ext-libs/ucx" and has the following directory pattern:
(bin  debug  include  lib  mt  prof  share)

However, this is the one that Slurm build (rpmbuild or configure) cannot find the ucx (libucp.so) under that location.

2) The second version is UCX v1.8 which was installed via yum just for test on all the node:

ucx-1.8.0-1.50218.x86_64
ucx-devel-1.8.0-1.50218.x86_64
ucx-knem-1.8.0-1.50218.x86_64
ucx-rdmacm-1.8.0-1.50218.x86_64

This version is not performing as well as the Mellanox version. However, since it's been installed under /user, both (rpmbuild and configure) can pickup the libucp.so with no issue:

# ./configure --with-ucx=/usr
...
checking for ucx installation... /usr
checking for ucp_cleanup in -lucp... yes
configure: ucx checking result: /usr/lib64
...

My guess is Slurm is looking for lib64 directory instead of (both lib & lib64) under the given root directory (Not sure if that's correct). Otherwise, it might be something strange with our version of UCX.

I've attached the HPCx ucx-1.9 in case that you need it for test.

Best,
Misha
Comment 6 Misha Ahmadian 2020-11-25 19:03:35 MST
(In reply to Misha Ahmadian from comment #5)
> 
> My guess is Slurm is looking for lib64 directory instead of (both lib &
> lib64) under the given root directory (Not sure if that's correct).
> Otherwise, it might be something strange with our version of UCX.
> 

I created a "lib64" symlink to the "lib" directory under the "/opt/apps/nfs/custom/ext-libs/ucx" and that didn't help. There should be something else that prevents Slurm to find the ucx in that location where OpenMPI can.

Best,
Misha
Comment 7 Tim McMullan 2020-11-30 06:12:05 MST
I was able to reproduce the issue and a viable workaround for now should be to add the path to ucx to LD_LIBRARY_PATH before running the rpmbuild command.

I'm working on a patch that should get this working without setting anything, but setting LD_LIBRARY_PATH should let it be detected!

Let me know if that works for you!
--Tim
Comment 8 Misha Ahmadian 2020-11-30 07:38:13 MST
Hi Tim,

That sounds good. I'll try compiling the Slurm by setting up the LD_LIBRARY_PATH, and meanwhile, I look forward to having the patch from you.

Best,
Misha
Comment 10 Tim McMullan 2020-11-30 10:17:41 MST
Created attachment 16864 [details]
mlx ucx fix

Hi Misha,

I've attached a patch you can try that updates the configure script so you shouldn't have to set LD_LIBRARY_PATH up before building the RPM.  This worked for me locally when using the HPCx version of UCX.

Let me know how it goes for you!
Thanks!
--Tim
Comment 11 Misha Ahmadian 2020-11-30 20:26:09 MST
Hi Tim,

That works perfectly now. I appreciate your help.

Best,
Misha
Comment 15 Tim McMullan 2020-12-09 05:53:36 MST
The patches for this have been merged and should land in 20.11.1!

Let us know if you have any other issues!

Thanks!
--Tim
Comment 16 Misha Ahmadian 2020-12-09 13:28:25 MST
Hi Tim,

Thanks. I have one more thing to add: Last time that I tried to rebuild the RPMs with the given patches (including the latest one that fixed the UCX build issue), I realized that I still have to add LD_LIBRARY_PATH and point it to the external ucx location. 

It looks like the first time I said the rpmbulid was passed successfully after applying your patches because I already had LD_LIBRARY_PATH set in my environment and didn't get a chance to report it. 

I'm not sure if you've already noticed that, but hopefully, this would also help you. 

Best,
Misha
Comment 17 Tim Wickberg 2022-02-25 13:37:09 MST
*** Bug 8647 has been marked as a duplicate of this bug. ***