Summary: | Slurm fails to build with rpmbuild for PMIx and UCX | ||
---|---|---|---|
Product: | Slurm | Reporter: | Misha Ahmadian <misha.ahmadian> |
Component: | Build System and Packaging | Assignee: | Tim McMullan <mcmullan> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | felip.moll, mcmullan |
Version: | 20.11.0 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | TTU | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | CentOS |
Machine Name: | CLE Version: | ||
Version Fixed: | 20.11.1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
bug10288 workaround
Slurm rpmbuild Log file (after the spec.patch was applied) Slurm rpmbuild exit status log mlx ucx fix |
Description
Misha Ahmadian
2020-11-24 19:59:01 MST
Created attachment 16827 [details] bug10288 workaround Hi Misha, I've attached a patch to the slurm.spec file that should allow this to work. Would you give this a try and let me know if it works for you? Thanks! --Tim Created attachment 16840 [details]
Slurm rpmbuild Log file (after the spec.patch was applied)
Hi Tim,
Thanks for your response. Looks like it worked partially but gives ma different error. Now it complains about the UCX location which is very strange. Please find the attached build log file and the build exit status file for more details.
I used the following command to build slurm after applied the patch:
rpmbuild --define '_with_pmix --with-pmix=/opt/apps/nfs/custom/ext-libs/pmix/install/3.2' --define '_with_ucx --with-ucx=/opt/apps/nfs/custom/ext-libs/ucx' -ta slurm-20.11.0-patched.tar.bz2 --with slurmrestd --with mysql --with ofed 2>&1 | tee build.log
As you can see in the build.log file, the configure process complains about missing ucx under the /opt/apps/nfs/custom/ext-libs/ucx:
...
checking for pmix installation... /opt/apps/nfs/custom/ext-libs/pmix/install/3.2
checking for freeipmi installation... /usr
checking for rrdtool installation... /usr
checking for ucx installation... /opt/apps/nfs/custom/ext-libs/ucx
checking for ucp_cleanup in -lucp... no
configure: error: unable to locate ucx installation
error: Bad exit status from /var/tmp/rpm-tmp.TpUh8e (%build)
However, I am able to locate ucp_cleanup function on the libupc.so and ucp.h files:
# grep -ir ucp_cleanup /opt/apps/nfs/custom/ext-libs/ucx/lib/
Binary file /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.so.0.0.0 matches
Binary file /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.a matches
# grep -ir ucp_cleanup /opt/apps/nfs/custom/ext-libs/ucx/include/
/opt/apps/nfs/custom/ext-libs/ucx/include/ucp/api/ucp.h:void ucp_cleanup(ucp_context_h context_p);
There is also a proper symlink to libucp.so.0.0.0:
# ls -l ucp_cleanup /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.*
ls: cannot access 'ucp_cleanup': No such file or directory
-rw-r--r-- 1 root root 10157598 Oct 13 10:10 /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.a
-rw-r--r-- 1 root root 933 Oct 13 10:10 /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.la
lrwxrwxrwx 1 root root 15 Oct 13 10:10 /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.so -> libucp.so.0.0.0
lrwxrwxrwx 1 root root 15 Oct 13 10:10 /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.so.0 -> libucp.so.0.0.0
-rwxr-xr-x 1 root root 3852456 Oct 13 10:10 /opt/apps/nfs/custom/ext-libs/ucx/lib/libucp.so.0.0.0
I'm not sure what prevents Slurm rpmbuild (or the spec file) from finding the ucx under the given directory. Perhaps it ignores the subdirectories (bin, lib, include,..) Any idea on this?
Best,
Misha
Created attachment 16841 [details]
Slurm rpmbuild exit status log
Thanks for the logs Misha! It looks like the patch did what it was supposed to do at least! configure just tries to link a program, and I did a quick check and it at least appears that it is expecting it to be $ucx_path/lib to find the library. I did a quick check against my ucx install on rhel8 and it seemed to find the library ok. What version of ucx are you trying to build against? Can you try doing just a "configure --with-ucx=/opt/apps/nfs/custom/ext-libs/ucx" without rpmbuild? This might help narrow down if rpmbuild is getting in the way or if something else might be going on. Thanks! --Tim Created attachment 16843 [details]
HPCx-UCX-1.9
Hi Tim,
The "configure --with-ucx=/opt/apps/nfs/custom/ext-libs/ucx" gives the same error. Here is the thing, We have two versions of ucx on our system:
1) The UCX v1.9 which is coming from Mellanox HPCx package and we simply copied that to a shared location. It's already working with our OpenMPI with no issue and has a better performance compared to non-mellanox version. This version is located under the "/opt/apps/nfs/custom/ext-libs/ucx" and has the following directory pattern:
(bin debug include lib mt prof share)
However, this is the one that Slurm build (rpmbuild or configure) cannot find the ucx (libucp.so) under that location.
2) The second version is UCX v1.8 which was installed via yum just for test on all the node:
ucx-1.8.0-1.50218.x86_64
ucx-devel-1.8.0-1.50218.x86_64
ucx-knem-1.8.0-1.50218.x86_64
ucx-rdmacm-1.8.0-1.50218.x86_64
This version is not performing as well as the Mellanox version. However, since it's been installed under /user, both (rpmbuild and configure) can pickup the libucp.so with no issue:
# ./configure --with-ucx=/usr
...
checking for ucx installation... /usr
checking for ucp_cleanup in -lucp... yes
configure: ucx checking result: /usr/lib64
...
My guess is Slurm is looking for lib64 directory instead of (both lib & lib64) under the given root directory (Not sure if that's correct). Otherwise, it might be something strange with our version of UCX.
I've attached the HPCx ucx-1.9 in case that you need it for test.
Best,
Misha
(In reply to Misha Ahmadian from comment #5) > > My guess is Slurm is looking for lib64 directory instead of (both lib & > lib64) under the given root directory (Not sure if that's correct). > Otherwise, it might be something strange with our version of UCX. > I created a "lib64" symlink to the "lib" directory under the "/opt/apps/nfs/custom/ext-libs/ucx" and that didn't help. There should be something else that prevents Slurm to find the ucx in that location where OpenMPI can. Best, Misha I was able to reproduce the issue and a viable workaround for now should be to add the path to ucx to LD_LIBRARY_PATH before running the rpmbuild command. I'm working on a patch that should get this working without setting anything, but setting LD_LIBRARY_PATH should let it be detected! Let me know if that works for you! --Tim Hi Tim, That sounds good. I'll try compiling the Slurm by setting up the LD_LIBRARY_PATH, and meanwhile, I look forward to having the patch from you. Best, Misha Created attachment 16864 [details]
mlx ucx fix
Hi Misha,
I've attached a patch you can try that updates the configure script so you shouldn't have to set LD_LIBRARY_PATH up before building the RPM. This worked for me locally when using the HPCx version of UCX.
Let me know how it goes for you!
Thanks!
--Tim
Hi Tim, That works perfectly now. I appreciate your help. Best, Misha The patches for this have been merged and should land in 20.11.1! Let us know if you have any other issues! Thanks! --Tim Hi Tim, Thanks. I have one more thing to add: Last time that I tried to rebuild the RPMs with the given patches (including the latest one that fixed the UCX build issue), I realized that I still have to add LD_LIBRARY_PATH and point it to the external ucx location. It looks like the first time I said the rpmbulid was passed successfully after applying your patches because I already had LD_LIBRARY_PATH set in my environment and didn't get a chance to report it. I'm not sure if you've already noticed that, but hopefully, this would also help you. Best, Misha *** Bug 8647 has been marked as a duplicate of this bug. *** |