Summary: | slurmstepd: task/cgroup: plugin not compiled with hwloc support, skipping affinity. | ||
---|---|---|---|
Product: | Slurm | Reporter: | Hjalti Sveinsson <hjalti.sveinsson> |
Component: | slurmstepd | Assignee: | David Bigagli <david> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 2 - High Impact | ||
Priority: | --- | CC: | brian, da |
Version: | 14.11.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | deCODE | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Hjalti Sveinsson
2015-05-31 22:04:49 MDT
Hi, yes you need to install lib hwloc-devel and rebuild the binaries. You only need to reinstall the slurmd and slurmstepd and restart the slurmds in your cluster. mv the old slurmd and slurmstepd to a back up file, install the new binaries and restart. You don't need to stop all services the slurmd will simply restart and the running jobs unaffected. The configuration files in are untouched by the install in any case. David Hi David, thank you for your response. So i only need to replace the /usr/sbin/slurmstepd and /usr/sbin/slurmd on all machines by moving the old ones to /usr/sbin/slurmstepd-old and /usr/sbin/slurmd-old, including the Slurm controller and then restart services ( "service slurmd restart" ) on HPC nodes? regards, Hjalti Sveinsson Hi, my answer was actually incomplete, sorry about that. You also need to recompile the slurm library that links with the hwloc library and it used by the affinity plugin. You can build using make install or re-build the slurm and slurm-plugins rpm. David So basicly I will, 1. Install howloc and howloc-devel package with yum on the machine that i will use to recompile the slurm rpm packages. 2. Recompile from slurm.14.11.6.tar.bz2, "rpmbuild -ta slurm.*.tar.bz2" 3. Reinstall the repcompiled slurm rpm package on one machine. Only the slurm.14.11.6.rpm package. 4. Copy the /usr/sbin/slurmstepd and /usr/sbin/slurmd to all other Slurm HPC machines and clients from the one with the recompiled packages. 5. Restart slurmd on HPC nodes Am i missing something? Do i need to reinstall the recompiled slurm-plugins package on all machines as well? regards, Hjalti Sveinsson Modify step 4. Instead of reinstalling only the binaries I mentioned, it is more straightforward to install the entire rmp. Yes you need to install slurm-plugins as well as there are libraries that use the hwloc library. Basically all HPC machines need the new slurmd, slurmstepd and the libraries. David Great thank you, the problem seems to be fixed now. We reinstalled slurm and slurm-plugins after recomiling the rpms and restarted the slurmd services. regards, Hjalti Sveinsson |