Ticket 1713 - slurmstepd: task/cgroup: plugin not compiled with hwloc support, skipping affinity.
Summary: slurmstepd: task/cgroup: plugin not compiled with hwloc support, skipping aff...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 14.11.6
Hardware: Linux Linux
: --- 2 - High Impact
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-05-31 22:04 MDT by Hjalti Sveinsson
Modified: 2015-06-01 21:33 MDT (History)
2 users (show)

See Also:
Site: deCODE
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Hjalti Sveinsson 2015-05-31 22:04:49 MDT
Hello,

recently we upgraded from slurm-14.03.5 to 14.11.6 and everything has been going well since. 

However this weekend a user was submitting jobs on the cluster and did get these error messages and his jobs failed.

Error:
slurmstepd: task/cgroup: plugin not compiled with hwloc support, skipping affinity.

Do we need to install hwloc-devel package on the machine that we use to compile the rpm packages and do we then need to reinstall the slurm packages on all nodes in the cluster, the clients and the Slurm controller?

If that is needed, don't we have to stop all slurm services during the reinstall? Does the config (slurm.conf) get reinstalled?

Do we only need to reinstall slurm package or do we need to reinstall all packages (munge,sql,torque,perlapi,plugins,sjstat,slurmdbd).

regards,
Hjalti Sveinsson
Comment 1 David Bigagli 2015-06-01 06:13:56 MDT
Hi,
   yes you need to install lib hwloc-devel and rebuild the binaries.
You only need to reinstall the slurmd and slurmstepd and restart the slurmds
in your cluster. mv the old slurmd and slurmstepd to a back up file, install 
the new binaries and restart. You don't need to stop all services the slurmd 
will simply restart and the running jobs unaffected.
The configuration files in are untouched by the install in any case.

David
Comment 2 Hjalti Sveinsson 2015-06-01 06:21:38 MDT
Hi David,

thank you for your response.

So i only need to replace the /usr/sbin/slurmstepd and /usr/sbin/slurmd on all machines by moving the old ones to /usr/sbin/slurmstepd-old and /usr/sbin/slurmd-old, including the Slurm controller and then restart services ( "service slurmd restart" ) on HPC nodes?

regards,
Hjalti Sveinsson
Comment 3 David Bigagli 2015-06-01 06:34:20 MDT
Hi,
   my answer was actually incomplete, sorry about that. You also need to
recompile the slurm library that links with the hwloc library and it used by the affinity plugin. You can build using make install or re-build the slurm and 
slurm-plugins rpm.

David
Comment 4 Hjalti Sveinsson 2015-06-01 08:42:20 MDT
So basicly I will,

1. Install howloc and howloc-devel package with yum on the machine that i will use to recompile the slurm rpm packages.
2. Recompile from slurm.14.11.6.tar.bz2, "rpmbuild -ta slurm.*.tar.bz2" 
3. Reinstall the repcompiled slurm rpm package on one machine. Only the slurm.14.11.6.rpm package. 
4. Copy the /usr/sbin/slurmstepd and /usr/sbin/slurmd to all other Slurm HPC machines and clients from the one with the recompiled packages.
5. Restart slurmd on HPC nodes

Am i missing something?

Do i need to reinstall the recompiled slurm-plugins package on all machines as well?

regards,
Hjalti Sveinsson
Comment 5 David Bigagli 2015-06-01 08:47:04 MDT
Modify step 4. Instead of reinstalling only the binaries I mentioned, it is 
more straightforward to install the entire rmp. Yes you need to install
slurm-plugins as well as there are libraries that use the hwloc library.

Basically all HPC machines need the new slurmd, slurmstepd and the libraries.

David
Comment 6 Hjalti Sveinsson 2015-06-01 21:33:27 MDT
Great thank you, the problem seems to be fixed now.

We reinstalled slurm and slurm-plugins after recomiling the rpms and restarted the slurmd services.

regards,

Hjalti Sveinsson