Hello, recently we upgraded from slurm-14.03.5 to 14.11.6 and everything has been going well since. However this weekend a user was submitting jobs on the cluster and did get these error messages and his jobs failed. Error: slurmstepd: task/cgroup: plugin not compiled with hwloc support, skipping affinity. Do we need to install hwloc-devel package on the machine that we use to compile the rpm packages and do we then need to reinstall the slurm packages on all nodes in the cluster, the clients and the Slurm controller? If that is needed, don't we have to stop all slurm services during the reinstall? Does the config (slurm.conf) get reinstalled? Do we only need to reinstall slurm package or do we need to reinstall all packages (munge,sql,torque,perlapi,plugins,sjstat,slurmdbd). regards, Hjalti Sveinsson
Hi, yes you need to install lib hwloc-devel and rebuild the binaries. You only need to reinstall the slurmd and slurmstepd and restart the slurmds in your cluster. mv the old slurmd and slurmstepd to a back up file, install the new binaries and restart. You don't need to stop all services the slurmd will simply restart and the running jobs unaffected. The configuration files in are untouched by the install in any case. David
Hi David, thank you for your response. So i only need to replace the /usr/sbin/slurmstepd and /usr/sbin/slurmd on all machines by moving the old ones to /usr/sbin/slurmstepd-old and /usr/sbin/slurmd-old, including the Slurm controller and then restart services ( "service slurmd restart" ) on HPC nodes? regards, Hjalti Sveinsson
Hi, my answer was actually incomplete, sorry about that. You also need to recompile the slurm library that links with the hwloc library and it used by the affinity plugin. You can build using make install or re-build the slurm and slurm-plugins rpm. David
So basicly I will, 1. Install howloc and howloc-devel package with yum on the machine that i will use to recompile the slurm rpm packages. 2. Recompile from slurm.14.11.6.tar.bz2, "rpmbuild -ta slurm.*.tar.bz2" 3. Reinstall the repcompiled slurm rpm package on one machine. Only the slurm.14.11.6.rpm package. 4. Copy the /usr/sbin/slurmstepd and /usr/sbin/slurmd to all other Slurm HPC machines and clients from the one with the recompiled packages. 5. Restart slurmd on HPC nodes Am i missing something? Do i need to reinstall the recompiled slurm-plugins package on all machines as well? regards, Hjalti Sveinsson
Modify step 4. Instead of reinstalling only the binaries I mentioned, it is more straightforward to install the entire rmp. Yes you need to install slurm-plugins as well as there are libraries that use the hwloc library. Basically all HPC machines need the new slurmd, slurmstepd and the libraries. David
Great thank you, the problem seems to be fixed now. We reinstalled slurm and slurm-plugins after recomiling the rpms and restarted the slurmd services. regards, Hjalti Sveinsson