The systemd service file /usr/lib/systemd/system/slurmd.service (installed by the slurm-16.05.6-1.el7.centos.x86_64 RPM) correctly increases some limits for the slurmd daemon:
This is required for Infiniband or OmniPath fabrics, see for example https://bugs.schedmd.com/show_bug.cgi?id=3363.
Unfortunately, at system boot time the /usr/lib/systemd/system/slurmd.service is *not used*, since the daemons are started by /etc/init.d/slurm in stead of systemd, even on EL7 (RHEL 7 or CentOS 7) systems. Hence the limits are *not increased* as required at boot time, and we only get the system defaults:
# cat "/proc/$(pgrep -u 0 slurmd)/limits"
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 1029471 1029471 processes
Max open files 4096 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 1029471 1029471 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
The workaround is to duplicate the limit settings in /usr/lib/systemd/system/slurmd.service to the file /etc/sysconfig/slurm (sourced by /etc/init.d/slurm at boot time):
echo ulimit -l unlimited -s unlimited -n 51200 >> /etc/sysconfig/slurm
If the limits need to be reconfigured, they must be changed in both files.
Suggestion to developers: Please add the file /etc/sysconfig/slurm to the slurm RPM with the content:
ulimit -l unlimited -s unlimited -n 51200
Note: If slurmd is restarted by "systemctl restart slurmd", the limits in /usr/lib/systemd/system/slurmd.service are honored correctly.
As concluded in https://bugs.schedmd.com/show_bug.cgi?id=3363, the configuration of /etc/init.d/slurm for Slurm daemons on Systemd OSes (for example, RHEL7/CentOS7) is unwarranted. Only Systemd should be used on such OSes. It is the slurm-16.05.6-1.el7.centos.x86_64 RPM which installs /etc/init.d/slurm service, and it is due to the lines in the slurm.spec file mentioned in https://bugs.schedmd.com/show_bug.cgi?id=3363#c13.
Until the slurm.spec file can be corrected, a working solution is to disable execution of /etc/init.d/slurm:
chkconfig --del slurm
systemctl enable slurmd
I'm working on a completely revised approach to the slurm.spec file for future releases; but due to the complexity it won't be in the 17.02 release and will need to wait until 17.11 (although should be usable before then if desired).
Does the current workaround to disable the init scripts manually suffice for now?
(In reply to Tim Wickberg from comment #2)
> I'm working on a completely revised approach to the slurm.spec file for
> future releases; but due to the complexity it won't be in the 17.02 release
> and will need to wait until 17.11 (although should be usable before then if
> Does the current workaround to disable the init scripts manually suffice for
I'm fine with the workaround for now, since I understand the problem and found a workaround. I've spoken to a couple of other Slurm sites, and they have independently discovered the same bug on CentOS 7 systems. I think this init scripts problem should be shared on the Slurm mailing list, since every site with Systemd based systems will be affected if they install the Slurm RPMs.
For the record the best workaround for Systemd systems is:
chkconfig --del slurm
rm -f /etc/init.d/slurm
This must be repeated every time Slurm is updated.
We've not seen this bug, because we explicitly start using SystemD, but I'm happy to ensure that the SysV init file is not included in the RPMs I'm making available via COPR, in the meantime.
Tim - I'd be happy to help with the new .spec file. I'm a Fedora packager, and I know a couple of other people interested in improving it, too.
I've created an enhancement request as bug 3396 that discusses more aggressive changes to the slurm.spec file.
I'm looking into adjusting the existing spec file to install the init scripts OR the service files, but not both as is currently done. That I can have ready for 17.02.
*** Bug 3363 has been marked as a duplicate of this bug. ***
Commit faf9b41362a fixes the slurm.spec file to prevent installation of both the init scripts and systemd service files. This will be included in the 17.02.0 release.
Further work to overhaul our RPM packaging is discussed on bug 3396, and will need to wait until the 17.11 release (although could potentially be used to package 17.02 if desired).