|Summary:||slurmd limits in /usr/lib/systemd/system/slurmd.service are ignored at boot time|
|Component:||Configuration||Assignee:||Tim Wickberg <tim>|
|Status:||RESOLVED FIXED||QA Contact:|
|Severity:||3 - Medium Impact|
|Priority:||---||CC:||adam.huffman, brian.gilmer, matejz|
|Site:||DTU Physics||Alineos Sites:||---|
|Bull/Atos Sites:||---||Confidential Site:||---|
|Cray Sites:||---||HPCnow Sites:||---|
|HPE Sites:||---||IBM Sites:||---|
|NOAA SIte:||---||OCF Sites:||---|
|SFW Sites:||---||SNIC sites:||---|
|Linux Distro:||---||Machine Name:|
|CLE Version:||Version Fixed:||17.02.0|
Description Ole.H.Nielsen@fysik.dtu.dk 2016-12-30 02:10:52 MST
The systemd service file /usr/lib/systemd/system/slurmd.service (installed by the slurm-16.05.6-1.el7.centos.x86_64 RPM) correctly increases some limits for the slurmd daemon: LimitNOFILE=51200 LimitMEMLOCK=infinity LimitSTACK=infinity This is required for Infiniband or OmniPath fabrics, see for example https://bugs.schedmd.com/show_bug.cgi?id=3363. Unfortunately, at system boot time the /usr/lib/systemd/system/slurmd.service is *not used*, since the daemons are started by /etc/init.d/slurm in stead of systemd, even on EL7 (RHEL 7 or CentOS 7) systems. Hence the limits are *not increased* as required at boot time, and we only get the system defaults: # cat "/proc/$(pgrep -u 0 slurmd)/limits" Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 1029471 1029471 processes Max open files 4096 4096 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 1029471 1029471 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us The workaround is to duplicate the limit settings in /usr/lib/systemd/system/slurmd.service to the file /etc/sysconfig/slurm (sourced by /etc/init.d/slurm at boot time): echo ulimit -l unlimited -s unlimited -n 51200 >> /etc/sysconfig/slurm If the limits need to be reconfigured, they must be changed in both files. Suggestion to developers: Please add the file /etc/sysconfig/slurm to the slurm RPM with the content: ulimit -l unlimited -s unlimited -n 51200 Note: If slurmd is restarted by "systemctl restart slurmd", the limits in /usr/lib/systemd/system/slurmd.service are honored correctly.
Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2017-01-03 13:05:50 MST
As concluded in https://bugs.schedmd.com/show_bug.cgi?id=3363, the configuration of /etc/init.d/slurm for Slurm daemons on Systemd OSes (for example, RHEL7/CentOS7) is unwarranted. Only Systemd should be used on such OSes. It is the slurm-16.05.6-1.el7.centos.x86_64 RPM which installs /etc/init.d/slurm service, and it is due to the lines in the slurm.spec file mentioned in https://bugs.schedmd.com/show_bug.cgi?id=3363#c13. Until the slurm.spec file can be corrected, a working solution is to disable execution of /etc/init.d/slurm: chkconfig --del slurm systemctl enable slurmd
Comment 2 Tim Wickberg 2017-01-11 16:51:47 MST
I'm working on a completely revised approach to the slurm.spec file for future releases; but due to the complexity it won't be in the 17.02 release and will need to wait until 17.11 (although should be usable before then if desired). Does the current workaround to disable the init scripts manually suffice for now?
Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2017-01-12 00:40:46 MST
(In reply to Tim Wickberg from comment #2) > I'm working on a completely revised approach to the slurm.spec file for > future releases; but due to the complexity it won't be in the 17.02 release > and will need to wait until 17.11 (although should be usable before then if > desired). > > Does the current workaround to disable the init scripts manually suffice for > now? I'm fine with the workaround for now, since I understand the problem and found a workaround. I've spoken to a couple of other Slurm sites, and they have independently discovered the same bug on CentOS 7 systems. I think this init scripts problem should be shared on the Slurm mailing list, since every site with Systemd based systems will be affected if they install the Slurm RPMs. For the record the best workaround for Systemd systems is: chkconfig --del slurm rm -f /etc/init.d/slurm This must be repeated every time Slurm is updated.
Comment 4 Adam Huffman 2017-01-12 08:18:56 MST
We've not seen this bug, because we explicitly start using SystemD, but I'm happy to ensure that the SysV init file is not included in the RPMs I'm making available via COPR, in the meantime. Tim - I'd be happy to help with the new .spec file. I'm a Fedora packager, and I know a couple of other people interested in improving it, too.
Comment 5 Tim Wickberg 2017-01-12 11:35:58 MST
I've created an enhancement request as bug 3396 that discusses more aggressive changes to the slurm.spec file. I'm looking into adjusting the existing spec file to install the init scripts OR the service files, but not both as is currently done. That I can have ready for 17.02.
Comment 6 Tim Wickberg 2017-01-25 11:32:53 MST
*** Bug 3363 has been marked as a duplicate of this bug. ***
Comment 7 Tim Wickberg 2017-02-22 18:37:08 MST
Commit faf9b41362a fixes the slurm.spec file to prevent installation of both the init scripts and systemd service files. This will be included in the 17.02.0 release. Further work to overhaul our RPM packaging is discussed on bug 3396, and will need to wait until the 17.11 release (although could potentially be used to package 17.02 if desired). - Tim