Hi Support, On compute nodes we have this issue with slurmd/systemctl Can't open PID file please see below. [root@c1-compute-1 ~]# systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2020-01-23 18:03:05 AEDT; 16h ago Process: 45980 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 45982 (slurmd) Tasks: 1 Memory: 2.4M CGroup: /system.slice/slurmd.service └─45982 /usr/sbin/slurmd Jan 23 18:03:05 c1-compute-1.wehi.edu.au systemd[1]: Starting Slurm node daemon... Jan 23 18:03:05 c1-compute-1.wehi.edu.au systemd[1]: Can't open PID file /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory Jan 23 18:03:05 c1-compute-1.wehi.edu.au systemd[1]: Started Slurm node daemon. The slurmd functions appear to work properly is just the systemctl reporting this issue. The service unit file: [root@c1-compute-1 ~]# cat /usr/lib/systemd/system/slurmd.service [Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurm/slurmd.pid User=root Group=slurm KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes [Install] WantedBy=multi-user.target The permission on the directory and pid file: [root@c1-compute-1 ~]# ls -al /var/run/slurm/ total 4 drwxr-xr-x 2 slurm slurm 60 Jan 23 18:03 . drwxr-xr-x 27 root root 840 Jan 23 17:39 .. -rw-r--r-- 1 root slurm 6 Jan 23 18:03 slurmd.pid Thank you, Laszlo
Hi Laszlo, This is happening because we create the PID file slightly after systemd tries to read it. Commands where systemd needs to know the PID (eg systemctl restart slurmd.service) it will re-read the file (which appears to be getting created properly). From a functional standpoint, this error shouldn't have any impact on systemd or slurm. That said, I'm testing some changes to how we launch slurmd et al in unit files that should fix this problem. Is the error reported by systemd causing you any issues? Thank you! --Tim (McMullan)
Hi Tim, I had not observed functional problems at all, if you can fix the timing or find a work around solution that would be great. Cheers, Laszlo
Hey Laszlo, The quickest workaround you could use is to just comment out "PIDFile=*" line in the unit file and do a daemon-reload. instead of reading the pid file we write out, it will "guess" the main pid (and in my tests does so correctly). I'm still testing a more proper solution and will keep you posted here on it! Thanks! --Tim
We've landed a change to the suggested unit files that runs the slurm daemons in the foreground for systemd. This should work for 19.05 as well, but is currently set for 20.11. I think for now though, just running with "PIDFile=" commented out in the unit file should be fine! Thanks! --Tim
Thanks Tim, I will try your suggested workaround
Hi! Just wanted to check in and make sure this was working for you! Thanks! --Tim
I'm closing this for now since the patch has landed. Feel free to re-open if you still have the same problem, or open a new ticket if a new issue arises! Thanks! --Tim
Tested with 20.02 and the problem is still there.
Sorry about that, I should have mentioned it in the comment! The change we decided to make to the suggested unit files was different enough that we chose to put it in 20.11 but not 20.02. For now, I would suggest continuing to run without the "PIDFile=" line or switching to the style suggested for 20.11. Thanks! --Tim