Bug 8388 - slurmd/systemctl can't open pid file
Summary: slurmd/systemctl can't open pid file
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 19.05.2
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Tim McMullan
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-01-23 16:17 MST by dl_support-hpc
Modified: 2023-02-08 11:13 MST (History)
1 user (show)

See Also:
Site: WEHI
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description dl_support-hpc 2020-01-23 16:17:33 MST
Hi Support,

On compute nodes we have this issue with slurmd/systemctl Can't open PID file please see below.

[root@c1-compute-1 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-01-23 18:03:05 AEDT; 16h ago
  Process: 45980 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 45982 (slurmd)
    Tasks: 1
   Memory: 2.4M
   CGroup: /system.slice/slurmd.service
           └─45982 /usr/sbin/slurmd

Jan 23 18:03:05 c1-compute-1.wehi.edu.au systemd[1]: Starting Slurm node daemon...
Jan 23 18:03:05 c1-compute-1.wehi.edu.au systemd[1]: Can't open PID file /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory
Jan 23 18:03:05 c1-compute-1.wehi.edu.au systemd[1]: Started Slurm node daemon.

The slurmd functions appear to work properly is just the systemctl reporting this issue.

The service unit file:

[root@c1-compute-1 ~]# cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurm/slurmd.pid
User=root
Group=slurm
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes


[Install]
WantedBy=multi-user.target

The permission on the directory and pid file:

[root@c1-compute-1 ~]# ls -al /var/run/slurm/
total 4
drwxr-xr-x  2 slurm slurm  60 Jan 23 18:03 .
drwxr-xr-x 27 root  root  840 Jan 23 17:39 ..
-rw-r--r--  1 root  slurm   6 Jan 23 18:03 slurmd.pid

Thank you,
Laszlo
Comment 1 Tim McMullan 2020-01-30 07:32:53 MST
Hi Laszlo,

This is happening because we create the PID file slightly after systemd tries to read it.  Commands where systemd needs to know the PID (eg systemctl restart slurmd.service) it will re-read the file (which appears to be getting created properly).  From a functional standpoint, this error shouldn't have any impact on systemd or slurm.

That said, I'm testing some changes to how we launch slurmd et al in unit files that should fix this problem.

Is the error reported by systemd causing you any issues?

Thank you!
--Tim (McMullan)
Comment 2 dl_support-hpc 2020-02-02 15:42:46 MST
Hi Tim,

I had not observed functional problems at all, if you can fix the timing or find a work around solution that would be great.

Cheers,
Laszlo
Comment 3 Tim McMullan 2020-02-04 10:57:56 MST
Hey Laszlo,

The quickest workaround you could use is to just comment out "PIDFile=*" line in the unit file and do a daemon-reload. instead of reading the pid file we write out, it will "guess" the main pid (and in my tests does so correctly).

I'm still testing a more proper solution and will keep you posted here on it!

Thanks!
--Tim
Comment 6 Tim McMullan 2020-03-16 12:53:20 MDT
We've landed a change to the suggested unit files that runs the slurm daemons in the foreground for systemd.  This should work for 19.05 as well, but is currently set for 20.11.

I think for now though, just running with "PIDFile=" commented out in the unit file should be fine!

Thanks!
--Tim
Comment 7 dl_support-hpc 2020-03-16 18:32:32 MDT
Thanks Tim, I will try your suggested workaround
Comment 8 Tim McMullan 2020-03-26 12:58:53 MDT
Hi!


Just wanted to check in and make sure this was working for you!

Thanks!
--Tim
Comment 9 Tim McMullan 2020-04-07 05:54:58 MDT
I'm closing this for now since the patch has landed.  Feel free to re-open if you still have the same problem, or open a new ticket if a new issue arises!

Thanks!
--Tim
Comment 10 Issam SAID 2020-06-26 11:53:26 MDT
Tested with 20.02 and the problem is still there.
Comment 11 Tim McMullan 2020-06-26 12:12:32 MDT
Sorry about that, I should have mentioned it in the comment!  The change we decided to make to the suggested unit files was different enough that we chose to put it in 20.11 but not 20.02.

For now, I would suggest continuing to run  without the "PIDFile=" line or switching to the style suggested for 20.11.

Thanks!
--Tim