Bug 11878

Summary: Configless Slurm fails due to failing SRV record lookup on EL8 (CentOS 8)
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: slurmdAssignee: Tim McMullan <mcmullan>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: tru
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 21.08.0 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Ole.H.Nielsen@fysik.dtu.dk 2021-06-22 07:00:29 MDT
We're testing some compute nodes running EL8 (CentOS 8.4 and AlmaLinux 8.4).
Our cluster setup uses Configless Slurm with a DNS SRV record.  This is working correctly, for example, as shown by a DNS lookup on the slurmctld server (running CentOS 7.9):

$ host -t SRV _slurmctld._tcp 
_slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que.nifl.fysik.dtu.dk.

On the EL8 compute nodes, however, slurmd fails to start, and we see these lines in the syslog:

Jun 22 14:03:26 h002 slurmd[1274]: error: resolve_ctls_from_dns_srv: res_nsearch error: Host name lookup failure
Jun 22 14:03:26 h002 slurmd[1274]: error: fetch_config: DNS SRV lookup failed
Jun 22 14:03:26 h002 slurmd[1274]: error: _establish_configuration: failed to load configs
Jun 22 14:03:26 h002 slurmd[1274]: error: slurmd initialization failed
Jun 22 14:03:26 h002 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Jun 22 14:03:26 h002 systemd[1]: slurmd.service: Failed with result 'exit-code'.

It is indeed the DNS lookup of SRV records that seems to be causing problems on the EL8 compute nodes:

[root@h002 ~]# host -t SRV _slurmctld._tcp 
Host _slurmctld._tcp not found: 3(NXDOMAIN)

Normal DNS A records are OK using just the short hostname:

[root@h002 ~]# host h001
h001.nifl.fysik.dtu.dk has address 10.2.133.113

But when I append the entire FQDN, the SRV record can be looked up correctly:

[root@h002 ~]# host -t SRV _slurmctld._tcp.nifl.fysik.dtu.dk.
_slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que.nifl.fysik.dtu.dk.

AFAIK, we don't seem to have any issues with our DNS servers, which are running on CentOS 7.9 servers.

The node's resolv.conf file is generated automatically:

[root@h002 ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search nifl.fysik.dtu.dk
nameserver 10.2.128.110
nameserver 10.2.128.2

I've scanned DNS lookup of SRV records on all Linux servers in our network, and a consistent picture emerges:

* On all EL8 servers the command "host -t SRV _slurmctld._tcp" fails to give any results!  The OS versions tested include CentOS 8.3, CentOS 8.4, AlmaLinux 8.4, CentOS Stream 8, Fedora FC34.

* All EL7 servers are OK.

It would seem that DNS SRV records are treated differently on EL8 hosts than on EL7 hosts.  The slurmd daemon seems to fails to start in Configless mode on EL8 hosts owing to this problem.

Question:  Are you aware of what has changed wrt. SRV records on EL8?

A fix for DNS lookup in slurmd on EL8 systems may possibly be required.

Thanks a lot,
Ole
Comment 1 Tim McMullan 2021-06-22 10:21:51 MDT
Hi Ole,

Thanks for the report!  I've not yet heard of this issue, but in my own test environment I wasn't able to reproduce it quickly.  I'm currently testing with CentOS Stream 8, but host/dig can both get the appropriate response without the FQDN attached.

My resolv.conf is essentially the same as yours and is also being generated by NetworkManager (I'm using DHCP reservations for all my hosts), I updated/rebooted the test node before the test so its very current.

Just as a sanity check and to help me in trying to replicate the issue:
Are the /etc/resolv.conf files the same on the el7 and el8 nodes?
Does /etc/hostname have the FQDN or just the hostname on el7 and the el8 nodes?
Are the srv records defined on both DNS servers?
Can you share how you defined the SRV records on the DNS servers?

Hopefully with this I'll be able to match my setup to yours, reproduce, and find what changed!

Thanks!
--Tim
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2021-06-23 05:40:30 MDT
Hi Tim,

Thanks for a quick response.  Answers are below:

(In reply to Tim McMullan from comment #1)
> Thanks for the report!  I've not yet heard of this issue, but in my own test
> environment I wasn't able to reproduce it quickly.  I'm currently testing
> with CentOS Stream 8, but host/dig can both get the appropriate response
> without the FQDN attached.

In my cluster network, the SRV record doesn't lookup on the EL8 node except when I add the FQDN:

[root@h002 ~]# dig +short -t SRV -n _slurmctld._tcp
[root@h002 ~]# dig +short -t SRV -n _slurmctld._tcp.nifl.fysik.dtu.dk.
0 0 6817 que.nifl.fysik.dtu.dk.

[root@h002 ~]# host -t SRV _slurmctld._tcp
Host _slurmctld._tcp not found: 3(NXDOMAIN)
[root@h002 ~]# host -t SRV _slurmctld._tcp.nifl.fysik.dtu.dk.
_slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que.nifl.fysik.dtu.dk.

On an EL7 node the "host" command works without FQDN, but "dig" doesn't:

[root@s004 ~]# dig +short -t SRV -n _slurmctld._tcp
[root@s004 ~]# host -t SRV _slurmctld._tcp
_slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que.nifl.fysik.dtu.dk.

Maybe I'm barking up the wrong tree here:  I may just be that the "host" command from the bind-utils RPM has a changed behavior from EL7 (bind-utils-9.11.4) to EL8 (bind-utils-9.11.26), but I couldn't find any changelog.

I wonder why your CentOS Stream 8 system behaves different from mine?  We have a PC running CentOS Stream 8 where the FQDN is also required:

[root@tesla ~]# dig +short -t SRV -n _slurmctld._tcp
[root@tesla ~]# dig +short -t SRV -n _slurmctld._tcp.fysik.dtu.dk.
0 0 6817 que.fysik.dtu.dk.
[root@tesla ~]# dig +short -t SRV -n _slurmctld._tcp.nifl.fysik.dtu.dk.
0 0 6817 que.nifl.fysik.dtu.dk.

The network for this PC has a campus-wide Infoblox DNS server box which is completely different from my cluster's CentOS 7.9 BIND DNS server.

> My resolv.conf is essentially the same as yours and is also being generated
> by NetworkManager (I'm using DHCP reservations for all my hosts), I
> updated/rebooted the test node before the test so its very current.
> 
> Just as a sanity check and to help me in trying to replicate the issue:
> Are the /etc/resolv.conf files the same on the el7 and el8 nodes?

Yes, verified on EL7 and EL8.  The same DHCP server services all cluster nodes.

> Does /etc/hostname have the FQDN or just the hostname on el7 and the el8
> nodes?

Full FQDN:

[root@h002 ~]# cat /etc/hostname
h002.nifl.fysik.dtu.dk

> Are the srv records defined on both DNS servers?

Yes, they are both DNS slave servers using the same authoritative DNS server.

> Can you share how you defined the SRV records on the DNS servers?

In the zone file for nifl.fysik.dtu.dk I have:

_slurmctld._tcp 3600 IN SRV 0 0 6817 que
Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2021-06-23 06:20:54 MDT
I have made some new observations today.  Having found in Comment 2 that the FQDN is required for lookup of the SRV record (at least in our network for reasons that I don't understand yet), maybe this is not a DNS issue after all?

When my EL8 node "h002" boots, I get slurmd error messages in the syslog related to DNS lookup failures:

Jun 23 13:50:36 h002 slurmd[1281]: error: resolve_ctls_from_dns_srv: res_nsearch error: Host name lookup failure
Jun 23 13:50:36 h002 slurmd[1281]: error: fetch_config: DNS SRV lookup failed
Jun 23 13:50:36 h002 slurmd[1281]: error: _establish_configuration: failed to load configs
Jun 23 13:50:36 h002 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Jun 23 13:50:36 h002 slurmd[1281]: error: slurmd initialization failed
Jun 23 13:50:36 h002 systemd[1]: slurmd.service: Failed with result 'exit-code'.

Consequently, the slurmd service has failed:

[root@h002 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─core_limit.conf
   Active: failed (Result: exit-code) since Wed 2021-06-23 13:50:36 CEST; 4min 24s ago
  Process: 1281 ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 1281 (code=exited, status=1/FAILURE)

Jun 23 13:50:36 h002.nifl.fysik.dtu.dk systemd[1]: Started Slurm node daemon.
Jun 23 13:50:36 h002.nifl.fysik.dtu.dk systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Jun 23 13:50:36 h002.nifl.fysik.dtu.dk systemd[1]: slurmd.service: Failed with result 'exit-code'.

Strangely, when I now restart slurmd it works just fine:

[root@h002 ~]# systemctl restart slurmd
[root@h002 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─core_limit.conf
   Active: active (running) since Wed 2021-06-23 13:55:05 CEST; 1s ago
 Main PID: 2151 (slurmd)
    Tasks: 2
   Memory: 18.4M
   CGroup: /system.slice/slurmd.service
           └─2151 /usr/sbin/slurmd -D

Jun 23 13:55:05 h002.nifl.fysik.dtu.dk systemd[1]: Started Slurm node daemon.

I've confirmed that this behavior is repeated every time I reboot the node h002.  The slurmd fails when started by Systemd during booting, but a few minutes later slurmd starts correctly from Systemd.  I think this precludes any temporary issue with our DNS servers, also because all other nodes in the cluster work just fine in Configless mode.

Perhaps the base issue is related to the slurmd error: fetch_config: DNS SRV lookup failed

Could it be that slurmd is being started too early in the boot process where some  network services have not yet been fully completed?  In the syslog I see that NetworkManager is started with the exact same timestamp as the slurmd service:

Jun 23 13:50:36 h002 systemd[1]: Starting Network Manager...
Jun 23 13:50:36 h002 NetworkManager[1224]: <info>  [1624449036.3693] NetworkManager (version 1.30.0-7.el8) is starting... (for the first time)
Jun 23 13:50:36 h002 NetworkManager[1224]: <info>  [1624449036.3697] Read config: /etc/NetworkManager/NetworkManager.conf
Jun 23 13:50:36 h002 systemd[1]: Started Network Manager.
Jun 23 13:50:36 h002 NetworkManager[1224]: <info>  [1624449036.3722] bus-manager: acquired D-Bus service "org.freedesktop.NetworkManager"
Jun 23 13:50:36 h002 systemd[1]: Starting Network Manager Wait Online...
Jun 23 13:50:36 h002 systemd[1]: Reached target Network.

I tried delaying the startup of slurmd for some seconds by adding a ExecStartPre line to /usr/lib/systemd/system/slurmd.service:

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
# Testing a delay:
ExecStartPre=/bin/sleep 30

After rebooting again the slurmd service has now started correctly during the boot process:

[root@h002 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─core_limit.conf
   Active: active (running) since Wed 2021-06-23 14:12:12 CEST; 3min 11s ago
  Process: 1273 ExecStartPre=/bin/sleep 30 (code=exited, status=0/SUCCESS)
 Main PID: 2055 (slurmd)
    Tasks: 2
   Memory: 19.5M
   CGroup: /system.slice/slurmd.service
           └─2055 /usr/sbin/slurmd -D

Jun 23 14:11:42 h002.nifl.fysik.dtu.dk systemd[1]: Starting Slurm node daemon...
Jun 23 14:12:12 h002.nifl.fysik.dtu.dk systemd[1]: Started Slurm node daemon.

With this experiment it seems to me that the issue may be a race condition between the start of the slurmd and NetworkManager services.

What are your thoughts on this?

Thanks,
Ole
Comment 4 Tim McMullan 2021-06-23 06:37:02 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #3)
> With this experiment it seems to me that the issue may be a race condition
> between the start of the slurmd and NetworkManager services.
> 
> What are your thoughts on this?

Thanks for the additional observations!  I think that could explain a lot of what we are seeing here.

Would you be able to add "Wants=network-online.target" to the slurmd unit file on an EL8 node and see if that makes any difference?  This may delay the slurmd a little bit more and let the network come up all the way without an added sleep.
Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2021-06-23 06:40:42 MDT
Following up on my idea in Comment 3 that slurmd starts before the network is fully started, I've found that the slurmd.service file may be depending incorrectly on the network.target

[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target

As discussed in the RHEL8 documentation:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_networking/systemd-network-targets-and-services_configuring-and-managing-networking#differences-between-the-network-and-network-online-systemd-target_systemd-network-targets-and-services

we may in fact require the network-online.target in the slurmd.service file so that the case of Configless Slurm will work correctly:

[Unit]
Description=Slurm node daemon
After=munge.service network-online.target remote-fs.target

I've removed the ExecStartPre delay in Comment 2 and rebooted the system.  Now slurmd starts correctly at boot time:

[root@h002 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─core_limit.conf
   Active: active (running) since Wed 2021-06-23 14:28:25 CEST; 18s ago
 Main PID: 1657 (slurmd)
    Tasks: 2
   Memory: 19.4M
   CGroup: /system.slice/slurmd.service
           └─1657 /usr/sbin/slurmd -D

Jun 23 14:28:25 h002.nifl.fysik.dtu.dk systemd[1]: Started Slurm node daemon.

Therefore I would like to propose the following patch introducing the Systemd network-online.target:

[root@h002 ~]# diff -c /usr/lib/systemd/system/slurmd.service /usr/lib/systemd/system/slurmd.service.orig 
*** /usr/lib/systemd/system/slurmd.service	2021-06-23 14:37:15.890078007 +0200
--- /usr/lib/systemd/system/slurmd.service.orig	2021-05-28 10:42:26.000000000 +0200
***************
*** 1,6 ****
  [Unit]
  Description=Slurm node daemon
! After=munge.service network-online.target remote-fs.target
  #ConditionPathExists=/etc/slurm/slurm.conf
  
  [Service]
--- 1,6 ----
  [Unit]
  Description=Slurm node daemon
! After=munge.service network.target remote-fs.target
  #ConditionPathExists=/etc/slurm/slurm.conf
  
  [Service]


Does this seem to be a correct conclusion?

Thanks,
Ole
Comment 6 Tim McMullan 2021-06-23 06:47:40 MDT
Looks like we came to basically the same conclusion here :)

I'll work on getting this included!
Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2021-06-23 06:51:46 MDT
(In reply to Tim McMullan from comment #6)
> Looks like we came to basically the same conclusion here :)
> 
> I'll work on getting this included!

Thanks!  I'm confused whether we want to use wants= or after= or possibly both in the service file.  This is defined in https://www.freedesktop.org/software/systemd/man/systemd.unit.html but I don't fully understand the distinction.  There is a comment "It is a common pattern to include a unit name in both the After= and Wants= options, in which case the unit listed will be started before the unit that is configured with these options."

One issue remains, though:  Why do my EL8 systems require the FQDN in DNS lookups, whereas yours don't?

Thanks,
Ole
Comment 9 Tim McMullan 2021-06-23 07:12:06 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #7)
> Thanks!  I'm confused whether we want to use wants= or after= or possibly
> both in the service file.  This is defined in
> https://www.freedesktop.org/software/systemd/man/systemd.unit.html but I
> don't fully understand the distinction.  There is a comment "It is a common
> pattern to include a unit name in both the After= and Wants= options, in
> which case the unit listed will be started before the unit that is
> configured with these options."

I read through https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ and https://www.freedesktop.org/software/systemd/man/systemd.special.html and between them I think the right thing to do is have network-online.target in both After= and Wants=.

This quote in particular makes me think that: "systemd automatically adds dependencies of type Wants= and After= for this target unit to all SysV init script service units with an LSB header referring to the "$network" facility."

> One issue remains, though:  Why do my EL8 systems require the FQDN in DNS
> lookups, whereas yours don't?

I do find that strange.  I'm going to tweak my environment some to match yours as best I can and see if it behaves the same way.  My DNS is run on FreeBSD with unbound so it won't be perfect... but its worth a try at least.
Comment 11 Tim McMullan 2021-07-01 09:49:14 MDT
Hi Ole,

We've push changes to the unit files for slurm for 21.08+!

I've tried (mostly) replicating your setup with the details you provided and haven't seen the same behavior yet... nor have I found a good explanation for the difference you mention between EL7 and EL8.  Have you found anything?

Thanks!
--Tim
Comment 12 Ole.H.Nielsen@fysik.dtu.dk 2021-07-01 12:24:31 MDT
(In reply to Tim McMullan from comment #11)
> Hi Ole,
> 
> We've push changes to the unit files for slurm for 21.08+!
> 
> I've tried (mostly) replicating your setup with the details you provided and
> haven't seen the same behavior yet... nor have I found a good explanation
> for the difference you mention between EL7 and EL8.  Have you found anything?

As you can see in Comment 3, the issue is due to a race condition between the network being up before or after slurmd is started.  The winner of the race condition may depend on many things.

IMHO, the correct and safe solution is to start slurmd only after the network-online target!  This is crucial in the case of Configless Slurm.

Thanks,
Ole
Comment 13 Tim McMullan 2021-07-01 12:30:46 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #12)
> (In reply to Tim McMullan from comment #11)
> > Hi Ole,
> > 
> > We've push changes to the unit files for slurm for 21.08+!
> > 
> > I've tried (mostly) replicating your setup with the details you provided and
> > haven't seen the same behavior yet... nor have I found a good explanation
> > for the difference you mention between EL7 and EL8.  Have you found anything?
> 
> As you can see in Comment 3, the issue is due to a race condition between
> the network being up before or after slurmd is started.  The winner of the
> race condition may depend on many things.
> 
> IMHO, the correct and safe solution is to start slurmd only after the
> network-online target!  This is crucial in the case of Configless Slurm.
> 
> Thanks,
> Ole

Sorry Ole, the situation I was trying to replicate was the difference in behavior of dig and host!

https://github.com/SchedMD/slurm/commit/e1e7926
and
https://github.com/SchedMD/slurm/commit/e88f7ff

Fix the issue in 21.08 :)

Thanks!
-Tim
Comment 14 Ole.H.Nielsen@fysik.dtu.dk 2021-07-02 04:21:03 MDT
(In reply to Tim McMullan from comment #13)
> Sorry Ole, the situation I was trying to replicate was the difference in
> behavior of dig and host!

I've verified again that on all our EL7 servers dig/host both work correctly, and on all EL8 servers (CentOS 8.3, 8.4, Stream, AmlaLinux 8.4) it doesn't (FQDN is required).

Additionally, I have access to the Slurm cluster at another university, and on their EL7 nodes dig/host both work correctly.  They have installed an AlmaLinux 8.4 node which shows the same behavior:

$ cat /etc/redhat-release 
AlmaLinux release 8.4 (Electric Cheetah)
$ host -t SRV _slurmctld._tcp 
Host _slurmctld._tcp not found: 3(NXDOMAIN)
$ host -t SRV _slurmctld._tcp.grendel.cscaa.dk
_slurmctld._tcp.grendel.cscaa.dk has SRV record 0 0 6817 in4.grendel.cscaa.dk.
$ dig +short -t SRV -n _slurmctld._tcp 
$ dig +short -t SRV -n _slurmctld._tcp.grendel.cscaa.dk
0 0 6817 in4.grendel.cscaa.dk.

So I believe the DNS SRV record problem is not due to our particular network or DNS setup.

Could you possibly ask some other Slurm sites and SchedMD colleagues to check the dig/host behavior on any available EL8 nodes?

Thanks,
Ole
Comment 15 Tim McMullan 2021-07-09 11:30:13 MDT
Hey Ole,

I did some asking around internally and the rhel8 systems we have aren't exhibiting the issue.  Since this doesn't seem to be impacting Slurm functionality and I'm not finding the issue thus far, it might be better to see if Red Hat might know whats going on with the host/dig behavior differences?

Thanks,
--Tim
Comment 16 Ole.H.Nielsen@fysik.dtu.dk 2021-07-23 05:41:38 MDT
(In reply to Tim McMullan from comment #13)
> Sorry Ole, the situation I was trying to replicate was the difference in
> behavior of dig and host!
> 
> https://github.com/SchedMD/slurm/commit/e1e7926
> and
> https://github.com/SchedMD/slurm/commit/e88f7ff
> 
> Fix the issue in 21.08 :)

Today there is a thread "slumctld don't start at boot" on the slurm-users list where a site is running RockyLinux 8.4 and slurmctld fails to start.

Do you think you can push the change also to 20.11.9, because this issue may be hitting more broadly on EL 8.4 systems?

Thanks,
Ole
Comment 17 Tim McMullan 2021-08-31 10:08:09 MDT
Hi Ole,

I did some chatting internally and right now the feeling is to leave it the way it is for 20.11 for now.  The issue doesn't seem to be impacting many people right now, and we are reluctant to change the unit files just in case we introduce something unexpected... and its very easy to handle if there is a problem since it doesn't require any code changes.

I'm going to resolve this for now since the issue is resolved in 21.08.

Thanks!
--Tim