Ticket 11434

Summary:	Configless Slurm: slurmd -C does not work
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	slurmd	Assignee:	Tim McMullan <mcmullan>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	ward.poelmans
Version:	20.11.5
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Linux Distro:	---
Machine Name:		CLE Version:
Version Fixed:	21.08pre1	Target Release:	---
DevPrio:	---	Emory-Cloud Sites:	---

Description Ole.H.Nielsen@fysik.dtu.dk 2021-04-22 02:00:42 MDT

Our cluster is running very well with Configless Slurm.  However, when I run "slurmd -C" on a configless node, I get an error:

$ slurmd -C
NodeName=a001 slurmd: error: s_p_parse_file: unable to status file /etc/slurm/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
slurmd: error: ClusterName needs to be specified
slurmd: Considering each NUMA node as a socket
CPUs=40 Boards=1 SocketsPerBoard=4 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=385380
UpTime=22-10:21:31

Could you make "slurmd -C" work correctly on configless nodes?

Thanks,
Ole

Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2021-04-22 02:26:24 MDT

The error also occurs even when I specify the slurmctld server:

$ slurmd -C --conf-server que
NodeName=a001 slurmd: error: s_p_parse_file: unable to status file /etc/slurm/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
slurmd: error: ClusterName needs to be specified
slurmd: Considering each NUMA node as a socket
CPUs=40 Boards=1 SocketsPerBoard=4 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=385380
UpTime=22-10:46:45

Comment 2 Tim McMullan 2021-04-22 10:44:04 MDT

Hi Ole,

I'm having trouble reproducing this locally, as when I run slurmd -C it never attempts to init the config file at all (even when not running configless).  Would you be able to run "slurmd -C" in gdb, set a break point for "slurm_conf_init" and "s_p_parse_file", and attach a backtrace from it when it hits one of those break points?

Thanks!
--Tim

Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2021-04-22 12:14:15 MDT

Hi Tim,

(In reply to Tim McMullan from comment #2)
> I'm having trouble reproducing this locally, as when I run slurmd -C it
> never attempts to init the config file at all (even when not running
> configless).  Would you be able to run "slurmd -C" in gdb, set a break point
> for "slurm_conf_init" and "s_p_parse_file", and attach a backtrace from it
> when it hits one of those break points?

This is what I see (don't know if I'm doing exactly what you asked for):

[root@a001 ~]# gdb slurmd
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmd...done.
(gdb) break slurm_conf_init
Breakpoint 1 at 0x40a250
(gdb) break s_p_parse_file
Function "s_p_parse_file" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 2 (s_p_parse_file) pending.
(gdb) run -C
Starting program: /usr/sbin/slurmd -C
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 2, s_p_parse_file (hashtbl=hashtbl@entry=0x63f850, hash_val=hash_val@entry=0x6351b0 <slurm_conf+400>, filename=filename@entry=0x7ffff7966c12 "/etc/slurm/slurm.conf", ignore_new=ignore_new@entry=false) at parse_config.c:1119
1119    parse_config.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install slurm-slurmd-20.11.5-1.el7.x86_64
(gdb) where
#0  s_p_parse_file (hashtbl=hashtbl@entry=0x63f850, hash_val=hash_val@entry=0x6351b0 <slurm_conf+400>, filename=filename@entry=0x7ffff7966c12 "/etc/slurm/slurm.conf", ignore_new=ignore_new@entry=false) at parse_config.c:1119
#1  0x00007ffff78b74b4 in _init_slurm_conf (file_name=file_name@entry=0x0) at read_config.c:3124
#2  0x00007ffff78bd266 in slurm_conf_lock () at read_config.c:3420
#3  0x00007ffff78ddfbd in slurm_get_sched_params () at slurm_protocol_api.c:592
#4  0x000000000042304b in xcpuinfo_hwloc_topo_get (p_cpus=0x639a94, p_boards=0x639a96, p_sockets=0x639a98, p_cores=0x639a9a, p_threads=0x639a9c, p_block_map_size=0x639ab0, p_block_map=0x639ab8, p_block_map_inv=0x639ac0)
    at xcpuinfo.c:325
#5  0x000000000040ea06 in _print_config () at slurmd.c:1337
#6  _process_cmdline (av=0x7fffffffdf48, ac=2) at slurmd.c:1420
#7  _slurmd_init () at slurmd.c:1680
#8  main (argc=2, argv=0x7fffffffdf48) at slurmd.c:295


A new observation:  On another node with the same 20.11.5 I don't see the problem:

[root@b001 ~]# slurmd -C
NodeName=b001 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=772299
UpTime=23-06:09:45

Looking at the outputs of these two very similar Dell Poweredge R640 servers, I see a difference:

a001: CPUs=40 Boards=1 SocketsPerBoard=4 CoresPerSocket=10 ThreadsPerCore=1 
b001: CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 

The only difference is that the a??? nodes were delivered by Dell with performance optimized BIOS settings, including Sub NUMA Cluster = Enabled, hence the SocketsPerBoard=4 CoresPerSocket=10.  This is at variance with what's defined in slurm.conf.  The relevant lines from slurm.conf are:

# Dell R640  Intel Xeon Skylake 40-core nodes (768 GB RAM)
NodeSet=b_nodes_768 Nodes=b[001-012]
NodeName=b[001-012] Weight=10735 Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=768000 TmpDisk=140000 Feature=xeon6148v5,opa,xeon40

# Dell R640 Intel Xeon Cascade Lake 40-core nodes (384 GB RAM)
NodeSet=a_nodes_384 Nodes=a[001-128]
NodeName=a[001-128] Weight=10536 Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=384000 TmpDisk=140000 Feature=xeon6242r,opa,xeon40

Maybe slurmd gets confused by the differences and decides to read the local slurm.conf?

Do you have any ideas about the best way to proceed?

Comment 5 Tim McMullan 2021-04-22 13:30:24 MDT

That's perfect, thank you!  I'll let you know if I need any more information!

--Tim

Comment 6 Tim McMullan 2021-04-26 10:50:19 MDT

Hi Ole,

I've taken a look at the trace, your configuration, and that hint you gave about the node differences and I do see what is happening here.

We have a parameter that deals with how we want to handle multiple numa nodes on a socket (SchedulerParameters=Ignore_NUMA), so in your case it attempts to read the slurm config when it otherwise wouldn't.  I'm looking into how to best approach this from our side now.

There are a couple options for you as far as "working around" this:

a) Disable "Sub NUMA Cluster" in the bios if you don't need it, which will avoid the issue with slurmd -C

b) Update the settings in slurm.conf to "SocketsPerBoard=4 CoresPerSocket=10" for the Cascade Lake nodes

Which way you go is really up to you and if this is something new you need in your cluster.  It seems like you aren't running with Sub NUMA on the Skylake nodes, so disabling it would make the new nodes more closely resemble the older ones.

This particular call happening before the slurm config exists should only happen with slurmd -C, so I expect it won't be an issue for when the slurmd starts normally!

Thanks!
--Tim

Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2021-04-27 00:26:45 MDT

Hi Tim,

(In reply to Tim McMullan from comment #6)
> We have a parameter that deals with how we want to handle multiple numa
> nodes on a socket (SchedulerParameters=Ignore_NUMA), so in your case it
> attempts to read the slurm config when it otherwise wouldn't.  I'm looking
> into how to best approach this from our side now.

Thanks, this should get sorted out eventually.

> There are a couple options for you as far as "working around" this:
> 
> a) Disable "Sub NUMA Cluster" in the bios if you don't need it, which will
> avoid the issue with slurmd -C
> 
> b) Update the settings in slurm.conf to "SocketsPerBoard=4
> CoresPerSocket=10" for the Cascade Lake nodes

I've seen papers documenting that Sub NUMA Cluster (SNC) can give a couple of percent better performance, so I'd like to go with that.

> This particular call happening before the slurm config exists should only
> happen with slurmd -C, so I expect it won't be an issue for when the slurmd
> starts normally!

Good to know that there are no further impacts.

I have a new observation:  I've reconfigured BIOS for SNC on some Xeon Cascade Lake nodes so that I now have 4 NUMA domains on 2 sockets:

$ numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 5 6 10 11 12 15 16 40 41 42 45 46 50 51 52 55 56
node 0 size: 46795 MB
node 0 free: 23177 MB
node 1 cpus: 3 4 7 8 9 13 14 17 18 19 43 44 47 48 49 53 54 57 58 59
node 1 size: 48360 MB
node 1 free: 41965 MB
node 2 cpus: 20 21 22 25 26 30 31 32 35 36 60 61 62 65 66 70 71 72 75 76
node 2 size: 48376 MB
node 2 free: 46559 MB
node 3 cpus: 23 24 27 28 29 33 34 37 38 39 63 64 67 68 69 73 74 77 78 79
node 3 size: 48376 MB
node 3 free: 46409 MB
node distances:
node   0   1   2   3 
  0:  10  11  21  21 
  1:  11  10  21  21 
  2:  21  21  10  11 
  3:  21  21  11  10 

I've also reconfigured slurm.conf on the slurmctld server (we're using Configless) and made a "scontrol reconfig", and things seem to be OK:

$ scontrol show node s004
NodeName=s004 Arch=x86_64 CoresPerSocket=10 
   CPUAlloc=24 CPUTot=80 CPULoad=3.32
   AvailableFeatures=xeon5218r,GPU_RTX3090
   ActiveFeatures=xeon5218r,GPU_RTX3090
   Gres=gpu:RTX3090:10
   NodeAddr=s004 NodeHostName=s004 Version=20.11.5
   OS=Linux 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Apr 8 19:51:47 UTC 2021 
   RealMemory=191000 AllocMem=35840 FreeMem=158357 Sockets=4 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=850000 Weight=19336 Owner=N/A MCS_label=N/A
   Partitions=sm3090 
   BootTime=2021-04-26T09:17:43 SlurmdStartTime=2021-04-26T12:04:20
   CfgTRES=cpu=80,mem=191000M,billing=160
   AllocTRES=cpu=24,mem=35G
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

In spite of this seemingly correct configuration, I still get the error message on the node:

$ slurmd -C
NodeName=s004 slurmd: error: s_p_parse_file: unable to status file /etc/slurm/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
slurmd: error: ClusterName needs to be specified
slurmd: Considering each NUMA node as a socket
CPUs=80 Boards=1 SocketsPerBoard=4 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=191909
UpTime=0-22:57:14

The node had been rebooted, and I've also restarted slurmd.

So something is still missing!  Is it possible that a restart of slurmctld is also required for the node reconfiguration to work correctly?  AFAIK, such a requirement hasn't been documented anywhere.

Thanks,
Ole

Comment 8 Tim McMullan 2021-04-27 09:27:23 MDT

Hi Ole,

I'm not surprised to see that slurmd -C is still showing errors.  slurmd -C never tries to determine where the config file is because it *almost* never needs the config.  The one "Ignore_NUMA" option is the only one that would change the output.

I expect that in configless, slurmd -C will always display that error when on a node with multiple numa domains per socket, but shouldn't impact the slurmd that is running as a daemon.

Thanks!
--Tim

Comment 9 Ole.H.Nielsen@fysik.dtu.dk 2021-04-27 13:03:46 MDT

Hi Tim,

(In reply to Tim McMullan from comment #8)
> I'm not surprised to see that slurmd -C is still showing errors.  slurmd -C
> never tries to determine where the config file is because it *almost* never
> needs the config.  The one "Ignore_NUMA" option is the only one that would
> change the output.

IMHO, slurmd should read the slurm.conf file (from configless or whatever) in stead of throwing an error and wait for 60 seconds before printing the output.

> I expect that in configless, slurmd -C will always display that error when
> on a node with multiple numa domains per socket, but shouldn't impact the
> slurmd that is running as a daemon.

I agree that the daemon seems to be running correctly.  Nevertheless, I don't think that an error message and timeout is a correct and unavoidable behavior, when all it takes is just reading the slurm.conf.  The code to do so must already be available in slurmd.

Could you kindly flag this as a bug?

Thanks,
Ole

Comment 10 Tim McMullan 2021-04-27 13:24:57 MDT

Hi Ole,

I'm sorry it wasn't more clear, but we do consider the behavior a bug and I'm looking at an appropriate fix.  However, what we are considering the bug here is that it tries to read the config file at all.

slurmd -C is intended to be run before the node has been configured to show what what we think the hardware configuration is.  The attempt to read the config isn't intended (and shouldn't be necessary), so that is what I'm working on fixing!

I hope this clears things up.
Thanks!
--Tim

Comment 11 Ole.H.Nielsen@fysik.dtu.dk 2021-04-27 13:26:39 MDT

(In reply to Tim McMullan from comment #10)
> Hi Ole,
> 
> I'm sorry it wasn't more clear, but we do consider the behavior a bug and
> I'm looking at an appropriate fix.  However, what we are considering the bug
> here is that it tries to read the config file at all.
> 
> slurmd -C is intended to be run before the node has been configured to show
> what what we think the hardware configuration is.  The attempt to read the
> config isn't intended (and shouldn't be necessary), so that is what I'm
> working on fixing!
> 
> I hope this clears things up.

Perfect, thanks!
Ole

Comment 15 Tim McMullan 2021-06-22 14:35:36 MDT

Hi Ole,

This issue has been fixed on master and should appear in 21.08!

I'm going to resolve this for now!

Thanks!
--Tim

Comment 16 Ward Poelmans 2021-07-16 08:55:07 MDT

*** Ticket 12053 has been marked as a duplicate of this ticket. ***