Summary: | Configless Slurm: slurmd -C does not work | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | slurmd | Assignee: | Tim McMullan <mcmullan> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | ward.poelmans |
Version: | 20.11.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | DTU Physics | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 21.08pre1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Ole.H.Nielsen@fysik.dtu.dk
2021-04-22 02:00:42 MDT
The error also occurs even when I specify the slurmctld server: $ slurmd -C --conf-server que NodeName=a001 slurmd: error: s_p_parse_file: unable to status file /etc/slurm/slurm.conf: No such file or directory, retrying in 1sec up to 60sec slurmd: error: ClusterName needs to be specified slurmd: Considering each NUMA node as a socket CPUs=40 Boards=1 SocketsPerBoard=4 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=385380 UpTime=22-10:46:45 Hi Ole, I'm having trouble reproducing this locally, as when I run slurmd -C it never attempts to init the config file at all (even when not running configless). Would you be able to run "slurmd -C" in gdb, set a break point for "slurm_conf_init" and "s_p_parse_file", and attach a backtrace from it when it hits one of those break points? Thanks! --Tim Hi Tim, (In reply to Tim McMullan from comment #2) > I'm having trouble reproducing this locally, as when I run slurmd -C it > never attempts to init the config file at all (even when not running > configless). Would you be able to run "slurmd -C" in gdb, set a break point > for "slurm_conf_init" and "s_p_parse_file", and attach a backtrace from it > when it hits one of those break points? This is what I see (don't know if I'm doing exactly what you asked for): [root@a001 ~]# gdb slurmd GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmd...done. (gdb) break slurm_conf_init Breakpoint 1 at 0x40a250 (gdb) break s_p_parse_file Function "s_p_parse_file" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 2 (s_p_parse_file) pending. (gdb) run -C Starting program: /usr/sbin/slurmd -C [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Breakpoint 2, s_p_parse_file (hashtbl=hashtbl@entry=0x63f850, hash_val=hash_val@entry=0x6351b0 <slurm_conf+400>, filename=filename@entry=0x7ffff7966c12 "/etc/slurm/slurm.conf", ignore_new=ignore_new@entry=false) at parse_config.c:1119 1119 parse_config.c: No such file or directory. Missing separate debuginfos, use: debuginfo-install slurm-slurmd-20.11.5-1.el7.x86_64 (gdb) where #0 s_p_parse_file (hashtbl=hashtbl@entry=0x63f850, hash_val=hash_val@entry=0x6351b0 <slurm_conf+400>, filename=filename@entry=0x7ffff7966c12 "/etc/slurm/slurm.conf", ignore_new=ignore_new@entry=false) at parse_config.c:1119 #1 0x00007ffff78b74b4 in _init_slurm_conf (file_name=file_name@entry=0x0) at read_config.c:3124 #2 0x00007ffff78bd266 in slurm_conf_lock () at read_config.c:3420 #3 0x00007ffff78ddfbd in slurm_get_sched_params () at slurm_protocol_api.c:592 #4 0x000000000042304b in xcpuinfo_hwloc_topo_get (p_cpus=0x639a94, p_boards=0x639a96, p_sockets=0x639a98, p_cores=0x639a9a, p_threads=0x639a9c, p_block_map_size=0x639ab0, p_block_map=0x639ab8, p_block_map_inv=0x639ac0) at xcpuinfo.c:325 #5 0x000000000040ea06 in _print_config () at slurmd.c:1337 #6 _process_cmdline (av=0x7fffffffdf48, ac=2) at slurmd.c:1420 #7 _slurmd_init () at slurmd.c:1680 #8 main (argc=2, argv=0x7fffffffdf48) at slurmd.c:295 A new observation: On another node with the same 20.11.5 I don't see the problem: [root@b001 ~]# slurmd -C NodeName=b001 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=772299 UpTime=23-06:09:45 Looking at the outputs of these two very similar Dell Poweredge R640 servers, I see a difference: a001: CPUs=40 Boards=1 SocketsPerBoard=4 CoresPerSocket=10 ThreadsPerCore=1 b001: CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 The only difference is that the a??? nodes were delivered by Dell with performance optimized BIOS settings, including Sub NUMA Cluster = Enabled, hence the SocketsPerBoard=4 CoresPerSocket=10. This is at variance with what's defined in slurm.conf. The relevant lines from slurm.conf are: # Dell R640 Intel Xeon Skylake 40-core nodes (768 GB RAM) NodeSet=b_nodes_768 Nodes=b[001-012] NodeName=b[001-012] Weight=10735 Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=768000 TmpDisk=140000 Feature=xeon6148v5,opa,xeon40 # Dell R640 Intel Xeon Cascade Lake 40-core nodes (384 GB RAM) NodeSet=a_nodes_384 Nodes=a[001-128] NodeName=a[001-128] Weight=10536 Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=384000 TmpDisk=140000 Feature=xeon6242r,opa,xeon40 Maybe slurmd gets confused by the differences and decides to read the local slurm.conf? Do you have any ideas about the best way to proceed? That's perfect, thank you! I'll let you know if I need any more information! --Tim Hi Ole, I've taken a look at the trace, your configuration, and that hint you gave about the node differences and I do see what is happening here. We have a parameter that deals with how we want to handle multiple numa nodes on a socket (SchedulerParameters=Ignore_NUMA), so in your case it attempts to read the slurm config when it otherwise wouldn't. I'm looking into how to best approach this from our side now. There are a couple options for you as far as "working around" this: a) Disable "Sub NUMA Cluster" in the bios if you don't need it, which will avoid the issue with slurmd -C b) Update the settings in slurm.conf to "SocketsPerBoard=4 CoresPerSocket=10" for the Cascade Lake nodes Which way you go is really up to you and if this is something new you need in your cluster. It seems like you aren't running with Sub NUMA on the Skylake nodes, so disabling it would make the new nodes more closely resemble the older ones. This particular call happening before the slurm config exists should only happen with slurmd -C, so I expect it won't be an issue for when the slurmd starts normally! Thanks! --Tim Hi Tim, (In reply to Tim McMullan from comment #6) > We have a parameter that deals with how we want to handle multiple numa > nodes on a socket (SchedulerParameters=Ignore_NUMA), so in your case it > attempts to read the slurm config when it otherwise wouldn't. I'm looking > into how to best approach this from our side now. Thanks, this should get sorted out eventually. > There are a couple options for you as far as "working around" this: > > a) Disable "Sub NUMA Cluster" in the bios if you don't need it, which will > avoid the issue with slurmd -C > > b) Update the settings in slurm.conf to "SocketsPerBoard=4 > CoresPerSocket=10" for the Cascade Lake nodes I've seen papers documenting that Sub NUMA Cluster (SNC) can give a couple of percent better performance, so I'd like to go with that. > This particular call happening before the slurm config exists should only > happen with slurmd -C, so I expect it won't be an issue for when the slurmd > starts normally! Good to know that there are no further impacts. I have a new observation: I've reconfigured BIOS for SNC on some Xeon Cascade Lake nodes so that I now have 4 NUMA domains on 2 sockets: $ numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 5 6 10 11 12 15 16 40 41 42 45 46 50 51 52 55 56 node 0 size: 46795 MB node 0 free: 23177 MB node 1 cpus: 3 4 7 8 9 13 14 17 18 19 43 44 47 48 49 53 54 57 58 59 node 1 size: 48360 MB node 1 free: 41965 MB node 2 cpus: 20 21 22 25 26 30 31 32 35 36 60 61 62 65 66 70 71 72 75 76 node 2 size: 48376 MB node 2 free: 46559 MB node 3 cpus: 23 24 27 28 29 33 34 37 38 39 63 64 67 68 69 73 74 77 78 79 node 3 size: 48376 MB node 3 free: 46409 MB node distances: node 0 1 2 3 0: 10 11 21 21 1: 11 10 21 21 2: 21 21 10 11 3: 21 21 11 10 I've also reconfigured slurm.conf on the slurmctld server (we're using Configless) and made a "scontrol reconfig", and things seem to be OK: $ scontrol show node s004 NodeName=s004 Arch=x86_64 CoresPerSocket=10 CPUAlloc=24 CPUTot=80 CPULoad=3.32 AvailableFeatures=xeon5218r,GPU_RTX3090 ActiveFeatures=xeon5218r,GPU_RTX3090 Gres=gpu:RTX3090:10 NodeAddr=s004 NodeHostName=s004 Version=20.11.5 OS=Linux 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Apr 8 19:51:47 UTC 2021 RealMemory=191000 AllocMem=35840 FreeMem=158357 Sockets=4 Boards=1 State=MIXED ThreadsPerCore=2 TmpDisk=850000 Weight=19336 Owner=N/A MCS_label=N/A Partitions=sm3090 BootTime=2021-04-26T09:17:43 SlurmdStartTime=2021-04-26T12:04:20 CfgTRES=cpu=80,mem=191000M,billing=160 AllocTRES=cpu=24,mem=35G CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) In spite of this seemingly correct configuration, I still get the error message on the node: $ slurmd -C NodeName=s004 slurmd: error: s_p_parse_file: unable to status file /etc/slurm/slurm.conf: No such file or directory, retrying in 1sec up to 60sec slurmd: error: ClusterName needs to be specified slurmd: Considering each NUMA node as a socket CPUs=80 Boards=1 SocketsPerBoard=4 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=191909 UpTime=0-22:57:14 The node had been rebooted, and I've also restarted slurmd. So something is still missing! Is it possible that a restart of slurmctld is also required for the node reconfiguration to work correctly? AFAIK, such a requirement hasn't been documented anywhere. Thanks, Ole Hi Ole, I'm not surprised to see that slurmd -C is still showing errors. slurmd -C never tries to determine where the config file is because it *almost* never needs the config. The one "Ignore_NUMA" option is the only one that would change the output. I expect that in configless, slurmd -C will always display that error when on a node with multiple numa domains per socket, but shouldn't impact the slurmd that is running as a daemon. Thanks! --Tim Hi Tim, (In reply to Tim McMullan from comment #8) > I'm not surprised to see that slurmd -C is still showing errors. slurmd -C > never tries to determine where the config file is because it *almost* never > needs the config. The one "Ignore_NUMA" option is the only one that would > change the output. IMHO, slurmd should read the slurm.conf file (from configless or whatever) in stead of throwing an error and wait for 60 seconds before printing the output. > I expect that in configless, slurmd -C will always display that error when > on a node with multiple numa domains per socket, but shouldn't impact the > slurmd that is running as a daemon. I agree that the daemon seems to be running correctly. Nevertheless, I don't think that an error message and timeout is a correct and unavoidable behavior, when all it takes is just reading the slurm.conf. The code to do so must already be available in slurmd. Could you kindly flag this as a bug? Thanks, Ole Hi Ole, I'm sorry it wasn't more clear, but we do consider the behavior a bug and I'm looking at an appropriate fix. However, what we are considering the bug here is that it tries to read the config file at all. slurmd -C is intended to be run before the node has been configured to show what what we think the hardware configuration is. The attempt to read the config isn't intended (and shouldn't be necessary), so that is what I'm working on fixing! I hope this clears things up. Thanks! --Tim (In reply to Tim McMullan from comment #10) > Hi Ole, > > I'm sorry it wasn't more clear, but we do consider the behavior a bug and > I'm looking at an appropriate fix. However, what we are considering the bug > here is that it tries to read the config file at all. > > slurmd -C is intended to be run before the node has been configured to show > what what we think the hardware configuration is. The attempt to read the > config isn't intended (and shouldn't be necessary), so that is what I'm > working on fixing! > > I hope this clears things up. Perfect, thanks! Ole Hi Ole, This issue has been fixed on master and should appear in 21.08! I'm going to resolve this for now! Thanks! --Tim *** Ticket 12053 has been marked as a duplicate of this ticket. *** |