Bug 13805 - slurmd general protection error
Summary: slurmd general protection error
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 21.08.6
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-04-08 09:33 MDT by Paul Edmon
Modified: 2023-03-31 02:53 MDT (History)
0 users

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
/var/log/messages for holy7c16602 (3.08 MB, application/x-gzip)
2022-04-08 11:51 MDT, Paul Edmon
Details
slurm_task_prolog (543 bytes, application/x-shellscript)
2022-04-08 11:51 MDT, Paul Edmon
Details
Current slurm.conf (70.77 KB, text/x-matlab)
2022-04-08 12:01 MDT, Paul Edmon
Details
Current topology.conf (4.42 KB, text/x-matlab)
2022-04-08 12:20 MDT, Paul Edmon
Details
/var/log/messages for holy7c16602 for the week ending 03/20/22 (574.50 KB, application/x-gzip)
2022-04-08 12:43 MDT, Paul Edmon
Details
gdb output for frames 4 and 5 (21.67 KB, text/plain)
2022-04-09 07:23 MDT, Paul Edmon
Details
Diff file for the last changes made to the slurm.conf and topology.conf (4.96 KB, patch)
2022-04-09 07:27 MDT, Paul Edmon
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Paul Edmon 2022-04-08 09:33:04 MDT
I'm seeing nodes go into failure mode recently.  When I looked into our logs on the nodes and saw:

Apr  8 11:21:20 holy7c16602 kernel: [13019768.592961] traps: slurmd[103269] general protection ip:2b6140156065 sp:2b614617da90 error:0 in libslurmfull.so[2b61400af000+1d1000]
Apr  8 11:21:20 holy7c16602 abrt-hook-ccpp[103270]: Process 97407 (slurmd) of user 0 killed by SIGSEGV - dumping core
Apr  8 11:21:20 holy7c16602 abrt-server[103271]: Package 'slurm-slurmd' isn't signed with proper key
Apr  8 11:21:20 holy7c16602 abrt-server[103271]: 'post-create' on '/var/spool/abrt/ccpp-2022-04-08-11:21:20-97407' exited with 1
Apr  8 11:21:20 holy7c16602 abrt-server[103271]: Deleting problem directory '/var/spool/abrt/ccpp-2022-04-08-11:21:20-97407'

I'm not sure if this is related to a user code or if this is a bug with slurmd? Any thoughts?
Comment 1 Michael Hinton 2022-04-08 10:44:45 MDT
Can you attach a `t a a bt` GDB backtrace from the core file produced by that slurmd segfault?
Comment 2 Paul Edmon 2022-04-08 10:52:44 MDT
[root@holy7c16602 slurmd]# gdb /sbin/slurmstepd core.195454
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmstepd...done.

warning: exec file is newer than core file.
[New LWP 195454]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `slurmstepd: [57480851'.
Program terminated with signal 6, Aborted.
#0  0x00002ab1461b7387 in raise () from /usr/lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install slurm-slurmd-21.08.6-1fasrc01.el7.x86_64
(gdb) t a a bt

Thread 1 (Thread 0x2ab144a13300 (LWP 195454)):
#0  0x00002ab1461b7387 in raise () from /usr/lib64/libc.so.6
#1  0x00002ab1461b8a78 in abort () from /usr/lib64/libc.so.6
#2  0x00002ab1461f9f67 in __libc_message () from /usr/lib64/libc.so.6
#3  0x00002ab146202329 in _int_free () from /usr/lib64/libc.so.6
#4  0x00002ab144d85198 in _xstrfmtcatat (str=0x2fb7e, pos=0x1439ff0, fmt=0x6 <Address 0x6 out of bounds>) at xstring.c:301
#5  0x00002ab144cd4a98 in plugrack_use_by_type (rack=0x2fb7e, full_type=0x2fb7e <Address 0x2fb7e out of bounds>) at plugrack.c:331
#6  0x000000000144d550 in ?? ()
#7  0x00007ffcf2c0ea50 in ?? ()
#8  0x0000000001445fd0 in ?? ()
#9  0x00007ffcf2c0ea30 in ?? ()
#10 0x00002ab144cff457 in arg_set_data_wait_all_nodes (opt=0x2ab144d85198 <_xstrfmtcatat+128>, opt@entry=<error reading variable: Cannot access memory at address 0x100000000001008>, arg=0x2fb7e, errors=0x6)
    at slurm_opt.c:4945
(gdb)
Comment 3 Michael Hinton 2022-04-08 11:06:34 MDT
Can you give more context about this failure? Is it all nodes, or some? Is it only happening since an upgrade or some other event, or did it just start happening out of the blue? Can you attach the slurmd.log for a node that segfaulted like this? Are all the failures similar to each other?

Thanks!
-Michael
Comment 4 Paul Edmon 2022-04-08 11:45:13 MDT
It seems to have started out of the blue with in the past day or so.  I 
noticed more and more hosts showing up as:

[root@sa02 ~]# scontrol show node holy7c16208
NodeName=holy7c16208 Arch=x86_64 CoresPerSocket=24
    CPUAlloc=0 CPUTot=48 CPULoad=0.03
    AvailableFeatures=intel,holyhdr,cascadelake,avx,avx2,avx512
    ActiveFeatures=intel,holyhdr,cascadelake,avx,avx2,avx512
    Gres=(null)
    NodeAddr=holy7c16208 NodeHostName=holy7c16208 Version=21.08.6
    OS=Linux 3.10.0-1160.36.2.el7.x86_64 #1 SMP Wed Jul 21 11:57:15 UTC 2021
    RealMemory=192892 AllocMem=0 FreeMem=165111 Sockets=2 Boards=1
    MemSpecLimit=4096
    State=DOWN ThreadsPerCore=1 TmpDisk=70265 Weight=1 Owner=N/A 
MCS_label=N/A
    Partitions=emergency,serial_requeue,shared
    BootTime=2021-08-12T17:36:53 SlurmdStartTime=2022-04-08T11:08:43
    LastBusyTime=2022-04-08T10:49:10
    CfgTRES=cpu=48,mem=192892M,billing=95
    AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    Reason=Not responding [slurm@2022-04-08T10:58:27]

In the scheduler.  At first I was just rebooting them thinking that a 
user job was hosing the nodes or that our older IB fabric was having 
issues but its been impacting our newer hardware as well.   The failures 
all seem to be the same type.

Apr  8 10:41:50 holy7c16208 kernel: [20576238.128832] slurmd[256218]: 
segfault at 2cbee ip 00002b54f8972065 sp 00002b54ff2a2a90 error 4 in 
libslurmfull.so[2b54f88cb000+1d1000]

That's another example. I will attach the full log from the node I 
looked at earlier so you can see the context.

As for recent changes, we upgraded to 21.08.6 at the start of March and 
its been running fine.  The only other substantial change was that we 
started running a slurm task prolog to set MALLOC_ARENA_MAX due to 
issues with Matlab eating a ton of memory by default. I will attach the 
task prolog so you can see what we are doing.

-Paul Edmon-

On 4/8/2022 1:06 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c3> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> Can you give more context about this failure? Is it all nodes, or some? Is it
> only happening since an upgrade or some other event, or did it just start
> happening out of the blue? Can you attach the slurmd.log for a node that
> segfaulted like this? Are all the failures similar to each other?
>
> Thanks!
> -Michael
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 5 Paul Edmon 2022-04-08 11:48:15 MDT
By the way this appears to be a slow rolling outage.  Just random nodes 
pick up this error and then close off.  We do have a serial_requeue 
backfill queue running over all our hardware, so if this is caused by a 
job in there that might explain why its seemingly random and widespread. 
Though why a job would cause this sort of issue is beyond me as they 
should be self contained in a cgroup.

-Paul Edmon-

On 4/8/2022 1:06 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c3> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> Can you give more context about this failure? Is it all nodes, or some? Is it
> only happening since an upgrade or some other event, or did it just start
> happening out of the blue? Can you attach the slurmd.log for a node that
> segfaulted like this? Are all the failures similar to each other?
>
> Thanks!
> -Michael
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 6 Paul Edmon 2022-04-08 11:51:14 MDT
Created attachment 24341 [details]
/var/log/messages for holy7c16602

Contains slurmd logs
Comment 7 Paul Edmon 2022-04-08 11:51:31 MDT
Created attachment 24342 [details]
slurm_task_prolog
Comment 8 Michael Hinton 2022-04-08 11:58:44 MDT
(In reply to Paul Edmon from comment #5)
> By the way this appears to be a slow rolling outage.  Just random nodes 
> pick up this error and then close off.  We do have a serial_requeue 
> backfill queue running over all our hardware, so if this is caused by a 
> job in there that might explain why its seemingly random and widespread. 
> Though why a job would cause this sort of issue is beyond me as they 
> should be self contained in a cgroup.
Can you attach your current slurm.conf? Does your cluster allow root jobs that could possibly tamper with system settings on the node?

If there is a job that is causing these outages, can you try to find that job in the slurmctld.log? What jobs are running on each node before they fail? Are the failures linked to the slurmd trying to start a job, or is it a random failure? If you drain some nodes, do they still eventually fail?
Comment 9 Paul Edmon 2022-04-08 12:01:06 MDT
Created attachment 24343 [details]
Current slurm.conf
Comment 10 Michael Hinton 2022-04-08 12:02:56 MDT
Could you also do a `t a a bt full` on the same core file?
Comment 11 Paul Edmon 2022-04-08 12:06:30 MDT
I've attached my slurm.conf.  We do allow root jobs but no one is 
running any right now and users shouldn't be able to run them.

I haven't tried to traceback of it is a block of jobs causing this yet.  
We run in a mode where nodes can have multiple different types of jobs 
on them, so I would need to go see if there is a common block of jobs 
that caused this or not.  I can start looking and see if anything 
suspicious is going on.

-Paul Edmon-

On 4/8/2022 1:58 PM, bugs@schedmd.com wrote:
>
> *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c8> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> (In reply to Paul Edmon fromcomment #5  <show_bug.cgi?id=13805#c5>)
> > By the way this appears to be a slow rolling outage.  Just random nodes > pick up this error and then close off.  We do have a serial_requeue 
> > backfill queue running over all our hardware, so if this is caused 
> by a > job in there that might explain why its seemingly random and 
> widespread. > Though why a job would cause this sort of issue is 
> beyond me as they > should be self contained in a cgroup.
> Can you attach your current slurm.conf? Does your cluster allow root jobs that
> could possibly tamper with system settings on the node?
>
> If there is a job that is causing these outages, can you try to find that job
> in the slurmctld.log? What jobs are running on each node before they fail? Are
> the failures linked to the slurmd trying to start a job, or is it a random
> failure? If you drain some nodes, do they still eventually fail?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 12 Paul Edmon 2022-04-08 12:06:34 MDT
I'm sorry I just realized I did the original bt on the wrong core file.  
Here is the proper one with way more info:

[root@holy7c16602 slurmd]# gdb /sbin/slurmd core.97407
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmd...done.
[New LWP 103269]
[New LWP 97407]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmd -D -s'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002b6140156065 in host_prefix_end (dims=1, 
hostname=0x3439323736393439 <Address 0x3439323736393439 out of bounds>) 
at hostlist.c:471
471     hostlist.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install 
slurm-slurmd-21.08.6-1fasrc01.el7.x86_64
(gdb) t a a bt full

Thread 2 (Thread 0x2b613fca5880 (LWP 97407)):
#0  0x00002b61416219dd in accept () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b61401eeb1b in slurm_accept_msg_conn (fd=<optimized out>, 
addr=<optimized out>) at slurm_protocol_socket.c:468
         len = 128
#2  0x000000000040f422 in _msg_engine () at slurmd.c:473
         cli = 0x21a7270
         sock = <optimized out>
#3  main (argc=3, argv=0x7ffca285ee98) at slurmd.c:388
         pidfd = 7
         blocked_signals = {13, 0}
         oom_value = <optimized out>
         curr_uid = <optimized out>
         time_stamp = "Mon, 14 Mar 2022 14:29:44 
-0400\000\200텢\374\177\000\000\017\004\000\000\000\000\000\000\377\377\377\377\377\377\377\377\230\211/b\000\000\000\000\363\000\342\033\000\000\000\000<{\241(\377\037\000\000\337\277i@a+\000\000\000\000\000\000\000\000\000\000\300텢\374\177\000\000\320u\372\001\000\000\000\000\360셢\374\177\000\000\365\265i@a+\000\000\003\000\000\000\000\000\000\000\360\001\000\000\000\000\000\000|\000\000\000\000\000\000\000\022N\213Aa+\000\000`y\372\001\000\000\000\000`y\372\001\000\000\000\000\350텢\374\177\000\000Py\372\001\000\000\000\000`g\277Aa+\000\000\260\006\002", 
'\000' <repeats 13 times>...
         __func__ = "main"

Thread 1 (Thread 0x2b614617e700 (LWP 103269)):
#0  0x00002b6140156065 in host_prefix_end (dims=1, 
hostname=0x3439323736393439 <Address 0x3439323736393439 out of bounds>) 
at hostlist.c:471
         idx = <optimized out>
#1  hostname_create_dims (hostname=hostname@entry=0x3439323736393439 
<Address 0x3439323736393439 out of bounds>, dims=1) at hostlist.c:502
         hn = 0x2b619c02e880
         p = 0x2b614617dae0 "\020\333\027Fa+"
         idx = 0
         hostlist_base = 10
         __func__ = "hostname_create_dims"
#2  0x00002b6140156eda in hostlist_push_host_dims 
(hl=hl@entry=0x2b619c028a30, str=str@entry=0x3439323736393439 <Address 
0x3439323736393439 out of bounds>, dims=<optimized out>) at hostlist.c:1960
         hr = <optimized out>
         hn = <optimized out>
#3  0x00002b6140157268 in hostlist_push_host 
(hl=hl@entry=0x2b619c028a30, str=0x3439323736393439 <Address 
0x3439323736393439 out of bounds>) at hostlist.c:1979
         dims = <optimized out>
#4  0x00002b614016654f in bitmap2hostlist 
(bitmap=bitmap@entry=0x2b619c028940) at node_conf.c:199
         i = 1604
         first = <optimized out>
         last = 1606
         hl = 0x2b619c028a30
#5  0x00002b6143b2c7ae in route_p_split_hostlist (hl=<optimized out>, 
sp_hl=0x2b614617dcc0, count=0x2b614617dcbc, tree_width=<optimized out>) 
at route_topology.c:233
         i = 1
         j = <optimized out>
         k = <optimized out>
         hl_ndx = 0
         msg_count = 12
         sw_count = 8
         lst_count = 0
         buf = 0x66bd <Address 0x66bd out of bounds>
         nodes_bitmap = <optimized out>
         fwd_bitmap = 0x2b619c028940
         node_read_lock = {conf = NO_LOCK, job = NO_LOCK, node = 
READ_LOCK, part = NO_LOCK, fed = NO_LOCK}
         __func__ = "route_p_split_hostlist"
#6  0x00002b61401f1c98 in route_g_split_hostlist 
(hl=hl@entry=0x2b619c013d00, sp_hl=sp_hl@entry=0x2b614617dcc0, 
count=count@entry=0x2b614617dcbc, tree_width=12) at slurm_route.c:168
         rc = <optimized out>
         j = <optimized out>
         nnodes = 0
         nnodex = 0
         buf = 0x2b619c013d00 "\255", <incomplete sequence \336>
#7  0x00002b614013c3fa in forward_msg (forward_struct=0x2b619c028060, 
header=header@entry=0x2b614617dda0) at forward.c:603
         hl = 0x2b619c013d00
         sp_hl = 0x2b619c02c1f0
         hl_count = 0
---Type <return> to continue, or q <return> to quit---
#8  0x00002b61401b1485 in slurm_receive_msg_and_forward (fd=8, 
orig_addr=<optimized out>, msg=msg@entry=0x2b619c014350) at 
slurm_protocol_api.c:1431
         buf = 0x2b619c028130 "%"
         buflen = 269
         header = {version = 9472, flags = 0, msg_index = 0, msg_type = 
1008, body_length = 0, ret_cnt = 0, forward = {cnt = 12, init = 65534,
             nodelist = 0x2b619c02b190 
"holy7c[18604,20105,20610,24301-24302,24604],holygpu2c[0917,1125],holygpu7c[1323,26104-26105,26301]", 
timeout = 0, tree_width = 0}, orig_addr = {ss_family = 2,
             __ss_padding = "\224~\n\037\024\227", '\000' <repeats 111 
times>, __ss_align = 0}, ret_list = 0x0}
         rc = <optimized out>
         auth_cred = 0x0
         buffer = 0x2b619c02baf0
         __func__ = "slurm_receive_msg_and_forward"
#9  0x000000000040c8f4 in _service_connection (arg=<optimized out>) at 
slurmd.c:568
         con = 0x216d160
         msg = 0x2b619c014350
         __func__ = "_service_connection"
         rc = 0
#10 0x00002b614161aea5 in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#11 0x00002b614192d9fd in clone () from /usr/lib64/libc.so.6
No symbol table info available.

On 4/8/2022 2:02 PM, bugs@schedmd.com wrote:
>
> *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c10> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> Could you also do a `t a a bt full` on the same core file?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 13 Michael Hinton 2022-04-08 12:18:47 MDT
Can you attach your current topology.conf? It's originating in the route/topology code
Comment 14 Michael Hinton 2022-04-08 12:19:38 MDT
And I can go ahead and mark your attachments private, if you would like.
Comment 15 Paul Edmon 2022-04-08 12:20:59 MDT
Created attachment 24346 [details]
Current topology.conf
Comment 16 Paul Edmon 2022-04-08 12:21:32 MDT
No need, nothing in here is sensitive.  I will redact anything that is.

-Paul Edmon-

On 4/8/2022 2:19 PM, bugs@schedmd.com wrote:
>
> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c14> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> And I can go ahead and mark your attachments private, if you would like.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 17 Michael Hinton 2022-04-08 12:39:09 MDT
From the backtrace, in thread 1 in bitmap2hostlist() (frame #4 ), it looks like entry 1604 of the node_record_table_ptr has a corrupted name (str=0x3439323736393439 <Address
0x3439323736393439 out of bounds>). So it segfaults when it does hostlist_push_host(hl, node_record_table_ptr[i].name).

I'm not sure how this could happen, but thread 2 frame #3 looks very suspicious. The time_stamp variable is all corrupted in slurmd.c --> main():

         time_stamp = "Mon, 14 Mar 2022 14:29:44
-0400\000\200텢\374\177\000\000\017\004\000\000\000\000\000\000\377\377\377\377\377\377\377\377\230\211/b\000\000\000\000\363\000\342\033\000\000\000\000<{\241(\377\037\000\000\337\277i@a+\000\000\000\000\000\000\000\000\000\000\300텢\374\177\000\000\320u\372\001\000\000\000\000\360셢\374\177\000\000\365\265i@a+\000\000\003\000\000\000\000\000\000\000\360\001\000\000\000\000\000\000|\000\000\000\000\000\000\000\022N\213Aa+\000\000`y\372\001\000\000\000\000`y\372\001\000\000\000\000\350텢\374\177\000\000Py\372\001\000\000\000\000`g\277Aa+\000\000\260\006\002",

Can you get me the slurmd.log for 3/14 and see what it says for "slurmd: slurmd started on  ..."? If that looks normal, then I think something is corrupting memory, because I see no other place time_stamp could be touched.

Is the hardware faulty at all? It sounds like probably not, since it's happening to new hardware.
Comment 18 Paul Edmon 2022-04-08 12:43:59 MDT
Created attachment 24347 [details]
/var/log/messages for holy7c16602 for the week ending 03/20/22
Comment 19 Michael Hinton 2022-04-08 12:47:43 MDT
(In reply to Michael Hinton from comment #17)
> Can you get me the slurmd.log for 3/14 and see what it says for "slurmd:
> slurmd started on  ..."? If that looks normal, then I think something is
> corrupting memory, because I see no other place time_stamp could be touched.
Sorry, I think this is a red herring that I overreacted to. I think GDB prints all bytes in a stack-allocated array like `char time_stamp[256];`, and the timestamp string does seem to end in a NULL byte (\000), meaning the bytes are just random values on the stack that don't matter. The 3/14 slurmd log shows nothing weird.
Comment 20 Paul Edmon 2022-04-08 12:49:21 MDT
Sure. I've attached that log.  Looking through it I do see a slurmd 
restart at the time listed in the that timestamp.

The hardware isn't faulty and its happening in rashes across the cluster 
on different classes of nodes.  At one point overnight I had about 200 
nodes in not responding.  So I doubt its hardware.

We have been making quite a few changes to the slurm.conf and 
topology.conf to add new nodes and decommission old ones.  So I have 
been doing global restarts to pull in the new hardware and pull out the 
old stuff.  I did double check though that my slurm.conf and 
topology.conf were consistent and if they weren't I generally fix it ASAP.

It does look like I did do a global restart at that time as the 
slurmctld shows:


Mar 14 14:29:39 holy-slurm02 slurmctld[23705]: sched: Allocate 
JobId=65773208_669(65776019) NodeList=holyhbs04 #CPUs=1 
Partition=serial_requeue
Mar 14 14:29:39 holy-slurm02 slurmctld: slurmctld: _job_complete: 
JobId=65773208_661(65776011) WEXITSTATUS 0
Mar 14 14:29:39 holy-slurm02 slurmctld[23705]: _job_complete: 
JobId=65773208_661(65776011) WEXITSTATUS 0
Mar 14 14:29:44 holy-slurm02 systemd: Stopping Slurm controller daemon...
Mar 14 14:29:44 holy-slurm02 kernel: nfsd: too many open connections, 
consider increasing the number of threads
Mar 14 14:29:44 holy-slurm02 kernel: nfsd: too many open connections, 
consider increasing the number of threads
Mar 14 14:29:44 holy-slurm02 kernel: nfsd: too many open connections, 
consider increasing the number of threads
Mar 14 14:29:44 holy-slurm02 kernel: nfsd: too many open connections, 
consider increasing the number of threads
Mar 14 14:29:44 holy-slurm02 kernel: nfsd: too many open connections, 
consider increasing the number of threads
Mar 14 14:29:44 holy-slurm02 kernel: nfsd: too many open connections, 
consider increasing the number of threads
Mar 14 14:29:44 holy-slurm02 kernel: nfsd: too many open connections, 
consider increasing the number of threads
Mar 14 14:29:44 holy-slurm02 kernel: nfsd: too many open connections, 
consider increasing the number of threads
Mar 14 14:29:44 holy-slurm02 kernel: nfsd: too many open connections, 
consider increasing the number of threads
Mar 14 14:29:44 holy-slurm02 kernel: nfsd: too many open connections, 
consider increasing the number of threads
Mar 14 14:29:48 holy-slurm02 systemd: Stopped Slurm controller daemon.
Mar 14 14:29:48 holy-slurm02 systemd: Started Slurm controller daemon.
Mar 14 14:29:50 holy-slurm02 journal: Suppressed 263 messages from 
/system.slice/slurmctld.service
Mar 14 14:29:50 holy-slurm02 slurmctld[12455]: No memory enforcing 
mechanism configured.
Mar 14 14:29:50 holy-slurm02 slurmctld[12455]: topology/tree: init: 
topology tree plugin loaded
Mar 14 14:29:50 holy-slurm02 slurmctld[12455]: sched: Backfill scheduler 
plugin loaded
Mar 14 14:29:50 holy-slurm02 kernel: net_ratelimit: 165 callbacks suppressed
Mar 14 14:29:50 holy-slurm02 kernel: nfsd: too many open connections, 
consider increasing the number of threads
Mar 14 14:29:50 holy-slurm02 slurmctld[12455]: route/topology: init: 
route topology plugin loaded
Mar 14 14:29:50 holy-slurm02 kernel: nfsd: too many open connections, 
consider increasing the number of threads

-Paul Edmon-

On 4/8/2022 2:39 PM, bugs@schedmd.com wrote:
>
> *Comment # 17 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c17> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
>  From the backtrace, in thread 1 in bitmap2hostlist() (frame #4 ), it looks like
> entry 1604 of the node_record_table_ptr has a corrupted name
> (str=0x3439323736393439 <Address
> 0x3439323736393439 out of bounds>). So it segfaults when it does
> hostlist_push_host(hl, node_record_table_ptr[i].name).
>
> I'm not sure how this could happen, but thread 2 frame #3 looks very
> suspicious. The time_stamp variable is all corrupted in slurmd.c --> main():
>
>           time_stamp = "Mon, 14 Mar 2022 14:29:44
> -0400\000\200텢\374\177\000\000\017\004\000\000\000\000\000\000\377\377\377\377\377\377\377\377\230\211/b\000\000\000\000\363\000\342\033\000\000\000\000<{\241(\377\037\000\000\337\277i@a+\000\000\000\000\000\000\000\000\000\000\300텢\374\177\000\000\320u\372\001\000\000\000\000\360셢\374\177\000\000\365\265i@a+\000\000\003\000\000\000\000\000\000\000\360\001\000\000\000\000\000\000|\000\000\000\000\000\000\000\022N\213Aa+\000\000`y\372\001\000\000\000\000`y\372\001\000\000\000\000\350텢\374\177\000\000Py\372\001\000\000\000\000`g\277Aa+\000\000\260\006\002",
>
> Can you get me the slurmd.log for 3/14 and see what it says for "slurmd: slurmd
> started on  ..."? If that looks normal, then I think something is corrupting
> memory, because I see no other place time_stamp could be touched.
>
> Is the hardware faulty at all? It sounds like probably not, since it's
> happening to new hardware.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 21 Michael Hinton 2022-04-08 13:46:50 MDT
(In reply to Paul Edmon from comment #20)
> Sure. I've attached that log.  Looking through it I do see a slurmd 
> restart at the time listed in the that timestamp.
What is your process for adding/removing nodes from the config? I'm wondering if this is causing issues.

This is what we recommend:

1) Stop the slurmctld daemon (e.g. "systemctl stop slurmctld" on the head node)
2) Update the slurm.conf/topology.conf file on all nodes in the cluster
3) Restart the slurmd daemons on all nodes (e.g. "systemctl restart slurmd" on all nodes)
4) Restart the slurmctld daemon (e.g. "systemctl start slurmctld" on the head node)

"The slurmctld daemon has a multitude of bitmaps to track state of nodes and cores in the system. Adding (or removing) nodes to a running system would require the slurmctld daemon to rebuild all of those bitmaps, which the developers feel would be safer to do by restarting the daemon."

See https://slurm.schedmd.com/faq.html#add_nodes.

So basically, a reconfig won't cut it when you add or remove nodes, and perhaps we are seeing an issue where .

What I want to know is when the configs were changed (and what they were changed to), when was holy7c16602 (and other slurmds) either restarted or reconfigured, and when was the slurmctld restarted or reconfigured.

I see that node holy7c16602 got reconfigure requests on:

    Apr  4 07:07:45
    Apr  7 10:45:39

and was restarted at Apr  8 11:39:10 after the segfault.

The controller restarted at Mar 14 14:29:48 - was it ever restarted or reconfigured after that time? When did you last change the config?

Thanks,
-Michael
Comment 22 Paul Edmon 2022-04-08 13:50:48 MDT
Our general process for updates is to change the slurm.conf and 
topology.conf and simultaneously restart all the slurmctld and slurmd 
via salt.  Unfortunately salt is not completely reliable and thus some 
nodes don't get the restart, but they do show up in the slurmctld log as 
being out sync.  We then go clean them up. We've been doing it this way 
for probably about 6-7 years with out much issue.

We have done several changes to the slurm.conf and topology.conf since 
the 14th.  I've done two this week, one on Tuesday and another on 
Wednesday.  This is the first time I've ever seen this happen though 
after changes.  I've seen it complain certainly about slurmd's being out 
of sync because they didn't get restarted but I've never seen it throw a 
segfault like this.

-Paul Edmon-

On 4/8/2022 3:46 PM, bugs@schedmd.com wrote:
>
> *Comment # 21 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c21> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> (In reply to Paul Edmon fromcomment #20  <show_bug.cgi?id=13805#c20>)
> > Sure. I've attached that log.  Looking through it I do see a slurmd > restart at the time listed in the that timestamp.
> What is your process for adding/removing nodes from the config? I'm wondering
> if this is causing issues.
>
> This is what we recommend:
>
> 1) Stop the slurmctld daemon (e.g. "systemctl stop slurmctld" on the head node)
> 2) Update the slurm.conf/topology.conf file on all nodes in the cluster
> 3) Restart the slurmd daemons on all nodes (e.g. "systemctl restart slurmd" on
> all nodes)
> 4) Restart the slurmctld daemon (e.g. "systemctl start slurmctld" on the head
> node)
>
> "The slurmctld daemon has a multitude of bitmaps to track state of nodes and
> cores in the system. Adding (or removing) nodes to a running system would
> require the slurmctld daemon to rebuild all of those bitmaps, which the
> developers feel would be safer to do by restarting the daemon."
>
> Seehttps://slurm.schedmd.com/faq.html#add_nodes.
>
> So basically, a reconfig won't cut it when you add or remove nodes, and perhaps
> we are seeing an issue where .
>
> What I want to know is when the configs were changed (and what they were
> changed to), when was holy7c16602 (and other slurmds) either restarted or
> reconfigured, and when was the slurmctld restarted or reconfigured.
>
> I see that node holy7c16602 got reconfigure requests on:
>
>      Apr  4 07:07:45
>      Apr  7 10:45:39
>
> and was restarted at Apr  8 11:39:10 after the segfault.
>
> The controller restarted at Mar 14 14:29:48 - was it ever restarted or
> reconfigured after that time? When did you last change the config?
>
> Thanks,
> -Michael
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 23 Paul Edmon 2022-04-08 13:53:55 MDT
I will note that the rash of unresponding nodes seems to have mostly 
past.  So I think your theory of a bad bitmap due to a change to the 
slurm.conf not being propagated is correct.  It's possible that these 
nodes did not get their slurmd restarted with the latest change and thus 
had a bad map.

-Paul Edmon-

On 4/8/2022 3:46 PM, bugs@schedmd.com wrote:
>
> *Comment # 21 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c21> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> (In reply to Paul Edmon fromcomment #20  <show_bug.cgi?id=13805#c20>)
> > Sure. I've attached that log.  Looking through it I do see a slurmd > restart at the time listed in the that timestamp.
> What is your process for adding/removing nodes from the config? I'm wondering
> if this is causing issues.
>
> This is what we recommend:
>
> 1) Stop the slurmctld daemon (e.g. "systemctl stop slurmctld" on the head node)
> 2) Update the slurm.conf/topology.conf file on all nodes in the cluster
> 3) Restart the slurmd daemons on all nodes (e.g. "systemctl restart slurmd" on
> all nodes)
> 4) Restart the slurmctld daemon (e.g. "systemctl start slurmctld" on the head
> node)
>
> "The slurmctld daemon has a multitude of bitmaps to track state of nodes and
> cores in the system. Adding (or removing) nodes to a running system would
> require the slurmctld daemon to rebuild all of those bitmaps, which the
> developers feel would be safer to do by restarting the daemon."
>
> Seehttps://slurm.schedmd.com/faq.html#add_nodes.
>
> So basically, a reconfig won't cut it when you add or remove nodes, and perhaps
> we are seeing an issue where .
>
> What I want to know is when the configs were changed (and what they were
> changed to), when was holy7c16602 (and other slurmds) either restarted or
> reconfigured, and when was the slurmctld restarted or reconfigured.
>
> I see that node holy7c16602 got reconfigure requests on:
>
>      Apr  4 07:07:45
>      Apr  7 10:45:39
>
> and was restarted at Apr  8 11:39:10 after the segfault.
>
> The controller restarted at Mar 14 14:29:48 - was it ever restarted or
> reconfigured after that time? When did you last change the config?
>
> Thanks,
> -Michael
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 24 Michael Hinton 2022-04-08 15:07:46 MDT
I wonder if there was some recent change in 21.08 that allowed for this segfault to occur, since you've never seen it before.

I want to try to reproduce this. Can you describe the config change you most recently did in slurm.conf and topology.conf (was it adding nodes, removing nodes, changing switch configurations, changing node names in place, etc.)?
Comment 25 Michael Hinton 2022-04-08 18:18:33 MDT
Hi Paul,

Can you go back into the slurmd core file and get more information for me from frames 5 and 6 from thread 1?

# frame 5 - route_p_split_hostlist()
f 5
p i
p j
p k
p node_record_count
p switch_record_cnt
p switch_levels
p switch_record_table
p switch_record_table[j]
p switch_record_table[j].name
p switch_record_table[j].switch_index
p switch_record_table[j].switch_index[i]
p switch_record_table[j].switch_index[i].name
p switch_record_table[k]
p switch_record_table[k].name
p switch_record_table[k].node_bitmap
p switch_record_table[k].node_bitmap[0]
p switch_record_table[k].node_bitmap[1]
p nodes_bitmap
p nodes_bitmap[0]
p nodes_bitmap[1]
p sw_count
p fwd_bitmap
p fwd_bitmap[0]
p fwd_bitmap[1]


# frame 4 - bitmap2hostlist()
f 4
p i
p first
p last
p node_record_count
p switch_record_cnt
p switch_levels
p node_record_table_ptr
p node_record_table_ptr[0]
p node_record_table_ptr[0].name
p node_record_table_ptr[1]
p node_record_table_ptr[1].name
p node_record_table_ptr[2]
p node_record_table_ptr[2].name
p node_record_table_ptr[1602]
p node_record_table_ptr[1602].name
p node_record_table_ptr[1603]
p node_record_table_ptr[1603].name
p node_record_table_ptr[1604]
p node_record_table_ptr[1604].name
p node_record_table_ptr[1605]
p node_record_table_ptr[1605].name
p node_record_table_ptr[1606]
p node_record_table_ptr[1606].name

Lots of these may be optimized out, but that's ok.

I have a theory that the # of nodes initialized in bitmaps for the switches was a few nodes > than the # of nodes initialized for slurm.conf, which caused a mismatch of the two bitmaps.
Comment 26 Paul Edmon 2022-04-09 07:23:28 MDT
Created attachment 24352 [details]
gdb output for frames 4 and 5
Comment 27 Paul Edmon 2022-04-09 07:27:16 MDT
Created attachment 24353 [details]
Diff file for the last changes made to the slurm.conf and topology.conf
Comment 28 Paul Edmon 2022-04-09 07:29:32 MDT
I've attached the outcomes of that along with the last diff done to the 
slurm.conf and topology.conf (we make changes to these by git merge 
request so its easy to pull the last changes).  As you can see we 
removed several blocks of nodes.

-Paul Edmon-

On 4/8/2022 8:18 PM, bugs@schedmd.com wrote:
>
> *Comment # 25 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c25> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> Hi Paul,
>
> Can you go back into the slurmd core file and get more information for me from
> frames 5 and 6 from thread 1?
>
> # frame 5 - route_p_split_hostlist()
> f 5
> p i
> p j
> p k
> p node_record_count
> p switch_record_cnt
> p switch_levels
> p switch_record_table
> p switch_record_table[j]
> p switch_record_table[j].name
> p switch_record_table[j].switch_index
> p switch_record_table[j].switch_index[i]
> p switch_record_table[j].switch_index[i].name
> p switch_record_table[k]
> p switch_record_table[k].name
> p switch_record_table[k].node_bitmap
> p switch_record_table[k].node_bitmap[0]
> p switch_record_table[k].node_bitmap[1]
> p nodes_bitmap
> p nodes_bitmap[0]
> p nodes_bitmap[1]
> p sw_count
> p fwd_bitmap
> p fwd_bitmap[0]
> p fwd_bitmap[1]
>
>
> # frame 4 - bitmap2hostlist()
> f 4
> p i
> p first
> p last
> p node_record_count
> p switch_record_cnt
> p switch_levels
> p node_record_table_ptr
> p node_record_table_ptr[0]
> p node_record_table_ptr[0].name
> p node_record_table_ptr[1]
> p node_record_table_ptr[1].name
> p node_record_table_ptr[2]
> p node_record_table_ptr[2].name
> p node_record_table_ptr[1602]
> p node_record_table_ptr[1602].name
> p node_record_table_ptr[1603]
> p node_record_table_ptr[1603].name
> p node_record_table_ptr[1604]
> p node_record_table_ptr[1604].name
> p node_record_table_ptr[1605]
> p node_record_table_ptr[1605].name
> p node_record_table_ptr[1606]
> p node_record_table_ptr[1606].name
>
> Lots of these may be optimized out, but that's ok.
>
> I have a theory that the # of nodes initialized in bitmaps for the switches was
> a few nodes > than the # of nodes initialized for slurm.conf, which caused a
> mismatch of the two bitmaps.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 30 Michael Hinton 2022-04-11 11:28:34 MDT
Hey Paul.

It looks like the switch record table's node bitmaps were getting corrupted somehow, so I want to see what the whole table looks like. Could you do some more GDB prints for me?:

f 5
p switch_record_table
p switch_record_cnt
p switch_levels
p switch_record_table[0]
p switch_record_table[0].node_bitmap
p switch_record_table[0].node_bitmap[0]
p switch_record_table[0].node_bitmap[1]
p switch_record_table[1]
p switch_record_table[1].node_bitmap
p switch_record_table[1].node_bitmap[0]
p switch_record_table[1].node_bitmap[1]
p switch_record_table[2]
p switch_record_table[2].node_bitmap
p switch_record_table[2].node_bitmap[0]
p switch_record_table[2].node_bitmap[1]
p switch_record_table[3]
p switch_record_table[3].node_bitmap
p switch_record_table[3].node_bitmap[0]
p switch_record_table[3].node_bitmap[1]
p switch_record_table[4]
p switch_record_table[4].node_bitmap
p switch_record_table[4].node_bitmap[0]
p switch_record_table[4].node_bitmap[1]
p switch_record_table[5]
p switch_record_table[5].node_bitmap
p switch_record_table[5].node_bitmap[0]
p switch_record_table[5].node_bitmap[1]
p switch_record_table[6]
p switch_record_table[6].node_bitmap
p switch_record_table[6].node_bitmap[0]
p switch_record_table[6].node_bitmap[1]
p switch_record_table[7]
p switch_record_table[7].node_bitmap
p switch_record_table[7].node_bitmap[0]
p switch_record_table[7].node_bitmap[1]
p switch_record_table[8]
p switch_record_table[8].node_bitmap
p switch_record_table[8].node_bitmap[0]
p switch_record_table[8].node_bitmap[1]
p switch_record_table[9]
p switch_record_table[9].node_bitmap
p switch_record_table[9].node_bitmap[0]
p switch_record_table[9].node_bitmap[1]
p switch_record_table[10]
p switch_record_table[10].node_bitmap
p switch_record_table[10].node_bitmap[0]
p switch_record_table[10].node_bitmap[1]
p switch_record_table[11]
p switch_record_table[11].node_bitmap
p switch_record_table[11].node_bitmap[0]
p switch_record_table[11].node_bitmap[1]
p switch_record_table[12]
p switch_record_table[12].node_bitmap
p switch_record_table[12].node_bitmap[0]
p switch_record_table[12].node_bitmap[1]
p switch_record_table[13]
p switch_record_table[13].node_bitmap
p switch_record_table[13].node_bitmap[0]
p switch_record_table[13].node_bitmap[1]

Thanks,
-Michael
Comment 31 Paul Edmon 2022-04-11 12:10:23 MDT
#5  0x00002b6143b2c7ae in route_p_split_hostlist (hl=<optimized out>, 
sp_hl=0x2b614617dcc0, count=0x2b614617dcbc, tree_width=<optimized out>) 
at route_topology.c:233
233     route_topology.c: No such file or directory.
(gdb) p switch_record_table
$1 = (switch_record_t *) 0x20251d0
(gdb) p switch_record_cnt
$2 = 13
(gdb) p switch_levels
$3 = 2
(gdb) p switch_record_table[0]
$4 = {level = 2, link_speed = 1, name = 0x2044a80 "main", node_bitmap = 
0x20269b0,
   nodes = 0x1fc4520 
"bloxham-r940,bos2-01-[0208-0212],holy-es-dev[01-08],holy2a[01301-01302,01304,01306,01308-01316,02301,02303-02314,02316,03301-03308,04302-04305,04308,05101-05102,05104-05108,05201-05208,05301-05308,053"..., 
num_switches = 7, parent = 0, switches = 0x2026560 
"boston,holyoke,kubzansky,legewie,murphy,tata,lichtmandce", 
switches_dist = 0x2042330, switch_index = 0x1fc8540}
(gdb) p switch_record_table[0].node_bitmap
$5 = (bitstr_t *) 0x20269b0
(gdb) p switch_record_table[0].node_bitmap[0]
$6 = 1111704645
(gdb) p switch_record_table[0].node_bitmap[1]
$7 = 1653
(gdb) p switch_record_table[1]
$8 = {level = 1, link_speed = 1, name = 0x20265c0 "boston", node_bitmap 
= 0x2025630, nodes = 0x2026c50 "bos2-01-[0208-0212]", num_switches = 2, 
parent = 0, switches = 0x2024850 "boseth,bosib", switches_dist = 0x20448e0,
   switch_index = 0x2083c10}
(gdb) p switch_record_table[1].node_bitmap
$9 = (bitstr_t *) 0x2025630
(gdb) p switch_record_table[1].node_bitmap[0]
$10 = 1111704645
(gdb) p switch_record_table[1].node_bitmap[1]
$11 = 1653
(gdb) p switch_record_table[2]
$12 = {level = 0, link_speed = 1, name = 0x2044b10 "boseth", node_bitmap 
= 0x203fcd0, nodes = 0x1fc8600 "bos2-01-0208", num_switches = 0, parent 
= 1, switches = 0x0, switches_dist = 0x1fc85b0, switch_index = 0x0}
(gdb) p switch_record_table[2].node_bitmap
$13 = (bitstr_t *) 0x203fcd0
(gdb) p switch_record_table[2].node_bitmap[0]
$14 = 1111704645
(gdb) p switch_record_table[2].node_bitmap[1]
$15 = 1653
(gdb) p switch_record_table[3]
$16 = {level = 0, link_speed = 1, name = 0x219c640 "bosib", node_bitmap 
= 0x2045290, nodes = 0x2044a20 "bos2-01-02[09-12]", num_switches = 0, 
parent = 1, switches = 0x0, switches_dist = 0x2025130, switch_index = 0x0}
(gdb) p switch_record_table[3].node_bitmap
$17 = (bitstr_t *) 0x2045290
(gdb) p switch_record_table[3].node_bitmap[0]
$18 = 1111704645
(gdb) p switch_record_table[3].node_bitmap[1]
$19 = 1653
(gdb) p switch_record_table[4]
$20 = {level = 1, link_speed = 1, name = 0x1fdd9f0 "holyoke", 
node_bitmap = 0x20268b0,
   nodes = 0x2028c70 
"bloxham-r940,holy2a[01301-01302,01304,01306,01308-01316,02301,02303-02314,02316,03301-03308,04302-04305,04308,05101-05102,05104-05108,05201-05208,05301-05308,05310-05316,06305,09201-09202,09204,09206-"..., 
num_switches = 3, parent = 0, switches = 0x2044930 
"holyeth,holyib,holyhdr", switches_dist = 0x203fa80, switch_index = 
0x20841d0}
(gdb) p switch_record_table[4].node_bitmap
$21 = (bitstr_t *) 0x20268b0
(gdb) p switch_record_table[4].node_bitmap[0]
$22 = 1111704645
(gdb) p switch_record_table[4].node_bitmap[1]
$23 = 1653
(gdb) p switch_record_table[5]
$24 = {level = 0, link_speed = 1, name = 0x2025880 "holyeth", 
node_bitmap = 0x2042730, nodes = 0x203fdd0 
"holygpu2a601,holygpu2a605,holygpu2a609,holygpu7c0915,holygpu7c17[01,06,11,16,21,26],holyolveczkygpu01",
   num_switches = 0, parent = 4, switches = 0x0, switches_dist = 
0x2047200, switch_index = 0x0}
(gdb) p switch_record_table[5].node_bitmap
$25 = (bitstr_t *) 0x2042730
(gdb) p switch_record_table[5].node_bitmap[0]
$26 = 1111704645
(gdb) p switch_record_table[5].node_bitmap[1]
$27 = 1653
(gdb) p switch_record_table[6]
$28 = {level = 0, link_speed = 1, name = 0x2026c30 "holyib", node_bitmap 
= 0x20447e0,
   nodes = 0x20465a0 
"bloxham-r940,holy2a013[01-02,04,06,08-16],holy2a023[01,03-14,16],holy2a0330[1-8],holy2a0430[2-5,8],holy2a0510[1-2,4-8],holy2a0520[1-8],holy2a053[01-08,10-16],holy2a06305,holy2a0920[1-2,4,6-8],holy2a10"..., 
num_switches = 0, parent = 4, switches = 0x0, switches_dist = 0x1fc8020, 
switch_index = 0x0}
(gdb) p switch_record_table[6].node_bitmap
$29 = (bitstr_t *) 0x20447e0
(gdb) p switch_record_table[6].node_bitmap[0]
$30 = 1111704645
(gdb) p switch_record_table[6].node_bitmap[1]
$31 = 1653
(gdb) p switch_record_table[7]
$32 = {level = 0, link_speed = 1, name = 0x1fbafe0 "holyhdr", 
node_bitmap = 0x20411c0,
   nodes = 0x2040c30 
"holy2c0929[01-02],holy2c0930[01-02],holy2c0933[01-02],holy2c0934[01-02],holy2c1129,holy7c021[03-12],holy7c022[01-12],holy7c023[01-12],holy7c024[01-12],holy7c025[01-12],holy7c026[01-12],holy7c041[01-12"..., 
num_switches = 0, parent = 4, switches = 0x0, switches_dist = 0x2044aa0, 
switch_index = 0x0}
(gdb) p switch_record_table[7].node_bitmap
$33 = (bitstr_t *) 0x20411c0
(gdb) p switch_record_table[7].node_bitmap[0]
$34 = 1111704645
(gdb) p switch_record_table[7].node_bitmap[1]
$35 = 1653
(gdb) p switch_record_table[8]
$36 = {level = 0, link_speed = 1, name = 0x2044a50 "kubzansky", 
node_bitmap = 0x2044bb0, nodes = 0x203fbd0 "holy7c23107", num_switches = 
0, parent = 0, switches = 0x0, switches_dist = 0x2044cc0, switch_index = 
0x0}
(gdb) p switch_record_table[8].node_bitmap
$37 = (bitstr_t *) 0x2044bb0
(gdb) p switch_record_table[8].node_bitmap[0]
$38 = 1111704645
(gdb) p switch_record_table[8].node_bitmap[1]
$39 = 1653
(gdb) p switch_record_table[9]
$40 = {level = 0, link_speed = 1, name = 0x1fc1c70 "legewie", 
node_bitmap = 0x2042830, nodes = 0x203fba0 "holylegewie[01-02]", 
num_switches = 0, parent = 0, switches = 0x0, switches_dist = 0x20267f0, 
switch_index = 0x0}
(gdb) p switch_record_table[9].node_bitmap
$41 = (bitstr_t *) 0x2042830
(gdb) p switch_record_table[9].node_bitmap[0]
$42 = 1111704645
(gdb) p switch_record_table[9].node_bitmap[1]
$43 = 1653
(gdb) p switch_record_table[10]
$44 = {level = 0, link_speed = 1, name = 0x203fb40 "lichtmandce", 
node_bitmap = 0x20265e0, nodes = 0x203fb70 "lichtmandce[01-04]", 
num_switches = 0, parent = 0, switches = 0x0, switches_dist = 0x20b3830, 
switch_index = 0x0}
(gdb) p switch_record_table[10].node_bitmap
$45 = (bitstr_t *) 0x20265e0
(gdb) p switch_record_table[10].node_bitmap[0]
$46 = 1111704645
(gdb) p switch_record_table[10].node_bitmap[1]
$47 = 1653
(gdb) p switch_record_table[11]
$48 = {level = 0, link_speed = 1, name = 0x21dd500 "murphy", node_bitmap 
= 0x20266e0, nodes = 0x203fb10 "holy7c21101", num_switches = 0, parent = 
0, switches = 0x0, switches_dist = 0x2024fd0, switch_index = 0x0}
(gdb) p switch_record_table[11].node_bitmap
$49 = (bitstr_t *) 0x20266e0
(gdb) p switch_record_table[11].node_bitmap[0]
$50 = 1111704645
(gdb) p switch_record_table[11].node_bitmap[1]
$51 = 1653
(gdb) p switch_record_table[12]
$52 = {level = 0, link_speed = 1, name = 0x21dd9d0 "tata", node_bitmap = 
0x2025530, nodes = 0x203fae0 "holy-es-dev[01-08]", num_switches = 0, 
parent = 0, switches = 0x0, switches_dist = 0x2025020, switch_index = 0x0}
(gdb) p switch_record_table[12].node_bitmap
$53 = (bitstr_t *) 0x2025530
(gdb) p switch_record_table[12].node_bitmap[0]
$54 = 1111704645
(gdb) p switch_record_table[12].node_bitmap[1]
$55 = 1653
(gdb) p switch_record_table[13]
$56 = {level = 0, link_speed = 0, name = 0x101 <Address 0x101 out of 
bounds>, node_bitmap = 0x42, nodes = 0xe0 <Address 0xe0 out of bounds>, 
num_switches = 17477, parent = 16963,
   switches = 0x675 <Address 0x675 out of bounds>, switches_dist = 0x0, 
switch_index = 0x0}
(gdb) p switch_record_table[13].node_bitmap
$57 = (bitstr_t *) 0x42
(gdb) p switch_record_table[13].node_bitmap[0]
Cannot access memory at address 0x42
(gdb) p switch_record_table[13].node_bitmap[1]
Cannot access memory at address 0x4a

On 4/11/2022 1:28 PM, bugs@schedmd.com wrote:
>
> *Comment # 30 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c30> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> Hey Paul.
>
> It looks like the switch record table's node bitmaps were getting corrupted
> somehow, so I want to see what the whole table looks like. Could you do some
> more GDB prints for me?:
>
> f 5
> p switch_record_table
> p switch_record_cnt
> p switch_levels
> p switch_record_table[0]
> p switch_record_table[0].node_bitmap
> p switch_record_table[0].node_bitmap[0]
> p switch_record_table[0].node_bitmap[1]
> p switch_record_table[1]
> p switch_record_table[1].node_bitmap
> p switch_record_table[1].node_bitmap[0]
> p switch_record_table[1].node_bitmap[1]
> p switch_record_table[2]
> p switch_record_table[2].node_bitmap
> p switch_record_table[2].node_bitmap[0]
> p switch_record_table[2].node_bitmap[1]
> p switch_record_table[3]
> p switch_record_table[3].node_bitmap
> p switch_record_table[3].node_bitmap[0]
> p switch_record_table[3].node_bitmap[1]
> p switch_record_table[4]
> p switch_record_table[4].node_bitmap
> p switch_record_table[4].node_bitmap[0]
> p switch_record_table[4].node_bitmap[1]
> p switch_record_table[5]
> p switch_record_table[5].node_bitmap
> p switch_record_table[5].node_bitmap[0]
> p switch_record_table[5].node_bitmap[1]
> p switch_record_table[6]
> p switch_record_table[6].node_bitmap
> p switch_record_table[6].node_bitmap[0]
> p switch_record_table[6].node_bitmap[1]
> p switch_record_table[7]
> p switch_record_table[7].node_bitmap
> p switch_record_table[7].node_bitmap[0]
> p switch_record_table[7].node_bitmap[1]
> p switch_record_table[8]
> p switch_record_table[8].node_bitmap
> p switch_record_table[8].node_bitmap[0]
> p switch_record_table[8].node_bitmap[1]
> p switch_record_table[9]
> p switch_record_table[9].node_bitmap
> p switch_record_table[9].node_bitmap[0]
> p switch_record_table[9].node_bitmap[1]
> p switch_record_table[10]
> p switch_record_table[10].node_bitmap
> p switch_record_table[10].node_bitmap[0]
> p switch_record_table[10].node_bitmap[1]
> p switch_record_table[11]
> p switch_record_table[11].node_bitmap
> p switch_record_table[11].node_bitmap[0]
> p switch_record_table[11].node_bitmap[1]
> p switch_record_table[12]
> p switch_record_table[12].node_bitmap
> p switch_record_table[12].node_bitmap[0]
> p switch_record_table[12].node_bitmap[1]
> p switch_record_table[13]
> p switch_record_table[13].node_bitmap
> p switch_record_table[13].node_bitmap[0]
> p switch_record_table[13].node_bitmap[1]
>
> Thanks,
> -Michael
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 33 Michael Hinton 2022-04-11 14:51:33 MDT
Was there ever a time where the cluster was configured with 1653 nodes, even temporarily? Your older config had 1588 nodes, and your newer config had 1566. So I'm not sure how each switch's node bitmaps are of size 1653. But it seems that this is only possible if node_record_count (and thus the # of nodes configured in slurm.conf) was 1653. I'm just wondering if you can somehow account for this 1653 node count number or not (maybe from some stray old slurm.conf or something).
Comment 34 Paul Edmon 2022-04-11 17:58:42 MDT
Yes, there was a time.  We did several decoms over the past week.  The 
first set I did removed about 70 nodes, which would put it at 1653.   
That was followed the next day by the next set which to us down to 1566.

-Paul Edmon-

On 4/11/2022 4:51 PM, bugs@schedmd.com wrote:
>
> *Comment # 33 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c33> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> Was there ever a time where the cluster was configured with 1653 nodes, even
> temporarily? Your older config had 1588 nodes, and your newer config had 1566.
> So I'm not sure how each switch's node bitmaps are of size 1653. But it seems
> that this is only possible if node_record_count (and thus the # of nodes
> configured in slurm.conf) was 1653. I'm just wondering if you can somehow
> account for this 1653 node count number or not (maybe from some stray old
> slurm.conf or something).
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 35 Michael Hinton 2022-04-11 18:03:51 MDT
(In reply to Paul Edmon from comment #34)
> Yes, there was a time.  We did several decoms over the past week.  The 
> first set I did removed about 70 nodes, which would put it at 1653.   
> That was followed the next day by the next set which to us down to 1566.
Is it possible that you did a scontrol reconfigure instead of a slurmd restart when you went from 1653 to 1566?

For each slurmd that segfaulted like this, did it only segfault once?
Comment 36 Paul Edmon 2022-04-11 18:06:06 MDT
Nope I did a global restart.  That said its possible that when I did the 
global restart this host was skipped.  Unfortunately salt is not 
complete and can miss hosts.

I can confirm that hosts that this happened to only segfaulted like this 
once.  Once I rebooted them they have been fine.

-Paul Edmon-

On 4/11/2022 8:03 PM, bugs@schedmd.com wrote:
>
> *Comment # 35 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c35> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> (In reply to Paul Edmon fromcomment #34  <show_bug.cgi?id=13805#c34>)
> > Yes, there was a time.  We did several decoms over the past week.  The > first set I did removed about 70 nodes, which would put it at 1653. 
> > That was followed the next day by the next set which to us down to 1566.
> Is it possible that you did a scontrol reconfigure instead of a slurmd restart
> when you went from 1653 to 1566?
>
> For each slurmd that segfaulted like this, did it only segfault once?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 37 Paul Edmon 2022-04-12 07:26:07 MDT
One other note, it is possible that a scontrol reconfigure was run after 
the global restart to pull in other changes or clean up stuff.  So its 
possible that we did a global restart, didn't tag that node, and then 
scontrol reconfigured it.

-Paul Edmon-

On 4/11/2022 8:03 PM, bugs@schedmd.com wrote:
>
> *Comment # 35 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c35> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> (In reply to Paul Edmon fromcomment #34  <show_bug.cgi?id=13805#c34>)
> > Yes, there was a time.  We did several decoms over the past week.  The > first set I did removed about 70 nodes, which would put it at 1653. 
> > That was followed the next day by the next set which to us down to 1566.
> Is it possible that you did a scontrol reconfigure instead of a slurmd restart
> when you went from 1653 to 1566?
>
> For each slurmd that segfaulted like this, did it only segfault once?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 38 Michael Hinton 2022-05-10 12:19:59 MDT
Hi Paul, since you aren't hitting this anymore, and since rebooting the slurmd has prooved to be a workaround, I'm going to go ahead and reduce the severity to 4. I still plan on looking into this further to try to fix it permanently, even with unexpected reconfigures and node additions and removals.

Thanks!
-Michael
Comment 39 Paul Edmon 2022-05-10 12:25:03 MDT
  Sounds good.

-Paul Edmon-

On 5/10/22 2:19 PM, bugs@schedmd.com wrote:
> Michael Hinton <mailto:hinton@schedmd.com> changed bug 13805 
> <https://bugs.schedmd.com/show_bug.cgi?id=13805>
> What 	Removed 	Added
> Severity 	3 - Medium Impact 	4 - Minor Issue
>
> *Comment # 38 <https://bugs.schedmd.com/show_bug.cgi?id=13805#c38> on 
> bug 13805 <https://bugs.schedmd.com/show_bug.cgi?id=13805> from 
> Michael Hinton <mailto:hinton@schedmd.com> *
> Hi Paul, since you aren't hitting this anymore, and since rebooting the slurmd
> has prooved to be a workaround, I'm going to go ahead and reduce the severity
> to 4. I still plan on looking into this further to try to fix it permanently,
> even with unexpected reconfigures and node additions and removals.
>
> Thanks!
> -Michael
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 46 Dominik Bartkiewicz 2023-03-31 02:53:57 MDT
Hi

Sorry that it took so long.
Finally, we fixed this issue by this commit:
https://github.com/SchedMD/slurm/commit/7f988ec71
It will be included in 23.02.2 and above.
I'll go ahead and close this out. Feel free to comment or reopen if needed.

Dominik