Ticket 14276 - slurmctld crash on "scontrol reconfigure"
Summary: slurmctld crash on "scontrol reconfigure"
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 22.05.0
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Carlos Tripiana Montes
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-06-08 11:43 MDT by Felix Abecassis
Modified: 2022-08-04 04:06 MDT (History)
3 users (show)

See Also:
Site: NVIDIA (PSLA)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Felix Abecassis 2022-06-08 11:43:14 MDT
Running on the Slurm 22.05 branch under Ubuntu 22.04, calling "scontrol reconfigure" will cause slurmctld to segfault:
$ sudo scontrol reconfigure
$ sudo scontrol reconfigure
slurm_reconfigure error: Zero Bytes were transmitted or received

Running slurmctld under gdb during the "scontrol reconfigure" calls:
$ sudo gdb -batch -ex run -ex bt --args /usr/local/sbin/slurmctld -D -i -v
[...]
Thread 25 "srvcn" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff561f640 (LWP 540148)]
__GI_getenv (name=0x7ffff734dede "UTLS_NO_IMPLICIT_INIT", name@entry=0x7ffff734dedc "GNUTLS_NO_IMPLICIT_INIT") at ./stdlib/getenv.c:84
84      ./stdlib/getenv.c: No such file or directory.
#0  __GI_getenv (name=0x7ffff734dede "UTLS_NO_IMPLICIT_INIT", name@entry=0x7ffff734dedc "GNUTLS_NO_IMPLICIT_INIT") at ./stdlib/getenv.c:84
#1  0x00007ffff7bca376 in __GI___libc_secure_getenv (name=name@entry=0x7ffff734dedc "GNUTLS_NO_IMPLICIT_INIT") at ./stdlib/secure-getenv.c:29
#2  0x00007ffff721abbc in lib_deinit () at ../../lib/global.c:517
#3  0x00007ffff7fc51a2 in call_destructors (closure=closure@entry=0x7fffd802aba0) at ./elf/dl-close.c:129
#4  0x00007ffff7cf9c85 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:182
#5  0x00007ffff7fc5636 in _dl_close_worker (force=force@entry=false, map=<optimized out>, map=<optimized out>) at ./elf/dl-close.c:292
#6  0x00007ffff7fc62a2 in _dl_close_worker (force=false, map=0x7fffd800ee40) at ./elf/dl-close.c:150
#7  _dl_close (_map=0x7fffd800ee40) at ./elf/dl-close.c:818
#8  0x00007ffff7cf9c28 in __GI__dl_catch_exception (exception=exception@entry=0x7ffff561eaa0, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:208
#9  0x00007ffff7cf9cf3 in __GI__dl_catch_error (objname=0x7ffff561eaf8, errstring=0x7ffff561eb00, mallocedp=0x7ffff561eaf7, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:227
#10 0x00007ffff7c151ae in _dlerror_run (operate=<optimized out>, args=<optimized out>) at ./dlfcn/dlerror.c:138
#11 0x00007ffff7c14ed8 in __dlclose (handle=<optimized out>) at ./dlfcn/dlclose.c:31
#12 0x00007ffff7854f7f in _libpmix_close (lib_plug=<optimized out>) at mpi_pmix.c:123
#13 fini () at mpi_pmix.c:208
#14 0x00007ffff7e99542 in plugin_unload (plug=0x7fffd801ab90) at plugin.c:315
#15 0x00007ffff7e999c4 in plugin_context_destroy (c=<optimized out>) at plugin.c:477
#16 0x00007ffff7ec5717 in _mpi_fini_locked () at slurm_mpi.c:261
#17 0x00007ffff7ec6c2e in mpi_g_daemon_reconfig () at slurm_mpi.c:585
#18 0x00005555556070a8 in _slurm_rpc_reconfigure_controller (msg=0x7fffc8000b80) at proc_req.c:3326
#19 0x0000555555608acd in slurmctld_req (msg=msg@entry=0x7fffc8000b80) at proc_req.c:6660
#20 0x0000555555586caf in _service_connection (arg=<optimized out>) at controller.c:1380
#21 0x00007ffff7c19b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#22 0x00007ffff7caba00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81


I have verified that setting "GNUTLS_NO_IMPLICIT_INIT=1" (see https://man7.org/linux/man-pages/man3/gnutls_global_init.3.html) in the environment of slurmctld fixes the bug. The stack trace seems to indicate that the problem could be deep into libpmix or libcurl and related to the use of dlopen.
Hence it might not be Slurm's fault, but I was also wondering if dlopening libpmix from the slurmctld context is truly needed? 


I haven't tested outside of my single node environment so far, so I'm filing as "Minor Issue" for now. I have not seen this issue before so I believe it might related to the SW versions on Ubuntu 22.04:
ii  libcurl3-gnutls:amd64                           7.81.0-1ubuntu1.2                                                amd64        easy-to-use client-side URL transfer library (GnuTLS flavour)
ii  libgnutls30:amd64                               3.7.3-4ubuntu1                                                   amd64        GNU TLS library - main runtime library
ii  libpmix-dev:amd64                               4.1.2-2ubuntu1                                                   amd64        Development files for the PMI Exascale library
Comment 1 Carlos Tripiana Montes 2022-06-09 03:04:31 MDT
Hi Felix,

> I was also wondering if dlopening libpmix from the slurmctld context is truly needed?

As part of the great improvements done in Bug 9395 landed in 22.05, ctld now needs to load the MPI plugins at startup, or if reconfigured. This is why you haven't seen this crash happening in the ctld until now.

I've taken a look at this and realized we can't do anything from our side to workaround this in some way.

I've even gone down to the end of the road in glibc-2.35/stdlib/getenv.c line:84 from the sources of the ubuntu package. I can't really see which is the corrupted pointer and/or why is corrupted. But it seems it is either __environ or some of the indexes. Whatever it is, I can't spot on Internet an open bug for this. I'm a bit puzzled but I'm not able extract more knowledge from this.

If you are OK I can close the bug as info given, I'm afraid this is just too deep into the OS to be workarounded in a better way.

Regards,
Carlos.
Comment 2 Felix Abecassis 2022-06-09 10:22:27 MDT
Hi Carlos,

> As part of the great improvements done in Bug 9395 landed in 22.05, ctld now needs to load the MPI plugins at startup, or if reconfigured.

I don't have access to Bug 9395, so I don't know what the context is (but it's not too important). Perhaps you can link the git commits instead?

> If you are OK I can close the bug as info given, I'm afraid this is just too deep into the OS to be workarounded in a better way.

Yes I think that's fair, this was a long shot anyway. 
But I'm curious, were you able to reproduce the issue on your side or did you investigate just with the stack trace?

Thanks
Comment 3 Jason Booth 2022-06-09 10:51:05 MDT
> I don't have access to Bug 9395, so I don't know what the context is (but it's 
> not too important). Perhaps you can link the git commits instead?


For bug#9395 here are the relevant commits.
> https://github.com/SchedMD/slurm/commit/c67c071ffa994b0c1ebadccf637b804f51c753eb
> https://github.com/SchedMD/slurm/commit/442576f78bcc0ec91c482cfa2106a758b750af8d

These were in preparation for those changes.
> https://github.com/SchedMD/slurm/commit/92cc7a296db60ec6231ca9a01dbe19ffae5c5945
> https://github.com/SchedMD/slurm/commit/9efbf3b008ba4e82550679355f5fc92c01c14917

Carlos will reply to your other questions.
Comment 4 Carlos Tripiana Montes 2022-06-10 01:37:12 MDT
I am running ubuntu 22.04 up to date, plus slurm 22.05.0. My compilation is as follows:

./configure --prefix=/home/tripi/slurm/22.05.0_14276/inst --disable-optimizations --enable-debug --enable-memory-leak-debug --enable-developer --enable-multiple-slurmd --with-hwloc --with-ucx --with-pmix --with-munge --with-hdf5 --with-pam_dir=/home/tripi/slurm/22.05.0_14276/inst/lib/security

The PMIx in use is the same as you, from system packages. Same for UCX and the rest.

Firing this crazy loop:

while [ 1 ]; do scontrol reconfigure; done

Can't make my slurmctld to crash. Whatever is happening to you in this GDB stacktrace, has been traced down into the GLIBC. I'm not sure, but it seems like a corrupt pointer in __environ messing the getenv function up. The concrete details on how GLIBC got that corruption is a mystery to me because I can't reproduce it using the same Ubuntu as you.

I'm sorry that I can't make it fail, but maybe just an "apt-get upgrade" will fix your issue? I hope so.

Regarding the Bug 9395, Jason missed some commits, but that's not really important. The summary will be: https://slurm.schedmd.com/mpi.conf.html. We now have this file (it supports configless) and can be used to tune the config for the specific underlying PMI. By now, only PMIx can be tuned.

Regards,
Carlos.
Comment 5 Carlos Tripiana Montes 2022-06-10 03:48:35 MDT
Is it possible you send the raw output for:

sudo gdb -batch -ex run -ex "thread apply all bt full" --args /usr/local/sbin/slurmctld -D -i -v

I've discovered this [1], so I want to be sure this is not causing problems to Slurm.

[1] https://github.com/xianyi/OpenBLAS/issues/716#issuecomment-164334498
Comment 6 Felix Abecassis 2022-06-10 13:01:20 MDT
Running this command did not reveal anything new, but you did send me down the right direction to finish investigating this bug. So thanks!

Running the application under gdb and setting a breakpoint in "getenv", I saw that "ZES_ENABLE_SYSMAN" was set during the first invocation of getenv("GNUTLS_NO_IMPLICIT_INIT") (no crash), but was not set during the second invocation of getenv("GNUTLS_NO_IMPLICIT_INIT"). 

At first I suspected the OneAPI plugin Slurm:
https://github.com/SchedMD/slurm/blob/e54b6d224c7873ba38a0fcfd2b41bbba0eaeb58b/src/plugins/gpu/oneapi/gpu_oneapi.c#L967
But commenting this line did not solve the problem, and then I also realized this plugin was not even active on my setup.
However: this pattern might still be dangerous given that, as you pointed out, this call to setenv() is unsafe. But that's not the problem I was facing.

Digging further, I noticed that hwloc is also setting this environment variable unconditionally, and the hwloc version on Ubuntu 22.04 (2.7.0) is using putenv():
https://github.com/open-mpi/hwloc/blob/hwloc-2.7.0/hwloc/topology.c#L85
And from man putenv:
> The string pointed to by string becomes part of the environment, so altering the string changes the environment.

Hence the corruption after a dlclose(): the environment now references a string from an unloaded library.

The following code is a simpler repro to generate the segfault:
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(void)
{
        void *lib = dlopen("libhwloc.so.15", RTLD_NOW);
        printf("dlopen: %p\n", lib);
        printf("getenv: %p\n", getenv("GNUTLS_NO_IMPLICIT_INIT"));

        printf("dlclose: %d\n", dlclose(lib));

        printf("getenv: %p\n", getenv("GNUTLS_NO_IMPLICIT_INIT"));
}



Now that I knew where to look (hwloc), I noticed that they were aware of the problem with putenv:
https://github.com/open-mpi/hwloc/pull/514
And another user reported a similar problem too with dlopen/dlclose:
https://github.com/open-mpi/hwloc/issues/533

So the issue will be solved when Ubuntu upgrades hwloc to 2.7.1, so that's good news.
Comment 7 Felix Abecassis 2022-06-10 13:02:48 MDT
Setting as RESOLVED, unfortunately there is no status for INFOGIVENTOMYSELF :)
Comment 8 Felix Abecassis 2022-06-10 13:49:23 MDT
BTW I noticed there is already an Ubuntu request to upgrade libhwloc for this bug, so I pinged it:
https://bugs.launchpad.net/ubuntu/+source/hwloc/+bug/1968742?comments=all
Comment 9 Carlos Tripiana Montes 2022-06-13 01:50:14 MDT
I'm glad this was your issue.

I was aware of that lore, and I was looking after the full backtrace to see if something related arised.

My main concern though was not being able to reproduce myself the issue with the same packages/OS. That's a matter of thread order of execution and I might had been lucky enough... or not, because I was not able to get a reproducer.

Btw, I don't expect oneapi init for cause trouble with this because gpu_plugin_init is using locks and thus setenv is in locked area, so thread safe.

At the end of the day a fix is going to get release soon or later, so that's good news.

Good job with your investigation from your side. This was helpful.

Cheers,
Carlos.
Comment 10 Taras Shapovalov 2022-08-03 10:39:51 MDT
We constantly reproduce the issue on Rocky8 and CentOS7 with hwloc 2.7.0.

> when Ubuntu upgrades hwloc to 2.7.1

I see hwloc developers are not really hurry to fix the issue, in the both 2.7.1 and 2.8.0 they still use putenv:

https://github.com/open-mpi/hwloc/blob/hwloc-2.8.0/tests/hwloc/levelzero.c#L25

Our workaround:

root@ts-tr-c7 ~]# cat /etc/sysconfig/slurmctld
ZES_ENABLE_SYSMAN=1
[root@ts-tr-c7 ~]#

Taking into account that it may take time when hwloc is really fixed, do you think Slurm 22.05 can just set this automatically on start?
Comment 11 Taras Shapovalov 2022-08-04 04:06:15 MDT
Please disregard my last question, updating to 2.7.1 fixed the issue.