Running on the Slurm 22.05 branch under Ubuntu 22.04, calling "scontrol reconfigure" will cause slurmctld to segfault: $ sudo scontrol reconfigure $ sudo scontrol reconfigure slurm_reconfigure error: Zero Bytes were transmitted or received Running slurmctld under gdb during the "scontrol reconfigure" calls: $ sudo gdb -batch -ex run -ex bt --args /usr/local/sbin/slurmctld -D -i -v [...] Thread 25 "srvcn" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff561f640 (LWP 540148)] __GI_getenv (name=0x7ffff734dede "UTLS_NO_IMPLICIT_INIT", name@entry=0x7ffff734dedc "GNUTLS_NO_IMPLICIT_INIT") at ./stdlib/getenv.c:84 84 ./stdlib/getenv.c: No such file or directory. #0 __GI_getenv (name=0x7ffff734dede "UTLS_NO_IMPLICIT_INIT", name@entry=0x7ffff734dedc "GNUTLS_NO_IMPLICIT_INIT") at ./stdlib/getenv.c:84 #1 0x00007ffff7bca376 in __GI___libc_secure_getenv (name=name@entry=0x7ffff734dedc "GNUTLS_NO_IMPLICIT_INIT") at ./stdlib/secure-getenv.c:29 #2 0x00007ffff721abbc in lib_deinit () at ../../lib/global.c:517 #3 0x00007ffff7fc51a2 in call_destructors (closure=closure@entry=0x7fffd802aba0) at ./elf/dl-close.c:129 #4 0x00007ffff7cf9c85 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:182 #5 0x00007ffff7fc5636 in _dl_close_worker (force=force@entry=false, map=<optimized out>, map=<optimized out>) at ./elf/dl-close.c:292 #6 0x00007ffff7fc62a2 in _dl_close_worker (force=false, map=0x7fffd800ee40) at ./elf/dl-close.c:150 #7 _dl_close (_map=0x7fffd800ee40) at ./elf/dl-close.c:818 #8 0x00007ffff7cf9c28 in __GI__dl_catch_exception (exception=exception@entry=0x7ffff561eaa0, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:208 #9 0x00007ffff7cf9cf3 in __GI__dl_catch_error (objname=0x7ffff561eaf8, errstring=0x7ffff561eb00, mallocedp=0x7ffff561eaf7, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:227 #10 0x00007ffff7c151ae in _dlerror_run (operate=<optimized out>, args=<optimized out>) at ./dlfcn/dlerror.c:138 #11 0x00007ffff7c14ed8 in __dlclose (handle=<optimized out>) at ./dlfcn/dlclose.c:31 #12 0x00007ffff7854f7f in _libpmix_close (lib_plug=<optimized out>) at mpi_pmix.c:123 #13 fini () at mpi_pmix.c:208 #14 0x00007ffff7e99542 in plugin_unload (plug=0x7fffd801ab90) at plugin.c:315 #15 0x00007ffff7e999c4 in plugin_context_destroy (c=<optimized out>) at plugin.c:477 #16 0x00007ffff7ec5717 in _mpi_fini_locked () at slurm_mpi.c:261 #17 0x00007ffff7ec6c2e in mpi_g_daemon_reconfig () at slurm_mpi.c:585 #18 0x00005555556070a8 in _slurm_rpc_reconfigure_controller (msg=0x7fffc8000b80) at proc_req.c:3326 #19 0x0000555555608acd in slurmctld_req (msg=msg@entry=0x7fffc8000b80) at proc_req.c:6660 #20 0x0000555555586caf in _service_connection (arg=<optimized out>) at controller.c:1380 #21 0x00007ffff7c19b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #22 0x00007ffff7caba00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 I have verified that setting "GNUTLS_NO_IMPLICIT_INIT=1" (see https://man7.org/linux/man-pages/man3/gnutls_global_init.3.html) in the environment of slurmctld fixes the bug. The stack trace seems to indicate that the problem could be deep into libpmix or libcurl and related to the use of dlopen. Hence it might not be Slurm's fault, but I was also wondering if dlopening libpmix from the slurmctld context is truly needed? I haven't tested outside of my single node environment so far, so I'm filing as "Minor Issue" for now. I have not seen this issue before so I believe it might related to the SW versions on Ubuntu 22.04: ii libcurl3-gnutls:amd64 7.81.0-1ubuntu1.2 amd64 easy-to-use client-side URL transfer library (GnuTLS flavour) ii libgnutls30:amd64 3.7.3-4ubuntu1 amd64 GNU TLS library - main runtime library ii libpmix-dev:amd64 4.1.2-2ubuntu1 amd64 Development files for the PMI Exascale library
Hi Felix, > I was also wondering if dlopening libpmix from the slurmctld context is truly needed? As part of the great improvements done in Bug 9395 landed in 22.05, ctld now needs to load the MPI plugins at startup, or if reconfigured. This is why you haven't seen this crash happening in the ctld until now. I've taken a look at this and realized we can't do anything from our side to workaround this in some way. I've even gone down to the end of the road in glibc-2.35/stdlib/getenv.c line:84 from the sources of the ubuntu package. I can't really see which is the corrupted pointer and/or why is corrupted. But it seems it is either __environ or some of the indexes. Whatever it is, I can't spot on Internet an open bug for this. I'm a bit puzzled but I'm not able extract more knowledge from this. If you are OK I can close the bug as info given, I'm afraid this is just too deep into the OS to be workarounded in a better way. Regards, Carlos.
Hi Carlos, > As part of the great improvements done in Bug 9395 landed in 22.05, ctld now needs to load the MPI plugins at startup, or if reconfigured. I don't have access to Bug 9395, so I don't know what the context is (but it's not too important). Perhaps you can link the git commits instead? > If you are OK I can close the bug as info given, I'm afraid this is just too deep into the OS to be workarounded in a better way. Yes I think that's fair, this was a long shot anyway. But I'm curious, were you able to reproduce the issue on your side or did you investigate just with the stack trace? Thanks
> I don't have access to Bug 9395, so I don't know what the context is (but it's > not too important). Perhaps you can link the git commits instead? For bug#9395 here are the relevant commits. > https://github.com/SchedMD/slurm/commit/c67c071ffa994b0c1ebadccf637b804f51c753eb > https://github.com/SchedMD/slurm/commit/442576f78bcc0ec91c482cfa2106a758b750af8d These were in preparation for those changes. > https://github.com/SchedMD/slurm/commit/92cc7a296db60ec6231ca9a01dbe19ffae5c5945 > https://github.com/SchedMD/slurm/commit/9efbf3b008ba4e82550679355f5fc92c01c14917 Carlos will reply to your other questions.
I am running ubuntu 22.04 up to date, plus slurm 22.05.0. My compilation is as follows: ./configure --prefix=/home/tripi/slurm/22.05.0_14276/inst --disable-optimizations --enable-debug --enable-memory-leak-debug --enable-developer --enable-multiple-slurmd --with-hwloc --with-ucx --with-pmix --with-munge --with-hdf5 --with-pam_dir=/home/tripi/slurm/22.05.0_14276/inst/lib/security The PMIx in use is the same as you, from system packages. Same for UCX and the rest. Firing this crazy loop: while [ 1 ]; do scontrol reconfigure; done Can't make my slurmctld to crash. Whatever is happening to you in this GDB stacktrace, has been traced down into the GLIBC. I'm not sure, but it seems like a corrupt pointer in __environ messing the getenv function up. The concrete details on how GLIBC got that corruption is a mystery to me because I can't reproduce it using the same Ubuntu as you. I'm sorry that I can't make it fail, but maybe just an "apt-get upgrade" will fix your issue? I hope so. Regarding the Bug 9395, Jason missed some commits, but that's not really important. The summary will be: https://slurm.schedmd.com/mpi.conf.html. We now have this file (it supports configless) and can be used to tune the config for the specific underlying PMI. By now, only PMIx can be tuned. Regards, Carlos.
Is it possible you send the raw output for: sudo gdb -batch -ex run -ex "thread apply all bt full" --args /usr/local/sbin/slurmctld -D -i -v I've discovered this [1], so I want to be sure this is not causing problems to Slurm. [1] https://github.com/xianyi/OpenBLAS/issues/716#issuecomment-164334498
Running this command did not reveal anything new, but you did send me down the right direction to finish investigating this bug. So thanks! Running the application under gdb and setting a breakpoint in "getenv", I saw that "ZES_ENABLE_SYSMAN" was set during the first invocation of getenv("GNUTLS_NO_IMPLICIT_INIT") (no crash), but was not set during the second invocation of getenv("GNUTLS_NO_IMPLICIT_INIT"). At first I suspected the OneAPI plugin Slurm: https://github.com/SchedMD/slurm/blob/e54b6d224c7873ba38a0fcfd2b41bbba0eaeb58b/src/plugins/gpu/oneapi/gpu_oneapi.c#L967 But commenting this line did not solve the problem, and then I also realized this plugin was not even active on my setup. However: this pattern might still be dangerous given that, as you pointed out, this call to setenv() is unsafe. But that's not the problem I was facing. Digging further, I noticed that hwloc is also setting this environment variable unconditionally, and the hwloc version on Ubuntu 22.04 (2.7.0) is using putenv(): https://github.com/open-mpi/hwloc/blob/hwloc-2.7.0/hwloc/topology.c#L85 And from man putenv: > The string pointed to by string becomes part of the environment, so altering the string changes the environment. Hence the corruption after a dlclose(): the environment now references a string from an unloaded library. The following code is a simpler repro to generate the segfault: #define _GNU_SOURCE #include <dlfcn.h> #include <stdio.h> #include <string.h> #include <stdlib.h> int main(void) { void *lib = dlopen("libhwloc.so.15", RTLD_NOW); printf("dlopen: %p\n", lib); printf("getenv: %p\n", getenv("GNUTLS_NO_IMPLICIT_INIT")); printf("dlclose: %d\n", dlclose(lib)); printf("getenv: %p\n", getenv("GNUTLS_NO_IMPLICIT_INIT")); } Now that I knew where to look (hwloc), I noticed that they were aware of the problem with putenv: https://github.com/open-mpi/hwloc/pull/514 And another user reported a similar problem too with dlopen/dlclose: https://github.com/open-mpi/hwloc/issues/533 So the issue will be solved when Ubuntu upgrades hwloc to 2.7.1, so that's good news.
Setting as RESOLVED, unfortunately there is no status for INFOGIVENTOMYSELF :)
BTW I noticed there is already an Ubuntu request to upgrade libhwloc for this bug, so I pinged it: https://bugs.launchpad.net/ubuntu/+source/hwloc/+bug/1968742?comments=all
I'm glad this was your issue. I was aware of that lore, and I was looking after the full backtrace to see if something related arised. My main concern though was not being able to reproduce myself the issue with the same packages/OS. That's a matter of thread order of execution and I might had been lucky enough... or not, because I was not able to get a reproducer. Btw, I don't expect oneapi init for cause trouble with this because gpu_plugin_init is using locks and thus setenv is in locked area, so thread safe. At the end of the day a fix is going to get release soon or later, so that's good news. Good job with your investigation from your side. This was helpful. Cheers, Carlos.
We constantly reproduce the issue on Rocky8 and CentOS7 with hwloc 2.7.0. > when Ubuntu upgrades hwloc to 2.7.1 I see hwloc developers are not really hurry to fix the issue, in the both 2.7.1 and 2.8.0 they still use putenv: https://github.com/open-mpi/hwloc/blob/hwloc-2.8.0/tests/hwloc/levelzero.c#L25 Our workaround: root@ts-tr-c7 ~]# cat /etc/sysconfig/slurmctld ZES_ENABLE_SYSMAN=1 [root@ts-tr-c7 ~]# Taking into account that it may take time when hwloc is really fixed, do you think Slurm 22.05 can just set this automatically on start?
Please disregard my last question, updating to 2.7.1 fixed the issue.