Bug 12396 - Adding Open PMIx 5.x support in the mpi/pmix build
Summary: Adding Open PMIx 5.x support in the mpi/pmix build
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Build System and Packaging (show other bugs)
Version: 22.05.x
Hardware: Linux Linux
: --- C - Contributions
Assignee: Danny Auble
QA Contact:
URL:
: 12798 (view as bug list)
Depends on:
Blocks:
 
Reported: 2021-08-31 04:49 MDT by Isaias Compres
Modified: 2022-04-22 11:17 MDT (History)
7 users (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 22.05.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
patch (4.16 KB, patch)
2021-08-31 04:49 MDT, Isaias Compres
Details | Diff
patch with mpi_pmix.c changes (5.11 KB, patch)
2021-08-31 05:31 MDT, Isaias Compres
Details | Diff
Updated to apply to current HEAD (8.24 KB, patch)
2022-03-09 07:35 MST, Isaias Compres
Details | Diff
v6 (32.69 KB, patch)
2022-03-28 15:56 MDT, Danny Auble
Details | Diff
Updated PMIx integration patch (90.72 KB, patch)
2022-04-13 12:54 MDT, Ralph Castain
Details | Diff
Updated PMIx integration patch (213.07 KB, patch)
2022-04-13 13:03 MDT, Ralph Castain
Details | Diff
v7 (11.09 KB, patch)
2022-04-13 13:57 MDT, Danny Auble
Details | Diff
v8 (11.18 KB, patch)
2022-04-18 12:37 MDT, Danny Auble
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Isaias Compres 2021-08-31 04:49:00 MDT
Created attachment 21100 [details]
patch

The mpi/pmix plugin does not build correctly with Open PMIx major versions larger than 4.

This patch increases the accepted major version of Open PMIx to 5.

This patch preserves the current approach in the build system, and increases the upper check to 5.  There is an individual check and target for each major version of Open PMIx.

Some discussion about it took place here:
https://bugs.schedmd.com/show_bug.cgi?id=7263
Comment 1 Isaias Compres 2021-08-31 05:31:42 MDT
Created attachment 21102 [details]
patch with mpi_pmix.c changes

I missed the /src/plugins/mpi/pmix/mpi_pmix.c changes in the previous attachment.
Comment 3 Danny Auble 2021-10-28 10:42:04 MDT
Isaias, this patch does not fix v4 (or v5) to compile.

With or without this patch I get

../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client_v2.c: In function ‘pmixp_lib_init’:
../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client_v2.c:252:9: error: implicit declaration of function ‘PMIx_Register_event_handler’ [-Werror=implicit-function-declaration]
  252 |         PMIx_Register_event_handler(NULL, 0, NULL, 0, _errhandler,
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client_v2.c: In function ‘pmixp_lib_finalize’:
../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client_v2.c:262:9: error: implicit declaration of function ‘PMIx_Deregister_event_handler’ [-Werror=implicit-function-declaration]
  262 |         PMIx_Deregister_event_handler(0, _op_callbk, NULL);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make: *** [Makefile:1502: mpi_pmix_v4_la-pmixp_client_v2.lo] Error 1
make: *** Waiting for unfinished jobs....

for v4.  I am interested in that working before we add v5 to the mix.  Could you please go revisit bug 7263 and see if we can get something there to get v4 working first?
Comment 4 Felix Abecassis 2021-10-29 11:33:40 MDT
Danny, this patch fixes issues such as what I described here: https://bugs.schedmd.com/show_bug.cgi?id=12376#c8

I have verified that this patch solves this situation, and it does not generate any compile-time error *by default*, as -Werror must be added manually, to my knowledge.
Comment 5 Danny Auble 2021-10-29 11:54:06 MDT
Felix, I would consider this patch as a "hope for the best" approach.  I have no confidence this will actually allow you to run v4 correctly with Slurm.  Until we get a patch that addresses any concerns on our end we consider >= v4 unsupported.

Anyone developing for Slurm should always configure with --enable-developer to avoid any warnings.  We don't expect end users to do this, but we do expect this from developers.
Comment 6 Felix Abecassis 2021-10-29 12:31:22 MDT
> I have no confidence this will actually allow you to run v4 correctly with Slurm.

FWIW, just one data point: with PMIx 4.1 from Ubuntu 21.10 and Slurm master from a few days ago, everything seems to be working fine. 

> Until we get a patch that addresses any concerns on our end we consider >= v4 unsupported.

What are the remaining concerns? As mentioned in https://bugs.schedmd.com/show_bug.cgi?id=7263#c64, the first step would to enable compiling against PMIx v4 without new features, and the patch attached to this bug seems to fit the bill.
Comment 7 Danny Auble 2021-10-29 12:46:02 MDT
Felix, can you relay what you did to confirm this patch worked?

Seeing as the patch causes obvious warnings it is an instant fail for us to include.  I am guessing basic functionality is probably easy to achieve but we have not spent any time trying to figure out what is missing and expect the pmix developers to supply this analysis.  As the patches so far been given haven't met our standards it has caused concern on our end so the vetting might take longer than usual.

As you probably already know we have removed the past partial support for >= 4v until this is resolved. At the moment I advise anyone looking to use pmix to use v3.x as we are confident this works.
Comment 8 Felix Abecassis 2021-10-29 14:21:17 MDT
> Felix, can you relay what you did to confirm this patch worked?

I verified that pmix_v4 is listed when doing "srun --mpi=list", then tested multiple MPI applications on development workstation with one slurmd, it seemed to work fine. Of course multinode testing would be required for validation purposes, but I'm sure Isaias did more testing than I did.

> Seeing as the patch causes obvious warnings it is an instant fail for us to include.

Understandable, but that will be easy to correct if it's indeed just a missing #include. 

> I am guessing basic functionality is probably easy to achieve but we have not spent any time trying to figure out what is missing and expect the pmix developers to supply this analysis

Isaias, what else would be needed besides this patch for simply supporting PMIx v4 without leveraging new features?
Comment 9 Jason Booth 2021-11-01 13:57:10 MDT
*** Bug 12798 has been marked as a duplicate of this bug. ***
Comment 10 Isaias Compres 2021-11-01 14:08:32 MDT
Hi Felix, Danny,

This is a missing include that is in upstream Slurm. It is not an issue that was introduced with this patch set.  

As Ralph points out in Bug 7263, you just need to add:

#include <pmix.h>

Around line 55 of:
https://github.com/SchedMD/slurm/blob/master/src/plugins/mpi/pmix/pmixp_client_v2.c  ​

I am currently in a business trip until Friday.  I will probably not be able to produce a new patch and test it before then, although this is very simple to do.

This patch does not add any features.  It only enables the build system to produce a plugin with v4 and v5 Open PMIx source trees.  

Ralph has clarified that there are no compatibility changes between releases of PMIx since >2.2 that will affect this base functionality.  This is different from the changes in Bug 7263: those changes add features that require further scrutiny.
Comment 11 Danny Auble 2021-11-01 14:14:47 MDT
Thanks Isaias, I felt it was something simple like that.  I am guessing things moved from one place to another in v4.

Mike (just added) could you please post any findings out of the norm on your end with your testing?

21.08.3 will be tagged tomorrow so this will not be in there, but if your testing goes well I will push it for 21.08.4.
Comment 12 Danny Auble 2022-03-08 18:57:56 MST
Mike, can you verify this patch fixes things for you?
Comment 13 Isaias Compres 2022-03-09 07:35:26 MST
Created attachment 23783 [details]
Updated to apply to current HEAD

Updated this patch to apply to the current _master_.  Also added the missing pmix.h include mentioned in the thread.
Comment 14 mike coyne 2022-03-21 07:57:55 MDT
(In reply to Danny Auble from comment #12)
> Mike, can you verify this patch fixes things for you?

Danny the patch does seem to work as intended , although i did run into an  error with pmix v4.1.1 and 4.1.2 what i believe i am seeing is a issue between the openmpi5.0.0.rc2  and openmpi5.0.0.rc1 . If i tested with ompi5rc2 it would fail to connect to the slurm if it was compiled using the internal pmix version but would function correctly ( as far as i could see ) if i compiled it with a external pmix version used to build slurm's pmix_v4 it worked fine. Now what is interesting is the announcement from Ralph Castain on the schedmd mailing list about a updated plugin patch for slurm to expose the full/more of  pmix server interface . https://github.com/slurm-pmix/slurm/wiki/Patches . I am wondering if these changes could be incorporated into the pmix_v4 + interface ? As a note  have not got a answer from my openmpi contacts regarding  yet.

https://lists.schedmd.com/pipermail/slurm-users/2022-March/008466.html

Mike
Comment 15 mike coyne 2022-03-21 08:06:20 MDT
(In reply to mike coyne from comment #14)
> (In reply to Danny Auble from comment #12)
> > Mike, can you verify this patch fixes things for you?
> 
> Danny the patch does seem to work as intended , although i did run into an 
> error with pmix v4.1.1 and 4.1.2 what i believe i am seeing is a issue
> between the openmpi5.0.0.rc2  and openmpi5.0.0.rc1 . If i tested with
> ompi5rc2 it would fail to connect to the slurm if it was compiled using the
> internal pmix version but would function correctly ( as far as i could see )
> if i compiled it with a external pmix version used to build slurm's pmix_v4
> it worked fine. Now what is interesting is the announcement from Ralph
> Castain on the schedmd mailing list about a updated plugin patch for slurm
> to expose the full/more of  pmix server interface .
> https://github.com/slurm-pmix/slurm/wiki/Patches . I am wondering if these
> changes could be incorporated into the pmix_v4 + interface ? As a note  have
> not got a answer from my openmpi contacts regarding  yet.
> 
> https://lists.schedmd.com/pipermail/slurm-users/2022-March/008466.html
> 
> Mike
example error output for ompi5rc2

mcoyne
2:53 PM
[gr0171.localdomain:23407] PMIX ERROR: PACK-MISMATCH in file client/pmix_client.c at line 832

PMIx_Init failed for the following reason:

  PACK-MISMATCH

Open MPI requires access to a local PMIx server to execute. Please ensure
that either you are operating in a PMIx-enabled environment, or use "mpirun"
to execute the job.
Comment 16 Danny Auble 2022-03-28 15:56:23 MDT
Created attachment 24095 [details]
v6

Thanks Mike, I wonder what is going on?  I am seeing much more issue on my system.

v4 (4.1.2 on the v4.1 branch) segfaults the slurmstepd

backtrace

#0  __strncmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:105
#1  0x00007f6b0bd7ca66 in register_nspace (nptr=0x7f6afc001530) at ../../../../../pmix/src/mca/pmdl/ompi5/pmdl_ompi5.c:398
#2  0x00007f6b0bef3aa2 in pmix_pmdl_base_register_nspace () from /home/da/pmix/4/lib/libpmix.so
#3  0x00007f6b0be2945b in _register_nspace () from /home/da/pmix/4/lib/libpmix.so
#4  0x00007f6b0bd5bebf in ?? () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#5  0x00007f6b0bd5c75f in event_base_loop () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#6  0x00007f6b0be5a7e6 in progress_engine () from /home/da/pmix/4/lib/libpmix.so
#7  0x00007f6b0cb53947 in start_thread (arg=<optimized out>) at pthread_create.c:435
#8  0x00007f6b0cbe3a44 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

This is with ompi 5rc3 configured with pmix 4.1.2

I tried the pmix v4.2 branch but that appears to be behind the v4.1 branch at 4.1.1rc4 and had the same result.

the v3.2 branch, 3.2.3rc1 works just fine though.

There was a problem with the patch for v5.  It was pointing to the v4 libs so if you tried to make it work you would get this error when compiling...

../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client.c:71:9: note: ‘#pragma message: PMIx version mismatch: the major version seen during configuration was 5L but found 4L compilation will most likely fail.  Please reconfigure against the new version.’
   71 | #pragma message "PMIx version mismatch: the major version seen during configuration was " VALUE(HAVE_PMIX_VER) "L but found " VALUE(PMIX_VERSION_MAJOR) " compilation will most likely fail.  Please reconfigure against the new version."
      |         ^~~~~~~

This is fixed in the attached patch, but then you get these errors when compiling...

In file included from /home/da/pmix/5/include/pmix_common.h:4108,
                 from /home/da/pmix/5/include/pmix_server.h:64,
                 from ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_common.h:57,
                 from ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client.c:39:
../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client.c: In function ‘_general_proc_info’:
/home/da/pmix/5/include/pmix_deprecated.h:210:12: error: implicit declaration of function ‘PMIx_Info_load’; did you mean ‘PMIX_INFO_LOAD’? [-Werror=implicit-function-declaration]
  210 |     (void) PMIx_Info_load(i, k, d, t)
      |            ^~~~~~~~~~~~~~
../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client.h:59:9: note: in expansion of macro ‘PMIX_INFO_LOAD’
   59 |         PMIX_INFO_LOAD(kvp, key_str, val, type);                \
      |         ^~~~~~~~~~~~~~
../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client.c:121:9: note: in expansion of macro ‘PMIXP_KVP_CREATE’
  121 |         PMIXP_KVP_CREATE(kvp, PMIX_SPAWNED, &flag, PMIX_BOOL);
      |         ^~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make: *** [Makefile:1431: mpi_pmix_v5_la-pmixp_client.lo] Error 1
make: *** Waiting for unfinished jobs....

So from what I can tell v4 nor v5 are ready for primetime. Until these things are fixed Slurm will not be allowed to work against them as it seems it will cause more issues than fix.

Isaias, could you help me out?  I am not sure how you want to proceed.

In your next version please use 'git format-patch' with your changes started with 'fixup!' so it is easy to tell what you did different.

Thanks!
Comment 17 Danny Auble 2022-03-28 15:56:43 MDT
Comment on attachment 23783 [details]
Updated to apply to current HEAD

Replaced with v6
Comment 18 Ralph Castain 2022-04-12 13:11:31 MDT
Sorry to come late to the party - Isaias just pointed me at this ticket this morning (I didn't know about it).

FWIW: the attached patch contains nothing relevant to the PMIx integration. It is a bunch of changes to the Slurm step manager and GRES support. I think someone attached the wrong file.

I will try to replicate some of the problems reported here. As for OMPI v5, please keep in mind that it is relatively unstable at this time. For one thing, there have been significant configure issues with it picking up the wrong PMIx, HWLOC, and libevent installations. They have been working on fixing it and _may_ have gotten the kinks worked out - but some of the problems you describe here sound very much like what they were encountering.

I applied my patch to the head of Slurm master branch and hit a couple of problems. It appears some changes had been made to the PMIx integration there. It was fairly easy to find a fix for the problems (which did not appear as conflicts as they introduced new - unfortunately now incorrect - lines of code).

Given that OMPI may itself be having problems, I recommend testing with the PMIx examples. Between the various examples, we pretty much cover everything an MPI app would do - and more. Should give you at least an initial smoke test.

Ralph
Comment 19 Ralph Castain 2022-04-12 13:33:06 MDT
FWIW: I have built and tested head of Slurm master with my patch (with the change I mentioned earlier due to changes in the master branch) and it works fine.

Will now check against PMIx v4.x
Ralph
Comment 20 Ralph Castain 2022-04-12 15:15:26 MDT
Updating: I have built my amended patch against PMIx v4.2 and it built and ran successfully.

Moving on to v4.1
Ralph
Comment 21 Ralph Castain 2022-04-12 16:41:33 MDT
Well, PMIx v4.1 checks out fine as well - moving on to v3.2
Ralph
Comment 22 Ralph Castain 2022-04-12 20:16:33 MDT
I've tested and confirmed all builds/works with PMIx v3.2 as well. I'm now going to test the OMPI master branch. However, that takes a couple of hours to build on my poor VM, so it won't be something I can complete quickly.

I'll update again once I have more. If the OMPI tests pass, I'll update my patch and post it here.

Ralph
Comment 23 Isaias Compres 2022-04-13 03:20:50 MDT
Hi Danny, Ralph, Mike,

Just a bit of context, since it has been several weeks since this was opened:  

The intent of this patch was to separate the build system concern, from the Open PMIx features concern in 1978 (https://bugs.schedmd.com/show_bug.cgi?id=1978).

I tried to keep it as bare-bones as possible and only touch the build system parts, but as you know, there are small changes needed in mpi_pmix.c and pmixp_client_v2.c, since these do runtime checks based on ./configure detected values.

As far as I have understood from Ralph, cross-version PMIx client-server interactions should work.  So there is an opportunity to simplify things by dropping sub 2.1 PMIx support.  Please correct me if this is not accurate, Ralph.

Danny:
I have been only testing both Slurm and Open PMIx at HEAD.  I can try to setup a test environment (automated) to replicate the issues.  So far, staying at HEAD I have had no problems.

Ralph: 
I have little experience with Open MPI, so I have started with an MPICH only setup.  I can report that with MPICH it works as expected.  I have to add Open MPI to our workflow, but this probably will not happen in the next few weeks.
Comment 24 Isaias Compres 2022-04-13 03:24:08 MDT
(In reply to Isaias Compres from comment #23)

> The intent of this patch was to separate the build system concern, from the
> Open PMIx features concern in 1978
> (https://bugs.schedmd.com/show_bug.cgi?id=1978).

Typo here, posted the wrong bug number.  I was referring to 7263:
https://bugs.schedmd.com/show_bug.cgi?id=7263
Comment 25 Ralph Castain 2022-04-13 12:24:56 MDT
Okay, I have tested OMPI master (using its internal copy of PMIx master branch) against Slurm built with PMIx v3.2 and all works fine. The patch therefore appears to be good to me.

I'll reassemble a final patch and post it here for others to try.
Comment 26 Ralph Castain 2022-04-13 12:54:28 MDT
Created attachment 24434 [details]
Updated PMIx integration patch

Afraid I have no idea how to use "git format-patch" and the docs I can find are less than helpful. I have provided the aggregated diff as applied to the Slurm master branch.
Comment 27 Tim Wickberg 2022-04-13 13:03:05 MDT
Ralph - Does this exist as a broken-up set of commits on your local git branch somewhere?

If so, and assuming there are 5 relevant commits you'd like to send over to us, the way we'd generate the patch internally would be:

git format-patch -5 --stdout > bug12396_master_v1.patch

This helps us maintain the separation between different commits, and make sure we don't inadvertently drop the Author and other metadata when applying it upstream.
Comment 28 Ralph Castain 2022-04-13 13:03:20 MDT
Created attachment 24435 [details]
Updated PMIx integration patch

Sorry - forgot you guys include the build product in your repo.
Comment 29 Tim Wickberg 2022-04-13 13:05:40 MDT
FYI - no need to include the autoreconf changes.

Whomever reviews it on our end will add that in before it gets pushed upstream - we need to make sure it comes through on the exact automake flavor we've standardized anyways.
Comment 30 Ralph Castain 2022-04-13 13:56:25 MDT
(In reply to Tim Wickberg from comment #27)
> Ralph - Does this exist as a broken-up set of commits on your local git
> branch somewhere?
> 
> If so, and assuming there are 5 relevant commits you'd like to send over to
> us, the way we'd generate the patch internally would be:
> 
> git format-patch -5 --stdout > bug12396_master_v1.patch
> 
> This helps us maintain the separation between different commits, and make
> sure we don't inadvertently drop the Author and other metadata when applying
> it upstream.

No - in order to create this, I have to copy the files from the modified version I created into a fresh clone of your master branch. It isn't as simple as grabbing a set of commits since (a) they are sprinkled in time and (b) my clone of your repo does not include the build product as it complicates my work.

Afraid all I can do is hand you the diff. Not sure I really care about the Authorship etc. so long as we can get something that works. :-)

Autoreconf changes - got it. I included them anyway, but you can overwrite.
Comment 31 Danny Auble 2022-04-13 13:57:05 MDT
Created attachment 24437 [details]
v7

Thanks for trying things out Ralph.  As Isaias stated in comment 24 this bug is only to get to a point where pmix 4+ will be allowed to work with Slurm.  This bug should not be used to further any new functionality or such, only basic working.  I believe we can cross that bridge when we come to it either here or in bug 7263 (which I think is where this should happen at the moment)

In my tests I am using the head of pmix 4.1 at the moment. I am guessing 4.2 is in the v4.2 branch, but it isn't clear as the VERSION there is still 4.1.2.

The segfault I am seeing in comment 16 is still there (in the pmix code 

#0  __strncmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:105
#1  0x00007feaa48e3a66 in register_nspace (nptr=0x7fea94001530) at ../../../../../pmix/src/mca/pmdl/ompi5/pmdl_ompi5.c:398
#2  0x00007feaa4a50aa2 in pmix_pmdl_base_register_nspace () from /home/da/pmix/4/lib/libpmix.so
#3  0x00007feaa498645b in _register_nspace () from /home/da/pmix/4/lib/libpmix.so
#4  0x00007feaa48b7ebf in ?? () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#5  0x00007feaa48b875f in event_base_loop () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#6  0x00007feaa49b77e6 in progress_engine () from /home/da/pmix/4/lib/libpmix.so
#7  0x00007feaa56b3947 in start_thread (arg=<optimized out>) at pthread_create.c:435
#8  0x00007feaa5743a44 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

This is just running a simple helloworld.  The same happens with ompi 4.1 or 5.  Note also   in frame 1 of the backtrace I am always getting ompi5/pmdl_ompi5.c. I would had expected ompi4 when running with opmi v4?

I am also assuming the v5 is master, that might also be incorrect? The same compile issues reported in comment 16 are still present...

In file included from /home/da/pmix/5/include/pmix_common.h:4111,
                 from /home/da/pmix/5/include/pmix_server.h:64,
                 from ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_common.h:57,
                 from ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client.c:39:
../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client.c: In function ‘_general_proc_info’:
/home/da/pmix/5/include/pmix_deprecated.h:222:12: error: implicit declaration of function ‘PMIx_Info_load’; did you mean ‘PMIX_INFO_LOAD’? [-Werror=implicit-function-declaration]
  222 |     (void) PMIx_Info_load(i, k, d, t)
      |            ^~~~~~~~~~~~~~
../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client.h:59:9: note: in expansion of macro ‘PMIX_INFO_LOAD’
   59 |         PMIX_INFO_LOAD(kvp, key_str, val, type);                \
      |         ^~~~~~~~~~~~~~
../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client.c:121:9: note: in expansion of macro ‘PMIXP_KVP_CREATE’
  121 |         PMIXP_KVP_CREATE(kvp, PMIX_SPAWNED, &flag, PMIX_BOOL);
      |         ^~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make: *** [Makefile:1431: mpi_pmix_v5_la-pmixp_client.lo] Error 1
make: *** Waiting for unfinished jobs....

I get this same issue with the v4.2 branch, so I am guessing this change happened there since v4.1.

So everything that is in comment 16 is still happening as I test with updated heads on all branches.

As stated in comment 16, pmix 3.2 works without problem.

Attached is an updated patch that I am trying to get working.  I will not be looking at your patch as it is beyond the scope of this bug at the moment.
Comment 32 Danny Auble 2022-04-13 13:59:39 MDT
One thing I will make note of is your patch rips out the pmix versioning for some reason.  It isn't uncommon for people to compile Slurm against multiple versions of pmix on their systems.  This patch would be rejected for this alone.  We need to keep the versioning.
Comment 33 Ralph Castain 2022-04-13 14:56:00 MDT
I won't waste any more of my time on Slurm/PMIx integration then. Good luck and hope you get your version working.

Ralph
Comment 34 Danny Auble 2022-04-13 16:18:57 MDT
Ralph -
I am just trying to get things to work, but we need to start at a point that does not introduce additional functionality, and does not remove the existing multi-version support, as is currently the case with the patch you're proposing. This should look a lot closer to what's in the latest patch I attached (attachment 24437 [details]).

But I do not have the expertise to figure out why PMIx isn't working with that minimal subset of changes alone. As stated before I am not sure how to proceed. This is clearly segfaulting in the PMIx code for v4. And v5 doesn't compile cleanly in Slurm (setting aside the additional changes included in your patch set).

At this time I can't include any support for v4+ until PMIx itself does not segfault the stepd. If it segfaults for me, it will surely segfault for someone else, which will only cause problems for us.
Comment 35 Isaias Compres 2022-04-14 02:54:41 MDT
(In reply to Danny Auble from comment #32)
> One thing I will make note of is your patch rips out the pmix versioning for
> some reason.  It isn't uncommon for people to compile Slurm against multiple
> versions of pmix on their systems.  This patch would be rejected for this
> alone.  We need to keep the versioning.

Hi Danny,

I have to say that I agree with Ralph in simplifying the build.  

The current build system creates separate build paths for V1, V2, V3, we are adding V4 and V5 now, and given the pace of development of Open PMIx and the PMIx standard, very soon V6 and higher.  Furthermore, Open PMIx handles version mismatchs; this has been an important feature for a while, that targets precisely your concerns here about version mixes (although it was motivated by containers originally).

I understand and agree that there is value in multi-version build support, but given the nature of m4 scripts, there is no way to keep the control logic sane or understandable.  So in my opinion, the benefit of simultaneous multi-version builds does not outweigh the complexity added, given m4's limitations.  But that is only my opinion.

I will defer to you for a final decision here.  I will produce an updated patch based on your decision.

PS: Sorry for the late replies always, but it can't be helped given the time zone difference.
Comment 36 Danny Auble 2022-04-14 08:14:33 MDT
I think this bug alone is enough to convince me of the opposite.  As we are trying to make as stable an experience as possible.  At this moment v4+ has shown it doesn't work correctly.  As a side node mpirun segfaults in the exact location...

Core was generated by `/home/da/openmpi/4.1/22.05/snowflake/bin/mpirun helloworld'.
...
#0  __strncmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:105
#1  0x00007fe0df36ea66 in register_nspace (nptr=0x7fe0d8001190) at ../../../../../pmix/src/mca/pmdl/ompi5/pmdl_ompi5.c:398
#2  0x00007fe0df755aa2 in pmix_pmdl_base_register_nspace () from /home/da/pmix/4/lib/libpmix.so.2
#3  0x00007fe0df68b45b in _register_nspace () from /home/da/pmix/4/lib/libpmix.so.2
#4  0x00007fe0e34c9ebf in ?? () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#5  0x00007fe0e34ca75f in event_base_loop () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#6  0x00007fe0df6bc7e6 in progress_engine () from /home/da/pmix/4/lib/libpmix.so.2
#7  0x00007fe0e3319947 in start_thread (arg=<optimized out>) at pthread_create.c:435
#8  0x00007fe0e33a9a44 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

This hopefully shows there is a situation happening without Slurm involved at all.

The fact that the PMIx v4.2 and master branches don't compile cleaning with Slurm is also a reason to gate on version.

The versioning used in the m4 script is fairly simple to add as we have demonstrated with this patch. I am not sure I understand the complexities you are siting. To add a new version you copy current code and just up the number, it is about as simple as one could hope.

As PMIx progresses we can easily remove versions as we have done in the past.  In 59bd4fc2d2 we removed support for v1 which was probably represents the most work needed to remove any version as it was much different than the more modern versions.

As v3 seems to continue to be stable I am ok removing support for v2 as well, but I think this discussion is outside of the scope of the bug.
Comment 37 Ralph Castain 2022-04-15 15:49:10 MDT
I think I finally figured out your segfault problem. My fault for being dense and not looking into it more closely. If we look at the trace, we see:

v4 (4.1.2 on the v4.1 branch) segfaults the slurmstepd

backtrace

#0  __strncmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:105
#1  0x00007f6b0bd7ca66 in register_nspace (nptr=0x7f6afc001530) at ../../../../../pmix/src/mca/pmdl/ompi5/pmdl_ompi5.c:398
#2  0x00007f6b0bef3aa2 in pmix_pmdl_base_register_nspace () from /home/da/pmix/4/lib/libpmix.so
#3  0x00007f6b0be2945b in _register_nspace () from /home/da/pmix/4/lib/libpmix.so
#4  0x00007f6b0bd5bebf in ?? () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#5  0x00007f6b0bd5c75f in event_base_loop () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#6  0x00007f6b0be5a7e6 in progress_engine () from /home/da/pmix/4/lib/libpmix.so
#7  0x00007f6b0cb53947 in start_thread (arg=<optimized out>) at pthread_create.c:435
#8  0x00007f6b0cbe3a44 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

The problem here is that there is no "ompi5" component in the v4.1 series. However, there is one in the v4.0 series (it was removed because we combined support for all OMPI versions into one "ompi" component in the v4.1 series).

If you look closer at the path, you'll see that this component is being drawn from a different location than the PMIx install:

at ../../../../../pmix/src/mca/pmdl/ompi5/pmdl_ompi5.c:398

In other words, this component is coming from a different PMIx installation. than the function that is calling it:

from /home/da/pmix/4/lib/libpmix.so

So we are mixing libraries here, which is leading to the segfault. I suspect the problem is that you have multiple PMIx library paths in your environment, and that is causing a mixing of components to be dynamically loaded.
Comment 38 Ralph Castain 2022-04-15 15:51:47 MDT
Another possibility is that you installed the v4.1 library in the same location as you had previously installed v4.0, without first doing an "rm -rf" of that location to remove all the prior files. PMIx doesn't perform that removal for you, and so in cases where components are removed, the "stale" files are left behind - and get picked up, leading to this kind of segfault behavior.
Comment 39 Danny Auble 2022-04-15 15:58:35 MDT
Ralph, you must be a genius.  The problem was directly related to me having reused my dir to build a pmix 4.0 and a 4.1.  So I am not sure why pmix was deciding to mix and match there, but removing all things in the pmix lib install made everything happy for me.

Ok, so this is good for pmix 4.1.  Could you help me figure out how to make the warnings go away from the pmix v5 stuff?
Comment 40 Ralph Castain 2022-04-15 16:47:40 MDT
Yeah, I know what the problem is there. When we deprecate things, we move them from whatever header they are currently in to the "pmix_deprecated.h" header. In this case, the Standard deprecated the "PMIX_INFO_LOAD" macro, replacing it with a new "PMIx_Info_load" function.

When I moved the macro to the deprecated header, I linked it back to the function, which is declared in "pmix.h". However, the deprecated header is included from "pmix_common.h" - and so if you don't have "pmix.h" in your file, then the function is not resolved.

I will add the "pmix_deprecated.h" header to "pmix.h" so we avoid this problem. Should have it committed soon. Meantime, you can fix it locally by just adding "pmix.h" to your includes
Comment 41 Ralph Castain 2022-04-15 16:57:30 MDT
Actually, as I start to do the edits I realize that I can't do what I had thought. The problem is that the the macros were originally defined in "pmix_common.h", so backward compatibility would say that "pmix_deprecated.h" should be included from there (as it currently is). However, the macros now reference APIs that are defined in "pmix.h", which is where they belong.

If I move the deprecated macros to "pmix.h", the functions will resolve - but you will still fail to compile as you only include "pmix_common.h", and so the macros won't be found.

The only solution I can see is for you to add "pmix.h" to your includes. I'll have to think more and see if I can come up with a scheme that can resolve the problem.
Comment 42 Danny Auble 2022-04-15 17:03:29 MDT
Adding pmix.h isn't that big a deal, we have to do it in the v4 patch already for src/plugins/mpi/pmix/pmixp_client_v2.c.

I wonder if we just add it to

src/plugins/mpi/pmix/pmixp_common.h

things will get fixed correctly for both versions?

Or should we try to avoid including this file and do it better?  Seems like we shouldn't be using deprecated things right?
Comment 43 Ralph Castain 2022-04-15 17:40:51 MDT
It's up to you. Problem is that the older versions won't have the new functions in them - they only have the macros. So unless you want to add #if's to the code, you're probably better off leaving the macros.

On the ompi5 plugin: we check plugins for compatibility with their respective framework interface definitions. In this case, the interfaces didn't change between PMIx v4.0 and v4.1. What you exposed is actually a bug in the ompi5 plugin that doesn't get exercised by v4.0, but does get touched by values passed in from v4.1.

I'll try to diagnose that further and issue a patch for v4.0.
Comment 44 Ralph Castain 2022-04-15 17:42:08 MDT
> I wonder if we just add it to

> src/plugins/mpi/pmix/pmixp_common.h

> things will get fixed correctly for both versions?

Yes - there is no harm in adding "pmix.h" to any file for all versions. Probably the safest thing to do.
Comment 45 Ralph Castain 2022-04-15 18:06:51 MDT
Figured out a solution for the deprecated macro situation:

https://github.com/openpmix/openpmix/pull/2566

Still, no harm in adding pmix.h to your common header, and probably worth doing for the long run.
Comment 46 Ralph Castain 2022-04-16 08:39:57 MDT
Okay, I have updated the deprecated header on both master and v4.2 branches. Please let me know if that fixed the problem.

Ralph
Comment 47 Ralph Castain 2022-04-16 20:53:53 MDT
> Seems like we shouldn't be using deprecated things right?

Just to clarify: the Standard will deprecate and eventually remove things such as these macros. However, the *library* will never remove them - we have promised to maintain backward compatibility for our users.

Thus, we move the macros to the "deprecated" header where they will remain essentially forever. We show the replacement APIs in the pmix.h header so people who are writing new code will use them. People who already used the macros don't need to do anything.

So it is up to you - personally, since you already have the macros in the code and want to support all the prior versions, I'd just leave them. No harm in doing so.
Comment 48 Danny Auble 2022-04-18 12:27:05 MDT
Ralph, I can confirm your patch to the pmix branches does fix the compilation problems.

So the status here is this.  I see pmix v4.* all working as expected so I am ok putting that into Slurm.

If I use pmix v5 (master branch head) the stepd locks up with this backtrace...

#0  0x00007fc68195c516 in pmix_value_array_set_size () from /home/da/pmix/5/lib/libpmix.so
#1  0x00007fc68195857b in group_register () from /home/da/pmix/5/lib/libpmix.so
#2  0x00007fc681954bd5 in register_variable () from /home/da/pmix/5/lib/libpmix.so
#3  0x00007fc6819555d5 in pmix_mca_base_var_register () from /home/da/pmix/5/lib/libpmix.so
#4  0x00007fc681954299 in pmix_mca_base_var_cache_files () from /home/da/pmix/5/lib/libpmix.so
#5  0x00007fc6819549ad in pmix_mca_base_var_init () from /home/da/pmix/5/lib/libpmix.so
#6  0x00007fc681920089 in pmix_init_util () from /home/da/pmix/5/lib/libpmix.so
#7  0x00007fc6819202fe in pmix_rte_init () from /home/da/pmix/5/lib/libpmix.so
#8  0x00007fc6818fb8be in PMIx_server_init () from /home/da/pmix/5/lib/libpmix.so
#9  0x00007fc681ab1b00 in pmixp_lib_init () at ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client_v2.c:245
#10 0x00007fc681a9b6c3 in pmixp_libpmix_init () at ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client.c:493
#11 0x00007fc681aa3b77 in pmixp_stepd_init (job=job@entry=0x55c08981f4a0, env=env@entry=0x55c08981f598) at ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_server.c:422
#12 0x00007fc681a991eb in mpi_p_slurmstepd_prefork (job=0x55c08981f4a0, env=0x55c08981f598) at ../../../../../../slurm/src/plugins/mpi/pmix/mpi_pmix.c:167
#13 0x00007fc68297d365 in mpi_g_slurmstepd_prefork (job=job@entry=0x55c08981f4a0, env=env@entry=0x55c08981f598) at ../../../../slurm/src/common/slurm_mpi.c:257
#14 0x000055c088e4ae10 in job_manager (job=job@entry=0x55c08981f4a0) at ../../../../../slurm/src/slurmd/slurmstepd/mgr.c:1242
#15 0x000055c088e46463 in main (argc=<optimized out>, argv=<optimized out>) at ../../../../../slurm/src/slurmd/slurmstepd/slurmstepd.c:190

The debug in the stepd is

[Apr 18 12:19:53.821796 3393634 slurmstepd   0x7fc68244fd40] [621157.0] debug:  MPI: Type: pmix_v5
[Apr 18 12:19:53.822224 3393634 slurmstepd   0x7fc68244fd40] [621157.0] debug:  mpi/pmix_v5: init: PMIx plugin loaded
[Apr 18 12:19:53.822328 3393634 slurmstepd   0x7fc68244fd40] [621157.0] debug:  spank: opening plugin stack /home/da/slurm/22.05/snowflake/etc/plugstack.conf
[Apr 18 12:19:53.822343 3393634 slurmstepd   0x7fc68244fd40] [621157.0] debug:  mpi/pmix_v5: mpi_p_slurmstepd_prefork: (null) [0]: mpi_pmix.c:162: start
[Apr 18 12:19:53.822397 3393634 slurmstepd   0x7fc68244fd40] [621157.0] debug:  mpi/pmix_v5: pmixp_info_nspace_usock: setup sockets

That is where it hangs.  Any ideas?
Comment 49 Danny Auble 2022-04-18 12:37:35 MDT
Created attachment 24519 [details]
v8

This is the updated patch putting the pmix.h include in src/plugins/mpi/pmix/pmixp_common.h instead of the .c that only fixed things for <= v4.1.
Comment 50 Danny Auble 2022-04-18 12:37:54 MDT
Comment on attachment 24437 [details]
v7

obsoleted by v8
Comment 51 Ralph Castain 2022-04-19 08:14:46 MDT
(In reply to Danny Auble from comment #48)
> Ralph, I can confirm your patch to the pmix branches does fix the
> compilation problems.
> 
> So the status here is this.  I see pmix v4.* all working as expected so I am
> ok putting that into Slurm.
> 
> If I use pmix v5 (master branch head) the stepd locks up with this
> backtrace...
> 
> #0  0x00007fc68195c516 in pmix_value_array_set_size () from
> /home/da/pmix/5/lib/libpmix.so
> #1  0x00007fc68195857b in group_register () from
> /home/da/pmix/5/lib/libpmix.so
> #2  0x00007fc681954bd5 in register_variable () from
> /home/da/pmix/5/lib/libpmix.so
> #3  0x00007fc6819555d5 in pmix_mca_base_var_register () from
> /home/da/pmix/5/lib/libpmix.so
> #4  0x00007fc681954299 in pmix_mca_base_var_cache_files () from
> /home/da/pmix/5/lib/libpmix.so
> #5  0x00007fc6819549ad in pmix_mca_base_var_init () from
> /home/da/pmix/5/lib/libpmix.so
> #6  0x00007fc681920089 in pmix_init_util () from
> /home/da/pmix/5/lib/libpmix.so
> #7  0x00007fc6819202fe in pmix_rte_init () from
> /home/da/pmix/5/lib/libpmix.so
> #8  0x00007fc6818fb8be in PMIx_server_init () from
> /home/da/pmix/5/lib/libpmix.so
> #9  0x00007fc681ab1b00 in pmixp_lib_init () at
> ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client_v2.c:245
> #10 0x00007fc681a9b6c3 in pmixp_libpmix_init () at
> ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client.c:493
> #11 0x00007fc681aa3b77 in pmixp_stepd_init (job=job@entry=0x55c08981f4a0,
> env=env@entry=0x55c08981f598) at
> ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_server.c:422
> #12 0x00007fc681a991eb in mpi_p_slurmstepd_prefork (job=0x55c08981f4a0,
> env=0x55c08981f598) at
> ../../../../../../slurm/src/plugins/mpi/pmix/mpi_pmix.c:167
> #13 0x00007fc68297d365 in mpi_g_slurmstepd_prefork
> (job=job@entry=0x55c08981f4a0, env=env@entry=0x55c08981f598) at
> ../../../../slurm/src/common/slurm_mpi.c:257
> #14 0x000055c088e4ae10 in job_manager (job=job@entry=0x55c08981f4a0) at
> ../../../../../slurm/src/slurmd/slurmstepd/mgr.c:1242
> #15 0x000055c088e46463 in main (argc=<optimized out>, argv=<optimized out>)
> at ../../../../../slurm/src/slurmd/slurmstepd/slurmstepd.c:190
> 
> The debug in the stepd is
> 
> [Apr 18 12:19:53.821796 3393634 slurmstepd   0x7fc68244fd40] [621157.0]
> debug:  MPI: Type: pmix_v5
> [Apr 18 12:19:53.822224 3393634 slurmstepd   0x7fc68244fd40] [621157.0]
> debug:  mpi/pmix_v5: init: PMIx plugin loaded
> [Apr 18 12:19:53.822328 3393634 slurmstepd   0x7fc68244fd40] [621157.0]
> debug:  spank: opening plugin stack
> /home/da/slurm/22.05/snowflake/etc/plugstack.conf
> [Apr 18 12:19:53.822343 3393634 slurmstepd   0x7fc68244fd40] [621157.0]
> debug:  mpi/pmix_v5: mpi_p_slurmstepd_prefork: (null) [0]: mpi_pmix.c:162:
> start
> [Apr 18 12:19:53.822397 3393634 slurmstepd   0x7fc68244fd40] [621157.0]
> debug:  mpi/pmix_v5: pmixp_info_nspace_usock: setup sockets
> 
> That is where it hangs.  Any ideas?

Puzzling - I'm not sure what that would happen. All it is doing is a realloc to increase the size of the array by 1 more than the item being indexed, which seems innocent enough.

Can you build PMIx with --enable-debug and get a line number? We could perhaps look to see if the index of the item it is trying to set is an enormous number - perhaps a variable wasn't properly initialized in this environment.
Comment 52 Danny Auble 2022-04-19 14:07:28 MDT
Sorry I didn't have that before :).

(gdb) where
#0  0x00007f35acb7d376 in pmix_value_array_set_size () from /home/da/pmix/5/lib/libpmix.so
#1  0x00007f35acb73aaf in pmix_value_array_set_item (array=0x55a8e56dbf28, item_index=0, item=0x7ffe395574cc) at /home/da/pmix/5/pmix/src/class/pmix_value_array.h:204
#2  0x00007f35acb73b2a in pmix_value_array_append_item (array=0x55a8e56dbf28, item=0x7ffe395574cc) at /home/da/pmix/5/pmix/src/class/pmix_value_array.h:227
#3  0x00007f35acb7507e in group_register (project_name=0x7f35accd34d4 "pmix", framework_name=0x7f35accd34d0 "mca", component_name=0x7f35accd34cb "base", description=0x0) at ../../../../pmix/src/mca/base/pmix_mca_base_var_group.c:303
#4  0x00007f35acb750d2 in pmix_mca_base_var_group_register (project_name=0x7f35accd34d4 "pmix", framework_name=0x7f35accd34d0 "mca", component_name=0x7f35accd34cb "base", description=0x0) at ../../../../pmix/src/mca/base/pmix_mca_base_var_group.c:312
#5  0x00007f35acb6e731 in register_variable (project_name=0x7f35accd34d4 "pmix", framework_name=0x7f35accd34d0 "mca", component_name=0x7f35accd34cb "base", variable_name=0x7f35accd3604 "param_files", description=0x7f35accd35c8 "Path for MCA configuration files containing variable values", type=PMIX_MCA_BASE_VAR_TYPE_STRING, flags=PMIX_MCA_BASE_VAR_FLAG_NONE, synonym_for=0, storage=0x7f35acd38b88 <pmix_mca_base_var_files>) at ../../../../pmix/src/mca/base/pmix_mca_base_var.c:1203
#6  0x00007f35acb6f308 in pmix_mca_base_var_register (project_name=0x7f35accd34d4 "pmix", framework_name=0x7f35accd34d0 "mca", component_name=0x7f35accd34cb "base", variable_name=0x7f35accd3604 "param_files", description=0x7f35accd35c8 "Path for MCA configuration files containing variable values", type=PMIX_MCA_BASE_VAR_TYPE_STRING, storage=0x7f35acd38b88 <pmix_mca_base_var_files>) at ../../../../pmix/src/mca/base/pmix_mca_base_var.c:1341
#7  0x00007f35acb6c29a in pmix_mca_base_var_cache_files (rel_path_search=false) at ../../../../pmix/src/mca/base/pmix_mca_base_var.c:349
#8  0x00007f35acb6bfa7 in pmix_mca_base_var_init () at ../../../../pmix/src/mca/base/pmix_mca_base_var.c:263
#9  0x00007f35acb110e0 in pmix_init_util (info=0x55a8e56d9a30, ninfo=2, helpdir=0x0) at ../../pmix/src/runtime/pmix_init.c:152
#10 0x00007f35acb11318 in pmix_rte_init (type=2, info=0x55a8e56d9a30, ninfo=2, cbfunc=0x0) at ../../pmix/src/runtime/pmix_init.c:216
#11 0x00007f35acaaaf2d in PMIx_server_init (module=0x7f35acd6cc80 <slurm_pmix_cb>, info=0x55a8e56d9a30, ninfo=2) at ../../pmix/src/server/pmix_server.c:632
#12 0x00007f35acd5fb00 in pmixp_lib_init () at ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client_v2.c:245
#13 0x00007f35acd496c3 in pmixp_libpmix_init () at ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_client.c:493
#14 0x00007f35acd51b77 in pmixp_stepd_init (job=job@entry=0x55a8e56aa4d0, env=env@entry=0x55a8e56aa5c8) at ../../../../../../slurm/src/plugins/mpi/pmix/pmixp_server.c:422
#15 0x00007f35acd471eb in mpi_p_slurmstepd_prefork (job=0x55a8e56aa4d0, env=0x55a8e56aa5c8) at ../../../../../../slurm/src/plugins/mpi/pmix/mpi_pmix.c:167
#16 0x00007f35adc2b365 in mpi_g_slurmstepd_prefork (job=job@entry=0x55a8e56aa4d0, env=env@entry=0x55a8e56aa5c8) at ../../../../slurm/src/common/slurm_mpi.c:257
#17 0x000055a8e5101e10 in job_manager (job=job@entry=0x55a8e56aa4d0) at ../../../../../slurm/src/slurmd/slurmstepd/mgr.c:1242
#18 0x000055a8e50fd463 in main (argc=<optimized out>, argv=<optimized out>) at ../../../../../slurm/src/slurmd/slurmstepd/slurmstepd.c:190
Comment 53 Ralph Castain 2022-04-19 21:28:29 MDT
Could you please try applying this patch to your PMIx source:

diff --git a/src/class/pmix_value_array.c b/src/class/pmix_value_array.c
index d228ddc..3ee4c58 100644
--- a/src/class/pmix_value_array.c
+++ b/src/class/pmix_value_array.c
@@ -27,7 +27,7 @@ static void pmix_value_array_construct(pmix_value_array_t *array)
     array->array_items = NULL;
     array->array_size = 0;
     array->array_item_sizeof = 0;
-    array->array_alloc_size = 0;
+    array->array_alloc_size = 1;
 }
 
 static void pmix_value_array_destruct(pmix_value_array_t *array)
Comment 54 Danny Auble 2022-04-20 08:21:16 MDT
Doing that causes srun to fail.

srun --mpi=pmix_v5 -n1 helloworld
srun: error: snowflake7: task 0: Exited with exit code 14
[snowflake:598966] PMIX ERROR: PACK-MISMATCH in file ../../pmix/src/client/pmix_client.c at line 832
--------------------------------------------------------------------------
PMIx_Init failed for the following reason:

  PACK-MISMATCH

Open MPI requires access to a local PMIx server to execute. Please ensure
that either you are operating in a PMIx-enabled environment, or use "mpirun"
to execute the job.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and MPI will try to terminate your MPI job as well)
[snowflake:598966] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Comment 55 Danny Auble 2022-04-20 08:26:35 MDT
I am actually getting that with or without the patch now.  I don't know what changed, but I am no longer locked up, but and getting that error message for all cases with v5.
Comment 56 Ralph Castain 2022-04-20 08:56:23 MDT
This is with the head of Slurm master branch? What OMPI branch are you using? Have you tried just running one of the PMIx examples (e.g., client or client2)?

I can try to locally replicate later today.
Comment 57 Danny Auble 2022-04-20 09:15:15 MDT
Thanks Ralph, yes, the Slurm master head.  That error is from ompi 5.0

The opmi 4.1 error is this...

srun --mpi=pmix_v5 -n1 helloworld
srun: error: snowflake7: task 0: Exited with exit code 1
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIX stopped checking at the first component that it did not find.

Host:      snowflake
Framework: psec
Component: munge
--------------------------------------------------------------------------

--------------------------------------------------------------------------
It looks like pmix_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during pmix_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
PMIX developer):

  pmix_psec_base_open failed
  --> Returned value -46 instead of PMIX_SUCCESS
--------------------------------------------------------------------------

[snowflake:1309404] PMIX ERROR: NOT-FOUND in file ../../../../../../../../ompi/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client.c at line 562
[snowflake:1309404] OPAL ERROR: Not found in file ../../../../../../ompi/opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[snowflake:1309404] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

PMIx <= v4 works in all cases.

srun --mpi=pmix_v5 -n2 -N2 /home/da/pmix/5/build/test/pmix_client -n 2 --job-fence -c
==1377153== OK
==1377154== OK

So that seems to work?

As a side note, it looks like Mike was getting the error I am seeing with opmi v5 back in comment 15 now that I am looking closer.
Comment 58 Ralph Castain 2022-04-20 11:24:32 MDT
This error:

> The opmi 4.1 error is this...

> srun --mpi=pmix_v5 -n1 helloworld
> srun: error: snowflake7: task 0: Exited with exit code 1
> --------------------------------------------------------------------------
> A requested component was not found, or was unable to be opened.  This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded).  Note that
> PMIX stopped checking at the first component that it did not find.

> Host:      snowflake
> Framework: psec
> Component: munge

could be the result of a configuration mismatch. When you built PMIx for Slurm, I assume you added the --with-munge configuration option? Did you do the same for OMPI?

You might check to see if there is an "mca_psec_munge" library in the OMPI lib install area - should be under the "pmix" subdirectory, I believe.

Also check the PMIx install area that Slurm is tied to - I'm pretty sure you'll find it there. The error indicates that the server told the app to use the munge security plugin, but the app cannot find it.
Comment 59 Ralph Castain 2022-04-20 11:32:35 MDT
(In reply to Danny Auble from comment #55)
> I am actually getting that with or without the patch now.  I don't know what
> changed, but I am no longer locked up, but and getting that error message
> for all cases with v5.

Did you update OMPI v5 by any chance? I'm not sure how closely you are tracking their branch head - they updated PMIx in it last week. They are lagging behind a fair amount, so you may see some errors due to the lag time as the PMIx master moves ahead of the PMIx v4.2 release branch, and OMPI lags even further back on updates to their copy of the v4.2 branch.
Comment 60 Danny Auble 2022-04-20 11:44:11 MDT
I have munge install normally not as a standalone, I haven't ever used --with-munge on a configure line, but it seems like it is finding it anyway

External Packages
-----------------------
Curl: no (explicitly disabled)
Jansson: no (explicitly disabled)
munge: yes (pkg-config: default search paths)
ZLIB: yes (pkg-config: default search paths)

I don't see the lib installed anywhere, but I do it see it compiled as a .la

/home/da/pmix/5/build/src/mca/psec/munge/libpmix_mca_psec_munge.la
/home/da/pmix/5/build/src/mca/psec/munge/.libs/libpmix_mca_psec_munge.a
/home/da/pmix/5/build/src/mca/psec/munge/.libs/libpmix_mca_psec_munge.la

I don't see --with-munge as an opmi option, but I put it on anyway, but that didn't change the end result.  I still get the same error.

I just pulled and rebuild ompi 5.0, I never use the locally provided prrte or pmix, I am always using my own (autogen.pl --no-3rdparty pmix,prrte). Things are working now there!

I am guessing the problem is now only with v4.1 ompi which seems like a configuration issue?
Comment 61 Ralph Castain 2022-04-20 12:15:12 MDT
(In reply to Danny Auble from comment #60)
> I have munge install normally not as a standalone, I haven't ever used
> --with-munge on a configure line, but it seems like it is finding it anyway
> 
> External Packages
> -----------------------
> Curl: no (explicitly disabled)
> Jansson: no (explicitly disabled)
> munge: yes (pkg-config: default search paths)
> ZLIB: yes (pkg-config: default search paths)
> 
> I don't see the lib installed anywhere, but I do it see it compiled as a .la
> 
> /home/da/pmix/5/build/src/mca/psec/munge/libpmix_mca_psec_munge.la
> /home/da/pmix/5/build/src/mca/psec/munge/.libs/libpmix_mca_psec_munge.a
> /home/da/pmix/5/build/src/mca/psec/munge/.libs/libpmix_mca_psec_munge.la
> 
> I don't see --with-munge as an opmi option, but I put it on anyway, but that
> didn't change the end result.  I still get the same error.
> 
> I just pulled and rebuild ompi 5.0, I never use the locally provided prrte
> or pmix, I am always using my own (autogen.pl --no-3rdparty pmix,prrte).
> Things are working now there!
> 
> I am guessing the problem is now only with v4.1 ompi which seems like a
> configuration issue?

I suspect so - if you look in that OMPI install tree, do you see a mca_psec_munge somewhere? Or are you also build OMPI v4.1 with an external PMIx?
Comment 62 Danny Auble 2022-04-20 12:16:05 MDT
Always external.
Comment 63 Ralph Castain 2022-04-20 13:28:04 MDT
I don't know if OMPI v4 will work with an external version of PMIx v5 - I'm not sure their configure logic knows how to deal with it.

Can we recap the situation? I'm getting lost in all the v's and am unsure just what is working and what is not. It sounds like you now have OMPI v5 (pointing at an external version of PMIxv5) working with Slurm+PMIxv5 - is that correct?

And it sounds like OMPI v4 (pointing at an external version of PMIxv?) is not able to run against Slurm+PMIxv5?
Comment 64 Danny Auble 2022-04-21 11:10:32 MDT
(In reply to Ralph Castain from comment #63)
> I don't know if OMPI v4 will work with an external version of PMIx v5 - I'm
> not sure their configure logic knows how to deal with it.
> 
> Can we recap the situation? I'm getting lost in all the v's and am unsure
> just what is working and what is not. It sounds like you now have OMPI v5
> (pointing at an external version of PMIxv5) working with Slurm+PMIxv5 - is
> that correct?
> 
> And it sounds like OMPI v4 (pointing at an external version of PMIxv?) is
> not able to run against Slurm+PMIxv5?

This is all correct.  

OMPI v5 + PMIxv[3-5] all seems to work now.

OMPI v4 is totally fine compiling against PMIxv5 but doesn't work.

PMIxv[3-4] seems to work in all situations.

So at this point I am ok putting v8 into 22.05.  We will need to update the docs as well as they talk about PMIxv1 which is not even supported anymore.

In anycase I think the integration is in a good spot at the moment.  Do you think PMIxv5 will ever work with OMPI v4.*?
Comment 65 Ralph Castain 2022-04-21 11:44:11 MDT
> In anycase I think the integration is in a good spot at the moment.  Do you think PMIxv5 will ever work with OMPI v4.*?

I'm not sure why it _doesn't_ work, to be honest - can't think of any reasons why it would fail. Probably something the OMPI folks need to look at - I can file a ticket for them.
Comment 66 Danny Auble 2022-04-21 16:43:32 MDT
Comment on attachment 24519 [details]
v8

Thanks for everyone's help on this.  v8 has been pushed to master commits 4182b6dcd4..dabecc3d74.
Comment 67 Danny Auble 2022-04-21 16:46:24 MDT
Comment on attachment 24435 [details]
Updated PMIx integration patch

As discussed this is beyond the scope of this bug.  I think bug 7263 is where this probably fits the best.
Comment 68 Danny Auble 2022-04-21 16:46:49 MDT
Please reopen if anything is found not working correctly here.
Comment 69 Felix Abecassis 2022-04-22 10:53:31 MDT
Thanks a lot Isaias, Danny and Ralph for all the hard work to get PMIx v4/v5 support in Slurm.

I tried the current master branch (d563ec24) with Ubuntu 22.04 and libpmix-dev=4.1.2-2ubuntu1 (from Ubuntu), but it didn't seem to work. PMIx v4 was found by "configure" after some help:
./configure --with-pmix=/usr/lib/x86_64-linux-gnu/pmix2

But the pmix plugin was not being built by "make" and so did not show up in "srun --mpi=list" after an install.

I think the issue is simply that v4 and v5 cases are missing here:
https://github.com/SchedMD/slurm/blob/d563ec24bf5ef502d2eb3bd25ee45e73cfd1122b/auxdir/x_ac_pmix.m4#L196-L198

This is needed here to add the pmix plugin to the build:
https://github.com/SchedMD/slurm/blob/d563ec24bf5ef502d2eb3bd25ee45e73cfd1122b/src/plugins/mpi/Makefile.am#L4-L6
Comment 70 Danny Auble 2022-04-22 11:17:56 MDT
Thanks Felix for reporting, this is fixed in 73ea2d56f8.

Please note if a bug is closed and I don't respond within an hour the odds of me responding are right around 0 ;).  Reopening the bug is a good idea to avoid this if something is required.