Ticket 16687 - PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
Summary: PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: PMIx (show other tickets)
Version: 23.02.2
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-05-10 08:12 MDT by Josko Plazonic
Modified: 2023-05-16 07:38 MDT (History)
3 users (show)

See Also:
Site: Princeton (PICSciE)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Josko Plazonic 2023-05-10 08:12:42 MDT
We just upgraded our clusters to 23.02.2 (from 22.05.8) and we started seeing in MPI jobs (openmpi 4.1.0, built against pmix 3.2.3):

[stellar-i02n6:239005] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168

I tracked it down to a failure to read

/var/spool/slurmd/pmix.835089.6/pmix_dstor_ds12_251982/dstore_sm.lock
/var/spool/slurmd/pmix.835089.6/pmix_dstor_ds21_251982/smlockseg-slurm.pmix.835089.6
/var/spool/slurmd/pmix.835089.6/pmix_dstor_ds21_251982/initial-pmix_shared-segment-0

files which are created by slurmsted

COMMAND      PID USER  FD   TYPE DEVICE SIZE/OFF      NODE NAME
slurmstep 251982 root mem    REG  253,0     4096 107895504 /var/spool/slurmd/pmix.835089.6/pmix_dstor_ds21_251982/smlockseg-slurm.pmix.835089.6
slurmstep 251982 root mem    REG  253,0     4096 107895502 /var/spool/slurmd/pmix.835089.6/pmix_dstor_ds21_251982/initial-pmix_shared-segment-0

and here are their permissions

/var/spool/slurmd/pmix.835089.6/pmix_dstor_ds21_251982:
total 8
drwxr-x--- 2 plazonic root   80 May 10 09:58 .
drwxrwx--- 4 plazonic root   66 May 10 09:58 ..
-rw------- 1 root     root 4096 May 10 09:58 initial-pmix_shared-segment-0
-rw------- 1 root     root 4096 May 10 09:58 smlockseg-slurm.pmix.835089.6

The PMIX error gets output immediately after trying to open above files, e.g.

openat(AT_FDCWD, "/var/spool/slurmd/pmix.835087.1//pmix_dstor_ds12_249307/dstore_sm.lock", O_RDWR) = -1 EACCES (Permission denied)
write(2, "[stellar-i02n5:249395] PMIX ERRO"..., 85) = 85

dstore_sm.lock is similarly owned by root:

/var/spool/slurmd/pmix.835089.6/pmix_dstor_ds12_251982:
total 8
drwxr-x--- 2 plazonic root   65 May 10 09:58 .
drwxrwx--- 4 plazonic root   66 May 10 09:58 ..
-rw------- 1 root     root 4096 May 10 09:58 dstore_sm.lock
-rw------- 1 root     root 4096 May 10 09:58 initial-pmix_shared-segment-0

Is this a new bug or is there some configuration item we can tweak to get this fixed that we should have changed/added when we upgraded to 23.02.2?

Thanks
Comment 1 Alejandro Sanchez 2023-05-12 05:01:06 MDT
Hi Josko,

Thanks for reporting. I'll try to reproduce this locally and let you know what I find. In the meantime, were before the Slurm upgrade when you were using 22.05.8, were openmpi 4.1.0, built against pmix 3.2.3 the same versions?

Also, can you reproduce with a simple 2 node MPI hello world test program? How does the job request and program look like?

Thanks
Comment 2 Josko Plazonic 2023-05-12 11:23:52 MDT
Nothing but slurm changed. And yes, this was tested with a very simple 2 node hello world mpi test. It happens right at the start of the job.
Comment 3 Josko Plazonic 2023-05-15 07:50:00 MDT
Any news here? This is what those same files look like on a cluster where we are still on 22.05.7 - note how they are owned by me, rather than root. This is exactly the same build of pmix and openmpi - the difference is in slurm.

[root@adroit-h11n4 ~]# ls -laR /var/spool/slurmd/pmix.1771026.0
/var/spool/slurmd/pmix.1771026.0:
total 4
drwxrwx---   4 plazonic root   68 May 15 09:47 .
drwxr-xr-x. 10 root     root 4096 May 15 09:47 ..
drwxr-x---   2 plazonic root   65 May 15 09:47 pmix_dstor_ds12_2858799
drwxr-x---   2 plazonic root  157 May 15 09:47 pmix_dstor_ds21_2858799

/var/spool/slurmd/pmix.1771026.0/pmix_dstor_ds12_2858799:
total 8
drwxr-x--- 2 plazonic root   65 May 15 09:47 .
drwxrwx--- 4 plazonic root   68 May 15 09:47 ..
-rw-rw---- 1 plazonic root 4096 May 15 09:47 dstore_sm.lock
-r--rw---- 1 plazonic root 4096 May 15 09:47 initial-pmix_shared-segment-0

/var/spool/slurmd/pmix.1771026.0/pmix_dstor_ds21_2858799:
total 8200
drwxr-x--- 2 plazonic root     157 May 15 09:47 .
drwxrwx--- 4 plazonic root      68 May 15 09:47 ..
-r--rw---- 1 plazonic root    4096 May 15 09:47 initial-pmix_shared-segment-0
-r--rw---- 1 plazonic root 4194304 May 15 09:47 smdataseg-slurm.pmix.1771026.0-0
-rw-rw---- 1 plazonic root    4096 May 15 09:47 smlockseg-slurm.pmix.1771026.0
-r--rw---- 1 plazonic root 4194304 May 15 09:47 smseg-slurm.pmix.1771026.0-0
Comment 4 Alejandro Sanchez 2023-05-15 08:03:31 MDT
I've been trying to reproduce locally by installing same stack:

Slurm 23.02.2
PMIx 3.2.3
OpenMPI 4.1.0

But build for OpenMPI 4.1.0 is failing for me with error:

error: unknown type name ‘ptrdiff_t’

So I'm trying to figure out what's going on with this version at the moment.

In the meantime, looking at PMIx related changes between 22.05.8 and 23.02.2, I highly suspect about this one:

https://github.com/SchedMD/slurm/commit/d23cad68dfa

and perhaps (but don't suspect that much):

https://github.com/SchedMD/slurm/commit/9985efc4a85
Comment 5 Josko Plazonic 2023-05-15 08:16:15 MDT
That first change does look like a very likely possibility. If I could figure out what might be a fix/workaround I'd test but not clear to me yet the order of exection - open to suggestions.

As far as reproducing - if you can test on RHEL8 compatible system we do have these rpms in

http://springdale.princeton.edu/data/springdale/computational/8.7/x86_64/

The pmix is there and openmpi rpms start with openmpi040100-gcc (-runtime/-devel).
Comment 6 Josko Plazonic 2023-05-15 08:49:56 MDT
Yup, that's the culprit (https://github.com/SchedMD/slurm/commit/d23cad68dfa). This is before I installed slurm build with revert:

[plazonic@stellar-intel ~]$ salloc -N 2 -c 1 -t 30:00 --reservation=slurmfix
salloc: Granted job allocation 837668
salloc: Nodes stellar-i01n[6-7] are ready for job
[plazonic@stellar-i01n6 ~]$ module load openmpi/gcc/4.1.0
[plazonic@stellar-i01n6 ~]$ srun ./hello_mpi2
[stellar-i01n7:1156328] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[stellar-i01n6:1029047] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
Hello World from Node 0
Hello World from Node 1

And this is after

[plazonic@stellar-intel ~]$ salloc -N 2 -c 1 -t 30:00 --reservation=slurmfix
salloc: Pending job allocation 837669
salloc: job 837669 queued and waiting for resources
salloc: job 837669 has been allocated resources
salloc: Granted job allocation 837669
salloc: Waiting for resource configuration
salloc: Nodes stellar-i01n[6-7] are ready for job
[plazonic@stellar-i01n6 ~]$ module load openmpi/gcc/4.1.0
[plazonic@stellar-i01n6 ~]$ srun ./hello_mpi2
Hello World from Node 1
Hello World from Node 0

No more PMIX errors.

Now, I am not sure that a simple revert is enough - I don't know much about pmix and why the code in pmixp_client.c puts all this ops in a lresp list (in d23cad68dfa) and if that's necessary for a proper fix (in a different/earlier spot), instead of a revert - I'll leave that to you and wait to patch this until I hear from you.

Thanks.
Comment 12 Alejandro Sanchez 2023-05-16 07:38:52 MDT
Josko,

While local revert worked as a temporal workaround, a more appropriate fix has been pushed to 23.02 branch ahead of 23.02.3 tag, commit:

commit 1f9386909230cd73506d88f02f75126924d3f41e
Author:     Danny Auble <da@schedmd.com>
AuthorDate: Mon May 15 18:35:25 2023 +0200

    mpi/pmix - fix PMIx shmem backed files permissions regression.
    
    Introduced in 23.02.2 commit d23cad68df.
    
    Bug 16687

I'm gonna go ahead and close the bug as resolved. Please, reopen or open a new bug if there's anything else.

Thanks for reporting and testing with the revert applied.