We just upgraded our clusters to 23.02.2 (from 22.05.8) and we started seeing in MPI jobs (openmpi 4.1.0, built against pmix 3.2.3): [stellar-i02n6:239005] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168 I tracked it down to a failure to read /var/spool/slurmd/pmix.835089.6/pmix_dstor_ds12_251982/dstore_sm.lock /var/spool/slurmd/pmix.835089.6/pmix_dstor_ds21_251982/smlockseg-slurm.pmix.835089.6 /var/spool/slurmd/pmix.835089.6/pmix_dstor_ds21_251982/initial-pmix_shared-segment-0 files which are created by slurmsted COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME slurmstep 251982 root mem REG 253,0 4096 107895504 /var/spool/slurmd/pmix.835089.6/pmix_dstor_ds21_251982/smlockseg-slurm.pmix.835089.6 slurmstep 251982 root mem REG 253,0 4096 107895502 /var/spool/slurmd/pmix.835089.6/pmix_dstor_ds21_251982/initial-pmix_shared-segment-0 and here are their permissions /var/spool/slurmd/pmix.835089.6/pmix_dstor_ds21_251982: total 8 drwxr-x--- 2 plazonic root 80 May 10 09:58 . drwxrwx--- 4 plazonic root 66 May 10 09:58 .. -rw------- 1 root root 4096 May 10 09:58 initial-pmix_shared-segment-0 -rw------- 1 root root 4096 May 10 09:58 smlockseg-slurm.pmix.835089.6 The PMIX error gets output immediately after trying to open above files, e.g. openat(AT_FDCWD, "/var/spool/slurmd/pmix.835087.1//pmix_dstor_ds12_249307/dstore_sm.lock", O_RDWR) = -1 EACCES (Permission denied) write(2, "[stellar-i02n5:249395] PMIX ERRO"..., 85) = 85 dstore_sm.lock is similarly owned by root: /var/spool/slurmd/pmix.835089.6/pmix_dstor_ds12_251982: total 8 drwxr-x--- 2 plazonic root 65 May 10 09:58 . drwxrwx--- 4 plazonic root 66 May 10 09:58 .. -rw------- 1 root root 4096 May 10 09:58 dstore_sm.lock -rw------- 1 root root 4096 May 10 09:58 initial-pmix_shared-segment-0 Is this a new bug or is there some configuration item we can tweak to get this fixed that we should have changed/added when we upgraded to 23.02.2? Thanks
Hi Josko, Thanks for reporting. I'll try to reproduce this locally and let you know what I find. In the meantime, were before the Slurm upgrade when you were using 22.05.8, were openmpi 4.1.0, built against pmix 3.2.3 the same versions? Also, can you reproduce with a simple 2 node MPI hello world test program? How does the job request and program look like? Thanks
Nothing but slurm changed. And yes, this was tested with a very simple 2 node hello world mpi test. It happens right at the start of the job.
Any news here? This is what those same files look like on a cluster where we are still on 22.05.7 - note how they are owned by me, rather than root. This is exactly the same build of pmix and openmpi - the difference is in slurm. [root@adroit-h11n4 ~]# ls -laR /var/spool/slurmd/pmix.1771026.0 /var/spool/slurmd/pmix.1771026.0: total 4 drwxrwx--- 4 plazonic root 68 May 15 09:47 . drwxr-xr-x. 10 root root 4096 May 15 09:47 .. drwxr-x--- 2 plazonic root 65 May 15 09:47 pmix_dstor_ds12_2858799 drwxr-x--- 2 plazonic root 157 May 15 09:47 pmix_dstor_ds21_2858799 /var/spool/slurmd/pmix.1771026.0/pmix_dstor_ds12_2858799: total 8 drwxr-x--- 2 plazonic root 65 May 15 09:47 . drwxrwx--- 4 plazonic root 68 May 15 09:47 .. -rw-rw---- 1 plazonic root 4096 May 15 09:47 dstore_sm.lock -r--rw---- 1 plazonic root 4096 May 15 09:47 initial-pmix_shared-segment-0 /var/spool/slurmd/pmix.1771026.0/pmix_dstor_ds21_2858799: total 8200 drwxr-x--- 2 plazonic root 157 May 15 09:47 . drwxrwx--- 4 plazonic root 68 May 15 09:47 .. -r--rw---- 1 plazonic root 4096 May 15 09:47 initial-pmix_shared-segment-0 -r--rw---- 1 plazonic root 4194304 May 15 09:47 smdataseg-slurm.pmix.1771026.0-0 -rw-rw---- 1 plazonic root 4096 May 15 09:47 smlockseg-slurm.pmix.1771026.0 -r--rw---- 1 plazonic root 4194304 May 15 09:47 smseg-slurm.pmix.1771026.0-0
I've been trying to reproduce locally by installing same stack: Slurm 23.02.2 PMIx 3.2.3 OpenMPI 4.1.0 But build for OpenMPI 4.1.0 is failing for me with error: error: unknown type name ‘ptrdiff_t’ So I'm trying to figure out what's going on with this version at the moment. In the meantime, looking at PMIx related changes between 22.05.8 and 23.02.2, I highly suspect about this one: https://github.com/SchedMD/slurm/commit/d23cad68dfa and perhaps (but don't suspect that much): https://github.com/SchedMD/slurm/commit/9985efc4a85
That first change does look like a very likely possibility. If I could figure out what might be a fix/workaround I'd test but not clear to me yet the order of exection - open to suggestions. As far as reproducing - if you can test on RHEL8 compatible system we do have these rpms in http://springdale.princeton.edu/data/springdale/computational/8.7/x86_64/ The pmix is there and openmpi rpms start with openmpi040100-gcc (-runtime/-devel).
Yup, that's the culprit (https://github.com/SchedMD/slurm/commit/d23cad68dfa). This is before I installed slurm build with revert: [plazonic@stellar-intel ~]$ salloc -N 2 -c 1 -t 30:00 --reservation=slurmfix salloc: Granted job allocation 837668 salloc: Nodes stellar-i01n[6-7] are ready for job [plazonic@stellar-i01n6 ~]$ module load openmpi/gcc/4.1.0 [plazonic@stellar-i01n6 ~]$ srun ./hello_mpi2 [stellar-i01n7:1156328] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168 [stellar-i01n6:1029047] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168 Hello World from Node 0 Hello World from Node 1 And this is after [plazonic@stellar-intel ~]$ salloc -N 2 -c 1 -t 30:00 --reservation=slurmfix salloc: Pending job allocation 837669 salloc: job 837669 queued and waiting for resources salloc: job 837669 has been allocated resources salloc: Granted job allocation 837669 salloc: Waiting for resource configuration salloc: Nodes stellar-i01n[6-7] are ready for job [plazonic@stellar-i01n6 ~]$ module load openmpi/gcc/4.1.0 [plazonic@stellar-i01n6 ~]$ srun ./hello_mpi2 Hello World from Node 1 Hello World from Node 0 No more PMIX errors. Now, I am not sure that a simple revert is enough - I don't know much about pmix and why the code in pmixp_client.c puts all this ops in a lresp list (in d23cad68dfa) and if that's necessary for a proper fix (in a different/earlier spot), instead of a revert - I'll leave that to you and wait to patch this until I hear from you. Thanks.
Josko, While local revert worked as a temporal workaround, a more appropriate fix has been pushed to 23.02 branch ahead of 23.02.3 tag, commit: commit 1f9386909230cd73506d88f02f75126924d3f41e Author: Danny Auble <da@schedmd.com> AuthorDate: Mon May 15 18:35:25 2023 +0200 mpi/pmix - fix PMIx shmem backed files permissions regression. Introduced in 23.02.2 commit d23cad68df. Bug 16687 I'm gonna go ahead and close the bug as resolved. Please, reopen or open a new bug if there's anything else. Thanks for reporting and testing with the revert applied.