5956 – Error in assigning contexts running multiple MPI jobs on the same node with srun

Ticket 5956 - Error in assigning contexts running multiple MPI jobs on the same node with srun

Summary: Error in assigning contexts running multiple MPI jobs on the same node with srun

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	18.08.3
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Nate Rini
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-10-31 08:06 MDT by Cineca HPC Systems
Modified:	2018-11-30 09:42 MST (History)
CC List:	1 user (show)

See Also:	2770
Site:	Cineca
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:	galileo
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
requested info and logs (86.40 KB, application/x-compressed-tar) 2018-11-06 07:13 MST, Cineca HPC Systems	Details
mpirun and srun environment (6.86 KB, application/x-compressed-tar) 2018-11-23 09:42 MST, Cineca HPC Systems	Details
strace of mpi apps and slurmd logs (116.78 KB, application/x-compressed-tar) 2018-11-27 09:07 MST, Cineca HPC Systems	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Cineca HPC Systems 2018-10-31 08:06:14 MDT

Hi,

on our 36 cores Broadwell nodes with one QLogic QDR InfiniBand HCA (16 hardware contexts), we set the environment variable PSM_RANKS_PER_CONTEXT=4 to allow more than 16 MPI process per node. 
Before upgrading SLURM from 17.11.7 to 18.08.3, everything worked fine with jobs sharing the same node. 
After the upgrade we are now experiencing the following behaviour running two jobs like this 

$ srun -N 2 --ntasks-per-node=16 -p gll_usr_prod -w node[234-235] -t 6:00:00 --pty bash

1) launching the same MPI application with mpirun, everithing works fine: each job occupies 4 contexts per node (16 tasks per node)

2) launching the same MPI application, one with srun and the second with mpirun, also works fine: each job occupies 4 contexts per node (16 tasks per node)

3) launching the same MPI application with srun, the first job launched works fine (4 contexts per node) while the second fails with the following errors

[ibaccare@node234 Programs]$ srun  ./hello_mpi_ompi_2.1.1
node234.11959can't open /dev/ipath, network down (err=26)
node234.11960can't open /dev/ipath, network down (err=26)
node234.11962can't open /dev/ipath, network down (err=26)
node234.11963can't open /dev/ipath, network down (err=26)
--------------------------------------------------------------------------
PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

  Error: Could not detect network connectivity
--------------------------------------------------------------------------
node234.11959ipath_userinit: assign_context command failed: Network is down
[...]
node235.35701can't open /dev/ipath, network down (err=26)
node235.35706can't open /dev/ipath, network down (err=26)
node235.35709can't open /dev/ipath, network down (err=26)
node235.35710can't open /dev/ipath, network down (err=26)
--------------------------------------------------------------------------
PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

  Error: Could not detect network connectivity
--------------------------------------------------------------------------
node235.35709ipath_userinit: assign_context command failed: Network is down
[...]
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
slurmstepd: error: *** STEP 247805.14 ON node234 CANCELLED AT 2018-10-31T14:54:34 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: node235: tasks 16-18: Exited with exit code 1
srun: Terminating job step 247805.14
srun: error: node235: tasks 19-31: Killed
srun: error: node234: tasks 0-4: Exited with exit code 1
srun: error: node234: tasks 5-15: Killed


You may notice that the error messages are related *ONLY* to 4 out of the 16 processes 
launched by the second srun command. So the first job is correctly consuming 4 hardware
contexts (leaving 12 free hw contexts) but apparently the second job is not allowed to
share the remaining free contexts. In fact running only 12 processes per node works:

[ibaccare@node234 Programs]$ srun -N 2 -n 24 --ntasks-per-node=12 ./hello_mpi_ompi_2.1.1
<it works!!!>

and the number of free hw contexts is obviously zero on both nodes

[root@master ~]# xdsh node23[4,5] cat /sys/class/infiniband/qib0/nfreectxts
node234: 0
node235: 0

To sum up: running both job with srun, the contexts sharing seems not to be enabled for
the second job.

This happens for both IntelMPI and OpenMPI.

Thanks
ale & isa

Comment 1 Nate Rini 2018-10-31 12:17:07 MDT

Ale, Isa,

Can you please upload your slurm configuration.

Can you also please call:
ldd hello_mpi_ompi_2.1.1 (both for Intel and OpenMPI)
lsb_release -a
ofed_info | head -1

Thanks
--Nate

Comment 2 Nate Rini 2018-10-31 12:29:33 MDT

Ale, Isa,

Can you also please upload your slurm logs from the affected nodes, along with the node running slurmctld?

Thanks
--Nate

Comment 4 Cineca HPC Systems 2018-11-06 07:13:27 MST

Created attachment 8218 [details]
requested info and logs

Hi Nate
I'm attaching the following files

slurm.conf
hello_mpi_ompi_2.1.1.ldd.out (ldd of Intelmpi exe)
hello_mpi_sleep_impi.ldd.out (ldd of Openmpi exe)
redhat-release
proc_version
ofed_info.out

and the relevant logs of nodes and controller

slurmd-node423.log
slurmd-node424.log
slurmctld.log

in the job 259968 we launched the first srun consuming 4 hw contexts

[root@master ~]# xdsh node4[23,24] cat /sys/class/infiniband/qib0/nfreectxts
node423: 12
node424: 12

in the job 259969 we first launched the second srun that crashes and then 
srun --ntasks-per-node=12 that consumes the remaining 12 contexts

[root@master ~]# xdsh node4[23,24] cat /sys/class/infiniband/qib0/nfreectxts
node423: 0
node424: 0

thanks 
ale & isa

Comment 5 Cineca HPC Systems 2018-11-06 07:15:23 MST

Nate
ERRATA:

hello_mpi_ompi_2.1.1.ldd.out (ldd of Openmpi exe)
hello_mpi_sleep_impi.ldd.out (ldd of Intelmpi exe)

sorry for the mistake ;-)

thanks
ale

Comment 6 Nate Rini 2018-11-06 08:44:33 MST

(In reply to Cineca HPC Systems from comment #4)
> in the job 259969 we first launched the second srun that crashes and then 
> srun --ntasks-per-node=12 that consumes the remaining 12 contexts
Do the processes of the first job die or do they stay around as unkillable zombies?

Comment 7 Cineca HPC Systems 2018-11-07 03:20:12 MST

(In reply to Nate Rini from comment #6)
> (In reply to Cineca HPC Systems from comment #4)
> > in the job 259969 we first launched the second srun that crashes and then 
> > srun --ntasks-per-node=12 that consumes the remaining 12 contexts
> Do the processes of the first job die or do they stay around as unkillable
> zombies?

None of the jobs leave any zombies.

thanks

Comment 8 Nate Rini 2018-11-07 09:01:47 MST

Ale, Isa,

> None of the jobs leave any zombies.
If the job processes are being cleaned up, then this is likely an issue with PSM.

Do you have PSM_RANKS_PER_CONTEXT environmental variable set in your job? 

If not, can you try:
> export PSM_RANKS_PER_CONTEXT=4

--Nate

Comment 9 Cineca HPC Systems 2018-11-08 06:33:55 MST

(In reply to Nate Rini from comment #8)
> Ale, Isa,
> 
> > None of the jobs leave any zombies.
> If the job processes are being cleaned up, then this is likely an issue with
> PSM.
> 
> Do you have PSM_RANKS_PER_CONTEXT environmental variable set in your job? 
> 
> If not, can you try:
> > export PSM_RANKS_PER_CONTEXT=4
> 
> --Nate

As we wrote in the first comment, PSM_RANKS_PER_CONTEXT=4 is already set in the jobs environment.

thanks
ale

Comment 10 Nate Rini 2018-11-08 09:12:16 MST

Ale, 

Slurm should not be affecting PSM directly. Can you attach a copy of /etc/slurm/cgroup.conf to this ticket?

Can you also call this srun after the first job that crashes?
> srun -N 2 --ntasks-per-node=16 -p gll_usr_prod -w node[234-235] -t 6:00:00 --pty bash -c "env |grep -e PSM -e SLURM;stat /dev/ipath; ipathstats"

I would like to verify the environment is being passed around as expected and ipath is visible to the user processes.

Thanks,
--Nate

Comment 11 Cineca HPC Systems 2018-11-13 05:03:33 MST

Hi Nate
this is the cgroup.conf

[ibaccare@node186 Programs]$ cat /etc/slurm/cgroup.conf 
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes
ConstrainKmemSpace=no
TaskAffinity=no
AllowedRamSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

this is the environment

[ibaccare@node186 Programs]$ env |grep -e PSM -e SLURM
SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint
SLURM_NODELIST=node[186-187]
SLURM_JOB_NAME=bash
SLURMD_NODENAME=node186
SLURM_TOPOLOGY_ADDR=node186
SLURM_NTASKS_PER_NODE=16
SLURM_PRIO_PROCESS=0
SLURM_SRUN_COMM_PORT=36285
SLURM_JOB_QOS=normal
SLURM_PTY_WIN_ROW=39
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_CPU_BIND_LIST=0x000010000,0x000040000,0x000020000,0x000080000,0x000100000,0x000200000,0x000400000,0x000800000,0x001000000,0x002000000,0x004000000,0x008000000,0x010000000,0x020000000,0x040000000,0x080000000
SLURM_NNODES=2
SLURM_STEP_NUM_NODES=2
SLURM_JOBID=296059
SLURM_NTASKS=32
SLURM_LAUNCH_NODE_IPADDR=10.23.16.165
SLURM_STEP_ID=0
SLURM_STEP_LAUNCHER_PORT=36285
SLURM_TASKS_PER_NODE=16(x2)
SLURM_WORKING_CLUSTER=galileo:io07:6817:8448
SLURM_JOB_ID=296059
SLURM_JOB_USER=ibaccare
SLURM_STEPID=0
SLURM_SRUN_COMM_HOST=10.23.16.165
SLURM_CPU_BIND_TYPE=mask_cpu:
SLURM_PTY_WIN_COL=151
SLURM_UMASK=0022
SLURM_JOB_UID=28550
SLURM_NODEID=0
SLURM_SUBMIT_DIR=/galileo/home/userinternal/ibaccare
SLURM_TASK_PID=22081
SLURM_NPROCS=32
SLURM_CPUS_ON_NODE=16
SLURM_DISTRIBUTION=block
SLURM_PROCID=0
SLURM_JOB_NODELIST=node[186-187]
SLURM_PTY_PORT=34011
SLURM_LOCALID=0
PSM_RANKS_PER_CONTEXT=4
SLURM_JOB_GID=25200
SLURM_JOB_CPUS_PER_NODE=16(x2)
SLURM_CLUSTER_NAME=galileo
SLURM_GTIDS=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
SLURM_SUBMIT_HOST=node165
SLURM_JOB_PARTITION=gll_usr_prod
SLURM_STEP_NUM_TASKS=32
SLURM_JOB_ACCOUNT=cin_staff
SLURM_JOB_NUM_NODES=2
SLURM_STEP_TASKS_PER_NODE=16(x2)
SLURM_STEP_NODELIST=node[186-187]
SLURM_CPU_BIND=quiet,mask_cpu:0x000010000,0x000040000,0x000020000,0x000080000,0x000100000,0x000200000,0x000400000,0x000800000,0x001000000,0x002000000,0x004000000,0x008000000,0x010000000,0x020000000,0x040000000,0x080000000

[ibaccare@node186 Programs]$ stat /dev/ipath
  File: ‘/dev/ipath’
  Size: 0               Blocks: 0          IO Block: 4096   character special file
Device: 5h/5d   Inode: 14361       Links: 1     Device type: f6,0
Access: (0666/crw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2018-09-28 18:43:12.307213177 +0200
Modify: 2018-09-28 18:43:12.307213177 +0200
Change: 2018-09-28 18:43:12.307213177 +0200
 Birth: -

we don't have ipathstats command

thanks
ale

Comment 12 Nate Rini 2018-11-19 10:13:54 MST

Ale,

(In reply to Cineca HPC Systems from comment #11)
> [ibaccare@node186 Programs]$ env |grep -e PSM -e SLURM
> SLURM_CPU_BIND_TYPE=mask_cpu:
> PSM_RANKS_PER_CONTEXT=4

> [ibaccare@node186 Programs]$ stat /dev/ipath
>   File: ‘/dev/ipath’
> Access: (0666/crw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root
Looks like the device is visible and Slurm is not hiding it from the job with cgroups.

> we don't have ipathstats command
Wasn't required but would have been nice to verify state of the device outside of MPI.

It doesn't look like Slurm is affecting PSM directly, your environment also appears correct. 

>36 cores Broadwell nodes with one QLogic QDR InfiniBand HCA (16 hardware contexts)
>$ srun -N 2 --ntasks-per-node=16 -p gll_usr_prod -w node[234-235] -t 6:00:00 --pty bash

Looking at the Intel docs:
>Each MPI process requires a context. If there are more MPI processes than hardware contexts, the hardware contexts will be shared. They can be shared 2, 3 or 4 ways, supporting a maximum of 4x16=64 processes.

Can you try setting this for your test job?
>PSM_RANKS_PER_CONTEXT=2
>PSM_SHAREDCONTEXTS=1
I want to verify multiple jobs can be made to work from inside of srun call. PSM_SHAREDCONTEXTS should be 1 by default but might be safer to make sure it is set.

>2) launching the same MPI application, one with srun and the second with mpirun, also works fine: each job occupies 4 contexts per node (16 tasks per node)
Based on your description, a change is getting applied by srun but nothing should be touching PSM directly.

Can you also try disabling cpu binding by setting this argument to your job:
>--cpu-bind=none

If neither of those work, please try setting:
For your Intel MPI job and send all the logs.
>export I_MPI_DEBUG=5

Also, please try adding this argument to your OpenMPI mpirun call:
>-mca mpi_show_mca_params all

It might be worthwhile to open a parallel ticket with Intel about Omni-Path as we could be hitting some bug or issue with the driver.

--Nate

Comment 13 Cineca HPC Systems 2018-11-23 09:42:29 MST

Hi Nate

running further tests using intelmpi we noticed that we missed an important error reported by it, when two different users share the same node. We first run a job like this with the user ibaccare

[ibaccare@node444 ~]$ srun -N 1 -n 1 --ntasks-per-node=1 ~ibaccare/Programs/./hello_mpi_sleep_impi_env |& tee srun-ENV.1

then the user afederic runs the same job

[afederic@node444 ~]$ srun -N 1 -n 1 --ntasks-per-node=1 ~ibaccare/Programs/./hello_mpi_sleep_impi_env |& tee srun-ENV.2
[...]
Error attaching to shared memory object in shm_open: Permission denied (err=9)
[0] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
srun: error: node444: task 0: Exited with exit code 254
srun: Terminating job step 422673.41

while the first job is running we checked the psm file created in /dev/shm

node444: -rwx------ 1 ibaccare interactive 6352896 Nov 23 17:06 psm_shm.0fff0fff-0000-0000-0000-0fff0fff0fff

If the first job is run by afederic the error occurs in the job of ibaccare and the psm file created has the 
same name but it's owned by afederic

node444: -rwx------ 1 afederic interactive 99557376 Nov 23 16:47 psm_shm.0fff0fff-0000-0000-0000-0fff0fff0fff

Hence our impression is that the second job tries to use the same file instead of creating a new one.

In addition to that, running the following jobs with openmpi

[ibaccare@node444 ~]$ srun -n 32 --ntasks-per-node=16 ~ibaccare/Programs/./hello_mpi_sleep_ompi_env
[afederic@node444 ~]$ srun -n 24 --ntasks-per-node=12 ~ibaccare/Programs/./hello_mpi_sleep_ompi_env

we noticed that 2 psm files are created in /dev/shm

[root@master ~]# xdsh node[444,455] ls -l /dev/shm \| grep -v check
node444: total 170256
node444: -rwx------ 1 afederic interactive 74702848 Nov 23 16:43 psm_shm.1a000000-1173-0000-1a00-00001a000000
node444: -rwx------ 1 ibaccare interactive 99557376 Nov 23 16:43 psm_shm.1a000000-8a73-0000-1a00-00001a000000
node455: total 170336
node455: -rwx------ 1 afederic interactive 74702848 Nov 23 16:43 psm_shm.1a000000-1173-0000-1a00-00001a000000
node455: -rwx------ 1 ibaccare interactive 99557376 Nov 23 16:43 psm_shm.1a000000-8a73-0000-1a00-00001a000000

but the second job uses one context per rank in spite of having set PSM_RANKS_PER_CONTEXT=4

[root@master ~]# xdsh node[444,455] cat /sys/class/infiniband/qib0/nfreectxts
node444: 0
node455: 0

So while the first job is using 4 contexts per node (shared by the 16 tasks per node), the second uses all 
the remaining 12 contexts, one per rank.

When using mpirun to do same tests

[ibaccare@node444 ~]$ mpirun -n 1 ~ibaccare/Programs/./hello_mpi_sleep_impi_env |& tee mpirun-ENV.1
[afederic@node444 ~]$ mpirun -n 1 ~ibaccare/Programs/./hello_mpi_sleep_impi_env |& tee mpirun-ENV.2

there are two different psm files in /dev/shm

node444: -rwx------ 1 ibaccare interactive 6352896 Nov 23 17:25 psm_shm.48260000-cdc9-896d-577b-050011bc0a17
node444: -rwx------ 1 afederic interactive 6352896 Nov 23 17:25 psm_shm.68260000-dd90-066e-577b-050011bc0a17

and everything works fine.

We are attaching the environment files {s,mpi}run-ENV.[1,2] produced with PSM_VERBOSE_ENV=1 and I_MPI_DEBUG=5
on both srun and mpirun.

thanks
ale & isa

Comment 14 Cineca HPC Systems 2018-11-23 09:42:55 MST

Created attachment 8406 [details]
mpirun and srun environment

Comment 16 Nate Rini 2018-11-26 10:32:12 MST

>node444.9711Error attaching to shared memory object in shm_open: Permission denied (err=9)


Please activate debug logging on slurmd to see how the cgroups are being configured. Can you please add this line to your slurm.conf on the test nodes and SIGHUP your slurmd daemons:
>SlurmdDebug=debug3


Can you try using strace to see which file it is failing to open?
>[ibaccare@node444 ~]$ srun -n 32 --ntasks-per-node=16 strace -e open -tff -s9999 ~ibaccare/Programs/./hello_mpi_sleep_ompi_env
>[afederic@node444 ~]$ srun -n 24 --ntasks-per-node=12 strace -e open -tff -s9999 ~ibaccare/Programs/./hello_mpi_sleep_ompi_env


Please remove the SlurmdDebug line after testing to avoid filling your logs. Please attach the compressed log to this ticket.

>the second job uses one context per rank in spite of having set PSM_RANKS_PER_CONTEXT=4

The logs provided show the env is being passed correctly to the application by Slurm:
>srun-ENV.1:node444.9711env  PSM_RANKS_PER_CONTEXT     Number of ranks per context              => 4 (default was 1)
>srun-ENV.2:node444.9683env  PSM_RANKS_PER_CONTEXT     Number of ranks per context              => 4 (default was 1)

I suggest opening a bug with the MPI provider about this issue.

> there are two different psm files in /dev/shm
> 
> node444: -rwx------ 1 ibaccare interactive 6352896 Nov 23 17:25
> psm_shm.48260000-cdc9-896d-577b-050011bc0a17
> node444: -rwx------ 1 afederic interactive 6352896 Nov 23 17:25
> psm_shm.68260000-dd90-066e-577b-050011bc0a17
> 
> and everything works fine.


Are you using pam_namespace.so or pam_slurm or pam_slurm_adopt to control /dev/shm instances? Using mpirun outside of Slurm is likely escaping any kind of cgroup containment setup in your cgroup.conf (which your config has constraints active on devices).

Have you tried telling the job to only use tmi to see if jobs work without shm?
>export I_MPI_FABRICS=tmi


--Nate

Comment 17 Cineca HPC Systems 2018-11-27 09:06:51 MST

(In reply to Nate Rini from comment #16)
> Please activate debug logging on slurmd to see how the cgroups are being
> configured. Can you please add this line to your slurm.conf on the test
> nodes and SIGHUP your slurmd daemons:
> >SlurmdDebug=debug3
> 
> 
> Can you try using strace to see which file it is failing to open?
> >[ibaccare@node444 ~]$ srun -n 32 --ntasks-per-node=16 strace -e open -tff -s9999 ~ibaccare/Programs/./hello_mpi_sleep_ompi_env
> >[afederic@node444 ~]$ srun -n 24 --ntasks-per-node=12 strace -e open -tff -s9999 ~ibaccare/Programs/./hello_mpi_sleep_ompi_env


Strace and logs are attached (strace-and-slurmd-logs.tgz)


> The logs provided show the env is being passed correctly to the application
> by Slurm:
> >srun-ENV.1:node444.9711env  PSM_RANKS_PER_CONTEXT     Number of ranks per context              => 4 (default was 1)
> >srun-ENV.2:node444.9683env  PSM_RANKS_PER_CONTEXT     Number of ranks per context              => 4 (default was 1)
> 
> I suggest opening a bug with the MPI provider about this issue.


Why we should open an issue to Intel or OpenMPI when everything is working fine using mpirun to launch MPI applications?


> Are you using pam_namespace.so or pam_slurm or pam_slurm_adopt to control
> /dev/shm instances? Using mpirun outside of Slurm is likely escaping any
> kind of cgroup containment setup in your cgroup.conf (which your config has
> constraints active on devices).


We are only using pam_slurm_adopt in this way

[root@node479 ~]# grep account /etc/pam.d/sshd
account    required     pam_nologin.so
account    include      password-auth
account    sufficient   pam_slurm_adopt.so
account    required     pam_access.so

but the MPI apps were always launched inside the shell opened by srun.


> Have you tried telling the job to only use tmi to see if jobs work without
> shm?
> >export I_MPI_FABRICS=tmi

yes, same results.

thanks
ale & isa

Comment 18 Cineca HPC Systems 2018-11-27 09:07:21 MST

Created attachment 8439 [details]
strace of mpi apps and slurmd logs

Comment 20 Nate Rini 2018-11-27 13:32:26 MST

(In reply to Cineca HPC Systems from comment #17)
> > The logs provided show the env is being passed correctly to the application
> > by Slurm:
> > >srun-ENV.1:node444.9711env  PSM_RANKS_PER_CONTEXT     Number of ranks per context              => 4 (default was 1)
> > >srun-ENV.2:node444.9683env  PSM_RANKS_PER_CONTEXT     Number of ranks per context              => 4 (default was 1)
> > 
> > I suggest opening a bug with the MPI provider about this issue.

Looking at the strace logs, it appears the MPI is locking other users out or possibly it should be using a new UUID for the psm_shm file per user:

First run:
> [pid 26995] 16:40:20 open("/dev/shm/psm_shm.0fff0fff-0000-0000-0000-0fff0fff0fff", O_RDWR|O_CREAT|O_EXCL|O_TRUNC|O_NOFOLLOW|O_CLOEXEC, 0700) = 6

mode = 0700 = -rwx------

Second run:
> [pid 27028] 16:40:27 open("/dev/shm/psm_shm.0fff0fff-0000-0000-0000-0fff0fff0fff", O_RDWR|O_CREAT|O_EXCL|O_TRUNC|O_NOFOLLOW|O_CLOEXEC, 0700) = -1 EEXIST (File exists)
> [pid 27028] 16:40:27 open("/dev/shm/psm_shm.0fff0fff-0000-0000-0000-0fff0fff0fff", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = -1 EACCES (Permission denied)

> > Why we should open an issue to Intel or OpenMPI when everything is working
> fine using mpirun to launch MPI applications?

I suspect the MPI is detecting the srun and altering the run behavior causing the failure. The environment is being handed down by Slurm as expected. Slurm has no direct control of how PSM is implemented. Cgroups are not active either against /dev/shm or the PSM drivers based on your setup.

Comment 21 Cineca HPC Systems 2018-11-29 09:35:43 MST

Hi Nate,

we found a solution to force the IntelMPI library to create different /dev/shm/psm_shm.UUID files when launching with srun.

We dumped the environment of all the process launched by Intel mpirun. The process tree is

1. mpirun launches Intel mpiexec.hydra
2. mpiexec.hydra launches srun
3. srun launches Intel pmi_proxy via slurmctld
4. pmi_proxy launches the MPI processes

Looking at MPI process environment variables we found the variable I_MPI_HYDRA_UUID and looking at the file in /dev/shm we see that the UUID matched that of the variable.

When the MPI processes are launched with srun the variable I_MPI_HYDRA_UUID is not set.

So by setting it before running srun we can now force the creation of two different psm files. For example

[ibaccare@node476 Programs]$ export I_MPI_HYDRA_UUID=`uuidgen`
[ibaccare@node476 Programs]$ srun -n 32 --ntasks-per-node=16 ~ibaccare/Programs/hello_mpi_sleep_impi

[afederic@node476 ~]$ export I_MPI_HYDRA_UUID=`uuidgen`
[afederic@node476 ~]$ srun -n 32 --ntasks-per-node=16 ~ibaccare/Programs/hello_mpi_sleep_impi

[root@master ~]# xdsh node[476,477] ls -ls /dev/shm/psm\*
node476: 97224 -rwx------ 1 ibaccare interactive 99557376 Nov 29 17:18 /dev/shm/psm_shm.122f7fba-a77b-47ed-ad43-4887319f8e44
node476: 97224 -rwx------ 1 afederic interactive 99557376 Nov 29 17:18 /dev/shm/psm_shm.3c038879-af69-428d-ac7c-463e85790821
node477: 97224 -rwx------ 1 ibaccare interactive 99557376 Nov 29 17:18 /dev/shm/psm_shm.122f7fba-a77b-47ed-ad43-4887319f8e44
node477: 97224 -rwx------ 1 afederic interactive 99557376 Nov 29 17:18 /dev/shm/psm_shm.3c038879-af69-428d-ac7c-463e85790821

We cannot find a similar solution for OpenMPI.

thanks
ale

Comment 22 Cineca HPC Systems 2018-11-29 09:46:40 MST

Sorry Nate, 
I meant slurmstepd in the line below 

> 3. srun launches Intel pmi_proxy via slurmctld

Comment 24 Nate Rini 2018-11-30 09:42:49 MST

Ale,

(In reply to Cineca HPC Systems from comment #21)
> we found a solution to force the IntelMPI library to create different
> /dev/shm/psm_shm.UUID files when launching with srun.

That is good to know. Thanks for reporting that back.

> We cannot find a similar solution for OpenMPI. 

We are not aware of a similar solution but this does look like a bug that the Open MPI team should look at. We suggest that you get in contact with them for a solution rather than have us look into a workaround.

For now, I am going to resolve this issue since the best course of action would be to talk with the OpenMPI and Intel MPI teams. Please reply to reopen this ticket.
--Nate