Bug 10383

Summary: OpenMPI issue with Slurm and UCX support (Step resources limited to lower mem/cpu after upgrade to 20.11)
Product: Slurm Reporter: Misha Ahmadian <misha.ahmadian>
Component: OtherAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: Alan.Sill, albert.gil, bart.oldeman, csamuel, fabecassis, felip.moll, fordste5, greg.wickham, kaizaad, kenneth.hoste, kilian, lyeager, maxime.boissonneault, sts, yellowhat46
Version: 20.11.0   
Hardware: Linux   
OS: Linux   
Site: TTU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.11.3 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm-20.11.0-rpmbuild.log
slurm.conf
cgroup.conf
slurmd_cpu-25-20.log
slurmd_cpu-25-21.log
slurm-srun-orted-fix.patch

Description Misha Ahmadian 2020-12-07 10:03:12 MST
Created attachment 17000 [details]
slurm-20.11.0-rpmbuild.log

Hello,

We are experiencing a bizarre situation here at HPC Center of Texas Tech University, making us extend our maintenance downtime so far. Since we upgraded the Slurm from v20.02.3 to v.20.11.0, the OpenMPI is not functioning correctly, and we have been checking on various debugging parameters on the system but still no chance to fix the issue. 

First, we compiled the Slurm 20.11.0 against the external PMIx v.1,2,3 and UCX v1.9 packages, but we ran into errors during the rpmbuild process. Then Tim helped us to patch the Slurm 20.11.0 twice to fix those issues:

https://bugs.schedmd.com/show_bug.cgi?id=10288

After installing the Slurm 20.11.0, we recompiled the OpenMPI with Slurm support, PMI support (pointed to the /usr/include/slurm), and PMIx and UCX support (pointed to the external PMIx and UCX). However, we never got a chance to run any multi-node jobs. Single-node MPI jobs were running fine, but any job more than one node had no opportunity to complete. (We used both mprirun and run --mpi=pmix for this case, and none of them worked for us, and the slurmd.log file on the worker node was showing an error that PMIx cannot communicate to the outside) 

Unfortunately, we didn't keep those logs since we decided to get the Slurm 20.11.0 recompiled and reinstalled without PMIx and UCX. However, after we recompiled the OpenMPI with the later version of Slurm we ran into a similar issue. 
After spending a considerable amount of time on finding the reason for this issue, we realized that for some strange reason compiling the OpenMPI with "--with-slurm" and "--with-ucx" causes a problem for multi-node jobs. At first, we thought it might be the PMI (not PMIx) issue, and we tried to recompile the OpenMPI without PMI "--without-pmi" flag but still didn't get a chance to run a multi-node job successfully. Taking out either the "--with-slurm" or "--with-ucx" will resolve this issue, but this is not preferable a way for us.

So, to give you more details regarding this issue, please find the attached rpmbuild output log of Slurm 20.11.0, then the following explanation:

1) Compiling OpenMPI with Slurm, with UCX, with/without PMI support:

$OPENMPI_4.0.4/configure --with-slurm --with-pmi=/usr --with-ucx=/opt/apps/nfs/custom/ext-libs/ucx 
OR
$OPENMPI_4.0.4/configure --with-slurm --without-pmi --with-ucx=/opt/apps/nfs/custom/ext-libs/ucx

In both cases, when we run a multi-node job, it fails:

* MPITEST program (mpitest.c):

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv)
{
  int rank;
  char hostname[256];

  MPI_Init(&argc,&argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  gethostname(hostname,255);

  printf("Hello world!  I am process number: %d on host %s\n", rank, hostname);

  MPI_Finalize();

  return 0;
}


* Job Submission Script:

#!/bin/bash
#SBATCH -J MPI_Test
#SBATCH -N 2
#SBATCH --ntasks-per-node=128
#SBATCH -o %x.%j.out
#SBATCH -e %x.%j.err
#SBATCH -p test
 
module load openmpi-4.0.4

mpicc mpitest.c -o mpitest

mpirun ./mpitest



* The MPI_Test.out will be empty for this job and the error file contains the following contents:

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 192 with PID 67722 on node cpu-25-27 exited on signal 9 (Killed).
--------------------------------------------------------------------------


* Slurmd.logs:

The first node (cpu-25-26) has a clear slurred log:

[2020-12-07T10:30:58.069] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 11211
[2020-12-07T10:30:58.069] task/affinity: batch_bind: job 11211 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[2020-12-07T10:30:58.069] task/affinity: batch_bind: job 11211 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[2020-12-07T10:30:58.243] [11211.extern] Considering each NUMA node as a socket
[2020-12-07T10:30:58.254] [11211.extern] task/cgroup: _memcg_initialize: /slurm/uid_98263/job_11211: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2020-12-07T10:30:58.254] [11211.extern] task/cgroup: _memcg_initialize: /slurm/uid_98263/job_11211/step_extern: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2020-12-07T10:30:58.267] Launching batch job 11211 for UID 98263
[2020-12-07T10:30:58.274] [11211.batch] Considering each NUMA node as a socket
[2020-12-07T10:30:58.277] [11211.batch] task/cgroup: _memcg_initialize: /slurm/uid_98263/job_11211: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2020-12-07T10:30:58.277] [11211.batch] task/cgroup: _memcg_initialize: /slurm/uid_98263/job_11211/step_batch: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2020-12-07T10:31:25.775] [11211.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:35072
[2020-12-07T10:31:25.777] [11211.batch] done with job
[2020-12-07T10:31:25.781] [11211.extern] done with job

But the second node (cpu-25-27) shows an error:

[2020-12-07T10:30:58.241] [11211.extern] Considering each NUMA node as a socket
[2020-12-07T10:30:58.251] [11211.extern] task/cgroup: _memcg_initialize: /slurm/uid_98263/job_11211: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2020-12-07T10:30:58.251] [11211.extern] task/cgroup: _memcg_initialize: /slurm/uid_98263/job_11211/step_extern: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2020-12-07T10:30:59.286] launch task StepId=11211.0 request from UID:98263 GID:230 HOST:10.100.25.26 PORT:58424
[2020-12-07T10:30:59.286] task/affinity: lllp_distribution: entire node must be allocated, disabling affinity
[2020-12-07T10:30:59.286] task/affinity: lllp_distribution: JobId=11211 manual binding: mask_cpu,one_thread
[2020-12-07T10:30:59.292] [11211.0] Considering each NUMA node as a socket
[2020-12-07T10:30:59.296] [11211.0] task/cgroup: _memcg_initialize: /slurm/uid_98263/job_11211: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2020-12-07T10:30:59.296] [11211.0] task/cgroup: _memcg_initialize: /slurm/uid_98263/job_11211/step_0: alloc=4027MB mem.limit=4027MB memsw.limit=5033MB
[2020-12-07T10:31:19.192] [11211.0] task/cgroup: _oom_event_monitor: oom-kill event count: 1
[2020-12-07T10:31:20.082] [11211.0] task/cgroup: _oom_event_monitor: oom-kill event count: 2
[2020-12-07T10:31:20.359] [11211.0] task/cgroup: _oom_event_monitor: oom-kill event count: 3
[2020-12-07T10:31:20.770] [11211.0] task/cgroup: _oom_event_monitor: oom-kill event count: 4
[2020-12-07T10:31:21.896] [11211.0] task/cgroup: _oom_event_monitor: oom-kill event count: 5
[2020-12-07T10:31:22.024] [11211.0] task/cgroup: _oom_event_monitor: oom-kill event count: 6
[2020-12-07T10:31:24.222] [11211.0] task/cgroup: task_cgroup_memory_check_oom: StepId=11211.0 hit memory+swap limit at least once during execution. This may or may not result in some failure.
[2020-12-07T10:31:24.223] [11211.0] error: Detected 6 oom-kill event(s) in StepId=11211.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2020-12-07T10:31:25.781] [11211.extern] done with job
[2020-12-07T10:31:27.063] [11211.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2020-12-07T10:31:27.065] [11211.0] done with job

* Slurmctld Log:

[2020-12-07T10:30:57.748] _slurm_rpc_submit_batch_job: JobId=11211 InitPrio=24404 usec=12409
[2020-12-07T10:30:58.065] sched: Allocate JobId=11211 NodeList=cpu-25-[26-27] #CPUs=256 Partition=test
[2020-12-07T10:31:25.776] _job_complete: JobId=11211 WEXITSTATUS 137
[2020-12-07T10:31:25.777] _job_complete: JobId=11211 done


* Please note that this is a very simple send/receive job, and it typically takes over 30 sec (up to a minute) to completes, which is very unusual.
* Please also note that the srun command also fails in this case (srun --mpi=none OR srun --mpi=pmi2)
* The slurmd log may suggest an issue with the network, but actually, we do not see any problem with the network at this point. Moreover, the same OpenMPI config with Slurm 20.02.3 was working fine with no issue.


2) In our second test, we tried to compile the OpenMPI once with/without Slurm and once with/without UCX:

$OPENMPI_4.0.4/configure --without-slurm --without-pmi --with-ucx=/opt/apps/nfs/custom/ext-libs/ucx 

$OPENMPI_4.0.4/configure --with-slurm --with-pmi=/usr --without-ucx

In both cases, we have no issue with running the job submission script mentioned above. And it runs with no problem.


So, at this point, we cannot find the relation between UCX, slurm, and OpenMPI since they're supposed to be independent of each other. Therefore, before we roll-back to Slurm 20.02, we decided to contact you and see if you can assist us on this issue, which has affected our cluster for more than a week.


Best Regards,
Misha
Comment 1 Misha Ahmadian 2020-12-07 10:04:25 MST
Created attachment 17001 [details]
slurm.conf
Comment 2 Misha Ahmadian 2020-12-07 10:04:49 MST
Created attachment 17002 [details]
cgroup.conf
Comment 4 Marcin Stolarek 2020-12-08 06:01:19 MST
Misha,

I'm looking into the case, but until now I can't reproduce the issue.

cheers,
Marcin
Comment 5 Marcin Stolarek 2020-12-08 09:41:54 MST
Misha,

Not a resolution, but could you please check how much memory is used by the job when you disable memory limit handling by task/cgroup, by setting
>ConstrainRAMSpace=no
in cgroup.conf?

cheers,
Marcin
Comment 6 Marcin Stolarek 2020-12-08 09:57:48 MST
Additionally could you please increase SlurmdDebug to `debug3`, today you have numeric value of 3 which corresponds to info level(We generally don't recommend numeric values) and share full slurmd log from the time when you reproduce the issue.
Comment 7 Misha Ahmadian 2020-12-08 10:46:51 MST
Hi Marcin,

Thanks for your response. This is the current critical issue for us at this time that prevents us from bringing the cluster back online, and I really appreciate your help.

I set the ConstrainRAMSpace=no in cgroup.conf and SlurmdDebug=debug3 in slurm.conf files and restarted all slurmctld and slurmd daemons. Then submitted an MPI test program (the same code as I provided above), which was compiled against an OpenMPI with Slurm and UCX support. Below is the same error that we got before:

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 131 with PID 70697 on node cpu-25-21 exited on signal 9 (Killed).
--------------------------------------------------------------------------

The job started running on two cpu nodes (cpu-25-[20-21]) with the internal IP address of (x.x.25.20 and xx.25.21). Please find the corresponding slurmd logs that I collected from each node in the following posts.

The interesting part is the error in the second node (cpu-25-21), which is about connection refusal in the first node:


[2020-12-08T11:15:28.936] [11276.0] debug2: Error connecting slurm stream socket at x.x.25.20:63936: Connection refused
[2020-12-08T11:15:28.936] [11276.0] debug:  _send_srun_resp_msg: 5/5 failed to send msg type 6003: Connection refused
[2020-12-08T11:15:28.936] [11276.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused


Please also note that we do not have the firewall set up on the worker nodes, and I'm not sure what may cause this. Let me know if you need more information from me.

Best,
Misha
Comment 8 Misha Ahmadian 2020-12-08 10:48:11 MST
Created attachment 17031 [details]
slurmd_cpu-25-20.log
Comment 9 Misha Ahmadian 2020-12-08 10:48:40 MST
Created attachment 17032 [details]
slurmd_cpu-25-21.log
Comment 10 Misha Ahmadian 2020-12-08 10:53:55 MST
I was also curious to know if the CPU/Network architecture might have any impact on the Slurm and the way it works/communicates between the nodes since the CPU nodes of our cluster come with AMD Epyc Roam processors with Mellanox HDR 200Gps Infiniband network, and might be different from the test system on your side.

Best,
Misha
Comment 11 Misha Ahmadian 2020-12-08 14:10:17 MST
Hi Marcin,

So I have some update for you:

In my previous attempt, I set the ConstrainRAMSpace=no  but didn't set the ConstrainSwapSpace=no, and I got the same error and the same slurmd.logs that I sent to you.

I went ahead and set both ConstrainRAMSpace=no and ConstrainSwapSpace=no and I was able to run the following job with proper output. The slurm_connect failed issue was still there didn't cause the job to fail:

#!/bin/bash
#SBATCH -J MPI_Test
#SBATCH -N 2
#SBATCH --ntasks-per-node=128
#SBATCH -o %x.%j.out
#SBATCH -e %x.%j.err
#SBATCH -p test
 
module load openmpi-4.0.4

mpicc mpitest.c -o mpitest

mpirun ./mpitest


Please note that the OpenMPI, in this case, has been compiled against Slurm and UCX, but cgroup Memory constraint was disabled. Furthermore, when we disable the Memory cgroup the whole job life cycle is much faster than when we leave it enabled. 

So, I went ahead and did the following:

1) I checked the Memory usage of the successful job and it sounds very reasonable:

sacct -j 11295 --format=partition,jobid,ntasks,nodelist,maxrss,maxvmsize
 Partition        JobID   NTasks        NodeList     MaxRSS  MaxVMSize
---------- ------------ -------- --------------- ---------- ----------
      test 11295                  cpu-25-[26-27]
           11295.batch         1       cpu-25-26      3904K    274048K
           11295.extern        2  cpu-25-[26-27]          0      4352K
           11295.0             1       cpu-25-27          0    342356K

2) Then I checked the groups on both nodes. It looks like the group is doing its job correctly:

On both cpu-25-[26-27]:

/sys/fs/cgroup/memory/slurm/uid_99577/job_11296/memory.memsw.limit_in_bytes ->
540494790656


which is a correct memory limit for this job (DefMemPerCPU=4027 is defined in slurm.conf and control shows the correct memory request)

However, in the second node (cpu-25-27) the group for step_0 is limited to 4027M:

/sys/fs/cgroup/memory/slurm/uid_99577/job_11296/step_0/memory.memsw.limit_in_bytes -->4222615552

I think it should be fine as well. Now the thing is we have a hard time finding a relation between enabling/disabling the memory group (RAM,Swap), UCX, and OpenMPI!

Best,
Misha
Comment 12 Marcin Stolarek 2020-12-09 07:24:10 MST
Misha,

The value of MaxVMSize returned by sacct may be underestimated it jumped quickly. Accounting data is collected with JobAcctGatherFrequency interval, you're on default of 30s.

As you noticed memory for step 0 is set to 4GB - like for one CPU, one task while you have --ntasks-per-node=128 and mpirun starts 128 tasks. Could you please check `scontrol show step JOBID` while the step is executing (You can add `sleep(100);` in your test program to make it easier). I'm looking for potential differences in Slurm 20.02 and 20.11 behavior here.

I didn't fully check UCX code, but it sounds possible that starting 128 tasks even for very simply job may allocate fair amount of memory (as you know it may be platform dependent).

The way we recommend starting openMPI applications[1] is to build openmpi with PMI support and then let srun start tasks instead of mpirun. Using this method I'm getting the correct number of tasks for step and total memory for step_0 comming from --mem-per-cpu is higher, multiplied by number of CPUs.

Let me know how that works on your side.

cheers,
Marcin
[1]https://slurm.schedmd.com/mpi_guide.html#open_mpi
Comment 13 Misha Ahmadian 2020-12-09 15:43:50 MST
Hi Marcin,

Thanks for your suggestions, and I may have something for you now:

1) I compiled the OpenMPI with the following options:

./configure --with-pmi=/usr --with-ucx=/opt/apps/nfs/custom/ext-libs/ucx --with-slurm

2) Then submitted a job with this openmpi:

#!/bin/bash
#SBATCH -J Misha_MPI
#SBATCH -N 2
#SBATCH --ntasks-per-node=128
#SBATCH -o %x.%j.out
#SBATCH -e %x.%j.err
#SBATCH -p test

ml openmpi-4.0.4

mpicc mpitest.c -o mpitest

mpicc mpitest.c -o mpitest


So this job fails if we enable the ConstrainRAMSpace and ConstrainSwapSpace, but it will run fine if we disable them. Below is the output of the "scontrol" for both step and job:

$ scontrol show job 11339
JobId=11339 JobName=Misha_MPI
   UserId=xxxx(xxxx) GroupId=xxx(xxx) MCS_label=N/A
   Priority=23448 Nice=0 Account=default QOS=normal
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:21 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2020-12-09T15:12:20 EligibleTime=2020-12-09T15:12:20
   AccrueTime=2020-12-09T15:12:20
   StartTime=2020-12-09T15:12:20 EndTime=2020-12-09T15:12:41 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-12-09T15:12:20
   Partition=test AllocNode:Sid=cpu-23-1.localdomain:46666
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cpu-25-[20-21]
   BatchHost=cpu-25-20
   NumNodes=2 NumCPUs=256 NumTasks=256 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=256,mem=1030912M,node=2,billing=256
   Socks/Node=* NtasksPerN:B:S:C=128:0:*:* CoreSpec=*
   MinCPUsNode=128 MinMemoryCPU=4027M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=mpi_test
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/xxxx/slurm_test/mpi/submit.sh
   WorkDir=/home/xxxx/slurm_test/mpi
   StdErr=/home/xxxx/slurm_test/mpi/Misha_MPI.11339.err
   StdIn=/dev/null
   StdOut=/home/xxx/slurm_test/mpi/Misha_MPI.11339.out
   Power=
   NtasksPerTRES:0


$ scontrol show step 11339
StepId=11339.extern UserId=98263 StartTime=2020-12-09T15:12:20 TimeLimit=UNLIMITED
   State=RUNNING Partition=test NodeList=cpu-25-[20-21]
   Nodes=2 CPUs=0 Tasks=2 Name=extern Network=(null)
   TRES=(null)
   ResvPorts=(null)
   CPUFreqReq=Default
   SrunHost:Pid=(null):0

StepId=11339.batch UserId=98263 StartTime=2020-12-09T15:12:20 TimeLimit=UNLIMITED
   State=RUNNING Partition=test NodeList=cpu-25-20
   Nodes=1 CPUs=0 Tasks=1 Name=batch Network=(null)
   TRES=(null)
   ResvPorts=(null)
   CPUFreqReq=Default
   SrunHost:Pid=(null):0

StepId=11339.0 UserId=98263 StartTime=2020-12-09T15:12:22 TimeLimit=UNLIMITED
   State=RUNNING Partition=test NodeList=cpu-25-21
   Nodes=1 CPUs=1 Tasks=1 Name=orted Network=(null)
   TRES=cpu=1,mem=4027M,node=1
   ResvPorts=(null)
   CPUFreqReq=Default Dist=Cyclic
   SrunHost:Pid=cpu-25-20:88803

* As you can observe, step_0, which is the one that suns the ORTED has only 1 CPU and ~3.9G RAM! When we increase the number of nodes, this amount of RAM will never be sufficient for ORTED to proceed. More interesting is when the group Memory constraint is in place, the ORTED cannot exceed this memory amount and fails. It also looks like when we compile the OpenMPI with UCX the memory consumption grows significantly compared to non-UCX MPIs.

The output of the sacct for this job:

# sacct -j 11341 --format=partition,jobid,ntasks,nodelist,maxrss,maxvmsize
 Partition        JobID   NTasks        NodeList     MaxRSS  MaxVMSize
---------- ------------ -------- --------------- ---------- ----------
      test 11341                  cpu-25-[20-21]
           11341.batch         1       cpu-25-20      5080K    340604K
           11341.extern        2  cpu-25-[20-21]          0      4352K
           11341.0           256  cpu-25-[20-21]    102496K    475976K


3) Then I submitted the job with the same OpenMPI and test program, but this time with srun and PMI2:

#!/bin/bash
#SBATCH -J Misha_MPI
#SBATCH -N 2
#SBATCH --ntasks-per-node=128
#SBATCH -o %x.%j.out
#SBATCH -e %x.%j.err
#SBATCH -p test

srun --mpi=mpi2 ./mpitest

This job never fails no matter if we enable or disable the cgroup memory constraint:

$ scontrol show job 11341
JobId=11341 JobName=Misha_MPI
   UserId=xxxx(xxx) GroupId=xxxx(xxx) MCS_label=N/A
   Priority=23448 Nice=0 Account=default QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:32 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2020-12-09T15:21:13 EligibleTime=2020-12-09T15:21:13
   AccrueTime=2020-12-09T15:21:13
   StartTime=2020-12-09T15:21:13 EndTime=2020-12-11T15:21:13 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-12-09T15:21:13
   Partition=test AllocNode:Sid=cpu-23-1.localdomain:46666
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cpu-25-[20-21]
   BatchHost=cpu-25-20
   NumNodes=2 NumCPUs=256 NumTasks=256 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=256,mem=1030912M,node=2,billing=256
   Socks/Node=* NtasksPerN:B:S:C=128:0:*:* CoreSpec=*
   MinCPUsNode=128 MinMemoryCPU=4027M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=mpi_test
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/xxxx/slurm_test/mpi/submit.sh
   WorkDir=/home/xxxx/slurm_test/mpi
   StdErr=/home/xxxx/slurm_test/mpi/Misha_MPI.11341.err
   StdIn=/dev/null
   StdOut=/home/xxxx/slurm_test/mpi/Misha_MPI.11341.out
   Power=
   NtasksPerTRES:0

$ scontrol show step 11341
StepId=11341.extern UserId=98263 StartTime=2020-12-09T15:21:13 TimeLimit=UNLIMITED
   State=RUNNING Partition=test NodeList=cpu-25-[20-21]
   Nodes=2 CPUs=0 Tasks=2 Name=extern Network=(null)
   TRES=(null)
   ResvPorts=(null)
   CPUFreqReq=Default
   SrunHost:Pid=(null):0

StepId=11341.batch UserId=98263 StartTime=2020-12-09T15:21:13 TimeLimit=UNLIMITED
   State=RUNNING Partition=test NodeList=cpu-25-20
   Nodes=1 CPUs=0 Tasks=1 Name=batch Network=(null)
   TRES=(null)
   ResvPorts=(null)
   CPUFreqReq=Default
   SrunHost:Pid=(null):0

StepId=11341.0 UserId=98263 StartTime=2020-12-09T15:21:15 TimeLimit=UNLIMITED
   State=RUNNING Partition=test NodeList=cpu-25-[20-21]
   Nodes=2 CPUs=256 Tasks=256 Name=mpitest Network=(null)
   TRES=cpu=256,mem=1030912M,node=2
   ResvPorts=(null)
   CPUFreqReq=Default Dist=Block
   SrunHost:Pid=cpu-25-20:90351

Now, as you can see, step_0 has all the requested resources, and it works fine!

# sacct -j 11339 --format=partition,jobid,ntasks,nodelist,maxrss,maxvmsize
 Partition        JobID   NTasks        NodeList     MaxRSS  MaxVMSize
---------- ------------ -------- --------------- ---------- ----------
      test 11339                  cpu-25-[20-21]
           11339.batch         1       cpu-25-20      3812K    274040K
           11339.extern        2  cpu-25-[20-21]          0      4352K
           11339.0             1       cpu-25-21          0    342340K

The sacct shows a lower MaxVMSize since the PMI2 was not compiled against any UCX packages and is supposed to have a slightly lower memory footprint than ORTED.

The quests are:

1) As you mentioned, maybe the "MaxVMSize" is not a very reliable number, but we can expect this number to be exceeded. However, why the ORTED and UCX require such a huge memory? Does Slurmstepd or any other Slurm processes add extra memory footprint, or is it just the nature of MPI launchers?

2) Why the step_0 for MPI jobs (and especially the ORTED) should be limited to 1 CPU? Can we just let it have all the resources just like the way srun does?

3) Why was this not an issue with v20.02? (Although were experiencing slower multi-node MPI jobs with Slurm 20.02 compares to our older clusters with UGE scheduler)

Please let me know if you need more information from me, and thank you very much for helping us in this case.


Best,
Misha
Comment 14 Misha Ahmadian 2020-12-09 17:16:58 MST
Marcin,

I did one more step in addition to what I posted above:

1) I compiled OpenMPI without Slurm but kept the UCX:

./configure --without-pmi --with-ucx=/opt/apps/nfs/custom/ext-libs/ucx --without-slurm


2) Then I submitted a job to Slurm

#!/bin/bash
#SBATCH -J Misha_MPI
#SBATCH -N 2
#SBATCH --ntasks-per-node=128
#SBATCH -o %x.%j.out
#SBATCH -e %x.%j.err
#SBATCH -p test

ml openmpi-4.0.4

mpicc mpitest.c -o mpitest

mpirun -np 256 -map-by ppr:128:node -bind-to core -hostfile machinefile.$SLURM_JOB_ID ./mpitest


*(The Prolog generates the machinefile.$SLURM_JOB_ID for every job, and epilog removes it)

* This job again has no issue running with/without cgroup memory constraint and works just fine.

$ scontrol show job 11352
JobId=11352 JobName=Misha_MPI
   UserId=xxxx(xxx) GroupId=xxxx(xxx) MCS_label=N/A
   Priority=22568 Nice=0 Account=default QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:51 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2020-12-09T17:52:13 EligibleTime=2020-12-09T17:52:13
   AccrueTime=2020-12-09T17:52:13
   StartTime=2020-12-09T17:52:14 EndTime=2020-12-11T17:52:14 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-12-09T17:52:14
   Partition=test AllocNode:Sid=cpu-23-1.localdomain:52013
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cpu-25-[20-21]
   BatchHost=cpu-25-20
   NumNodes=2 NumCPUs=256 NumTasks=256 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=256,mem=1030912M,node=2,billing=256
   Socks/Node=* NtasksPerN:B:S:C=128:0:*:* CoreSpec=*
   MinCPUsNode=128 MinMemoryCPU=4027M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=mpi_test
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/xxxx/slurm_test/mpi/submit.sh
   WorkDir=/home/xxxx/slurm_test/mpi
   StdErr=/home/xxxx/slurm_test/mpi/Misha_MPI.11352.err
   StdIn=/dev/null
   StdOut=/home/xxxx/slurm_test/mpi/Misha_MPI.11352.out
   Power=
   NtasksPerTRES:0


$ scontrol show step 11352
StepId=11352.extern UserId=98263 StartTime=2020-12-09T17:52:14 TimeLimit=UNLIMITED
   State=RUNNING Partition=test NodeList=cpu-25-[20-21]
   Nodes=2 CPUs=0 Tasks=2 Name=extern Network=(null)
   TRES=(null)
   ResvPorts=(null)
   CPUFreqReq=Default
   SrunHost:Pid=(null):0

StepId=11352.batch UserId=98263 StartTime=2020-12-09T17:52:14 TimeLimit=UNLIMITED
   State=RUNNING Partition=test NodeList=cpu-25-20
   Nodes=1 CPUs=0 Tasks=1 Name=batch Network=(null)
   TRES=(null)
   ResvPorts=(null)
   CPUFreqReq=Default
   SrunHost:Pid=(null):0

As we can see, there is no step_0 for this job, and the mpirun uses up all the allocated resources for this job:

$ sacct -j 11352 --format=partition,jobid,ntasks,nodelist,maxrss,maxvmsize
 Partition        JobID   NTasks        NodeList     MaxRSS  MaxVMSize
---------- ------------ -------- --------------- ---------- ----------
      test 11352                  cpu-25-[20-21]
           11352.batch         1       cpu-25-20   1835.50M    340604K
           11352.extern        2  cpu-25-[20-21]          0    134288K

However, this is not preferable to compile the OpenMPI without Slurm support. We want users to avoid adding any machinefile or be specific about their process allocation, as I did in the job submission script above. Moreover, we would like to allow users to have their choice of "mpirun" and "srun" and do not restrict them to the "srun" command for their MPI jobs.

Best,
Misha
Comment 15 Marcin Stolarek 2020-12-10 04:59:18 MST
Misha,

I'll start rephrasing what you wrote adding some comments to make sure we're on the same page.

The job executes correctly under memory control when you use srun or mpirun build without Slurm support. When tasks are lanuched by srun, then it's responsible for step request, it's requesting a step with all CPUs which results in correct memory calculation (#CPUs * --mem-per-cpu). In case of mpirun without Slurm support, porcesses are spawned on nodes using ssh/rsh to the nodes from the machine file, if you have pam_slurm_adopt enabled those are than adopted to an extern step which contains all the job resources.

Tasks are killed by OOM killer when step request is done by mpirun complied with Slurm support. Looking at openmpi-4.0.4 code I see that it actually calls srun behind the scene adding --ntasks-per-node=1 in the command line:
>./orte/mca/plm/slurm/plm_slurm_module.c
>260     /*                                                                           
>261      * SLURM srun OPTIONS                                                        
>262      */                                                                          
>263                                                                                  
>264     /* add the srun command */                                                   
>265     opal_argv_append(&argc, &argv, "srun");                                      
>266                                                                                  
>267     /* start one orted on each node */                                           
>268     opal_argv_append(&argc, &argv, "--ntasks-per-node=1");    

you can even verify that executing `ps aux | grep srun | grep orted` on the batch host(normally the 1st node in the allocation) which will give you a line similar to:
>srun --ntasks-per-node=1 --kill-on-bad-exit --ntasks=2 orted -mca ess "slurm" [...] 

This setting was done in openmpi years ago, in commit 83e59e67610 (introduced in openmpi-2.0.0.
I compared the step parameters when mpirun used to create the step on Slurm 20.02 and I'm seeing exactly the same behavior.

>The sacct shows a lower MaxVMSize since the PMI2 was not compiled against any UCX packages and is supposed to have a slightly lower memory footprint than ORTED.
As far as I understand PMI2 is simple and tiny mechanism (Key-Value container) to exchange the information between resource manager and MPI launcher, while UCX is mostly about interproces communication optimization. I don't expect UCX having much if any impact on PMI (The terminology here is a little bit confusing since Slurm PMI is the runtime environment equivalent for orted).

>1) As you mentioned, maybe the "MaxVMSize" is not a very reliable number, but we 
>can expect this number to be exceeded. However, why the ORTED and UCX require 
>such a huge memory? Does Slurmstepd or any other Slurm processes add extra 
>memory footprint, or is it just the nature of MPI launchers?
I'm pretty sure that the memory is exceeded here, since we relay on the kernel/cgroup mechanism. The only processes in the cgroup are job tasks- slurmd/slurmstepd don't contribute here. Judging on the application required memory is outside our (SchedMD) expertise, however from my reading while working on the case I feel that larger minimal footprint for UCX may be expected. At the end of the day it's goal is to optimize real application runs not just test Init/Fini app.

>2) Why the step_0 for MPI jobs (and especially the ORTED) should be limited to 1 
>CPU? Can we just let it have all the resources just like the way srun does?
>3) Why was this not an issue with v20.02? (Although were experiencing slower 
>multi-node MPI jobs with Slurm 20.02 compares to our older clusters with UGE 
>scheduler)
I think I answered those two above. In tests I did I don't see the difference between 20.02 and 20.11.

cheers,
Marcin
Comment 16 Misha Ahmadian 2020-12-10 11:19:09 MST
Hi Marcin,

> The job executes correctly under memory control when you use srun or mpirun
> build without Slurm support. When tasks are lanuched by srun, then it's
> responsible for step request, it's requesting a step with all CPUs which
> results in correct memory calculation (#CPUs * --mem-per-cpu). In case of
> mpirun without Slurm support, porcesses are spawned on nodes using ssh/rsh
> to the nodes from the machine file, if you have pam_slurm_adopt enabled
> those are than adopted to an extern step which contains all the job
> resources.

That's correct.

 
> Tasks are killed by OOM killer when step request is done by mpirun complied
> with Slurm support. Looking at openmpi-4.0.4 code I see that it actually
> calls srun behind the scene adding --ntasks-per-node=1 in the command line:
> >./orte/mca/plm/slurm/plm_slurm_module.c
> >260     /*                                                                           
> >261      * SLURM srun OPTIONS                                                        
> >262      */                                                                          
> >263                                                                                  
> >264     /* add the srun command */                                                   
> >265     opal_argv_append(&argc, &argv, "srun");                                      
> >266                                                                                  
> >267     /* start one orted on each node */                                           
> >268     opal_argv_append(&argc, &argv, "--ntasks-per-node=1");    
> 
> you can even verify that executing `ps aux | grep srun | grep orted` on the
> batch host(normally the 1st node in the allocation) which will give you a
> line similar to:
> >srun --ntasks-per-node=1 --kill-on-bad-exit --ntasks=2 orted -mca ess "slurm" [...] 
> 
> This setting was done in openmpi years ago, in commit 83e59e67610
> (introduced in openmpi-2.0.0.
> I compared the step parameters when mpirun used to create the step on Slurm
> 20.02 and I'm seeing exactly the same behavior.

Thanks for letting me know about this. That was interesting!

> 
> >The sacct shows a lower MaxVMSize since the PMI2 was not compiled against any UCX packages and is supposed to have a slightly lower memory footprint than ORTED.
> As far as I understand PMI2 is simple and tiny mechanism (Key-Value
> container) to exchange the information between resource manager and MPI
> launcher, while UCX is mostly about interproces communication optimization.
> I don't expect UCX having much if any impact on PMI (The terminology here is
> a little bit confusing since Slurm PMI is the runtime environment equivalent
> for orted).
> 
> >1) As you mentioned, maybe the "MaxVMSize" is not a very reliable number, but we 
> >can expect this number to be exceeded. However, why the ORTED and UCX require 
> >such a huge memory? Does Slurmstepd or any other Slurm processes add extra 
> >memory footprint, or is it just the nature of MPI launchers?
> I'm pretty sure that the memory is exceeded here, since we relay on the
> kernel/cgroup mechanism. The only processes in the cgroup are job tasks-
> slurmd/slurmstepd don't contribute here. Judging on the application required
> memory is outside our (SchedMD) expertise, however from my reading while
> working on the case I feel that larger minimal footprint for UCX may be
> expected. At the end of the day it's goal is to optimize real application
> runs not just test Init/Fini app.

So, here is the thing. I think I somehow misinterpreted the output of sacct that I posted above. (I repost it here again): 

# sacct -j 11341 --format=partition,jobid,ntasks,nodelist,maxrss,maxvmsize
 Partition        JobID   NTasks        NodeList     MaxRSS  MaxVMSize
---------- ------------ -------- --------------- ---------- ----------
      test 11341                  cpu-25-[20-21]
           11341.batch         1       cpu-25-20      5080K    340604K
           11341.extern        2  cpu-25-[20-21]          0      4352K
           11341.0           256  cpu-25-[20-21]    102496K    475976K

This is the test job compiled and executed with an OpenMPI that was compiled with Slurm, UCX, and PMI. When I looked again I found MaxVMSize is 465MB (Not 4GB) since I missed the 'k' next to the 475976K. I understand the values of MaxVMSize and MaxRSS can be underestimated since the account gathering happens every 30 sec and may simply miss the actual MAX value. But, after I dicuss with other people here and looking aroung on internet and OpenMPI Github page, we still have a hard time to understand why MPI_Init should requre more than 3.9G!! to run a MPI job? Moreover, we never had such and issue before with the othe clusters.

>I think I answered those two above. In tests I did I don't see the difference between >20.02 and 20.11.

I understand there is no difference between Slurm 20.11 and 20.02 in this case, but we were tryin to understand what might have caused this to happen to our MPI jobs, since we never had that in the previous version of Slurm with the same OS packages and OpenMPI build.

One more question: Do you think something might have been changed in the recent version of Slurm in terms of handling the cgroup or memory control? or are we missing something in our config files? (Just curious to know)

Thanks for all your helpful explanation. We appreciate your help.

Best,
Misha
Comment 17 Marcin Stolarek 2020-12-10 11:29:36 MST
Could you please try applying this patch to you openmpi build:
>./orte/mca/plm/slurm/plm_slurm_module.c
>@@ -267,6 +267,11 @@ static void launch_daemons(int fd, short args, void *cbdata)
>     /* start one orted on each node */
>     opal_argv_append(&argc, &argv, "--ntasks-per-node=1");
> 
>+    /* add all CPUs to this task */
>+    cpus_on_node = getenv("SLURM_CPUS_ON_NODE");
>+    asprintf(&tmp, "--cpus-per-task=%s", cpus_on_node);
>+    opal_argv_append(&argc, &argv, tmp);
>+    free(tmp);
>+
>     if (!orte_enable_recovery) {
>         /* kill the job if any orteds die */
>         opal_argv_append(&argc, &argv, "--kill-on-bad-exit");

The above snippet shows the final location of file in the openmpi tar.gz (after prrte subproject inclusion)[1].


>One more question: Do you think something might have been changed in the recent version of Slurm in terms of handling the cgroup or memory control? or are we missing something in our config files? (Just curious to know)
I'm further investigating it and I'll keep you posted.

cheers,
Marcin
[1]https://github.com/openpmix/prrte/pull/698
Comment 18 Misha Ahmadian 2020-12-10 13:17:42 MST
Marcin,

I appreciate that. The patch sounds like it's going to fix the issue. 

Would you mind to send me the patch file generated by the diff command instead? this patch gives "malformed patch" error. Then I can put the patch in the root directory of openmpi and apply it. That would be helpful for our Spack builds as well.

Is this for specific version of OpenMPI? (I see the changes are made for OpenPMIX but not sure which version of OpenMPI has this version of PMIX)

BTW, looks like there are more discussion after you posted this patch above.


Thank you,
Misha
Comment 19 Misha Ahmadian 2020-12-10 15:46:08 MST
Hi Marcin,

Please ignore my previous post. I could produce the patch and apply it correctly. The good news is the OpenMPI works just fine after I applied the patch!

I recompiled the OpenMPI with Slurm, UCX, and PMI support after applied the patch and submitted the same test job across multiple nodes and it worked fine. Below is the output of scontrol for that job:

$ scontrol show step 11396
StepId=11396.extern UserId=98263 StartTime=2020-12-10T16:12:07 TimeLimit=UNLIMITED
   State=RUNNING Partition=test NodeList=cpu-25-[33-34]
   Nodes=2 CPUs=0 Tasks=2 Name=extern Network=(null)
   TRES=(null)
   ResvPorts=(null)
   CPUFreqReq=Default
   SrunHost:Pid=(null):0

StepId=11396.batch UserId=98263 StartTime=2020-12-10T16:12:07 TimeLimit=UNLIMITED
   State=RUNNING Partition=test NodeList=cpu-25-33
   Nodes=1 CPUs=0 Tasks=1 Name=batch Network=(null)
   TRES=(null)
   ResvPorts=(null)
   CPUFreqReq=Default
   SrunHost:Pid=(null):0

StepId=11396.0 UserId=98263 StartTime=2020-12-10T16:12:11 TimeLimit=UNLIMITED
   State=RUNNING Partition=test NodeList=cpu-25-34
   Nodes=1 CPUs=128 Tasks=1 Name=orted Network=(null)
   TRES=cpu=128,mem=515456M,node=1
   ResvPorts=(null)
   CPUFreqReq=Default Dist=Cyclic
   SrunHost:Pid=cpu-25-33:111824


We were also trying to push the changes upstream to the OpenMPI RP but not sure which pbanch would be the right place. Would you mind to add this to OpenMPI as well?

https://github.com/open-mpi/ompi

I also realized that the "orte/mca/plm/slurm/plm_slurm_module.c" under OpenMPI uses "opal_argv_append" instead of "prte_argv_append" in OpenPMIX.

Thanks for all your help on this issue. We really appreciate it.

Best Regards,
Misha
Comment 20 Marcin Stolarek 2020-12-11 01:45:36 MST
*** Bug 10413 has been marked as a duplicate of this bug. ***
Comment 21 Marcin Stolarek 2020-12-11 02:11:44 MST
>We were also trying to push the changes upstream to the OpenMPI RP but not sure which pbanch would be the right place. Would you mind to add this to OpenMPI as well?
That's already done[1]. The function name differences are context/version dependent. I'm glad you resolved those for you version of openmpi.

The major difference between Slurm 20.02 and 20.11 causing the issue is that in 20.11 we switched to steps being "--exclusive" while previous behavior requires use of "--overlap" option.[2]

I feel like we have a complete understanding of the case now, but just to make sure - is there any outstanding question on your side?

cheers,
Marcin

[1]https://github.com/openpmix/prrte/commit/0288ebbc15c36e1d3c32f6d12c47237053e06101
[2]https://slurm.schedmd.com/srun.html#OPT_overlap
Comment 22 Alan Sill 2020-12-11 09:15:31 MST
The fix linked above is for the OpenPMIX code base. Is there a similar fix yet for the OpenMPI code base?
Comment 23 Misha Ahmadian 2020-12-11 09:26:29 MST
Hi Marcin,

> >We were also trying to push the changes upstream to the OpenMPI RP but not sure which pbanch would be the right place. Would you mind to add this to OpenMPI as well?
> That's already done[1]. The function name differences are context/version
> dependent. I'm glad you resolved those for you version of openmpi.

Thanks again for your help. The link in [1] points to the OpenPMIX, not to the OpenMPI RP. I'm not sure how that would be merged with OpenMPI, or do we need to have a separate push for OpenMPI since the function names are different? ("prte_argv_append" in OpenPMIx VS "opal_argv_append" in OpenMPI)

 
> The major difference between Slurm 20.02 and 20.11 causing the issue is that
> in 20.11 we switched to steps being "--exclusive" while previous behavior
> requires use of "--overlap" option.[2]

Does that suggest that if we use "--overlap" with sbatch command, it would fix this issue if we wouldn't patch the OpenMPI? (That may not be preferable to force users to add another Slurm options to their OpenMPI job submissions)

Best,
Misha


> [1]https://github.com/openpmix/prrte/commit/
> 0288ebbc15c36e1d3c32f6d12c47237053e06101
Comment 24 Kilian Cavalotti 2020-12-11 09:47:48 MST
(In reply to Misha Ahmadian from comment #23)
> > The major difference between Slurm 20.02 and 20.11 causing the issue is that
> > in 20.11 we switched to steps being "--exclusive" while previous behavior
> > requires use of "--overlap" option.[2]
> 
> Does that suggest that if we use "--overlap" with sbatch command, it would
> fix this issue if we wouldn't patch the OpenMPI? (That may not be preferable
> to force users to add another Slurm options to their OpenMPI job submissions)

AFAIK, --overlap is not a valid sbatch/salloc option, it can only be used with srun.

I do share the pain here and second the motion to get the change pushed upstream in Open MPI. But beyond this, having to modify and update Open MPI (and potentially user applications that may have been dependent on previous Open MPI versions) won't be an acceptable course of action for many sites, I'm afraid.

What would SchedMD recommend as a workaround that would still allow users to run their MPI codes with existing Open MPI installations, using mpirun, under Slurm 20.11?

Cheers,
--
Kilian
Comment 25 Kilian Cavalotti 2020-12-11 10:44:58 MST
(In reply to Kilian Cavalotti from comment #24)
> (In reply to Misha Ahmadian from comment #23)
> > > The major difference between Slurm 20.02 and 20.11 causing the issue is that
> > > in 20.11 we switched to steps being "--exclusive" while previous behavior
> > > requires use of "--overlap" option.[2]
> > 
> > Does that suggest that if we use "--overlap" with sbatch command, it would
> > fix this issue if we wouldn't patch the OpenMPI? (That may not be preferable
> > to force users to add another Slurm options to their OpenMPI job submissions)
> 
> AFAIK, --overlap is not a valid sbatch/salloc option, it can only be used
> with srun.

As an additional data point, I also tried to set SLURM_OVERLAP before launching mpirun:

$ salloc -N 2 --ntasks-per-node=4 
$ export SLURM_OVERLAP=1
$ mpirun ...

I verified that the srun processes spawned by mpirun correctly inherited the environment variable:

$ for p in $(pgrep -u $USER srun); do xargs -0 -L1 -a /proc/$p/environ | grep SLURM_OVERLAP; done
SLURM_OVERLAP=1
SLURM_OVERLAP=1

but that doesn't seem to solve the problem.

Cheers,
--
Kilian
Comment 26 Marcin Stolarek 2020-12-14 05:17:34 MST
Prrte inclusinon is out of scope for SchedMD support, but out of my experience/understanding. Prrte is currently an open-mpi/ompi subproject, meaning that the only change required to make this change effective for next ompi builds is to move the subproject pointer. I've opened a pull request to check with ompi maintainers how they manage that part[1].

When it comes to the --overlap steps the thing today is a little bit more complicated since the chance was part of bigger rewrite in step management code and maybe my reply in comment 21 was too terse. I'll start with a quote from 20.11 RELEASE NOTES:
> -- By default a step started with srun will get --exclusive behavior meaning
>    no other parallel step will be allowed to run on the same resources at
>    the same time.  To get the previous default behavior which allowed
>    parallel steps to share all resources use the new srun '--overlap' option.
>-- In conjunction to --exclusive behavior being the default for a step there
>     is also another option for step management, --whole, which will allow a
>     step access to all resources of a node in a job allocation.  This will
>     allocate all resources on a node allocated the step. No other parallel
>     step will have access to those unless \-\-overlap is used.
In the way mpirun spawns orted when configured using `--with-slurm` option the most appropriate would be to add --whole to it, but since it's not compatible with older srun addition of -c is most appropriate there in my opinion.

Putting the overlap/whole detail aside. Setting an environment variable before mpirun sounds doable and maybe an option for users of environemnt modules scripts (The other way around to get this working today is to use of cli_filter plugin)

Although SLURM_WHOLE variable it's not documented, it's implemented as an input variable of srun, so you can give it a try (As Josko suggested in Bug 10430). I'm seeing more oddities that may happen with the way srun is used by openmpi, b 

>[salloc] bash-4.2# srun  --ntasks-per-node=1  /bin/bash -c 'if [ $SLURM_NODEID -eq 0 ]; then scontrol show step; fi'
>StepId=58889.0 UserId=0 StartTime=2020-12-14T12:09:34 TimeLimit=UNLIMITED
>   State=RUNNING Partition=par1 NodeList=test[01,08]
>   Nodes=2 CPUs=2 Tasks=2 Name=bash Network=(null)
>   TRES=cpu=2,mem=0,node=2
>   ResvPorts=12043-12044
>   CPUFreqReq=Default Dist=Cyclic
>   SrunHost:Pid=slurmctl:28541
>[salloc] bash-4.2# export SLURM_WHOLE=1
>[salloc] bash-4.2# srun  --ntasks-per-node=1  /bin/bash -c 'if [ $SLURM_NODEID -eq 0 ]; then scontrol show step; fi'
>StepId=58889.1 UserId=0 StartTime=2020-12-14T12:09:40 TimeLimit=UNLIMITED
>   State=RUNNING Partition=par1 NodeList=test[01,08]
>   Nodes=2 CPUs=64 Tasks=2 Name=bash Network=(null)
>   TRES=cpu=64,mem=0,node=2
>   ResvPorts=12045-12046
>   CPUFreqReq=Default Dist=Cyclic
>   SrunHost:Pid=slurmctl:28616


I tried to cover all your questions, if I miss something or you need more clarification please let me know.

cheers,
Marcin


[1]https://github.com/open-mpi/ompi/pull/8284
Comment 27 Kilian Cavalotti 2020-12-14 05:41:07 MST
Hi,

I am currently out of office, returning on January 4. 

If you need to
reach Stanford Research Computing, please email srcc-support@stanford.edu

Cheers,
Comment 28 Marcin Stolarek 2020-12-14 06:00:24 MST
Misha,

Do you agree that we can close the case as information given or should we just lower the bug serverity to 4 for now?

cheers,
Marcin
Comment 29 Misha Ahmadian 2020-12-14 09:17:16 MST
Hi Marcin,

Thank you very much for your detailed response. I think we still have to look into the --whole and --overlap options more deeply. I'm hoping that the new changes in Slurm 20.11 won't change our current policy, allowing users to share a node between different jobs without oversubscribing the resources or overlapping each other.

If that's ok with you, I'm going to lower the severity of this bug to 4 and let it open for a few days, if anybody (including us) had further questions, then I'll close the ticket.

Best,
Misha
Comment 30 Kilian Cavalotti 2020-12-14 09:34:27 MST
(In reply to Marcin Stolarek from comment #26)
> Prrte inclusinon is out of scope for SchedMD support

Granted. But breaking a good proportion of user applications when going from Slurm 20.02 to 20.11 definitely seems in scope for SchedMD support, I believe.

The bottom line is that user jobs relying on mpirun/mpiexec that have been working fine until 20.11 suddenly stopped working with the new step behavior. Although some users may be able to modify their scripts to use srun instead of mpirun, some may not (think proprietary application, for instance, that use wrapper to start MPI jobs using mpirun or mpiexec under the scenes).

Expecting external software to be patched to accommodate Slurm's new behavior is probably not very reasonable. So I guess the question is: what option does SchedMD propose to restore the previous behavior, or at least to allow existing jobs and applications to work normally under 20.11 without modifications?

Thanks,
-- 
Kilian, not very happy right now.
Comment 31 Marcin Stolarek 2020-12-15 04:11:35 MST
>I'm hoping that the new changes in Slurm 20.11 won't change our current policy, allowing users to share a node between different jobs without oversubscribing the resources or overlapping each other.
The change is only for step management happening inside the allocation.

>If that's ok with you, I'm going to lower the severity of this bug to 4 and let it open for a few days, if anybody (including us) had further questions, then I'll close the ticket.
That makes sense.

>Granted. But breaking a good proportion of user applications when going from Slurm 20.02 to 20.11 definitely seems in scope for SchedMD support, I believe.
Don't take me wrong, I'm not trying to tell that we don't care. I wanted to ephasize that I can't be so sure about the way ompi development is maintained - it's my technical opinion not anything SchedMD can advice on.

>The bottom line is that user jobs relying on mpirun/mpiexec that have been working fine until 20.11 suddenly stopped working with the new step behavior. Although some users may be able to modify their scripts to use srun instead of mpirun, some may not (think proprietary application, for instance, that use wrapper to start MPI jobs using mpirun or mpiexec under the scenes).

I at least partially agree - that's why I've created the mentioned PL to pmix/prrte finally used by openmpi, and that's why we're also trying to improve openmpi/slurm faq[2]. If you think about that from the other side - the fact that openmpi's mpirun executing srun --ntasks-per-node=1 internally worked up to certain extent before (some parts of Slurm process lunch didn't work correctly - process affinity, task distribution) which was always the reason for us to recommend srun instead of mpirun doesn't mean it was correct/optimal - use Slurm launcher to launch OMPI launcher.

>Expecting external software to be patched to accommodate Slurm's new behavior is probably not very reasonable. So I guess the question is: what option does SchedMD propose to restore the previous behavior, or at least to allow existing jobs and applications to work normally under 20.11 without modifications?
I see your point, however, I think that it's also resonable to say that if an external software makes use of Slurm it should accomodate to changes in major releases of Slurm. (Building mpirun without Slurm support and spawning jobs over SSH, adopting from pam_slurm_adopt still works). As mentioned in comment 26 - doing:
>export SLURM_WHOLE=1
>mpirun ./mpi_program
should work as well - this doesn't require openmpi binaries rebuild. We're having an internal discussion on the best approach to this surprising consequence and I'll keep you posted.

cheers,
Marcin

[2]https://github.com/open-mpi/ompi-www/pull/342
Comment 33 Luke Yeager 2020-12-15 12:27:18 MST
I'd like to point out that this set of changes can break applications which don't use OpenMPI, too.

From NEWS:
> -- Make --exclusive the default with srun as a step adding --overlap to
>    reverse behavior.
> -- Add --whole option to srun to allocate all resources on a node
>    in an allocation.
It was not clear to me from reading NEWS that these two items are so closely related. In fact, you need both --overlap and --whole if you want to undo the --exclusive flag which srun now adds by default. I think the documentation could be clearer about this.

> # 20.02 behavior, which applications rely on
> $ salloc -N1 --exclusive
> $ srun -n2 nproc
> 80
> 80

> # 20.11 default behavior (breaks applications)
> $ salloc -N1 --exclusive
> $ srun -n2 nproc
> 4
> 4

> # actual application error message
> + exec numactl --physcpubind=0-4,40-44 --membind=0 -- ...
> libnuma: Warning: cpu argument 4,40-44 out of range
> <0-4,40-44> is invalid
This is not a CPU binding issue - the CPUs are not visible in the cgroup, even with '--cpu-bind=none'. We're using 'TaskPlugin=affinity,cgroup' and 'TaskPluginParam=Sched,None'.

I realize SchedMD is aware of this because they have worked around it in OpenMPI with [1]. But I wanted to make sure others reading this bug are aware of this fundamental change.

(In reply to Marcin Stolarek from comment #31)
>> Expecting external software to be patched to accommodate Slurm's new
>> behavior is probably not very reasonable. So I guess the question is: what
>> option does SchedMD propose to restore the previous behavior, or at least to
>> allow existing jobs and applications to work normally under 20.11 without
>> modifications?
> We're having an internal discussion on the best approach to this surprising
> consequence and I'll keep you posted.
I share Kilian's concern and am eager to hear whether any solution other than setting SLURM_WHOLE=1 will be recommended.

[1] https://github.com/openpmix/prrte/commit/e0991eb074294a81823c636b28c39db7c01dd19e
Comment 34 Marcin Stolarek 2020-12-16 01:18:42 MST
Luke,

>[...] I think the documentation could be clearer about this.
We're actively working on that (on our side and helping with openmpi). Is there any other place than our MPI guide[1] you'd expect this kind of information?

>I share Kilian's concern and am eager to hear whether any solution other than setting SLURM_WHOLE=1 will be recommended.
Do you have any idea in mind? What's your main concern in the environment variable workaround?

cheers,
Marcin

[1]https://slurm.schedmd.com/mpi_guide.html
Comment 35 Marcin Stolarek 2020-12-16 01:31:27 MST
*** Bug 10444 has been marked as a duplicate of this bug. ***
Comment 36 Maxime Boissonneault 2020-12-16 06:46:32 MST
The problem with a workaround that goes through environment variables is that it is brittle. Environment variables can be set and unset. One should not have to rely on an environment variable to get the correct behavior.
Comment 37 Maxime Boissonneault 2020-12-16 06:50:34 MST
A better solution would rather be a configuration option that just restores the right behavior, without needing to define a new environment variable.
Comment 39 Marcin Stolarek 2020-12-16 07:10:34 MST
>A better solution would rather be a configuration option[...]

We thought about configuration option for that, but we believe that long term it will lead to end-users confusion.
If one wants to set it for every srun execution cli_filter should be a way to go, however, this doesn't sound like a good approach to me - change all srun calls to make openmpi mpiexec working as expected. 

cheers,
Marcin
Comment 40 Maxime Boissonneault 2020-12-16 07:14:36 MST
Consider this. Every cluster has a single installation of Slurm. Every cluster has *multiple* installation of OpenMPI. In our case, we have 100 of them, literally. We have multiple versions of OpenMPI, compiled with multiple versions of compilers, for multiple CPU architectures. 

And that's not even counting the versions of MPI that can be shipped with vendored solutions. Breaking OpenMPI is *not an option*, ever.
Comment 41 Maxime Boissonneault 2020-12-16 07:22:17 MST
Sites who heavily rely on OpenMPI will just end up defining SLURM_WHOLE=1 by default everywhere (that's what we just did as a stop-gap measure). Except it will be every so brittle and it will resurface as a problem in the future when a workflow somewhere resets the environment. 

You mention using the cli_filter is an option. Can you point us to documentation about this ? It definitely sounds like a better option than an environment variable.
Comment 42 Alan Sill 2020-12-16 07:34:38 MST
We have implemented the patch and incorporated it into our spack-based build procedure, so are less concerned than others on this thread about applying the patch consistently, as it will be appleid to all versions we build. I am more concerned, however, abut the effect on scheduling for multiple-user and multiple-job node occupancy of the recent changes to slurm. 

Unlike some sites, we rely on the ability to have jobs share nodes nd not be restricted to scheduling jobs on a whole-node basis. Our latest cluster has 128 cores per node, and it just doesn't make sense at our scale to restrict scheduling to be entire nodes at a time. I do not understand the full implications of the recent changes, and would like to request and suggest that a full explanation of the change, including motivation for making it and how to work around it if needed, be made in the documentation. 

We need a clear understanding of how to tune scheduleing to allow for various ways in which a site may choose to run. Onlyt ehlargest outfits can afford to schedule on a whole-node basis for all jobs, and we as well as the majority of sites I beleive are not in that category.

Thanks.
Comment 43 Maxime Boissonneault 2020-12-16 07:37:54 MST
My understanding is that SLURM_WHOLE does not force whole node scheduling. It just forces srun to not constrain MPI processes *within* the job's allocation to less than the job's allocated resources. It negates the new "--exclusive" default.
Comment 44 Alan Sill 2020-12-16 08:40:32 MST
> My understanding is that SLURM_WHOLE does not force whole node scheduling. It
> just forces srun to not constrain MPI processes *within* the job's allocation
> to less than the job's allocated resources. It negates the new "--exclusive" > default.

Thanks. Given then that (as per previous comments) it is not yet documented, I'll repeat my request for complete documentation of the new behavior, flags, and behaviors.

Also I think it is important that slurm maintain any components related to slurm usage in upstream projects such as OpenMPI, OpenPMIX, etc. to avoid old behaviors tripping over new features, as has been mentioned above.
Comment 45 Luke Yeager 2020-12-16 09:50:24 MST
I would love to hear the argument in *favor* of this patch. The commits reference bug#8572, but it is private.


(In reply to Marcin Stolarek from comment #34)
> What's your main concern in the environment variable workaround?
1. I think that an envvar solution is obscure. I prefer to simply add 'set -x' at the top of my sbatch script and for the resulting log to be reproducible by others. But now I'll have to add an 'env' somewhere to further clarify the behavior of my job so that it's reproducible at other sites with different Slurm versions and/or with different envvars set.

2. Let's assume that the argument in favor of this change is persuasive for a moment. If I set 'SLURM_WHOLE=1' for all users, how do I enable users who want to opt in to the new Slurm behavior? There aren't flags for '--no-whole' or '--no-overlap', so they're reduced to running 'unset SLURM_WHOLE' in their sbatch script? I don't love that. Setting '--exclusive' explicitly on the srun invocation doesn't undo the envvar (come to think of it, why is that option even there anymore - only for backwards compatibility?).


(In reply to Alan Sill from comment #42)
> Unlike some sites, we rely on the ability to have jobs share nodes nd not be
> restricted to scheduling jobs on a whole-node basis.
Alan, look out for bug#10449 - srun now behaves differently when run standalone (outside of sbatch/salloc) vs. when it's run inside of an allocation. The '--exclusive' flag is overloaded for srun, which was already confusing.


(In reply to Maxime Boissonneault from comment #43)
> My understanding is that SLURM_WHOLE does not force whole node scheduling.
> It just forces srun to not constrain MPI processes *within* the job's
> allocation to less than the job's allocated resources. It negates the new
> "--exclusive" default.
It doesn't negate it fully. The --overlap flag is required to negate other parts of the new behavior, e.g. for 'srun --jobid=X' (see bug#10450).


(In reply to Alan Sill from comment #44)
> I'll repeat my request for complete documentation of the new behavior,
> flags, and behaviors.
Yes, I feel this warrants a blog post or something. Unless you can find a way to make it loud and clear enough in the manpages, NEWS, slurm-users updates, and/or any of the other existing channels.
Comment 46 Bart Oldeman 2020-12-16 11:26:54 MST
Using --mca plm_slurm_args '--whole' or OMPI_MCA_plm_slurm_args="--whole" or setting that in a configuration file does the trick as well by the way, in a more Open MPI - specific fashion than SLURM_WHOLE=1, if you can't patch Open MPI using the patch in https://github.com/open-mpi/ompi/pull/8288
Comment 47 Marcin Stolarek 2020-12-17 03:23:58 MST
>Every cluster has *multiple* installation of OpenMPI.
I see your point, however, the patches [1,2] are really simple and should be easy to apply to any version. I understand that it requires more labor, but unfortunately any change in behavior can lead to applications incompatibilities.

>Sites who heavily rely on OpenMPI will just end up defining SLURM_WHOLE=1 by default everywhere
It's not something I'd recommend. I think that if patching sounds like too much labor for you, better way would be to export the variable from openmpi environment modules or version specific configuration as Bart suggested in comment 46.

>I am more concerned, however, abut the effect on scheduling for multiple-user and multiple-job node occupancy of the recent changes to slurm. 
The `--whole` option applies to step management, not the job/allocation. Per our "Quick Start User Guide":
>jobs, or allocations of resources assigned to a user for a specified amount of
>time, and job steps, which are sets of (possibly parallel) tasks within a job.[3]

>I do not understand the full implications of the recent changes, and would like to request and suggest that a full explanation of the change, including motivation for making it and how to work around it if needed, be made in the documentation. 
Could you please open a separate ticket, where we can fully discuss your case? Briefly - The job allocation is not affected at all, the `--whole` option means that all the resources available to the job are given to the step requesting `--whole`. As it goes about REASONING for the change:
We've seen many users in the past actually expecting that steps are allocated resources exclusively within a job. This actually required the use of `--exclusive` option which was a little bit difficult to understand since it has different meaning for an 
1) srun creating allocation (not under salloc or sbatch submitted script) where it means that nodes are allocated exclusively to users
2) srun running in existing job/allocation (under salloc or script submitted over sbatch) where `--exclusive` (default since 20.11) means that processes created in this step are given only the resources requested by this step. This is the issue for mpirun->srun --ntasks-per-node=1 -> orted -> user_mpi_application execution stack, since mpirun wants srun to start only one orted per node giving it all the resources so it can spawn user tasks. The new default gives orted only resources calculated for --ntask-per-node=1 as requested. For instance: The root cause of original issue here. The request was for default 1 CPU per task, and --mem-per-cpu=X. The step is limitted to only X of memory since we give it one CPU.

>Also I think it is important that slurm maintain any components related to slurm usage in upstream projects such as OpenMPI, OpenPMIX, etc. to avoid old behaviors tripping over new features, as has been mentioned above.
I guess you mean that SchedMD should maintain those? Current state is that SchedMD owns Slurm, Slurm has it's own application launcher integrating with popular MPI implementations over various versions of PMI[4]. The way we recommend to run MPI applications is to use srun whithout mpirun/mpiexec. We understand that it's sometimes difficult (especially when the mpirun/mpiexec call is hard coded by commercial apps) and we document the difficulties related to this on our MPI page[4] - we're actively working on additional info about the issue in this bug if you want to get notified about that please follow Bug 10453 (and already closed bug 10430).

>I think that an envvar solution is obscure.
I agree, but I'd not call it a solution. As mentioned above, I strongly believe that exclusive allocation of steps is generally expected. Use of SLURM_WHOLE environment variable is one of the ways to make the explained above execution stack work as before for quite uncommon case where Slurm launcher is used to start another launcher. 

>There aren't flags for '--no-whole' or '--no-overlap'
Yes. We think that --whole and --overlap are two simple flags that should let people get what the want with --exclusive being a default in the most intuitive and simple way - long term. As of today we don't want to further complicate end-user available options.

I hope we're making things more clear for you and I happy to follow-up with you.

cheers,
Marcin


[1] https://github.com/openpmix/prrte/commit/e0991eb074294a81823c636b28c39db7c01dd19e
[2]https://github.com/open-mpi/ompi/pull/8288/files
[3]https://slurm.schedmd.com/quickstart.html
[4]https://slurm.schedmd.com/mpi_guide.html
Comment 48 Marcin Stolarek 2020-12-17 08:14:25 MST
*** Bug 10473 has been marked as a duplicate of this bug. ***
Comment 49 Luke Yeager 2020-12-18 11:18:44 MST
There are two breaking changes in 20.11:

(A) resources allocated to one jobstep may not be allocated to another overlapping jobstep (mitigated by --overlap)
(B) jobsteps now only get the minimum amount of resources necessary to fulfill the explicitly defined request, instead of all resources in the allocation (mitigated by --whole)

You have given the reasoning behind change (A) in comment #47. Perhaps this is a problem for some sites, but I haven't heard anyone complaining about it. And, as I said in bug#10450, this breaking change is loud and easy to understand. I would even go so far as to say that the argument in favor of this new default is convincing and that I like the change.

However, I haven't heard any argument in favor of (B), and I haven't heard anyone who is excited about the new default behavior.

It seems like you reached for the wrong tool in your arsenal when you went for the existing --exclusive flag to solve (A). Is that right? Did you actually mean to implement ~'--no-overlap', and only accidentally also implemented ~'--no-whole'?


Are you open to reverting (B)? If not, I would appreciate some advice with our transition plan (see bug#10489).
Comment 50 Bart Oldeman 2020-12-18 11:33:33 MST
For the Open MPI mpirun/mpiexec case one way to solve this within Slurm is to have "--cpu-bind=none" imply "--whole". I don't know if there are adverse consequences to this though.
Comment 51 Misha Ahmadian 2020-12-18 12:27:34 MST
Created attachment 17226 [details]
slurm-srun-orted-fix.patch

(In reply to Luke Yeager from comment #49)
> There are two breaking changes in 20.11:
> 
> (A) resources allocated to one jobstep may not be allocated to another
> overlapping jobstep (mitigated by --overlap)
> (B) jobsteps now only get the minimum amount of resources necessary to
> fulfill the explicitly defined request, instead of all resources in the
> allocation (mitigated by --whole)


So, as a result of Slurm's new behavior explained in (B), it turned out that the way OpenMPI v2+ is hardcoded to use srun with only one task for ORTED (--ntasks-per-node=1) has been a wrong idea (even in the previous versions of the Slurm with the --overlap in place) and had to be modified anyway. Using --whole might be an alternative solution for "non-mpirun" jobs, but "mpirun" definitely requires the users to add (SLURM_WHOLE=1) to get around this issue.

I think the new change in OMPI RP is not a good solution at all:
https://github.com/open-mpi/ompi/pull/8288/commits/7bac7eed6ef423e47fe980b4c32eae36b8e1d4cb

Adding the environment variable hardcoded in a C code is not what we should see in OpenMPI since it wouldn't fix the underlying issue with the previous versions of the Slurm.

One way for those HPC sites that have an easier way to building their OpenMPI applications (such as Spack or Easybuild) may add the Patch that I included in this post and rebuild their OpenMPI v2+.

for instance, we did this for Spack by copying the patch file under this directory:
$spack_root/var/spack/repos/ttu/packages/openmpi

And adding the following line into the $spack_root/var/spack/repos/ttu/packages/openmpi/package.py

patch('slurm-srun-orted-fix.patch', when='@2.0.0:')

That got the issue fixed for us by rebuilding the OpenMPI packages and running the spack install to rebuild all the MPI packages within a day.

In case you're interested, you can find the attached patch file for this fix. We're trying to see if we would be able to push the changes to the OMPI RP.

Best,
Misha
Comment 52 Alan Sill 2020-12-18 12:41:16 MST
As this is a show-stopper for mpirun users, I raised the impact level of the bug back to its original value of 2. We are running successfully with the change suggested by Misha above, and have incorporated it into our local spack build process for OpenMPI, but I agree with Misha that this should be pushed back upstream to OpenMPI to avoid ahving to use environmental variable workarounds.

Note this is based on the original patch suggested in the context of OpenPMIX by Marcin above. I think this is a better solution and can be applied to eaxisting OpenMPI installations to avoid this problem entirely.

This is probably our last input into this bug, and I suggest that it be closed once the above suggestion has been reviewed.
Comment 53 Maxime Boissonneault 2020-12-18 12:43:33 MST
In the case of Compute Canada, our software stack is completely distributed. We can not recompile OpenMPI without risking crashes on all of our clusters, or without scheduling a maintenance of all of our clusters, so patching OpenMPI is not an option. 

I would also highly favor reverting (B).
Comment 54 Bart Oldeman 2020-12-18 12:58:03 MST
There is an important issue with the patch above unless I missed something obvious:

it effectively adds --cpus-per-task=$SLURM_CPUS_ON_NODE to srun on the head node, assuming that this is value is uniform across all sister nodes. This is often but not always the case. I.e. you can easily have an MPI job with 8 tasks on node A, 4 on node B, and 2 on node C.

in non-uniform situations we then have SLURM_JOB_CPUS_PER_NODE=8,4,2 which cannot be passed as-is to srun (it then says: srun: error: Invalid numeric value "8,4,2" for --cpus-per-task.)
Comment 55 Kenneth Hoste 2020-12-18 13:33:38 MST
I think it's clear now that the impact of the breaking changes in Slurm 20.11 are not be underestimated, by neither SchedMD or the HPC community.

It's also clear from the initial discussion that the consequences of these changes are not only pretty big for people who are used to running 'mpirun' in job scripts (which is way more common than using 'srun', despite what is recommended, from what I can tell), but also that it's far from trivial to realize what is actually causing the vastly different behavior.
Many HPC sites and sysadmins or support teams will be spending/losing a lot of time on this, no doubt...

So I want to echo the requests to make sure that the impact of the changes are widely understood. Please document them best you can, and make sure SchedMD customers and the HPC community at large is well aware of what has changed, why, and what the impact is.

That's not enough though. Getting this properly documented is necessary and very important, but it's not sufficient to avoid that other HPC sites won't run into the problems being discussed here. Let's not pretend that people read through the Slurm documentation front-to-back on every update...


To me, this whole situation is a very good example of why semantic versioning exists: if breaking changes are made, or if defaults are changed, you bump the major version number.

With the current Slurm versioning scheme (which is essentially <year>.<month>), you totally lose that *excellent* and well understood mechanism of communicating breaking changes: if the major version number changes, it's a warning sign to pay close attention to what has changed, and how it may impact you.

This is way more valuable than knowing when a Slurm version was released (which is very issue with the current versioning scheme).

Please keep this in mind going forward.
Comment 56 Alan Sill 2020-12-18 13:58:37 MST
(In reply to Bart Oldeman from comment #54)
> There is an important issue with the patch above unless I missed something
> obvious:
> 
> it effectively adds --cpus-per-task=$SLURM_CPUS_ON_NODE to srun on the head
> node, assuming that this is value is uniform across all sister nodes. This
> is often but not always the case. I.e. you can easily have an MPI job with 8
> tasks on node A, 4 on node B, and 2 on node C.
> 
> in non-uniform situations we then have SLURM_JOB_CPUS_PER_NODE=8,4,2 which
> cannot be passed as-is to srun (it then says: srun: error: Invalid numeric
> value "8,4,2" for --cpus-per-task.)

AFAICT the $SLURM_CPUS_ON_NODE is not even *defined* on the submit or head nodes, unless they are slurm execution nodes, and this ie evlauated on each node. Please correct me right away if I am wrong!

But to the larger issue you raise of non-uniformity: Our users typically will do an salloc to get nodes and execure the mpirun from one of those nodes. In this situation, the condition you indicate could arise. Others who understand this step in more detail should please feel free to correct this statement.

Experimentally, the problem we see is not actually with the number of tasks or cores assigned, but to the memory limits encountered at step0. For some reason, as indicated in the original ticket above, step0 runs out of memory when the launch occurs. It is another question (and possibly a more important one) why this launch step takes so much memory.
Comment 57 Bart Oldeman 2020-12-18 14:24:51 MST
(In reply to Alan Sill from comment #56)
> (In reply to Bart Oldeman from comment #54)
> > There is an important issue with the patch above unless I missed something
> > obvious:
> > 
> > it effectively adds --cpus-per-task=$SLURM_CPUS_ON_NODE to srun on the head
> > node, assuming that this is value is uniform across all sister nodes. This
> > is often but not always the case. I.e. you can easily have an MPI job with 8
> > tasks on node A, 4 on node B, and 2 on node C.
> > 
> > in non-uniform situations we then have SLURM_JOB_CPUS_PER_NODE=8,4,2 which
> > cannot be passed as-is to srun (it then says: srun: error: Invalid numeric
> > value "8,4,2" for --cpus-per-task.)
> 
> AFAICT the $SLURM_CPUS_ON_NODE is not even *defined* on the submit or head
> nodes, unless they are slurm execution nodes, and this ie evlauated on each
> node. Please correct me right away if I am wrong!

Sorry wrong terminology, I meant the node that contains the process with rank 0, aka primary node, aka mother superior.
In my scenario on that node:
SLURM_CPUS_ON_NODE=8
SLURM_JOB_CPUS_PER_NODE=8,4,2
Comment 58 Alan Sill 2020-12-18 14:33:02 MST
(In reply to Bart Oldeman from comment #57)
> (In reply to Alan Sill from comment #56)
> > (In reply to Bart Oldeman from comment #54)
> > > There is an important issue with the patch above unless I missed something
> > > obvious:
> > > 
> > > it effectively adds --cpus-per-task=$SLURM_CPUS_ON_NODE to srun on the head
> > > node, assuming that this is value is uniform across all sister nodes. This
> > > is often but not always the case. I.e. you can easily have an MPI job with 8
> > > tasks on node A, 4 on node B, and 2 on node C.
> > > 
> > > in non-uniform situations we then have SLURM_JOB_CPUS_PER_NODE=8,4,2 which
> > > cannot be passed as-is to srun (it then says: srun: error: Invalid numeric
> > > value "8,4,2" for --cpus-per-task.)
> > 
> > AFAICT the $SLURM_CPUS_ON_NODE is not even *defined* on the submit or head
> > nodes, unless they are slurm execution nodes, and this ie evlauated on each
> > node. Please correct me right away if I am wrong!
> 
> Sorry wrong terminology, I meant the node that contains the process with
> rank 0, aka primary node, aka mother superior.
> In my scenario on that node:
> SLURM_CPUS_ON_NODE=8
> SLURM_JOB_CPUS_PER_NODE=8,4,2

Right, exactly. So we're on the same page on this, good. 

So this goes back to my original comment: we don't actually *care* how many cores are assigned or how this plays out in terms of couting the cpus on a node. We just care about not running out of memory in the step_0. Right? Using WHOLE or manipulating the number of cpus on a node is just a workaround for not triggering whatever causes step_0 to fail. Experimentally, using this workaround allows the launch to proceed. 

Marcin, plesae comment.
Comment 59 Maxime Boissonneault 2020-12-19 06:27:26 MST
I just want to report that we have confirmed that this *also* breaks Intel MPI when started with mpiexec. I suspect it breaks basically every MPI implementation unless you use srun.
Comment 60 Maxime Boissonneault 2020-12-19 06:28:12 MST
This basically means that the only way forward is to define SLURM_WHOLE=1 everywhere to basically revert the breaking change.
Comment 61 Marcin Stolarek 2020-12-20 15:29:27 MST
Luke,
>It seems like you reached for the wrong tool in your arsenal when you went for the existing --exclusive flag to solve (A). Is that right? Did you actually mean to implement ~'--no-overlap', and only accidentally also implemented ~'--no-whole'?

The changes are not incidental. When you think about that it's quite natural to prevent the step from using all resources when you don't allow "overlapping". Let's think about an openMP app checking the number of available CPUs and starting that many threads. I see your point and I hope we'll find a solution that can satisfy your needs in the other bug.

Misha,
Alan,
As stated before I can't be an authoritative source of information on open MPI process for prrte subproject inclusions, based on the info we've got it sounds like the patch you're referring got merged[1], but it's more ompi than SchedMD/Slurm question.

Kenneth,
>To me, this whole situation is a very good example of why semantic versioning exists: if breaking changes are made, or if defaults are changed, you bump the major version number.
>With the current Slurm versioning scheme (which is essentially <year>.<month>), you totally lose that *excellent* and well understood mechanism of communicating breaking changes: if the major version number changes, it's a warning sign to pay close attention to what has changed, and how it may impact you.
We generally understand two first numbers as a "name" of a major release. In that sense 19.05, 20.02, 20.11 are major numbers. Each had it's own RELEASE_NOTES file containing information about important changes and introduces state file/protocol changes. Those are also major version numbers in terms of backward compatibility and our commercial support.

Alan,
>Using WHOLE or manipulating the number of cpus on a node is just a[...]
Using -c as an estimate (that will be wrong for non-uniform allocation) is the way to tell that for the one task mpirun want's to start (--ntask-per-node=1) we want all cpus being allocated. This has indirect consequences like appropriate memory when requested as --mem-per-cpu.
I don't think it's a workaround, looking at it from the resource management perspective it's just a more accurate step specification, isn't it?

Maxime,
Thanks for letting us know. We didn't yet update our docs, but I already did a strace of Intel MPI mpirun and checked that it relies on similar --ntask-per-node=1. I'm sure we'll include the info as a result of Bug 10453.

cheers,
Marcin

[1]https://github.com/open-mpi/ompi/pull/8296
Comment 62 Bart Oldeman 2020-12-21 11:55:46 MST
The Intel MPI issue isn't new, as since it doesn't use --cpu-bind=none or SLURM_CPU_BIND=none, its mpirun has been problematic over the years going all the way back to 2014, e.g.
https://bugs.schedmd.com/show_bug.cgi?id=1111
https://bugs.schedmd.com/show_bug.cgi?id=4268
https://bugs.schedmd.com/show_bug.cgi?id=7681
Comment 63 Tim Wickberg 2021-01-07 21:50:52 MST
We've been reviewing this change internally, and will be making an adjustment the the default behavior in 20.11.3 to resolve this, and restore the "--whole" style allocation behavior to srun going forward.

For 20.11.{0,1,2}, setting SLURM_WHOLE=1 is the best workaround.

As further background behind this change: there was a customer request that the "--exclusive" srun option be made the default in 20.11, and this was done ahead of 20.11.0. Unfortunately some aspects of this had unforeseen impacts as have been discussed extensively on this ticket, most especially with external MPI stacks, and half of the functional changes described here have been reverted ahead of 20.11.3 to address this.

The --exclusive option (when used for step layout; no changes were made in respect to how that option works on job allocations) has had two orthogonal pieces:

- Controlling whether the job step is permitted to overlap on the assigned resources with other job steps. (The --overlap flag was introduced to opt-in to this, and the default behavior for 20.11 was changed and remains changed to providing non-overlapping allocations.)

- Restricting the job allocation to the minimum resources required, rather than permitting access to all resources assigned to the job on each node. (Which was made available through the --whole flag.)

The first change to non-overlapping behavior is what I believe was originally intended by that request, and that aspect remains the new default behavior going forward. That can be overridden by all steps in the job requesting --overlap, but we believe workflows that would intentionally desire such behavior to be rare in practice.

However, as noted here, the later change unfortunately causes considerable issue with OpenMPI (and other MPI flavors). This also causes issues for simple uses of srun, such as "sbatch -N 1 -n 24 --wrap srun ./my application", which as a result of these changes would only see a single core, rather than the full 24 cores it would have been assigned before (on most systems, assuming CR_CORE and a few other defaults).

Given the large number of installed MPI stacks affected by that later change, to say nothing of the number of job scripts potentially impacted, we have reverted that behavioral change ahead of 20.11.3.

The --whole flag thus becomes the default step allocation option for 20.11. The change for 20.11.3 compared to 20.02 is thus now limited to making the step allocations non-overlapping (or "exclusive", although that term is loaded with baggage due to the complicated history of the --exclusive flag) .

- Tim

(Commit bde072c607 is the relevant one, available on the slurm-20.11 / master branches now.)
Comment 64 Kilian Cavalotti 2021-01-08 08:27:39 MST
Hi Tim,

(In reply to Tim Wickberg from comment #63)
> We've been reviewing this change internally, and will be making an
> adjustment the the default behavior in 20.11.3 to resolve this, and restore
> the "--whole" style allocation behavior to srun going forward.

Thanks a lot for doing this, for taking the time to detail the rationale behind those changes, and distinguishing between the two features. This will certainly help many sites achieve a smooth transition to 20.11.

Cheers,
--
Kilian