Ticket 13351 - core assignments broken with mpirun
Summary: core assignments broken with mpirun
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 21.08.5
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Director of Support
QA Contact:
URL:
: 13339 13474 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2022-02-07 12:03 MST by Kaylea Nelson
Modified: 2022-08-17 07:33 MDT (History)
9 users (show)

See Also:
Site: Yale
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.6
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
hello world output (6.71 KB, text/plain)
2022-02-07 12:03 MST, Kaylea Nelson
Details
slurm conf (8.65 KB, text/plain)
2022-02-07 12:04 MST, Kaylea Nelson
Details
21.05.5 v1 (1.60 KB, patch)
2022-02-07 15:26 MST, Michael Hinton
Details | Diff
slurm_out_job18 (6.39 KB, text/plain)
2022-02-08 12:31 MST, Kaylea Nelson
Details
slurmd and slurmctld logs for job 18 (4.81 KB, text/plain)
2022-02-08 12:35 MST, Kaylea Nelson
Details
slurm conf for the test cluster (8.37 KB, text/plain)
2022-02-08 12:49 MST, Kaylea Nelson
Details
output for 21.08.4 successful job (6.36 KB, text/plain)
2022-02-08 13:37 MST, Kaylea Nelson
Details
slurmd slurmctld 21.08.4 successful job (4.67 KB, text/plain)
2022-02-08 13:37 MST, Kaylea Nelson
Details
mpi hello world (912 bytes, text/x-csrc)
2022-02-09 12:38 MST, Kaylea Nelson
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Kaylea Nelson 2022-02-07 12:03:42 MST
Created attachment 23320 [details]
hello world output

Hi,

We just upgraded our Grace cluster to 21.08.5 and have encountered a new issue with "mpirun". When a job is submitted to multiple cores on multiple nodes, all the cores on the first node are allocated properly, but on all other nodes all of the tasks for that nodes are executed on only a single core. srun seems to be operating correctly (each tasks gets its own core).

We have clusters running 21.08.2 and 21.08.4 and on both of those clusters, each task gets its own core regardless of "mpirun" vs "srun". I have also tested multiple version of OpenMPI with the same results (2.1.2, 3.1.1 and 4.0.5)

I've attached the output of simple hello world that demonstrates this.

Thanks,
Kaylea
Comment 1 Kaylea Nelson 2022-02-07 12:04:03 MST
Created attachment 23321 [details]
slurm conf
Comment 3 Kaylea Nelson 2022-02-07 14:33:32 MST
Looking through the release notes for 21.08.6, this sounds like our problem: 

Fix affinity of the batch step if batch host is different than the first
    node in the allocation.

Do you have a patch for this that we can apply?
Comment 4 Jason Booth 2022-02-07 14:45:59 MST
I think it is more likely you are running into a different set of changes. There are also some changes starting in 20.11 with --exclusive and srun. 

Here are a summary of the changes:

For 20.11.3+, the default behavior for srun within an allocation is:

* Exclusive access to all resources it requests (srun --exclusive)
* All the resources of the job on the node (srun --whole)

These can be overridden by:

* srun --overlap (srun can overlap each other)
* srun --exact (only use exactly the CPU resources requested)


You can use --overlap if your jobs need steps to overlap CPUs.
Comment 5 Kaylea Nelson 2022-02-07 14:54:50 MST
Hi Jason,

I tried export SLURM_OVERLAP=1 but am still seeing that on the non-batch host nodes, all of the tasks are being pinned to one CPU. Things look on the batch host (one task per cpu)

Kaylea
Comment 6 Kaylea Nelson 2022-02-07 14:56:16 MST
Here is my submission script:

#!/bin/bash
#SBATCH -J 2020b
#SBATCH -p day
#SBATCH -N 3
#SBATCH --ntasks-per-node=10

export SLURM_OVERLAP=1
module load OpenMPI/4.0.5-GCC-10.2.0
module list

echo 'format: Hello from <node>:<cpu_id>, <task_number>'

echo "### with srun ###"
srun ./c_eb_ompi_405_gcc_ucx

echo "### with mpirun ###"
mpirun ./c_eb_ompi_405_gcc_ucx
Comment 7 Michael Hinton 2022-02-07 15:16:56 MST
(In reply to Kaylea Nelson from comment #3)
> Looking through the release notes for 21.08.6, this sounds like our problem: 
> 
> Fix affinity of the batch step if batch host is different than the first
>     node in the allocation.
> 
> Do you have a patch for this that we can apply?
Let me get a patch for you to try that will apply cleanly to 21.08.5.

Can you give us some more information on a job that fails?

    scontrol show job -j <job-id>

Let's see if the batch host appears to be different than the first node of the allocation for all these jobs.
Comment 8 Michael Hinton 2022-02-07 15:26:07 MST
Created attachment 23335 [details]
21.05.5 v1

Ok, v1 is the patch that corresponds with commit 538420ad4d5 in 21.08.6 (Fix affinity of the batch step if batch host is different than the first node in the allocation) - but without the NEWS entry. Can you give that a try and see if it fixes things?

Thanks!
-Michael
Comment 9 Kaylea Nelson 2022-02-07 15:28:16 MST
JobId=50789647 JobName=2020b
   UserId=kln26(11135) GroupId=support(11133) MCS_label=N/A
   Priority=49718 Nice=0 Account=admins QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:07 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2022-02-07T17:11:17 EligibleTime=2022-02-07T17:11:17
   AccrueTime=2022-02-07T17:11:17
   StartTime=2022-02-07T17:13:48 EndTime=2022-02-07T18:13:48 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-02-07T17:13:48 Scheduler=Backfill
   Partition=admintest AllocNode:Sid=grace2:194801
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=p08r01n44,p09r07n[11,24]
   BatchHost=p08r01n44
   NumNodes=3 NumCPUs=30 NumTasks=30 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=30,mem=150G,node=3
   Socks/Node=* NtasksPerN:B:S:C=10:0:*:1 CoreSpec=*
   JOB_GRES=(null)
     Nodes=p08r01n44 CPU_IDs=18-27 Mem=51200 GRES=
     Nodes=p09r07n11 CPU_IDs=0-5,12-15 Mem=51200 GRES=
     Nodes=p09r07n24 CPU_IDs=0,2,4-10,33 Mem=51200 GRES=
   MinCPUsNode=10 MinMemoryCPU=5G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/vast/palmer/home.grace/kln26/mpi_tests/test_for_schedmd.sub
   WorkDir=/vast/palmer/home.grace/kln26/mpi_tests
   StdErr=/vast/palmer/home.grace/kln26/mpi_tests/slurm-50789647.out
   StdIn=/dev/null
   StdOut=/vast/palmer/home.grace/kln26/mpi_tests/slurm-50789647.out
   Power=
Comment 10 Michael Hinton 2022-02-07 15:57:25 MST
(In reply to Kaylea Nelson from comment #0)
> We just upgraded our Grace cluster to 21.08.5 and have encountered a new
> issue with "mpirun"...
> We have clusters running 21.08.2 and 21.08.4 and on both of those clusters,
> each task gets its own core regardless of "mpirun" vs "srun".
Are you saying that this issue does not show up on 21.08.[2,4] but it does on 21.08.5?
Comment 11 Michael Hinton 2022-02-07 15:58:25 MST
And what was the Grace cluster running before you upgraded it to 21.08.5? 21.08.4? 20.11?
Comment 12 Michael Hinton 2022-02-07 16:14:52 MST
Can you attach relevant snippets of the slurmctld log and slurmd logs for each of the nodes for job 50789647?
Comment 14 Michael Hinton 2022-02-07 17:11:11 MST
If v1 does not work, could you try adding an explicit task count to the submission script (`-n30`) to see if that fixes the issue with mpirun?
Comment 15 Kaylea Nelson 2022-02-08 12:23:51 MST
We installed 21.08.5 on our test cluster and have been able to reproduce the issue there. We tried deploying the v1 patch, but the problem persists. I also tried -n 32 on that cluster (2 nodes, 16 cores each) and the problem persists. I'm going to upload the pertinent snippets for a representative job.

We were running 20.11.8 on Grace prior the upgrade last week. I tried the same test on two other of our clusters with 21.08.2 and 21.08.4 and there was no affinity issues on the non-batch host nodes (every task go its own cpu on every node).
Comment 16 Kaylea Nelson 2022-02-08 12:31:24 MST
Created attachment 23361 [details]
slurm_out_job18
Comment 17 Kaylea Nelson 2022-02-08 12:35:08 MST
Created attachment 23362 [details]
slurmd and slurmctld logs for job 18
Comment 18 Kaylea Nelson 2022-02-08 12:49:09 MST
Created attachment 23365 [details]
slurm conf for the test cluster
Comment 19 Michael Hinton 2022-02-08 12:50:11 MST
(In reply to Kaylea Nelson from comment #15)
> We installed 21.08.5 on our test cluster and have been able to reproduce the
> issue there.
> ...
> I tried the
> same test on two other of our clusters with 21.08.2 and 21.08.4 and there
> was no affinity issues on the non-batch host nodes (every task go its own
> cpu on every node).
Could you downgrade you test cluster to 21.08.4 and confirm that the issue goes away? If so, that will help narrow down the issue. If not, then there must be some difference between this cluster and your other clusters that is to blame.
Comment 20 Kaylea Nelson 2022-02-08 13:33:23 MST
We rolled the test cluster back to 21.08.4 and the issue went away. I'll upload relevant logs in case they are of use.
Comment 21 Kaylea Nelson 2022-02-08 13:37:24 MST
Created attachment 23368 [details]
output for 21.08.4 successful job
Comment 22 Kaylea Nelson 2022-02-08 13:37:54 MST
Created attachment 23369 [details]
slurmd slurmctld 21.08.4 successful job
Comment 23 Kaylea Nelson 2022-02-08 13:47:32 MST
Upon closer inspection, lesser problems continue with 21.08.4 on the test cluster, but they are a bit different. Some CPUs are reused for multiple tasks, but a larger range of CPUs are used. Oddly, this new issue is no longer limited to the non-batch host. We are seeing reused CPUs on both nodes now.
Comment 24 Kaylea Nelson 2022-02-08 14:58:02 MST
More updates. Our production cluster running 21.08.4 is operating completely correctly regarding mpirun. We have found it is actually very easy to see the issue when using the --report-bindings flag on mpirun. When working properly, everything is bound. With 21.08.5 on Grace or the test cluster, batch host has bindings but non-batch host node has none.

We are looking into why our test cluster on .4 is behaving differently from the production cluster running .4.
Comment 26 Kaylea Nelson 2022-02-08 15:53:19 MST
We downgraded 2 compute nodes to 21.08.4 on the Grace cluster (the one originally exhibiting the issue) and *the issue went away*. We left the server and database on 21.08.5 for this test (since they are still serving the rest of the nodes which are also still on 21.08.5). So we feel pretty confident that this is an issue introduced in 21.08.5.


# Using mpirun clients/server 21.08.5/21.08.5 
[p08r02n21.grace.hpc.yale.internal:40919] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B][]
[p08r02n21.grace.hpc.yale.internal:40919] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][]
[p08r02n24.grace.hpc.yale.internal:170940] MCW rank 3 is not bound (or bound to all available processors)
[p08r02n24.grace.hpc.yale.internal:170939] MCW rank 2 is not bound (or bound to all available processors)
p08r02n21.grace.hpc.yale.internal core 2 (1/4)
p08r02n21.grace.hpc.yale.internal core 0 (0/4)
p08r02n24.grace.hpc.yale.internal core 0 (3/4)
p08r02n24.grace.hpc.yale.internal core 0 (2/4)

# Using mpirun clients/server 21.08.4/21.08.5
[aam233@grace1:~/test/mpi] cat slurm-50894200.out
[p08r02n21.grace.hpc.yale.internal:42623] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][]
[p08r02n21.grace.hpc.yale.internal:42623] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B][]
[p08r02n24.grace.hpc.yale.internal:172625] MCW rank 2 bound to socket 0[core 0[hwt 0]]: [B/.][]
[p08r02n24.grace.hpc.yale.internal:172625] MCW rank 3 bound to socket 0[core 1[hwt 0]]: [./B][]
p08r02n21.grace.hpc.yale.internal core 2 (1/4)
p08r02n21.grace.hpc.yale.internal core 0 (0/4)
p08r02n24.grace.hpc.yale.internal core 0 (2/4)
p08r02n24.grace.hpc.yale.internal core 2 (3/4)
Comment 27 Michael Hinton 2022-02-08 16:08:26 MST
The most likely candidate for the difference between 21.08.4 and 21.08.5 is commit https://github.com/SchedMD/slurm/commit/6e13352fc2:

"Fix srun -c and --threads-per-core imply --exact

"In 21.08, srun --cpus-per-task and threads-per-core were supposed to
imply --exact. This worked when that new behavior was introduced, but
was regressed by commit 5154ed2 before 21.08 was released. That
commit did not handle setting --exact in step_req if -c was also set."

The issue fixed by this commit affected 21.08.0 - 21.08.4, but not 20.11.3+.

...Except your problem jobs in 21.08.5 don't seem to use any of these options (or do they?). But perhaps OpenMPI uses these options internally when calling srun.

Would you be willing to test 21.08.5 and revert commit 6e13352fc2 to see if that is the main difference? In the meantime, I will work on creating a local reproducer with OpenMPI 4.0.5.

Do you have other flavors of MPI that you can also test to see if they are affected?
Comment 28 Michael Hinton 2022-02-09 12:34:29 MST
What mechanism does c_eb_ompi_405_gcc_ucx use to get the CPU id? MPI_Get_processor_name()?
Comment 29 Kaylea Nelson 2022-02-09 12:37:28 MST
I use sched_getcpu(). I've attached the program.
Comment 30 Kaylea Nelson 2022-02-09 12:38:00 MST
Created attachment 23391 [details]
mpi hello world
Comment 31 Kaylea Nelson 2022-02-10 12:10:09 MST
OpenMPI 4.0.5 seems to be fixed when we revert commit 6e13352fc2. Thanks!

However, we are still having problems with our other OpenMPIs (2.1.2 and 3.1.1).  
Its binding each rank to every core on one of the sockets.
- If every core assigned on a particular node is a single socket, --report-bindings reports:
MCW rank 4 is not bound (or bound to all available processors) 
- If the cores assigned span two sockets, --report-bindings reports bindings, such as:
[c19n01.farnam.hpc.yale.internal:28584] MCW rank 12 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B][./.]
[c19n01.farnam.hpc.yale.internal:28584] MCW rank 13 bound to socket 1[core 2[hwt 0]], socket 1[core 3[hwt 0]]: [./.][B/B]

I try to force "--bind-to core", I get the following error:

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        p08r01n27
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

This behavior also happens on our cluster running 21.0.8.2. 

Any ideas?
Comment 33 Michael Hinton 2022-02-10 15:12:17 MST
(In reply to Kaylea Nelson from comment #31)
> OpenMPI 4.0.5 seems to be fixed when we revert commit 6e13352fc2. Thanks!
Great!

> However, we are still having problems with our other OpenMPIs (2.1.2 and
> 3.1.1).  
> Its binding each rank to every core on one of the sockets.
> ...
> This behavior also happens on our cluster running 21.0.8.2. 
> 
> Any ideas?
Did this same issue happen on 20.11.3+?
Comment 34 Michael Hinton 2022-02-10 16:31:10 MST
Can you also give us more information on:

* what PMI[x] libraries Slurm is configured with
* if you are using a job_submit plugin, and if that changes --threads-per-core, --cpus-per-task, or --hint=nomultithread
Comment 35 Kaylea Nelson 2022-02-10 20:13:02 MST
To answer your questions
- We use PMI2
- we have SLURM_HINT=nomultithread set on Grace. None of the others are set in job_submit or default environment. However, we see all of the below behavior on also on our clusters without that hint.

I did some experimenting and I think the discrepancy between OpenMPI 3.1.1 vs 4.0.5 has to do with mpirun bind-to and map-by.

- If I run any version of OpenMPI outside of Slurm (just ssh to a node), mpirun uses bind-to numa, map-by numa. This makes sense because this is the default behavior for OpenMPI if you have more than two tasks.

- If I run inside sbatch with simple parameters of -N1 -n=<anything greater than 2>, they start to deviate. What I don't know is which one is the expected behavior inside Slurm.
-- 3.1.1 remains at numa/numa
-- 4.0.5 becomes core/core

- If I try to force `--bind-to core` (in attempt to make 3.1.1 behave more like 4.0.5)
-- 3.1.1:
   a) on a job with cores all on one socket, this runs as core/numa
   b) on a job with cores that span two sockets, this fails with the error in my previous comment
-- 4.0.5: remains core/core, since it was already binding this way

Ways I have gotten 3.1.1 with `--bind-to core` across two sockets to work (b above):
-- if I request every core on the node 3.1.1b works (runs successfully as core/numa).
-- if I set "-c", 3.1.1b works (runs successfully as core/numa)

So my questions are:
- Why does OpenMPI 4.0.5 mpirun run as bind-to core, map-by core under Slurm but the older versions don't? Which way is the expected behavior?
- Why does --bind-to core fail across multiple sockets with older versions of OpenMPI unless I set -c1 or request all of the cores on the node?

Thanks!
Comment 36 Michael Hinton 2022-02-11 10:59:23 MST
(In reply to Kaylea Nelson from comment #35)
> So my questions are:
> - Why does OpenMPI 4.0.5 mpirun run as bind-to core, map-by core under Slurm
> but the older versions don't? Which way is the expected behavior?
> 
> - Why does --bind-to core fail across multiple sockets with older versions
> of OpenMPI unless I set -c1 or request all of the cores on the node?
Scanning OpenMPI's NEWS file, perhaps it has something to do with these commits:

4.0.5 -- August, 2020
---------------------
...
- Disable binding of MPI processes to system resources by Open MPI
  if an application is launched using SLURM's srun command.
...
- Fix a problem with mpirun when the --map-by option is used.
  Thanks to Wenbin Lyu for reporting.

I think the first one is commit https://github.com/open-mpi/ompi/commit/c72f295dfa; I'm not sure what commit the second one is associated with, but it seems less likely to be relevant. It's hard to tell if the first one applies in the case of an mpirun under sbatch in a Slurm allocation.

I'll keep looking into it, but this may be something to report to the OpenMPI guys directly. I would also check to make sure this isn't something fixed in the latest versions of OpenMPI 4.x, but NEWS doesn't seem to show any changes along those lines.

-Michael
Comment 37 Michael Hinton 2022-02-15 17:48:52 MST
(In reply to Kaylea Nelson from comment #35)
> - we have SLURM_HINT=nomultithread set on Grace.
Did you set this for all jobs, and if so, how? This implies --threads-per-cpu=1, which implies --exact. This might be why reverting commit 6e13352fc2 fixed things for you. See https://github.com/SchedMD/slurm/blob/slurm-21-08-5-1/src/common/proc_args.c#L784-L789.

Regarding your other questions - for the life of me, I can't reproduce the single-cpu pinning on the non-batch node with mpirun, even with a true multi-node, multi-socket setup. I'll keep trying, but feel free to add more information or another example of the issue.

I want to first focus on the issue you are seeing with an unmodified 21.08.5; the other issues can be addressed after. Or, if they are distinctly different and need attention, feel free to open up a separate ticket to work on them in parallel. That will prevent muddying up the waters here.

Thanks!
-Michael
Comment 39 Michael Hinton 2022-02-16 12:00:55 MST
Ok Kayla, we are able to reproduce the issue. The key was setting `SLURM_HINT=nomultithread` for the sbatch process (`SLURM_HINT=nomultithread sbatch 13351.batch`). This will imply --threads-per-core=1, which implies --exact. Another way to reproduce it is to set `export SLURM_EXACT=1` in the batch script.

I still need to look into why this only happens for mpirun, and not for srun.

As for why the batch node is ok with mpirun, but not the other node, it appears that mpirun reuses the batch step on the batch node (which has access to all the node) instead of creating its own new step. This might be a separate issue altogether.

We'll keep looking into it. In the meantime, we recommend using srun instead of mpirun where possible.

Thanks,
-Michael
Comment 45 Jason Booth 2022-02-18 12:52:57 MST
*** Ticket 13474 has been marked as a duplicate of this ticket. ***
Comment 46 Kaylea Nelson 2022-02-18 12:59:57 MST
Thanks for you continued work on this.

I'll take up the different behavior between OpenMPI 4 and <4 with OpenMPI and see what they say.

I wonder if the "bind-to core" failing with older OpenMPI version is related to the initial issue because it resolves with the use of -c 1. I'll leave it here and can test if the issue persists when the primary affinity issue is fixed. If not, then I'll submit it as a separate ticket.
Comment 49 Michael Hinton 2022-02-18 15:31:47 MST
So, we've got a fix pending review that will prevent --threads-per-core from implying --exact. This should fix the OpenMPI issues in 21.08.5 where tasks are pinned to a single CPU. Let me explain why it was doing that, by comparing vanilla srun vs. mpirun.

Let's assume we launch a batch script with SLURM_EXACT=1:

SLURM_EXACT=1 sbatch script.batch

#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=10
...
srun <prog>
mpirun <prog>

Since SLURM_EXACT=1 for sbatch, srun inherits that and implicitly does --exact. srun also inherits the node and task counts (-N2, --ntasks-per-node=10). So with --exact, srun gets exactly what it requests - two nodes and 20 tasks-worth of cores. But this just happens to be the same as getting the whole allocation. So there is little difference between --exact and no --exact in this case.

mpirun internally execve()'s something like `srun --ntasks=2 orted ...` (for one orted on each node). In this case, there is a *big* difference between --exact and no --exact, because the # of tasks is getting overwritten by mpirun. So with --exact, srun now only gets 2 tasks-worth of the job allocation. Now, each orted only has access to 1 core to spawn 10 tasks on. This is why there is single-cpu pinning.

I believe that this issue will likely affect other MPI libraries besides OpenMPI.
Comment 70 Michael Hinton 2022-02-22 10:34:12 MST
Kayla,

This has been fixed in the upcoming 21.08.6 release with commits https://github.com/SchedMD/slurm/compare/0f6a61dcf9df...08ad49515fb6. Thanks for bringing this to our attention!

I'm going to go ahead and mark this ticket as resolved. Please open up new tickets for your other MPI issues if you still need us to look at them.

Thanks!
-Michael
Comment 71 Kaylea Nelson 2022-02-22 10:38:29 MST
Hi Micheal,

Great thanks! Just to clarify, with the fix in 21.08.6 is it still a problem to specify -c X when using mpirun? I ask because I know some of our users run hybrid MPI/OpenMP jobs and that has been traditionally the setup.

Thanks,
Kaylea
Comment 72 Michael Hinton 2022-02-22 10:38:52 MST
*** Ticket 13339 has been marked as a duplicate of this ticket. ***
Comment 73 Michael Hinton 2022-02-22 10:55:14 MST
(In reply to Kaylea Nelson from comment #71)
> Great thanks! Just to clarify, with the fix in 21.08.6 is it still a problem
> to specify -c X when using mpirun? I ask because I know some of our users
> run hybrid MPI/OpenMP jobs and that has been traditionally the setup.
Yes. If you have -c/--cpus-per-task set, this will still imply --exact, and you'll get the same issue whenever you use mpirun.

-Michael
Comment 76 Michael Hinton 2022-02-23 09:29:24 MST
Kayla, sorry for the commotion, but we have some more updates:

Now, as of 21.08.6 (commit https://github.com/SchedMD/slurm/commit/f6f6b4ff59), -c/--cpus-per-task will no longer imply --exact. --exact is what messes up mpirun and causes the single-cpu pinning. Now, mpirun behavior should be the same as it was in 21.08.4 and before, regardless of -c or --threads-per-core. So disregard comment 73. -c can be set for sbatch/salloc without breaking mpirun.

In 22.05, -c will imply --exact, but a -c set by sbatch/salloc will no longer be inherited by srun. This should keep MPI happy and not break existing MPI programs (since mpirun doesn't use -c when calling srun internally).

Thanks!
-Michael
Comment 77 Kaylea Nelson 2022-02-23 09:31:59 MST
Thanks for the update, that's great news! We look forward to 22.05.
Comment 78 Michael Hinton 2022-02-23 09:53:30 MST
*** Ticket 13339 has been marked as a duplicate of this ticket. ***
Comment 79 Michael Hinton 2022-02-24 12:29:32 MST
*** Ticket 13474 has been marked as a duplicate of this ticket. ***
Comment 80 Skyler Malinowski 2022-08-17 07:33:53 MDT
*** Ticket 14751 has been marked as a duplicate of this ticket. ***