Bug 10413 - MPI allocation regression with mpirun
Summary: MPI allocation regression with mpirun
Status: RESOLVED DUPLICATE of bug 10383
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 20.11.0
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-12-09 19:02 MST by Kilian Cavalotti
Modified: 2021-01-04 07:59 MST (History)
2 users (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name: Sherlock
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kilian Cavalotti 2020-12-09 19:02:24 MST
Hi there, 

It looks like since our upgrade to 20.11, we got a lot of users reporting massive slowdowns for their MPI jobs. It turns out that MPI applications started with mpirun rather than srun are not allocated resources correctly. The main visible effect is processes all pinned to CPU 0 on nodes, leading to very ppor performance. Things work fine with MPI codes are launched with srun.

Here's a simple example to illustrate the issue:

$ salloc -p test -N 2 --tasks-per-node=4
salloc: Pending job allocation 13283048
salloc: job 13283048 queued and waiting for resources
salloc: job 13283048 has been allocated resources
salloc: Granted job allocation 13283048
salloc: Waiting for resource configuration
salloc: Nodes sh03-01n[71-72] are ready for job

$ time srun -l bash -c 'printf "%s | CPU: %s (pid: %s)\n" $(hostname) $(ps -h -o psr,pid $$)'
0: sh03-01n71.int | CPU: 0 (pid: 21996)
1: sh03-01n71.int | CPU: 1 (pid: 21997)
2: sh03-01n71.int | CPU: 2 (pid: 21998)
3: sh03-01n71.int | CPU: 3 (pid: 21999)
4: sh03-01n72.int | CPU: 0 (pid: 20374)
5: sh03-01n72.int | CPU: 1 (pid: 20375)
6: sh03-01n72.int | CPU: 2 (pid: 20376)
7: sh03-01n72.int | CPU: 3 (pid: 20377)

real    0m0.344s
user    0m0.073s
sys     0m0.012s

$ ml openmpi/4.0.3
$ time mpirun bash -c 'printf "%s | CPU: %s (pid: %s)\n" $(hostname) $(ps -h -o psr,pid $$)'
sh03-01n71.int | CPU: 0 (pid: 22107)
sh03-01n72.int | CPU: 0 (pid: 20452)
sh03-01n71.int | CPU: 0 (pid: 22108)
sh03-01n71.int | CPU: 0 (pid: 22110)
sh03-01n72.int | CPU: 0 (pid: 20453)
sh03-01n71.int | CPU: 0 (pid: 22112)
sh03-01n72.int | CPU: 0 (pid: 20455)
sh03-01n72.int | CPU: 0 (pid: 20457)

real    0m11.438s
user    0m0.035s
sys     0m0.062s

See how all the ranks are properly dispatched to different CPUs with srun, and how they all run on CPU 0 with mpirun.

There's also a 10x difference in starting times between srun and mpirun, which I don't recall in previous Slurm versions.


Our MPI installations have not changed since before moving to 20.11, and jobs using mpirun used to work fine with 20.02.

Happy to provide additional details to help debug this!

Thanks,
--
Kilian
Comment 3 Marcin Stolarek 2020-12-10 10:25:58 MST
Kilian,

I think you know that's kind of out of scope for us, but could you please check openmpi build with the following patch applied to prrte component:

>./orte/mca/plm/slurm/plm_slurm_module.c
>@@ -267,6 +267,11 @@ static void launch_daemons(int fd, short args, void *cbdata)
>     /* start one orted on each node */
>     opal_argv_append(&argc, &argv, "--ntasks-per-node=1");
> 
>+    /* add all CPUs to this task */
>+    cpus_on_node = getenv("SLURM_CPUS_ON_NODE");
>+    asprintf(&tmp, "--cpus-per-task=%s", cpus_on_node);
>+    opal_argv_append(&argc, &argv, tmp);
>+
>     if (!orte_enable_recovery) {
>         /* kill the job if any orteds die */
>         opal_argv_append(&argc, &argv, "--kill-on-bad-exit");

The above snippet shows the final location of file in the openmpi tar.gz (after prrte subproject inclusion).

cheers,
Marcin
Comment 4 Kilian Cavalotti 2020-12-10 10:36:24 MST
Hi Marcin, 

(In reply to Marcin Stolarek from comment #3)
> I think you know that's kind of out of scope for us, but could you please
> check openmpi build with the following patch applied to prrte component:

Thanks for the suggestion!

I opened this bug here because that same Open MPI installation (4.0.3) worked fine with either srun or mpirun in Slurm 20.02, but it doesn't anymore in Slurm 20.11. 

If a bug was present in Open MPI, it would certainly present itself the same way in Slurm 20.02 and 20.11, right? Since it didn't in 20.02, I'd tend to think that something changed in 20.11 regarding this, and that's what I'm curious about.


Cheers,
-- 
Kilian
Comment 5 Kilian Cavalotti 2020-12-10 15:41:28 MST
Hi Marcin, 

I gave a try at your suggestion (adding --cpus-per-task to the srun command that starts the orted daemons on the nodes), and it looks like it works:

$ time mpirun --mca plm_slurm_args '--cpus-per-task=4' bash -c 'printf "%s | CPU: %s (pid: %s)\n" $(hostname) $(ps -h -o psr,pid $$)'
sh03-01n72.int | CPU: 2 (pid: 27828)
sh03-01n72.int | CPU: 1 (pid: 27827)
sh03-01n71.int | CPU: 0 (pid: 17352)
sh03-01n72.int | CPU: 0 (pid: 27826)
sh03-01n71.int | CPU: 1 (pid: 17353)
sh03-01n71.int | CPU: 2 (pid: 17354)
sh03-01n72.int | CPU: 3 (pid: 27829)
sh03-01n71.int | CPU: 3 (pid: 17355)

real    0m46.684s
user    0m0.013s
sys     0m0.040s

But recompiling Open MPI to add this workaround won't really fly with our users, I'm afraid. :\
So I'm still looking for what in 20.11 may have introduced that behavior change.

Thanks!
--
Kilian
Comment 6 Marcin Stolarek 2020-12-11 01:45:36 MST
Kilian,

The thing is that we changed the default for steps being '--exclusive'. The patch I shared is not only a workaround, it got merged to prrte[1]

Please take a look at Bug 10383 comment 15 - it covers the complexity of the case.

I'll go ahead and close the bug as duplicate now.

cheers,
Marcin
[1]https://github.com/openpmix/prrte/commit/0288ebbc15c36e1d3c32f6d12c47237053e06101

*** This bug has been marked as a duplicate of bug 10383 ***
Comment 7 Kilian Cavalotti 2020-12-11 08:51:17 MST
(In reply to Marcin Stolarek from comment #6)
> Kilian,
> 
> The thing is that we changed the default for steps being '--exclusive'. 

Ah, so that's coming from a Slurm change, indeed. :)
But wow, that's a pretty steep direction reversal. I found the mention of that change (4eccd2f9e) in the RELEASE_NOTES indeed, but did realize the kind of impact it would have. 

Out of curiosity, what was the rationale for that change of the default behavior?

> The
> patch I shared is not only a workaround, it got merged to prrte[1]
> 
> Please take a look at Bug 10383 comment 15 - it covers the complexity of the
> case.
> 
> I'll go ahead and close the bug as duplicate now.

Got it, thanks for pointing all this out!

Cheers,
-- 
Kilian