13339 – MPI job performance

Ticket 13339 - MPI job performance

Summary: MPI job performance

Status:	RESOLVED DUPLICATE of ticket 13351

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Heterogeneous Jobs (show other tickets)
Version:	21.08.5
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-02-04 11:12 MST by Wei Feinstein
Modified:	2022-02-24 10:12 MST (History)
CC List:	2 users (show)

See Also:	13351 13375
Site:	LBNL - Lawrence Berkeley National Laboratory
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
screenshot (86.67 KB, image/png) 2022-02-04 12:17 MST, Wei Feinstein	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Wei Feinstein 2022-02-04 11:12:11 MST

Hello,

We just upgrade slurm to 21.08 from 20.02 this week. 

A number of users have contacted us about MPI application performance being very slow. I am reading the release note as we speak. 

Any ideas and suggestions ?

Thank you,
Wei

Comment 1 Wei Feinstein 2022-02-04 12:17:08 MST

Created attachment 23298 [details]
screenshot

Comment 2 Wei Feinstein 2022-02-04 12:20:58 MST

here is a sample of the job script
#!/bin/bash -l

#########lbl##################
#SBATCH --account=nano
#SBATCH --job-name=S8-S8_hopping
#SBATCH --partition=etna

#SBATCH --time=12:00:00
#SBATCH --qos=normal

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=24
#SBATCH --ntasks-per-core=1
#SBATCH --cpus-per-task=1
#SBATCH --hint=nomultithread
##SBATCH --mem=120GB

#SBATCH --output=job.o%j
#SBATCH --error=job.e%j
#SBATCH --exclusive

# load modules
module purge
module load gpaw/21.1.1b1

# executable
MPIR="mpirun gpaw python"

##############################
# run calculation
export OMP_NUM_THREADS=1
cd $SLURM_SUBMIT_DIR

$MPIR input.py > LOG

The problem is that only a single node of the one asked for the jobs is running at 100%, while the others sit at 0-5% utilization:


3 nodes except the 1st node have 24 MPI tasks running but with very low %CPU.

Is it something to do with "exclusive/overlap" .

Thank you,
Wei

Comment 3 Michael Hinton 2022-02-04 12:49:37 MST

Hi Wei,

It could be with how 20.11 changed --exclusive with srun. Check the 20.11 release notes for more information (i.e. RELEASE_NOTES on the slurm-20.11 branch; 20.11 release notes are not present on 21.08). For now, here is a summary of the changes:

For 20.11.3+, the default behavior for srun within an allocation is:

* Exclusive access to all resources it requests (srun --exclusive)
* All the resources of the job on the node (srun --whole)

These can be overridden by:

* srun --overlap (srun can overlap each other)
* srun --exact (only use exactly the CPU resources requested)

Remember: --exact and --overlap only affect CPUs. 

I will look into this some more.

Thanks,
-Michael

Comment 4 Michael Hinton 2022-02-04 13:03:26 MST

(In reply to Wei Feinstein from comment #2)
> Is it something to do with "exclusive/overlap" .
I believe so. I think the solution is to set SLURM_OVERLAP=1 in the batch script. mpirun calls srun internally, so this is the way to make that implicit srun call --overlap for each step.

Comment 5 Wei Feinstein 2022-02-07 14:51:49 MST

Hi Michael,

When launching MPI jobs, it appears as if those processes that run on the non-head node (not the first in the SLURM_NODELIST) seem to be limited to a single CPU when launched with mpirun (they don't utilize all resources on that node). 

I have suggested "SLURM_OVERLAP=1", but it doesn't seem to make any difference.

Anything I should try?

Thanks,
Wei

Comment 6 Michael Hinton 2022-02-07 15:06:27 MST

(In reply to Wei Feinstein from comment #5)
> I have suggested "SLURM_OVERLAP=1", but it doesn't seem to make any
> difference.
> 
> Anything I should try?
I'm not quite sure yet. However, there is another recent ticket (bug 13351) that seems to have the exact same symptoms as you see here. I'm going to look over both to see if there are any clues as to what's going on.

Comment 7 Michael Hinton 2022-02-07 16:19:37 MST

What version of MPI are you using? Is it only a problem with mpirun? Or does this also happen with srun?

Comment 8 Michael Hinton 2022-02-07 17:14:52 MST

Could you also try explicitly setting the task count (-n 96) to see if that fixes it? Could you also do `scontrol show job <job-id>` on a job with the issue after you reproduce it?

Comment 9 Wei Feinstein 2022-02-08 10:29:56 MST

Hi Michael,

The following line seems to be the culprit. 

--cpus-per-task=1

Once it is commented out, %CPU usage on all the nodes shoot up to 100%. This is definite new after SLURM upgrade.

It is important to understand the underlying reasons. 

Can you please help? 

Thank you,

Wei

Comment 10 Michael Hinton 2022-02-08 11:02:17 MST

(In reply to Wei Feinstein from comment #9)
> The following line seems to be the culprit. 
> 
> --cpus-per-task=1
> 
> Once it is commented out, %CPU usage on all the nodes shoot up to 100%. This
> is definite new after SLURM upgrade.
Great!

> It is important to understand the underlying reasons. 
I believe this is because --cpus-per-task now implies --exact in 21.08.

From RELEASE_NOTES:

" -- --cpus-per-task and --threads-per-core now imply --exact.
    This fixes issues where steps would be allocated the wrong number of CPUs."

And from https://slurm.schedmd.com/srun.html#OPT_cpus-per-task:

"-c, --cpus-per-task=<ncpus>
"...
"Explicitly requesting this option implies --exact. The default is one CPU per process and does not imply --exact."

Since the default is already 1 task per cpu, perhaps instruct users to remove `--cpus-per-task=1` if they have that in their scripts.

It's still not clear to me why this causes the behavior that you are seeing, though. Let me reproduce the issue and dig into it.

Do you only see this issue with mpirun, or does it also show up when mpirun is replaced with srun in a similar submission script?

Thanks,
-Michael

Comment 11 Michael Hinton 2022-02-08 11:25:32 MST

(In reply to Michael Hinton from comment #7)
> What version of MPI are you using? Is it only a problem with mpirun? Or does
> this also happen with srun?
Do you have the answers to these questions? That would help us track down the issue.

Comment 12 Wei Feinstein 2022-02-08 11:32:56 MST

srun is never been used within job submission script, I think Slurm has to be compiled with certain flags to allow srun w/ sbatch. srun can be used to request interactive nodes but with sbatch.

Thank you for your assistance of getting this to the bottom.

Wei

Comment 13 Michael Hinton 2022-02-08 11:40:55 MST

Are your users using OpenMPI? If so, what version? If not, what MPI library are they using?

Comment 14 Wei Feinstein 2022-02-08 12:33:26 MST

The user has benefited from removing  --cpus-per-task=1
uses openMPI as below:
mpirun (Open MPI) 3.0.1 

I know other users might be using different MPI implementation, including IntelMPI as well.

I hope the solutions don't have to depend on MPI version/vender, etc.

Thank you,

Wei

Comment 15 Wei Feinstein 2022-02-08 13:02:50 MST

Hi Michael,

Here is another user's error. 

1644004097.360247] [n0188:13379:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bc
opy
Abort(1091215) on node 35 (rank 35 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(904)...............:
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed


More info about the lammps application:

[siddharthsundararaman@n0000 K_atom1]$ which lmp
/global/software/sl-7.x86_64/modules/apps/ms/lammps/3Mar20-mpi/bin/lmp
[siddharthsundararaman@n0000 K_atom1]$ ldd /global/software/sl-7.x86_64/modules/apps/ms/lammps/3Mar20-mpi/bin/lmp
	linux-vdso.so.1 =>  (0x00007ffc101a2000)
	libmpicxx.so.12 => /global/software/sl-7.x86_64/modules/langs/intel/parallel_studio_xe_2020_update4_cluster_edition/compilers_and_libraries_2020.1.217/linux/mpi/intel64/lib/libmpicxx.so.12 (0x00002ad5c3dce000)
	libmpifort.so.12 => /global/software/sl-7.x86_64/modules/langs/intel/parallel_studio_xe_2020_update4_cluster_edition/compilers_and_libraries_2020.1.217/linux/mpi/intel64/lib/libmpifort.so.12 (0x00002ad5c3fee000)
	libmpi.so.12 => /global/software/sl-7.x86_64/modules/langs/intel/parallel_studio_xe_2020_update4_cluster_edition/compilers_and_libraries_2020.1.217/linux/mpi/intel64/lib/release/libmpi.so.12 (0x00002ad5c43ad000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00002ad5c5530000)
	librt.so.1 => /lib64/librt.so.1 (0x00002ad5c5734000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ad5c593c000)
	libjpeg.so.62 => /lib64/libjpeg.so.62 (0x00002ad5c5b58000)
	libpng15.so.15 => /lib64/libpng15.so.15 (0x00002ad5c5dad000)
	libz.so.1 => /global/software/sl-7.x86_64/modules/langs/python/3.7/lib/libz.so.1 (0x00002ad5c3be5000)
	libfftw3.so.3 => /global/software/sl-7.x86_64/modules/intel/2020.1.024.par/fftw/3.3.8-intel/lib/libfftw3.so.3 (0x00002ad5c5fd8000)
	libfftw3_omp.so.3 => /global/software/sl-7.x86_64/modules/intel/2020.1.024.par/fftw/3.3.8-intel/lib/libfftw3_omp.so.3 (0x00002ad5c62e5000)
	libpython2.7.so.1.0 => /lib64/libpython2.7.so.1.0 (0x00002ad5c64ec000)
	libm.so.6 => /lib64/libm.so.6 (0x00002ad5c68b8000)
	libstdc++.so.6 => /global/software/sl-7.x86_64/modules/langs/python/3.7/lib/libstdc++.so.6 (0x00002ad5c3c07000)
	libiomp5.so => /global/software/sl-7.x86_64/modules/langs/python/3.7/lib/libiomp5.so (0x00002ad5c6bba000)
	libgcc_s.so.1 => /global/software/sl-7.x86_64/modules/langs/python/3.7/lib/libgcc_s.so.1 (0x00002ad5c3d7c000)
	libc.so.6 => /lib64/libc.so.6 (0x00002ad5c6fa4000)
	libfabric.so.1 => /global/software/sl-7.x86_64/modules/langs/intel/parallel_studio_xe_2020_update4_cluster_edition/compilers_and_libraries_2020.1.217/linux/mpi/intel64/libfabric/lib/libfabric.so.1 (0x00002ad5c7372000)
	/lib64/ld-linux-x86-64.so.2 (0x00002ad5c3baa000)
	libimf.so => /global/software/sl-7.x86_64/modules/langs/intel/parallel_studio_xe_2020_update4_cluster_edition/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64_lin/libimf.so (0x00002ad5c75b1000)
	libsvml.so => /global/software/sl-7.x86_64/modules/langs/intel/parallel_studio_xe_2020_update4_cluster_edition/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64_lin/libsvml.so (0x00002ad5c7cad000)
	libirng.so => /global/software/sl-7.x86_64/modules/langs/intel/parallel_studio_xe_2020_update4_cluster_edition/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64_lin/libirng.so (0x00002ad5c975f000)
	libintlc.so.5 => /global/software/sl-7.x86_64/modules/langs/intel/parallel_studio_xe_2020_update4_cluster_edition/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64_lin/libintlc.so.5 (0x00002ad5c9ac9000)
	libutil.so.1 => /lib64/libutil.so.1 (0x00002ad5c9d41000)

Comment 16 Michael Hinton 2022-02-08 13:07:36 MST

(In reply to Wei Feinstein from comment #15)
> Here is another user's error. 
> ...
Is this related to the issue at hand, or do you think it's a separate issue? If it's separate, would you mind opening up a new ticket?

Comment 17 Wei Feinstein 2022-02-08 13:13:35 MST

Hi Michael,

The 2nd user started having the problem since the Slurm upgrade. They both work in the same research group but using different MPI applications.

Please let me know if I need to open a separate ticket.

Thank you,
Wei

Comment 18 Michael Hinton 2022-02-08 13:15:18 MST

(In reply to Wei Feinstein from comment #17)
> The 2nd user started having the problem since the Slurm upgrade. They both
> work in the same research group but using different MPI applications.
Ok. Is the MPI version the same? Are there any Slurm errors (I can't see anything that I can help with)? Can you provide slurmctld error logs for that failed job?

Comment 19 Michael Hinton 2022-02-08 13:16:05 MST

(In reply to Wei Feinstein from comment #17)
> Please let me know if I need to open a separate ticket.
Actually, yes, let's open a separate ticket to not muddy the waters of this ticket.

Comment 20 Wei Feinstein 2022-02-08 13:31:33 MST

Hi Michael,

I am re-summarize one use case here:

When launching MPI jobs, it appears as if those processes that run on the non-head node (not the first in the SLURM_NODELIST) seem to be limited to a single CPU when launched with mpirun (they don't utilize all resources on that node). 

Once the following line is removed, %CPU usage on all the nodes shoot up to 100%. This is definite new after SLURM upgrade.

--cpus-per-task=1

The user was using openMPI as below:
mpirun (Open MPI) 3.0.1 

I have asked them to add SLURM_OVERLAP=1 or mpirun -np xxx, it didn't make any difference.  Only removing --cpus-per-task=1 seems to bring the performance back as pri slurm upgrade


srun is never been used within job submission script, I think Slurm has to be compiled with certain flags to allow srun w/ sbatch. srun can be used to request interactive nodes but with sbatch.

Let me know if you need any other information.

Thank you,
Wei

Comment 21 Michael Hinton 2022-02-21 15:00:13 MST

Hi Wei,

So we hope to fix this issue soon in bug 13351. Let me explain what is happening.

In 21.08.5, --threads-per-core and -c/--cpus-per-task began implying --exact (and --hint=nomultithread implies --threads-per-core=1, which in turn implies --exact). This messed up MPI jobs because internally they call something like `srun -nX ... orted ...`. But since --exact was implied, orted only got X cpus, instead of all of them. (See bug 13351 comment 49.)

We are planning on reverting --threads-per-core from implying --exact, which should fix the issue. --cpus-per-task will still imply --exact, so make sure that option isn't set when calling mpirun.

Thanks,
-Michael

Comment 22 Michael Hinton 2022-02-22 10:38:51 MST

Hi Wei,

This issue should now be fixed in the upcoming 21.08.6 release.

Thanks!
-Michael

*** This ticket has been marked as a duplicate of ticket 13351 ***

Comment 23 Wei Feinstein 2022-02-22 10:45:51 MST

Hi Michael,

When upcoming 21.08.6 is release, what do I need to do? install a patch? 

Thank you,

Wei

Comment 24 Michael Hinton 2022-02-22 10:53:03 MST

(In reply to Wei Feinstein from comment #23)
> When upcoming 21.08.6 is release, what do I need to do? install a patch? 
Just rebuild the 21.08.6 release of Slurm and restart the slurmctld and slurmds. That should fix the issue.

Thanks,
-Michael

Comment 25 Michael Hinton 2022-02-22 10:58:42 MST

Also, make sure users do not combine --cpus-per-task=X with mpirun, as this will imply --exact and cause the issue even in 21.08.6. However, --hint=nomultithread/SLURM_HINT=nomultithread and --threads-per-core=X will no longer imply --exact in 21.08.6, avoiding the issue when these are set.

Comment 26 Wei Feinstein 2022-02-22 13:08:28 MST

Hi Michael,

Thank you for the information.

Rebuild 21.08.6 release (a minor release) would be too much for our users, which would require a downtime. We probably will wait until our next upgrade.  

What do you think?

Wei

Comment 27 Michael Hinton 2022-02-23 09:53:30 MST

Hi Wei,

I wanted to share some recent developments regarding this issue from bug 13351:

Now, as of 21.08.6 (commit https://github.com/SchedMD/slurm/commit/f6f6b4ff59), -c/--cpus-per-task will no longer imply --exact. --exact is what messes up mpirun and causes the single-cpu pinning. Now, mpirun behavior should be the same as it was in 21.08.4 and before, regardless of -c or --threads-per-core. See bug 13351 comment 76.

So please disregard what I said earlier about instructing users to not using -c/--cpus-per-task with MPI jobs. Users can freely use that without messing up MPI jobs in <= 21.08.4 and 21.08.6.

(In reply to Wei Feinstein from comment #26)
> Rebuild 21.08.6 release (a minor release) would be too much for our users,
> which would require a downtime. We probably will wait until our next
> upgrade.  
21.08.5 is a particularly bad version to squat on, because of this issue. We also always recommend running the latest minor version possible, because earlier minor versions are usually less stable.

Upgrading from 21.08.5 to 21.08.6 should not require significant downtime. As long as you are on the same major version (e.g. 21.08) you should be able to painlessly upgrade to any minor version. With minor version upgrades, there are no db conversions, since the database schema is unchanged. And the protocol and RPCs are all the same, too, so minor versions can all interoperate with each other.

What this means is that you should be able to recompile and install Slurm on 21.08.6 and just restart Slurm. You won't need to stop any running jobs or mark any nodes as down, or create any maintenance reservations.

Feel free to open up a new ticket with us to guide you through any minor upgrades, if that is something that you want to consider.

With that, I'll go ahead and mark this as a duplicate of bug 13351. Thanks!
-Michael

*** This ticket has been marked as a duplicate of ticket 13351 ***

Comment 28 Wei Feinstein 2022-02-24 00:34:58 MST

Hi Michael,

Thanks for the explanation. I now have a better understanding of what was going on with our MPI issues. 

The upgrade we just had might be the first time that we used the latest release, and it was my first time taking on the task of Slurm upgrade. Nothing goes easy. Since we are on 21.08.05, -c/--cpus-per-task should be avoid for now.

From what you described about upgrading from 21.08.5 to 21.08.6, it sounds like the upgrade can be done without a downtime and affecting running jobs. 

When will 21.08.06 be released?

Thank you,
Wei

Comment 29 Michael Hinton 2022-02-24 10:12:25 MST

(In reply to Wei Feinstein from comment #28)
> Thanks for the explanation. I now have a better understanding of what was
> going on with our MPI issues. 
Great!

> The upgrade we just had might be the first time that we used the latest
> release, and it was my first time taking on the task of Slurm upgrade.
> Nothing goes easy. Since we are on 21.08.05, -c/--cpus-per-task should be
> avoid for now.
Yes, and also --threads-per-core or --hint=nomultithread.

> From what you described about upgrading from 21.08.5 to 21.08.6, it sounds
> like the upgrade can be done without a downtime and affecting running jobs. 
Exactly. It should not be the ordeal you had to deal with recently when upgrading multiple major versions :)

> When will 21.08.06 be released?
If not today, then within the next few weeks!

-Michael