Bug 7232 - Shared Memory Segments not removed at job termination
Summary: Shared Memory Segments not removed at job termination
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other bugs)
Version: 18.08.7
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-06-12 01:41 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified: 2019-06-12 13:30 MDT (History)
2 users (show)

See Also:
Site: DTU Physics
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (5.50 KB, text/plain)
2019-06-12 01:41 MDT, Ole.H.Nielsen@fysik.dtu.dk
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ole.H.Nielsen@fysik.dtu.dk 2019-06-12 01:41:34 MDT
Created attachment 10586 [details]
slurm.conf

With our new Xeon Skylake nodes we experience many job failures with an error message:

getshmem_C in getshmem.c: cannot create shared segment 8
No space left on device 

Using "ipcs -a" it turns out that there are 4096 allocated Shared Memory Segments which have not been not removed from previous jobs.  All new jobs will therefore fail when trying to allocate new segments.  I notice that many nodes have these orphaned Shared Memory Segments, but only a subset of nodes actually hit the system limit of 4096.

I believe the segments are created by the widely used code "VASP" which is built with Intel compilers, MKL and MPI.  Our nodes are running CentOS 7.6.

When a job terminates, I would expect slurmd to clean up all resources including Shared Memory Segments, for example, using the "ipcrm" command.  This does not seem to be the case.

Question: Should slurmd perform this cleanup?  If not, can you suggest a solution to this problem?

Thanks,
Ole
Comment 1 Nate Rini 2019-06-12 09:12:27 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #0)
> Question: Should slurmd perform this cleanup?  If not, can you suggest a
> solution to this problem?
Slurm does not perform any clean up of any SYSV IPC memory. This can be done via a job epilog/prolog script. The kernel also has an option to make sysv memory be accounted like other memory:
> sysctl kernel.shm_rmid_forced = 1
This will cause the sysv IPC memory to be released when the last attached process exits, which may cause application issues depending on the memory usage.

It is also possible to tell the kernel to limit the shared memory usage per process/namespace:
> sysctl kernel.shmmax=0
This amount should then be subtracted from the available memory on the nodes in the slurm.conf to avoid any oom events.

--Nate
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2019-06-12 11:44:14 MDT
(In reply to Nate Rini from comment #1)
> (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #0)
> > Question: Should slurmd perform this cleanup?  If not, can you suggest a
> > solution to this problem?
> Slurm does not perform any clean up of any SYSV IPC memory. This can be done
> via a job epilog/prolog script. The kernel also has an option to make sysv
> memory be accounted like other memory:
> > sysctl kernel.shm_rmid_forced = 1
> This will cause the sysv IPC memory to be released when the last attached
> process exits, which may cause application issues depending on the memory
> usage.
> 
> It is also possible to tell the kernel to limit the shared memory usage per
> process/namespace:
> > sysctl kernel.shmmax=0
> This amount should then be subtracted from the available memory on the nodes
> in the slurm.conf to avoid any oom events.

I've read the warnings about shm_rmid_forced, so it's probably safer to make ipcrm commands in an epilog script.  Would you have an example epilog script for this purpose?

Thanks,
Ole
Comment 3 Nate Rini 2019-06-12 12:13:39 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #2)
> I've read the warnings about shm_rmid_forced, so it's probably safer to make
> ipcrm commands in an epilog script.  Would you have an example epilog script
> for this purpose?

We do not have an example that I would consider safe to run on a multi user system. The main problem is the the namespace for sysv ipc is shared and there is no clear way to determine which job owns what. If you force only exclusive jobs on a node, then you could just clear out all of the memory at job end.

Unsharing out a new sysv ipc namespace per job is possible but that would be considered an RFE.

--Nate
Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2019-06-12 13:29:27 MDT
(In reply to Nate Rini from comment #3)
> (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #2)
> > I've read the warnings about shm_rmid_forced, so it's probably safer to make
> > ipcrm commands in an epilog script.  Would you have an example epilog script
> > for this purpose?
> 
> We do not have an example that I would consider safe to run on a multi user
> system. The main problem is the the namespace for sysv ipc is shared and
> there is no clear way to determine which job owns what. If you force only
> exclusive jobs on a node, then you could just clear out all of the memory at
> job end.
> 
> Unsharing out a new sysv ipc namespace per job is possible but that would be
> considered an RFE.

Thanks very much for providing excellent information.  Please close this case.

Best regards,
Ole
Comment 5 Nate Rini 2019-06-12 13:30:57 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #4)
> Thanks very much for providing excellent information.  Please close this
> case.

Closing per your response.

--Nate