8258 – srun: error: Unable to create step for job 355169: Memory required by task is not available

Ticket 8258 - srun: error: Unable to create step for job 355169: Memory required by task is not available

Summary: srun: error: Unable to create step for job 355169: Memory required by task is...

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	19.05.3
Hardware:	Linux Linux

Importance:	--- 2 - High Impact
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-12-19 18:06 MST by Sudhakar Lakkaraju
Modified:	2019-12-23 12:48 MST (History)
CC List:	1 user (show)

See Also:	8269 8140
Site:	ASC
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm configuration file 19.05.3 (17.72 KB, text/plain) 2019-12-19 18:06 MST, Sudhakar Lakkaraju	Details
CGROUP CONF FILE (248 bytes, text/x-matlab) 2019-12-20 08:24 MST, Sudhakar Lakkaraju	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Sudhakar Lakkaraju 2019-12-19 18:06:08 MST

Created attachment 12613 [details]
slurm configuration file 19.05.3

Hi,
Non-MPI Job is not getting killed when a job is using excess memory. This is happening after we migrated from SLURM 17 to SLURM 19.

MPI jobs are getting killed immediately.
 
Please find the attached slurm.conf



Non-MPI job:-
Here is the output from scontrol show job: 

JobId=355169 JobName=parallel8comG16
   UserId=asndcy(1001) GroupId=analyst(10000) MCS_label=N/A
   Priority=3235 Nice=0 Account=users QOS=sysadm
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:15:25 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2019-12-19T18:36:28 EligibleTime=2019-12-19T18:36:28
   AccrueTime=2019-12-19T18:36:28
   StartTime=2019-12-19T18:36:40 EndTime=2019-12-26T18:36:40 Deadline=N/A
   PreemptEligibleTime=2019-12-19T18:36:40 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-19T18:36:40
   Partition=gpu_kepler AllocNode:Sid=dmcvlogin3.asc.edu:6102
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=dmc4
   BatchHost=dmc4
   NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=8M,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1M MinTmpDiskNode=0
   Features=dmc DelayBoot=00:00:00
   Reservation=dmc67-68
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/mnt/beegfs/home/asndcy/calc/gaussian/tests
   StdErr=/mnt/beegfs/home/asndcy/calc/gaussian/tests/parallel8comG16.o355169
   StdIn=/dev/null
   StdOut=/mnt/beegfs/home/asndcy/calc/gaussian/tests/parallel8comG16.o355169

Comment 1 Felip Moll 2019-12-20 06:53:26 MST

Hi,

Can you attach your cgroup.conf too?

In the RELEASE_NOTES of version 19.05 there are the following notes:

NOTE: MemLimitEnforce parameter has been removed and the functionality that
      was provided with it has been merged into a JobAcctGatherParams. It
      may be enabled by setting JobAcctGatherParams=OverMemoryKill, so now
      job and steps killing by OOM is enabled from the same place.


NOTE: slurmd and slurmctld will now fatal if two incompatible mechanisms for
      enforcing memory limits are set. This makes incompatible the use of
      task/cgroup memory limit enforcing (Constrain[RAM|Swap]Space=yes) with
      JobAcctGatherParams=OverMemoryKill, which could cause problems when a
      task is killed by one of them while the other is at the same time
      managing that task. The NoOverMemoryKill setting has been deprecated in
      favor of OverMemoryKill, since now the default is *NOT* to have any
      memory enforcement mechanism.


You have currently set MemLimitEnforce=yes in your slurm.conf. Can you switch to JobAcctGatherParams=OverMemoryKill ? Only do that if you are not using cgroup enforcment, as explained in the second note.

Comment 3 Sudhakar Lakkaraju 2019-12-20 08:24:18 MST

Created attachment 12619 [details]
CGROUP CONF FILE

Comment 4 Sudhakar Lakkaraju 2019-12-20 08:29:55 MST

We are using Cgroup enforcements.
I would like to keep Cgroup enforcements if possible.

Comment 6 Sudhakar Lakkaraju 2019-12-20 11:21:51 MST

I made the following changes to the slurm.conf
- commented out "MemLimitEnforce" parameter 
- JobAcctGatherParams was not configured 

Jobs are still not getting killed when they use excess memory.
We are still seeing an error "srun: error: Unable to create step for job : Memory required by task is not available"

Please help us fix this issue.

Comment 7 Felip Moll 2019-12-20 11:36:45 MST

I'd like to check which memory limit is applied to the cgroup of the job. You can do that by reading memory.limit_in_bytes of the cgroup created:

]$ cat memory.limit_in_bytes 
314572800

]$ pwd
/sys/fs/cgroup/memory/slurm/uid_1000/job_12566/step_0

I see your job has 8MB granted:
   TRES=cpu=8,mem=8M,node=1

So you will probably see 314572800 bytes, which corresponds to 30MiB. That's because there's an internal minimum set to 30MiB where RAM is not constrained below this value.

See XCGROUP_DEFAULT_MIN_RAM constant in ./common/xcgroup_read_config.h

If you want to constrain below 30MB you need to set MinRAMSpace in cgroup.conf, as the docs says:

MinRAMSpace=<number>
    Set a lower bound (in MB) on the memory limits defined by AllowedRAMSpace and AllowedSwapSpace. This prevents accidentally creating a memory
    cgroup with such a low limit  that  slurmstepd is immediately killed due to lack of RAM. The default limit is 30M.



----
In the unlikely case this doesn't fix your issue I'll need:

- Check how much memory is it using with sstat or directly in the node with 'ps' ?
- Slurmctld and slurmd logs. Before sending the logs, increase verbosity to:

SlurmctldDebug=debug2
SlurmdDebug=debug2

and then reproduce the issue and send back the logs.

Comment 9 Felip Moll 2019-12-23 03:48:48 MST

Hi,

Do you have any feedback from my last comment?

I need to know if you're already safe in order to lower the severity of the bug (or close it if definitively fixed).

Comment 10 Sudhakar Lakkaraju 2019-12-23 09:02:09 MST

(In reply to Felip Moll from comment #9)
> Hi,
> 
> Do you have any feedback from my last comment?
> 
> I need to know if you're already safe in order to lower the severity of the
> bug (or close it if definitively fixed).

I could not afford to keep my queue system offline for more than two days. We reverted back to Slurm17. Please close this case.

Comment 11 Felip Moll 2019-12-23 11:21:02 MST

(In reply to Sudhakar Lakkaraju from comment #10)
> (In reply to Felip Moll from comment #9)
> > Hi,
> > 
> > Do you have any feedback from my last comment?
> > 
> > I need to know if you're already safe in order to lower the severity of the
> > bug (or close it if definitively fixed).
> 
> I could not afford to keep my queue system offline for more than two days.
> We reverted back to Slurm17. Please close this case.

Sudhakar, just remember that you can always set the severity to SEV-1 which means the system is unusable, then our responsiveness would increase exponentially to fix your issues and would be glad to help you as best as we could.

In your case we requested info the same day you opened the bug, we expected some responses but didn't receive any.

I think is bad that you haven't managed to move to 19.05 and yours is an issue that we haven't seen before in an upgrade from 17.11 to 19.05. I feel that we should've done better in assisting you but was impossible with more input, so if you want to still give it a try just open a new bug asking for help on a planned upgrade, one of us will take the bug and be aware of your case.