Ticket 6518 - jobacct_gather/linux not fully supported with pam_slurm_adopt leading to memory overlimit
Summary: jobacct_gather/linux not fully supported with pam_slurm_adopt leading to memo...
Status: RESOLVED DUPLICATE of ticket 8656
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 17.11.2
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-02-14 07:13 MST by SafranTech
Modified: 2020-03-13 11:28 MDT (History)
2 users (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
steps to reproduce and slurmd log (8.22 KB, application/x-gzip)
2019-02-14 07:13 MST, SafranTech
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description SafranTech 2019-02-14 07:13:38 MST
Created attachment 9177 [details]
steps to reproduce and slurmd log

Dear Support,

In a context when pam_slurm_adopt activated, we observe that jobs are killed by jobacct_gather plugin. 
It appears a orphaned process is added after each termination of a tracked process. And so the memory associated to the orphaned task increments the total memory of the job step (here step_extern). This leads to the job step being killed by slurm plugin.

In attachements, you fill a complete description of the issue.

Regards,
Philippe
Comment 1 CSC sysadmins 2020-02-27 06:45:40 MST
Hi SchedMD,

Our site suffers this issue as well (CSC - It center for science) and here is excellent analysis and reproducer so could you fix this issue?

Best Regards,
Tommi Tervo
CSC
Comment 2 SafranTech 2020-02-28 04:05:42 MST
Hello
Further to Tommi comment, let me update.
From our side, meanwhile the bug resolution, we wookround by adding this:

JobAcctGatherParams=NoOverMemoryKill

For more details,

Principle parameters in slurm.conf:
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherParams=NoOverMemoryKill
TaskPlugin=task/cgroup
TaskPluginParam=Cpusets

cgroup.conf contents:
ConstrainCores=yes
ConstrainRAMSpace=yes
MaxRAMPercent=98
ConstrainKmemSpace=no


Please be aware about this parameter (extract  from slurm.con man)


       JobAcctGatherParams
              Arbitrary parameters for the job account gather plugin Acceptable values at present include:

              NoShared            Exclude shared memory from accounting.

              UsePss              Use PSS value instead of RSS to calculate real usage of memory.  The PSS value will be saved as RSS.

              NoOverMemoryKill    Do not kill process that uses more then requested memory.  This parameter should be used with caution as if jobs exceeds its memory allocation it  may
                                  affect  other  processes  and/or machine health.  NOTE: It is recommended to limit memory by enabling task/cgroup in TaskPlugin and making use of Con‐
                                  strainRAMSpace=yes cgroup.conf.  If so, having JobAcctGather as an extra mechanism for memory enforcement is not recommended, so  setting  NoOverMemoryKill is advised.


Mohamed Hendawi
Comment 3 Nate Rini 2020-03-13 11:28:31 MDT
Marking this as a duplicate

*** This ticket has been marked as a duplicate of ticket 8656 ***