3674 – slurm_pam_adopt misses some cgroup subsystems

Ticket 3674 - slurm_pam_adopt misses some cgroup subsystems

Summary: slurm_pam_adopt misses some cgroup subsystems

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	17.02.1
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-04-06 17:59 MDT by Kilian Cavalotti
Modified:	2018-10-26 02:25 MDT (History)
CC List:	0 users

See Also:	5920
Site:	Stanford
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kilian Cavalotti 2017-04-06 17:59:24 MDT

Hi there. 

Just installed Slurm 17.02.2 and tested the slurm_pam_adopt module. It compiles,  loads and work fine, but it doesn't put the adopted processes in all the configured cgroups. 

During my tests, it correctly put the SSH adopted process in the "cpuset" and "freezer" subsystems, but misses "devices", "cpuacct" and "memory".

Illustration:

* Initial job submission:

sh-ln01 $ srun -w sh-101-59 -p test --pty bash
sh-101-59 $ cat /proc/$$/cgroup
11:hugetlb:/
10:freezer:/slurm/uid_215845/job_11354/step_0
9:memory:/slurm/uid_215845/job_11354/step_0
8:cpuacct,cpu:/slurm/uid_215845/job_11354/step_0/task_0
7:blkio:/
6:devices:/slurm/uid_215845/job_11354/step_0
5:pids:/
4:cpuset:/slurm/uid_215845/job_11354/step_0
3:net_prio,net_cls:/
2:perf_event:/
1:name=systemd:/system.slice/slurmd.service
[kilian@sh-101-59 ~]$


* SSH'ing to the node:

sh-ln01 $ ssh sh-101-59
sh-101-59 $ cat /proc/$$/cgroup
11:hugetlb:/
10:freezer:/slurm/uid_215845/job_11354/step_extern
9:memory:/
8:cpuacct,cpu:/
7:blkio:/
6:devices:/
5:pids:/
4:cpuset:/slurm/uid_215845/job_11354/step_extern
3:net_prio,net_cls:/
2:perf_event:/
1:name=systemd:/user.slice/user-215845.slice/session-2724.scope


And I can confirm that the shell from the SSH process is confined to the CPUs allocated to the job, but it can freely consume all the memory it wants.


We have the following config:

JobAcctGatherType       = jobacct_gather/cgroup
ProctrackType           = proctrack/cgroup
TaskPlugin              = task/cgroup
PrologFlags             = Alloc,Contain


Thanks!
-- 
Kilian

Comment 1 Tim Wickberg 2017-04-06 18:02:48 MDT

Did you disable pam_systemd in the various PAM configs? The output you've given suggests it may still be enabled.

Comment 2 Kilian Cavalotti 2017-04-06 18:12:54 MDT

(In reply to Tim Wickberg from comment #1)
> Did you disable pam_systemd in the various PAM configs? The output you've
> given suggests it may still be enabled.

Wow, spot on, thanks Tim! That's exactly it, disabling pam_systemd fixes the issue.

And I see there's a note in the Readme too, I missed it.

Thanks!
-- 
Kilian

Comment 3 Kilian Cavalotti 2017-04-06 18:17:44 MDT

Forgot to close the ticket. Done.