Ticket 11154 - Extern process isn't putting processes in all cgroups
Summary: Extern process isn't putting processes in all cgroups
Status: RESOLVED DUPLICATE of ticket 5920
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 20.02.4
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-03-19 12:24 MDT by Mikael Öhman
Modified: 2021-03-23 03:53 MDT (History)
0 users

See Also:
Site: SNIC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: C3SE
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Mikael Öhman 2021-03-19 12:24:56 MDT
I use pam_slurm_adopt, and cgroups;

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup

and cgroup.conf;

CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
AllowedRAMSpace=100
ConstrainSwapSpace=yes
AllowedSwapSpace=0
ConstrainDevices=yes

sbatch and srun processes seem to land in the correct cgroups and it all works perfectly, but extern steps when ssh'ing into the node does not.

The cpuset cgroup contains the extern PID as I expected;
/sys/fs/cgroup/cpuset/slurm/uid_xxxxx/job_xxx/cgroup.procs

but it's not to be found in memory and devices:
/sys/fs/cgroup/memory/slurm/uid_xxxxx/job_xxx/cgroup.procs
/sys/fs/cgroup/devices/slurm/uid_xxxxx/job_xxx/cgroup.procs

and running "nvidia-smi" on a shared GPU node I ssh into via pam_slurm_adopt I see all GPUs and not just the ones allocated to the job.
This suggests to me that the ssh shell isn't constrained in terms of memory or gpus.

Oversight, bug, or have I missed a configuration option?
Comment 1 Marcin Stolarek 2021-03-22 05:30:03 MDT
Mikael,

Can you share your pam configuration for sshd?

cheers,
Marcin
Comment 2 Mikael Öhman 2021-03-23 03:53:45 MDT
Turns out systemd-logind wasn't disabled and masked on these nodes. The fact that the cpuset group was working threw me off. Sorry for the noise.

Best regards, Mikael

*** This ticket has been marked as a duplicate of ticket 5920 ***