Hi, We're using the slurm_pam_adopt PAM module in conjunction with the cgroup devices subsystem, in order to restrict user access to GPUs. It works mostly fine, except for a corner case I can't seem to figure out: when a user SSH to a node, her shell doesn't inherit the cgroup devices settings, but is instead denied access to all devices. So, it's basically: 1. srun --gres gpu:2 --pty bash -> can see 2 GPUs 2. ssh to the node with pam_slurm_adopt -> can't see any GPU at all Our config is as follows: JobAcctGatherType = jobacct_gather/cgroup ProctrackType = proctrack/cgroup PrologFlags = Alloc,Contain TaskPlugin = task/cgroup cgroup.conf: ConstrainDevices=yes AllowedDevicesFile=/etc/slurm/cgroup_allowed_devices_file.conf When I run this job, I correctly get access to only 2 GPUs: $ srun -p test --gres gpu:2 --pty bash xs-0057 $ nvidia-smi -L GPU 0: Tesla K80 (UUID: GPU-198795b7-0588-3272-ee67-7b395851387f) GPU 1: Tesla K80 (UUID: GPU-68e053ec-d3d2-92c5-5e19-3a12d2444d4a) xs-0057 $ The cgroups eem to be set correctly, as the logs show, first for step_extern, then for step_0: slurmd[5840]: launch task 5736.0 request from 215845.2709@10.10.0.21 (port 10681) slurmstepd[6248]: task/cgroup: /slurm/uid_215845/job_5736: alloc=12000MB mem.limit=12000MB memsw.limit=12000MB slurmstepd[6248]: task/cgroup: /slurm/uid_215845/job_5736/step_extern: alloc=12000MB mem.limit=12000MB memsw.limit=12000MB slurmstepd[6248]: task/cgroup: manage devices jor job '5736' [...] slurmstepd[6248]: Allowing access to device c 195:0 rwm slurmstepd[6248]: Allowing access to device c 195:1 rwm slurmstepd[6248]: Not allowing access to device c 195:2 rwm slurmstepd[6248]: Not allowing access to device c 195:3 rwm slurmstepd[6248]: Not allowing access to device c 195:4 rwm slurmstepd[6248]: Not allowing access to device c 195:5 rwm slurmstepd[6248]: Not allowing access to device c 195:6 rwm slurmstepd[6248]: Not allowing access to device c 195:7 rwm slurmstepd[6248]: Not allowing access to device c 195:8 rwm [...] slurmstepd[6254]: task/cgroup: /slurm/uid_215845/job_5736: alloc=12000MB mem.limit=12000MB memsw.limit=12000MB slurmstepd[6254]: task/cgroup: /slurm/uid_215845/job_5736/step_0: alloc=12000MB mem.limit=12000MB memsw.limit=12000MB slurmstepd[6254]: task/cgroup: manage devices jor job '5736' [...] slurmstepd[6254]: Allowing access to device c 195:0 rwm slurmstepd[6254]: Allowing access to device c 195:1 rwm slurmstepd[6254]: Not allowing access to device c 195:2 rwm slurmstepd[6254]: Not allowing access to device c 195:3 rwm slurmstepd[6254]: Not allowing access to device c 195:4 rwm slurmstepd[6254]: Not allowing access to device c 195:5 rwm slurmstepd[6254]: Not allowing access to device c 195:6 rwm slurmstepd[6254]: Not allowing access to device c 195:7 rwm slurmstepd[6254]: Not allowing access to device c 195:8 rwm [...] Then, when I ssh to the node, the PID of the new shell is correctly added to the step_extern's process list, yet I can't seem to list the GPUs: $ ssh xs-0057 xs-0057 $ echo $$ 7202 xs-0057 $ cat /cgroup/devices/slurm/uid_215845/job_5736/step_extern/cgroup.procs 6248 6258 7197 7201 7202 7418 xs-0057 $ nvidia-smi -L No devices found. Since I was not entirely sure if it was a Slurm issue or more a cgroup issue, I tried to do this by hand: 1. create a new cgroup: xs-0057 # mkdir /cgroup/devices/test/ 2. ssh as a user to the node: $ ssh xs-0057 xs-0057 $ echo $$ 8216 3. configure the cgroup (195:* devices are the /dev/nvidia* devices, 195:255 is /dev/nvidiactl) xs-0057 # echo 8216 > /cgroup/devices/test/cgroup.procs xs-0057 # echo a > /cgroup/devices/test/devices.deny xs-0057 # echo "c 195:0 rwm" > /cgroup/devices/test/devices.allow xs-0057 # echo "c 195:1 rwm" > /cgroup/devices/test/devices.allow xs-0057 # echo "c 195:255 rwm" > /cgroup/devices/test/devices.allow xs-0057 # cat /cgroup/devices/test/devices.list c 195:0 rwm c 195:1 rwm c 195:255 rwm 4. check that access restriction works as a user: xs-0057 $ nvidia-smi -L GPU 0: Tesla K80 (UUID: GPU-198795b7-0588-3272-ee67-7b395851387f) GPU 1: Tesla K80 (UUID: GPU-68e053ec-d3d2-92c5-5e19-3a12d2444d4a) 5. now SSH in a new shell, add its PID to the test cgroup, and see what the device access is: $ ssh xs-0057 xs-0057 $ echo $$ 9094 as root in another session: xs-0057 # echo 9094 > /cgroup/devices/test/cgroup.procs back as the user: xs-0057 $ echo $$ 9094 xs-0057 $ nvidia-smi -L GPU 0: Tesla K80 (UUID: GPU-198795b7-0588-3272-ee67-7b395851387f) GPU 1: Tesla K80 (UUID: GPU-68e053ec-d3d2-92c5-5e19-3a12d2444d4a) So the new shell correctly inherits from the device access restriction when its PID is added to the cgroup's list of processes. That makes me think that there is something different in the was Slurm initiates and or sets the cgroups up. Any insight would be much appreciated. Thanks, -- Kilian PS: Something that may be related (or not), is that the devices.list file in the cgroups set by Slurm always only contains "a *:* rwm", whatever the access restrictions are: xs-0057 # tree /cgroup/devices/slurm/ /cgroup/devices/slurm/ ├── cgroup.event_control ├── cgroup.procs ├── devices.allow ├── devices.deny ├── devices.list ├── notify_on_release ├── tasks └── uid_215845 ├── cgroup.event_control ├── cgroup.procs ├── devices.allow ├── devices.deny ├── devices.list ├── job_5738 │ ├── cgroup.event_control │ ├── cgroup.procs │ ├── devices.allow │ ├── devices.deny │ ├── devices.list │ ├── notify_on_release │ ├── step_0 │ │ ├── cgroup.event_control │ │ ├── cgroup.procs │ │ ├── devices.allow │ │ ├── devices.deny │ │ ├── devices.list │ │ ├── notify_on_release │ │ └── tasks │ ├── step_extern │ │ ├── cgroup.event_control │ │ ├── cgroup.procs │ │ ├── devices.allow │ │ ├── devices.deny │ │ ├── devices.list │ │ ├── notify_on_release │ │ └── tasks │ └── tasks ├── notify_on_release └── tasks # find /cgroup/devices/slurm/ -iname devices.list -exec cat {} \; a *:* rwm a *:* rwm a *:* rwm a *:* rwm a *:* rwm This is not the case when setting the cgroups manually, as shown above: the devices.list correctly contains the list of whitelisted devices.
One more thing: I just noticed that if I manually add the PID of the external SSH shell to the step_0 cgroup instead of the step_extern one, the device access restriction works correctly, So it looks like, despite what the logs seem to indicate, the devices cgroup is not correctly configured for the step_extern cgroup. Cheers, Kilian
Kilian, I am working with Ryan right now on making this module work correctly. You can follow along through bug 2097 if you would like. I am not sure if it will fix your issue or not, but currently things aren't correct on many fronts. I would strongly suggest using jobacct_gather/linux FYI, cgroup doesn't buy you anything (expect for slowing things down).
Hi Danny, Noted for jobacct_gather/linux. I'll take a look at #2097. Thanks! Kilian
This is fixed in the following commits: https://github.com/SchedMD/slurm/commit/3101754f5074c56408d9a2f62afe42b857b7c296 https://github.com/SchedMD/slurm/commit/7f39ab4f1e4ab182ac65230b292759563e9a56e7 A lot has changed in how pam_slurm_adopt works, but what you were mostly likely experiencing was that the step_extern cgroup was explicitly denying access to the gpus. We've changed it so that the devices step_extern cgroup inherits the attributes of the parent job_<jobid> cgroup -- the first commit. Please reopen if you have any issues. Thanks, Brian
Hi, Just upgraded to 15.08.5, slurm_pam_adopt does indeed correctly set cpuset and freezer contraints to the step_extern, but we're not able to make it work for devices (nor ram). We added ConstrainDevices=yes in cgroup.conf. JobID 12380 running, and I connect to the running node using pam_slurm_adopt: 6464 ? Ss 0:00 \_ sshd: sthiell [priv] 6469 ? S 0:00 | \_ sshd: sthiell@pts/5 6470 pts/5 Ss+ 0:00 | \_ -bash Only the sshd PID running under root is found in /cgroup/devices/slurm/uid_282232/job_12380/step_extern/tasks It looks like the sshd user process and other children PIDs are not added to the step_extern devices and memory cgroups. [root@xs-0060 ~]# cat /proc/6464/cgroup 4:devices:/slurm/uid_282232/job_12380/step_extern 3:cpuset:/slurm/uid_282232/job_12380/step_extern 2:freezer:/slurm/uid_282232/job_12380/step_extern 1:memory:/slurm/uid_282232/job_12380/step_extern [root@xs-0060 ~]# cat /proc/6469/cgroup 4:devices:/ 3:cpuset:/slurm/uid_282232/job_12380/step_extern 2:freezer:/slurm/uid_282232/job_12380/step_extern 1:memory:/ [root@xs-0060 ~]# cat /proc/6470/cgroup 4:devices:/ 3:cpuset:/slurm/uid_282232/job_12380/step_extern 2:freezer:/slurm/uid_282232/job_12380/step_extern 1:memory:/ Any ideas? Thanks, Stephane
I'm able to reproduce a similar situation as well. In my case, it appears to be a race condition where the child processes are being forked before the parent process is being added to the cgroup. We'll work on a patch and get back to you. It is odd that in your case it is only happening for memory and devices. Does this happen every time for you? Thanks, Brian
Hi, I am currently out of office, returning on Thursday, January 14. Please expect a delay in response. If you need to reach Research Computing, please email research-computing-support@stanford.edu Cheers,
The situation that I found is fixed by commit: https://github.com/SchedMD/slurm/commit/c7fa3f8f08695502a0076ac1085797570aaaa525 Will you try this commit, or 15.05.6, and see if you still see the same behavior? Thanks, Brian
Hi Brian, That's great, I just upgraded from 15.05.5 to 15.05.6 and the problem is solved! cgroups on GPU devices and memory are now set correctly, for both job step and step_extern PIDs with pam_slurm_adopt. Thank you for the Christmas gift! Stephane
Great! Let us know if you see anything else. Thanks, Brian