Bug 489

Summary: Cgroup enhancements
Product: Slurm Reporter: Moe Jette <jette>
Component: OtherAssignee: Danny Auble <da>
Severity: 5 - Enhancement CC: da, kilian, ryan_cox
Priority: ---    
Version: 15.08.2   
Hardware: Linux   
OS: Linux   
Site: BYU - Brigham Young University Alineos Sites: ---
Bull/Atos Sites: --- Confidential Site: ---
Cray Sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
OCF Sites: --- SFW Sites: ---
Machine Name: Version Fixed: 15.08.5 16.05.0-pre1
Target Release: 16.05 DevPrio: 3 - High
CLE Version:
Attachments: pam_slurm_adopt_2.diff

Description Moe Jette 2013-10-25 11:32:24 MDT
I am turning this email into a trouble ticket before it gets accidentally deleted.

Here is some information about how we do cgroups on our login nodes and 
my thoughts on how to set up cgroups better in SLURM.

This is on our login nodes:
root@m7int02:~# cat /etc/ssh/sshrc
# must do this for X forwarding purposes
# reads from stdin so be careful what you place before it
# see sshd manpage for details under SSHRC
if read proto cookie && [ -n "$DISPLAY" ]; then
    if [ `echo $DISPLAY | cut -c1-10` = 'localhost:' ]; then
        # X11UseLocalhost=yes
        echo add unix:`echo $DISPLAY | cut -c11-` $proto $cookie
        # X11UseLocalhost=no
        echo add $DISPLAY $proto $cookie
    fi | xauth -q -

/usr/local/sbin/interactive_cgroups_assign_process $PPID
root@m7int02:~# cat /usr/local/sbin/interactive_cgroups_assign_process

username=$(/usr/bin/stat -c %U /proc/$pid)

(("$UID"==0)) || (/bin/echo $pid > 
/cgroup/cpu/users/user_$username/tasks) >/dev/null 2>&1
(("$UID"==0)) || (/bin/echo $pid > 
/cgroup/memory/users/user_$username/tasks) >/dev/null 2>&1
root@m7int02:~# ls -l /cgroup/memory/users/user_ryancox/tasks
--w-rw---- 1 ryancox root 0 Aug 13 16:54 
root@m7int02:~# grep SLURM /etc/ssh/*config
/etc/ssh/ssh_config:    SendEnv            SLURM_*

Basically, we have a script that creates cgroups for all users in 
/etc/passwd every few minutes.  It sets their "tasks" file to be owned 
by the user and write-only by the user.  It also sets a memory and 
cpu.shares limit for the user.  Users can't pull in processes from other 
users due to kernel restrictions, though they could reassign their own 
processes to a different cgroup that they have write access to.  That's 
not really a huge concern for us on our login nodes since there is only 
one cgroup per subsystem that they have access to at all.

/usr/local/sbin/interactive_cgroups_assign_process was designed so we 
can also cron something to go catch anything that didn't get assigned to 
a cgroup for whatever reason, so it has extra features like looking up 
the owner of the process rather than relying on $USER.

What I would like to see in SLURM is the following:
A PAM module creates the appropriate cgroup for the user exactly as it 
does now when launched under slurmd or slurmstepd.  The pam module would 
need to ask SLURM what the user has been allocated on that node and 
create the appropriate cgroup, just like it would through slurmstepd.  
I'm unsure if PAM has access to environment variables passed via ssh, 
but my testing shows that it does not.  If it does not have access to 
it, simply assign the ssh-launched process to slurm/uid_$uid.  I think 
this is your current plan.  The slurm/uid_$uid cgroups should have the 
aggregated job allocation limits per user (memory, cpus, etc) if they 
have multiple jobs on the same node.  So job1 (1GB memory) + job2 (2GB 
memory) results in slurm/uid_$uid/memory.limit_in_bytes of 3G.  When 
job1 exits, reduce it to 2GB.

I would like the pam module and the slurm code to create the cgroups 
then run the following commands per tasks file (uid, job, step, task):
chown $USER $tasksfile
chmod 200 $tasksfile   # or something else, as long as the user can write

If you do this, you can use an sshrc file to assign the task to the 
appropriate cgroup (job, step, or task) as long as AcceptEnv and SendEnv 
are configured correctly, just like the examples above. This would work 
in the following scenarios as follows:
1) Job is launched via sbatch or salloc.  From that script or shell, the 
user/code uses ssh to connect to node2 in the list.  $SLURM_* variables 
are sent due to ssh SendEnv.  PAM on node2 sees that the user has a job 
allocated on the node and creates slurm/uid_$uid/job_$job (and maybe 
step and task information?).  PAM assigns the ssh process to the 
slurm/uid_$uid cgroups since it doesn't have access to the $SLURM_* 
variables.  After the PAM stack is done, ssh calls sshrc.  sshrc 
reassigns the user's processes (starting at $PPID) from the 
slurm/uid_$uid cgroups to the appropriate slurm/uid_$uid/job_$job 
cgroups based on $SLURM_JOB_ID.

2) Job is launched via sbatch or salloc.  From a different shell 
completely, the user connects directly to node2 via ssh.  Since 
$SLURM_JOB_ID, etc. were not set, sshrc is unable to assign the process 
to the correct slurm/uid_$uid/job_$job.  However, PAM still assigns the 
process to slurm/uid_$uid.  Even though we lose the per-job accounting, 
the user is still subject to aggregate job allocation limits on that node.

3) Job is launched via sbatch or salloc.  sshrc is not set up and/or 
AcceptEnv and SendEnv are not set.  Since $SLURM_JOB_ID, etc. were not 
set, sshrc is unable to assign the process to the correct 
slurm/uid_$uid/job_$job.  However, PAM still assigns the process to 
slurm/uid_$uid.  Even though we lose the per-job accounting, the user is 
still subject to aggregate job allocation limits on that node.

4) Job is launched via sbatch or salloc.  The user maliciously sets 
$SLURM_JOB_ID to an incorrect value either in that shell or from a 
different host entirely, then uses ssh to connect to node2.  In this 
case, PAM still assigns that process to slurm/uid_$uid.  When sshrc 
runs, it tries to move that process to an incorrect 
slurm/uid_$uid/job_$job.  It will fail since the targeted 
slurm/uid_$uid/job_$job won't allow the user to write there since sshrc 
runs as the user.

We're still trying to determine whether ssh should pass anything more 
than $SLURM_JOB_ID or if it should send $SLURM_*.  We are actually about 
to use the cgroup release mechanism to gather stats from the memory and 
cpuacct cgroups so we can store per job per node data, but that's a 
different conversation :)

Hopefully this makes sense.
Comment 1 Martin Perry 2013-11-25 06:49:09 MST
The Slurm team at Bull has been planning to implement a PAM cgroups feature for some time. I'm not sure if this is consistent with your proposal above, but here is a description of what we plan to implement.  We're not planning to do anything with sshrc.

Our initial goal is to use cgroups with PAM to restrict compute node login access to users with Slurm resources (cpus) allocated on that node, and restrict login sessions to the set of cpus allocated to the user. This is similar to the functionality provided by the code in contribs/pam. Any new use of cgroups in Slurm needs to be compatible with the existing cgroups code in the proctrack, task and jobacct_gather plugins. The new feature will work as follows:

There will be a new plugin, PAMPlugin=pam/cgroup. The plugin will be loaded by slurmd. The plugin API defines the following functions:


pam_g_add_user_resources() will be called when a job is allocated resources, on each node in the allocation, and will do the following:

Create cpuset cgroups for the job and user (user cgroup may already exist). By default, the path will be /cgroup/cpuset/slurm/user_%uid/job_%jobid.
Add the set of cpus allocated to the job on this node to cpuset.cpus in the user and job cgroups.

pam_g_delete_user_resources() will be called when a job terminates, and will do the following:

Delete the job cgroup and update cpuset.cpus in the user cgroup from the remaining jobs for this user, if any. 
If there are no remaining jobs, kill any tasks attached to the user cgroup (i.e. active logins for this user) and delete the user cgroup.

The plugin will also contain an implementation of pam_sm_acct_mgmt(). This implementation will do the following:

Get user from PAM handle
If the user does not have a user cpuset cgroup on this node
     return denied status
     attach PID to user cpuset cgroup
     return allowed status

There will be code to build a new PAM slurm library (pam_slurm_cgroup.so) containing this implementation of pam_sm_acct_mgmt. Admins can then include this library in the PAM configuration file under /etc/pam.d for the appropriate services (e.g. sshd).
Comment 2 Ryan Cox 2013-12-10 08:36:17 MST

Your plan does implement a subset of my proposal.  However, I'm wondering if there is a reason why only the cpuset cgroup is planned?  If you're creating a per-user cgroup, why not also restrict the memory?

Where it differs is that it doesn't allow for accounting like normal.  "Adopting" a process through a call from sshrc would allow for that since sshrc does have access to the SLURM_* variables if the ssh{,d}_config files allow for it.  At that point you know what job the process belongs to.
Comment 3 Moe Jette 2014-01-07 09:54:42 MST
Contents of recent relevant emails:

Andy Wettstein <wettstein@uchicago.edu> writes:

[Hide Quoted Text]
I use pam_exec module to blindly put an ssh session into the user's most
recent task cgroup.

This does seem to work fine for users that just log in to a node to where there
job is running, but multinode jobs have sever limitations.  The cgroup isn't
created unless a slurm task has actually been started. For programs that launch
with ssh, this doesn't work at all really. If multinode jobs worked correctly,
then it would obviously be better to somehow detect the slurm job id and use

A fully functional 'pam_slurm' module would do just that (it has to
check for an active allocation on the node anyway), and would create the
uid cgroup as needed if one does not already exist.

This is how the old pam_slurm_cpuset[1] operated and eventually the
cgroups code was supposed to work similarly. (You also have to make
sure resources are added and subtracted from the UID cgroups as jobs
are created and destroyed)

[1] https://code.google.com/p/slurm-spank-plugins/wiki/CPUSET


[Hide Quoted Text]

Here is what I currently use for pam_exec:


[ "$PAM_USER" = "root" ] && exit 0
[ "$PAM_TYPE" = "open_session" ] || exit 0

. /etc/sysconfig/slurm


if [ ! -x $squeue ]; then
    exit 0

uidnumber=$(id -u $PAM_USER)
host=$(hostname -s)

# last job the user started is where these tasks will go
jobid=$($squeue --noheader --format=%i --user=$PAM_USER --node=localhost | tail -1)

[ -z "$jobid" ] && exit 0

for system in freezer cpuset; do

     # if the cgdir doesn't exist skip it
     [ -d $cgdir ] || continue
     # first job step is where we'll put these tasks
     cgtasks=$(find $cgdir/uid_$uidnumber/job_$jobid -mindepth 2 -type f -name tasks -print -quit)
     [ -f $cgtasks ] && echo $PPID > $cgtasks


exit 0

On Wed, Jan 01, 2014 at 09:43:55PM -0800, Christopher Samuel wrote:

At SC13 a few of us were talking about the issue about what to do when
you have to allow users to SSH into nodes where there jobs are running.

Currently SLURM doesn't put SSH sessions permitted by the current
pam_slurm module into any control groups, it would be nice if it would
at least put them into the top level /cgroup/cpuset/slurm/uid_$UID for
the user so they can only affect their own jobs.

The problem is that With SSH's privilege separation the current SLURM
PAM module cannot do this as it will run as an unprivileged process
(not the user) prior to authentication and so cannot have the
permissions to move processes around.

However, it appears that something like a pam_slurm *session* library
would (I believe) run as the user in question - as long as it it can
learn the PID of the shell being spawned.

The problem then is how to allow that to securely put itself into the
users top level cgroup.  If slurmd just made the tasks file for that
top level cgroup owned by the user then the process could do it,
though there would be a small risk that the user could then move
processes from lower, insulated containers into the top level, though
they'd only be affecting their own stuff.


All the best,
  Christopher Samuel        Senior Systems Administrator
  VLSCI - Victorian Life Sciences Computation Initiative
  Email: samuel@unimelb.edu.au Phone: +61 (0)3 903 55545
  http://www.vlsci.org.au/      http://twitter.com/vlsci

andy wettstein
hpc system administrator
research computing center
university of chicago
Comment 4 Moe Jette 2015-05-05 09:28:31 MDT
For related work, see bug 1593
Comment 5 Ryan Cox 2015-10-22 09:50:26 MDT
Created attachment 2330 [details]

Here is the updated code for pam_slurm_adopt.c that was submitted in bug 1593.  Once tested and merged, this should resolve this bug report.  I have tested it and it Works For Me(TM).

The README shows all the options for the pam module.

Since this will likely be the most debated topic, I will clarify some things about the "action_unknown" option.  This is what happens when 1) the user has more than one job on the node and 2) the RPC call cannot identify what job the process belongs to.  This is almost exclusively a problem when the user tries to directly connect from a login node to a compute node (on which that user has multiple jobs running).
  any* = Pick a job in a (somewhat) random fashion. The user can ssh in
         but may be adopted into a job that exits earlier than the job
         they intended to check on. The ssh connection will at least be
         subject to appropriate limits and the user can be informed of
         better ways to accomplish their objectives if this becomes a
  user = Use the /slurm/uid_$UID cgroups. Not all cgroups set appropriate
         limits at this level so this may not be very effective.
         Additionally, job accounting at this level is impossible as is
         automatic cleanup of stray processes when the job exits
  allow = Let the connection through without adoption
  deny = Deny the connection

"any" seems to be the most reasonable default for now.  It ensures that the user's process is limited by cgroups and can be cleaned up by Slurm.  Its usage is also accounted for.  It does have the downside of unpredictability, to some degree, but IMO it's no worse than denying the connection.  From a user's perspective, they won't know if they have a single or multiple jobs on a node unless they explicitly check for it.  If you deny the connection, they will "randomly" not be able to log into some nodes and there's nothing they can do about it.  If "any" is set, they will be able to get in but may "randomly" get kicked out when the job exits.  To me, it seems better to be randomly kicked off than randomly denied access.  Again, they're almost certainly logging in directly from a login node to run something like top or strace, so it shouldn't be that long, hopefully.

I did implement Matthieu's idea to adopt into the /slurm/uid_$UID cgroups (action_unknown=user) but that has the limitation that the memory limits aren't aggregated (unless we have something misconfigured here), so it's almost worse than nothing at the moment, IMO.

I do like the idea of choosing the most recently started job or the job with the longest remaining time but that seems harder to implement since I don't think the slurmd has easy access to that information without querying the ctld, correct?  If so, that seems pretty expensive but at least it should be a rare thing.

Also, the extern step seems to stay around and any ssh-launched processes live on when a job exits.  The extern cgroups do appear to be cleaned up after the processes exit.

At some point I make also make an HTML page for documentation.
Comment 6 Moe Jette 2015-10-22 10:19:03 MDT
(In reply to Ryan Cox from comment #5)
> I do like the idea of choosing the most recently started job or the job with
> the longest remaining time but that seems harder to implement since I don't
> think the slurmd has easy access to that information without querying the
> ctld, correct?  If so, that seems pretty expensive but at least it should be
> a rare thing.

We don't have the expected job end times available in slurmd, but those would just be guesses anyway.
We could stat() the cgroup directories to see what was most recently created, that seems like the best option using readily available information.
In any case, it comes down to guesswork what will be "best".
Comment 7 Ryan Cox 2015-10-26 10:25:48 MDT
Created attachment 2336 [details]

That was a good idea.  I replaced "any" with "newest".  If for some reason "any" is desirable, it would be easy to add back.  It compares the cgroup mtimes to pick the newest job.
Comment 8 Ryan Cox 2015-10-27 08:38:46 MDT
One thing I have noticed is that processes in the step_extern cgroups are not cleaned up automatically.  I do not know if this is intentional or not.  I could easily add something to epilog to kill all tasks in step_extern/tasks, if needed, but it would be nice if it gets cleaned up automatically.  Otherwise stray tasks can be left behind.

Additionally, I'm not seeing any accounting for the extern step through sacct or in the mysql database.  Bug 1593 comment 9 makes me think it is meant to be there.

Should I file separate bugs for those?

That aside, pam_slurm_adopt has handled our testing very well and we are now running it in production.  We'll see if users find any bugs.
Comment 9 Ryan Cox 2015-11-03 07:46:21 MST
I ended up filing separate bugs: bug 2096 and bug 2097.

Also, pam_slurm_adopt has run in production for a week now with no known issues so far.  We have seen a fair amount of ssh traffic in various scenarios that have exercised different parts of the code.
Comment 10 Danny Auble 2015-11-03 07:58:18 MST
Hey Ryan, I have committed this to the 15.08 branch.  I haven't been able to test it as much as I wanted, but hope to do that soon.

I'll check out the new bugs as well.
Comment 11 Tim Wickberg 2015-12-01 09:09:09 MST
Setting DevPrio and TargetRelease flags.
Comment 12 Danny Auble 2015-12-01 13:11:38 MST
I think this is fixed, please reopen of you feel otherwise.