Bug 4122 - GPU enforcement
Summary: GPU enforcement
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 16.05.7
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-08-30 15:53 MDT by Robert Yelle
Modified: 2017-12-14 00:06 MST (History)
0 users

See Also:
Site: University of Oregon
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
cgroup.conf (604 bytes, application/octet-stream)
2017-08-30 17:11 MDT, Robert Yelle
Details
ATT00001.htm (232 bytes, text/html)
2017-08-30 17:11 MDT, Robert Yelle
Details
slurm.conf (7.28 KB, application/octet-stream)
2017-08-30 17:11 MDT, Robert Yelle
Details
ATT00002.htm (2.85 KB, text/html)
2017-08-30 17:11 MDT, Robert Yelle
Details
slurm.conf (8.94 KB, application/octet-stream)
2017-12-12 17:08 MST, Robert Yelle
Details
ATT00001.htm (1.75 KB, text/html)
2017-12-12 17:08 MST, Robert Yelle
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Robert Yelle 2017-08-30 15:53:51 MDT
Hello,

We have found a couple of potential problems with enforcement of GPUs used by Slurm jobs.  In one case, I was able to request interactive jobs as follows:

$ srun -p gpu --nodes=1 --ntasks=1 --gres=gpu:2 --pty bash -i

I was given node "n097".  So then I did this:

$ srun -p gpu --nodes=1 --ntasks=1 --gres=gpu:1 --nodelist=n097 --gres=gpu:1 --pty bash -i

This is fine since each node in our gpu partition has four GPUs.  However, when I load our cuda/8.0 module and run "deviceQuery", I get the follow results for the two sessions created above:

session #1: 
  Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0

session #2
  Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0

This suggests to me that both sessions are sharing a GPU on node n097, namely "0/4/0".  Instead, the ID of a third GPU should be displayed.

In addition, if I open up another session (#3) to node n097 - this time via SSH rather than Slurm, I can access all four GPUs using deviceQuery:
  Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 132 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 133 / 0

Indeed, we are seeing instances of users reserving nodes without specifying the "--gres=gpu:X" option, and instead are getting around this by SSH to the node they have reserved and accessing the GPUs that way.  

The first problem appears to be a scheduling problem in that the first GPU (0/4/0 in this case) is scheduled when GPUs are requested, even if it has already been reserved.  The second problem may be unrelated to the first, and appears to be a workaround over resource enforcement by Slurm - perhaps a failure in cgroups?  If I should submit separate tickets for these issues, let me know.  Are these bugs in 16.05.7 that are fixed in future releases?  What do you recommend?

Thanks,

Rob
Comment 1 Tim Wickberg 2017-08-30 16:15:00 MDT
Can you attach your slurm.conf and cgroup.conf files?

One thing I'm looking for would be 'ConstrainDevices=yes' in cgroup.conf.

For SSH - do you use pam_slurm_adopt? If you do, and ConstrainDevices is enabled, then it should be limiting access. If you don't, no enforcement can take place - they're running outside of Slurm's control at that point.

- Tim
Comment 2 Robert Yelle 2017-08-30 17:11:51 MDT
Created attachment 5178 [details]
cgroup.conf

Hi Tim,

ConstrainDevices is set to “yes”.

As far as I can tell, we are not using “pam_slurm_adopt”.  However, we are using a similar mechanism in Bright Cluster Manager for denying ssh access via PAM.  This mechanism does work to prevent users from accessing a node that they have not reserved in Slurm.  However, the issue we are seeing in problem #2 is that a user may have exceeded their GPU TRES limits (to allow others to use GPUs) but they can get around them by reserving cores on a node on the gpu queue without reserving any GPUs, then they are granted ssh access to that node where they can proceed to use any of the GPUs on that node, whether they are already in use or not.  Would pam_slurm_adopt prevent that from happening?

Our latest slurm.conf and cgroup.conf files are attached.  If you have any recommended changes, let me know.

Thanks,

Rob
Comment 3 Robert Yelle 2017-08-30 17:11:52 MDT
Created attachment 5179 [details]
ATT00001.htm
Comment 4 Robert Yelle 2017-08-30 17:11:52 MDT
Created attachment 5180 [details]
slurm.conf
Comment 5 Robert Yelle 2017-08-30 17:11:52 MDT
Created attachment 5181 [details]
ATT00002.htm
Comment 6 Tim Wickberg 2017-08-30 18:51:51 MDT
The first case with double-booking a single card I need to dig into further, although I suspect that may have been addressed at some point since 16.05.7 was released.

> ConstrainDevices is set to “yes”.
> 
> As far as I can tell, we are not using “pam_slurm_adopt”.  However, we are
> using a similar mechanism in Bright Cluster Manager for denying ssh access
> via PAM.  This mechanism does work to prevent users from accessing a node
> that they have not reserved in Slurm.  However, the issue we are seeing in
> problem #2 is that a user may have exceeded their GPU TRES limits (to allow
> others to use GPUs) but they can get around them by reserving cores on a
> node on the gpu queue without reserving any GPUs, then they are granted ssh
> access to that node where they can proceed to use any of the GPUs on that
> node, whether they are already in use or not.  Would pam_slurm_adopt prevent
> that from happening?

You're likely using "pam_slurm", which only blocks users from accessing nodes they're not allowed on, but does not allow Slurm to manage the tasks they may launch through SSH, or enforce access to, e.g., the GPU devices.

pam_slurm_adopt does provide for that.

There's a concise guide to it included within the source as contribs/pam_slurm_adopt/README, or you can see it online as:

https://github.com/SchedMD/slurm/blob/master/contribs/pam_slurm_adopt/README

(We are working on a better version of the documentation at the moment, and hope to have that online sometime before the 17.11 release.)

> Our latest slurm.conf and cgroup.conf files are attached.  If you have any
> recommended changes, let me know.
Comment 8 Robert Yelle 2017-11-14 08:03:35 MST
Hello,

After upgrading from 16.05.8 to 17.02.9, I was able to confirm that GPUs are still double-booked in Slurm.  For instance, we have 2 K80s per node, presented as 4 GPUs to the operating system with the following IDs (as displayed by DeviceQuery):

  Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 132 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 133 / 0

If I reserve two GPUs on one node (e.g. srun -p gpu —nodes=1 —ntasks=4 —gres=gpu:2 —p, I get these GPUs:

  Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0

But if I reserve two GPUs on the same node, I get the same GPUs:

  Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0

Let me know if you need any further info from me.

Multiple users have reported back to me that they have suspected this happening.  One possible workaround for the time being is to require all 4 GPUs on a node to be reserved at once (i.e. minimum of 4 or multiples of 4 GPUs being reserved), to prevent other users from reserving the same GPUs - is that possible?

Thanks,

Rob


On Sep 29, 2017, at 4:29 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:

Tim Wickberg<mailto:tim@schedmd.com> changed bug 4122<https://bugs.schedmd.com/show_bug.cgi?id=4122>
What    Removed Added
Resolution      ---     INFOGIVEN
Status  UNCONFIRMED     RESOLVED

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 9 Robert Yelle 2017-12-12 10:44:57 MST
Hello,

I was wondering if there have been any developments or updates on this.  If you need assistance to reproduce this behavior, let me know.  I would be willing to have a screen-sharing session to demonstrate the issue, if that would be helpful.

Thanks,

Rob


On Nov 14, 2017, at 7:03 AM, Rob Yelle <ryelle@uoregon.edu<mailto:ryelle@uoregon.edu>> wrote:

Hello,

After upgrading from 16.05.8 to 17.02.9, I was able to confirm that GPUs are still double-booked in Slurm.  For instance, we have 2 K80s per node, presented as 4 GPUs to the operating system with the following IDs (as displayed by DeviceQuery):

  Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 132 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 133 / 0

If I reserve two GPUs on one node (e.g. srun -p gpu —nodes=1 —ntasks=4 —gres=gpu:2 —p, I get these GPUs:

  Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0

But if I reserve two GPUs on the same node, I get the same GPUs:

  Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0
  Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0

Let me know if you need any further info from me.

Multiple users have reported back to me that they have suspected this happening.  One possible workaround for the time being is to require all 4 GPUs on a node to be reserved at once (i.e. minimum of 4 or multiples of 4 GPUs being reserved), to prevent other users from reserving the same GPUs - is that possible?

Thanks,

Rob


On Sep 29, 2017, at 4:29 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:

Tim Wickberg<mailto:tim@schedmd.com> changed bug 4122<https://bugs.schedmd.com/show_bug.cgi?id=4122>
What    Removed Added
Resolution      ---     INFOGIVEN
Status  UNCONFIRMED     RESOLVED

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 10 Brian Christiansen 2017-12-12 16:21:16 MST
Hey Rob,

Can you attach your gres.conf?

Also, can you try configuring your nodes with 2 gpus instead of 4 gpus as you have it?

e.g.
NodeName=n[097-120] ... Gres=gpu:2

Thanks,
Brian
Comment 11 Robert Yelle 2017-12-12 17:08:48 MST
Created attachment 5728 [details]
slurm.conf

Hi Brian,

Most of our GPU nodes are currently in use, but I made the suggested change to n114 (testgpu partition - see attached slurm.conf).

Our current gres.conf file looks like this:

Name=gpu Count=4

I tried this on n114 with the suggested change to slurm.conf, then also changed gres.conf to:

Name=gpu Count=2

In either case, if I try to reserve a single GPU on this node from two accounts, the first GPU gets double-booked (output from deviceQuery below, same from both accounts):

  Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0

My latest slurm.conf is attached.

Thanks,

Rob
Comment 12 Robert Yelle 2017-12-12 17:08:48 MST
Created attachment 5729 [details]
ATT00001.htm
Comment 14 Brian Christiansen 2017-12-12 22:38:02 MST
The way gpu confinement works in the task/cgroup plugin is that it first white lists a list of default devices and then black lists the gpus that weren't requested. This is done with the following configs:

cgroup.conf:
ConstrainDevices=yes (which you have)

cgroup_allowed_devices_file.conf - There is an example file contained in the etc directory of the source tree. Just copy this to your etc directory without the .example extension.

e.g.
$ cat cgroup_allowed_devices_file.conf.example 
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*


Then in your gres.conf you need to specify the devices that are associated with the gpus.

e.g.
Name=gpu Count=2 File=/dev/nvidia[0-1]

Note how Count=2 matches up with 2 files. The count can be inferred from the number of files.

So in summary, Slurm (slurmstepd) will create a cgroup for the job and whitelist everything in the  cgroup_allowed_devices_file.conf file and then blacklist the devices not requested.

Can you setup the device files and try again? I apologize if you already have this setup -- I haven't seen them in your configs yet.

Let me know if you have any questions.
Comment 15 Robert Yelle 2017-12-13 14:46:05 MST
Hi Brian,

Thank you for the info.  I made your recommended changes to cgroup_allowed_devices_file.conf and to gres.conf on n114 and that FIXED the problem:

appended to cgroup_allowed_devices_file.conf:
/dev/nvidia*

changed in gres.conf:
Name=gpu Count=4 File=/dev/nvidia[0-3]

Now when I reserve GPUs on this node from two different instances I DO get different GPUs according to deviceQuery, so this issue now appears to be resolved.  I will roll out the changes to the rest of the GPU nodes in our cluster.

Thanks!!

Rob


On Dec 12, 2017, at 9:38 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:


Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=4122#c14> on bug 4122<https://bugs.schedmd.com/show_bug.cgi?id=4122> from Brian Christiansen<mailto:brian@schedmd.com>

The way gpu confinement works in the task/cgroup plugin is that it first white
lists a list of default devices and then black lists the gpus that weren't
requested. This is done with the following configs:

cgroup.conf:
ConstrainDevices=yes (which you have)

cgroup_allowed_devices_file.conf - There is an example file contained in the
etc directory of the source tree. Just copy this to your etc directory without
the .example extension.

e.g.
$ cat cgroup_allowed_devices_file.conf.example
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*


Then in your gres.conf you need to specify the devices that are associated with
the gpus.

e.g.
Name=gpu Count=2 File=/dev/nvidia[0-1]

Note how Count=2 matches up with 2 files. The count can be inferred from the
number of files.

So in summary, Slurm (slurmstepd) will create a cgroup for the job and
whitelist everything in the  cgroup_allowed_devices_file.conf file and then
blacklist the devices not requested.

Can you setup the device files and try again? I apologize if you already have
this setup -- I haven't seen them in your configs yet.

Let me know if you have any questions.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 16 Brian Christiansen 2017-12-14 00:06:18 MST
Good to hear. Let us know if you have any other issues.

Thanks,
Brian