Hello, We have found a couple of potential problems with enforcement of GPUs used by Slurm jobs. In one case, I was able to request interactive jobs as follows: $ srun -p gpu --nodes=1 --ntasks=1 --gres=gpu:2 --pty bash -i I was given node "n097". So then I did this: $ srun -p gpu --nodes=1 --ntasks=1 --gres=gpu:1 --nodelist=n097 --gres=gpu:1 --pty bash -i This is fine since each node in our gpu partition has four GPUs. However, when I load our cuda/8.0 module and run "deviceQuery", I get the follow results for the two sessions created above: session #1: Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0 session #2 Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0 This suggests to me that both sessions are sharing a GPU on node n097, namely "0/4/0". Instead, the ID of a third GPU should be displayed. In addition, if I open up another session (#3) to node n097 - this time via SSH rather than Slurm, I can access all four GPUs using deviceQuery: Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 132 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 133 / 0 Indeed, we are seeing instances of users reserving nodes without specifying the "--gres=gpu:X" option, and instead are getting around this by SSH to the node they have reserved and accessing the GPUs that way. The first problem appears to be a scheduling problem in that the first GPU (0/4/0 in this case) is scheduled when GPUs are requested, even if it has already been reserved. The second problem may be unrelated to the first, and appears to be a workaround over resource enforcement by Slurm - perhaps a failure in cgroups? If I should submit separate tickets for these issues, let me know. Are these bugs in 16.05.7 that are fixed in future releases? What do you recommend? Thanks, Rob
Can you attach your slurm.conf and cgroup.conf files? One thing I'm looking for would be 'ConstrainDevices=yes' in cgroup.conf. For SSH - do you use pam_slurm_adopt? If you do, and ConstrainDevices is enabled, then it should be limiting access. If you don't, no enforcement can take place - they're running outside of Slurm's control at that point. - Tim
Created attachment 5178 [details] cgroup.conf Hi Tim, ConstrainDevices is set to “yes”. As far as I can tell, we are not using “pam_slurm_adopt”. However, we are using a similar mechanism in Bright Cluster Manager for denying ssh access via PAM. This mechanism does work to prevent users from accessing a node that they have not reserved in Slurm. However, the issue we are seeing in problem #2 is that a user may have exceeded their GPU TRES limits (to allow others to use GPUs) but they can get around them by reserving cores on a node on the gpu queue without reserving any GPUs, then they are granted ssh access to that node where they can proceed to use any of the GPUs on that node, whether they are already in use or not. Would pam_slurm_adopt prevent that from happening? Our latest slurm.conf and cgroup.conf files are attached. If you have any recommended changes, let me know. Thanks, Rob
Created attachment 5179 [details] ATT00001.htm
Created attachment 5180 [details] slurm.conf
Created attachment 5181 [details] ATT00002.htm
The first case with double-booking a single card I need to dig into further, although I suspect that may have been addressed at some point since 16.05.7 was released. > ConstrainDevices is set to “yes”. > > As far as I can tell, we are not using “pam_slurm_adopt”. However, we are > using a similar mechanism in Bright Cluster Manager for denying ssh access > via PAM. This mechanism does work to prevent users from accessing a node > that they have not reserved in Slurm. However, the issue we are seeing in > problem #2 is that a user may have exceeded their GPU TRES limits (to allow > others to use GPUs) but they can get around them by reserving cores on a > node on the gpu queue without reserving any GPUs, then they are granted ssh > access to that node where they can proceed to use any of the GPUs on that > node, whether they are already in use or not. Would pam_slurm_adopt prevent > that from happening? You're likely using "pam_slurm", which only blocks users from accessing nodes they're not allowed on, but does not allow Slurm to manage the tasks they may launch through SSH, or enforce access to, e.g., the GPU devices. pam_slurm_adopt does provide for that. There's a concise guide to it included within the source as contribs/pam_slurm_adopt/README, or you can see it online as: https://github.com/SchedMD/slurm/blob/master/contribs/pam_slurm_adopt/README (We are working on a better version of the documentation at the moment, and hope to have that online sometime before the 17.11 release.) > Our latest slurm.conf and cgroup.conf files are attached. If you have any > recommended changes, let me know.
Hello, After upgrading from 16.05.8 to 17.02.9, I was able to confirm that GPUs are still double-booked in Slurm. For instance, we have 2 K80s per node, presented as 4 GPUs to the operating system with the following IDs (as displayed by DeviceQuery): Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 132 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 133 / 0 If I reserve two GPUs on one node (e.g. srun -p gpu —nodes=1 —ntasks=4 —gres=gpu:2 —p, I get these GPUs: Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0 But if I reserve two GPUs on the same node, I get the same GPUs: Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0 Let me know if you need any further info from me. Multiple users have reported back to me that they have suspected this happening. One possible workaround for the time being is to require all 4 GPUs on a node to be reserved at once (i.e. minimum of 4 or multiples of 4 GPUs being reserved), to prevent other users from reserving the same GPUs - is that possible? Thanks, Rob On Sep 29, 2017, at 4:29 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Tim Wickberg<mailto:tim@schedmd.com> changed bug 4122<https://bugs.schedmd.com/show_bug.cgi?id=4122> What Removed Added Resolution --- INFOGIVEN Status UNCONFIRMED RESOLVED ________________________________ You are receiving this mail because: * You reported the bug.
Hello, I was wondering if there have been any developments or updates on this. If you need assistance to reproduce this behavior, let me know. I would be willing to have a screen-sharing session to demonstrate the issue, if that would be helpful. Thanks, Rob On Nov 14, 2017, at 7:03 AM, Rob Yelle <ryelle@uoregon.edu<mailto:ryelle@uoregon.edu>> wrote: Hello, After upgrading from 16.05.8 to 17.02.9, I was able to confirm that GPUs are still double-booked in Slurm. For instance, we have 2 K80s per node, presented as 4 GPUs to the operating system with the following IDs (as displayed by DeviceQuery): Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 132 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 133 / 0 If I reserve two GPUs on one node (e.g. srun -p gpu —nodes=1 —ntasks=4 —gres=gpu:2 —p, I get these GPUs: Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0 But if I reserve two GPUs on the same node, I get the same GPUs: Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0 Device PCI Domain ID /Bus ID / location ID: 0 / 5 / 0 Let me know if you need any further info from me. Multiple users have reported back to me that they have suspected this happening. One possible workaround for the time being is to require all 4 GPUs on a node to be reserved at once (i.e. minimum of 4 or multiples of 4 GPUs being reserved), to prevent other users from reserving the same GPUs - is that possible? Thanks, Rob On Sep 29, 2017, at 4:29 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Tim Wickberg<mailto:tim@schedmd.com> changed bug 4122<https://bugs.schedmd.com/show_bug.cgi?id=4122> What Removed Added Resolution --- INFOGIVEN Status UNCONFIRMED RESOLVED ________________________________ You are receiving this mail because: * You reported the bug.
Hey Rob, Can you attach your gres.conf? Also, can you try configuring your nodes with 2 gpus instead of 4 gpus as you have it? e.g. NodeName=n[097-120] ... Gres=gpu:2 Thanks, Brian
Created attachment 5728 [details] slurm.conf Hi Brian, Most of our GPU nodes are currently in use, but I made the suggested change to n114 (testgpu partition - see attached slurm.conf). Our current gres.conf file looks like this: Name=gpu Count=4 I tried this on n114 with the suggested change to slurm.conf, then also changed gres.conf to: Name=gpu Count=2 In either case, if I try to reserve a single GPU on this node from two accounts, the first GPU gets double-booked (output from deviceQuery below, same from both accounts): Device PCI Domain ID /Bus ID / location ID: 0 / 4 / 0 My latest slurm.conf is attached. Thanks, Rob
Created attachment 5729 [details] ATT00001.htm
The way gpu confinement works in the task/cgroup plugin is that it first white lists a list of default devices and then black lists the gpus that weren't requested. This is done with the following configs: cgroup.conf: ConstrainDevices=yes (which you have) cgroup_allowed_devices_file.conf - There is an example file contained in the etc directory of the source tree. Just copy this to your etc directory without the .example extension. e.g. $ cat cgroup_allowed_devices_file.conf.example /dev/null /dev/urandom /dev/zero /dev/sda* /dev/cpu/*/* /dev/pts/* Then in your gres.conf you need to specify the devices that are associated with the gpus. e.g. Name=gpu Count=2 File=/dev/nvidia[0-1] Note how Count=2 matches up with 2 files. The count can be inferred from the number of files. So in summary, Slurm (slurmstepd) will create a cgroup for the job and whitelist everything in the cgroup_allowed_devices_file.conf file and then blacklist the devices not requested. Can you setup the device files and try again? I apologize if you already have this setup -- I haven't seen them in your configs yet. Let me know if you have any questions.
Hi Brian, Thank you for the info. I made your recommended changes to cgroup_allowed_devices_file.conf and to gres.conf on n114 and that FIXED the problem: appended to cgroup_allowed_devices_file.conf: /dev/nvidia* changed in gres.conf: Name=gpu Count=4 File=/dev/nvidia[0-3] Now when I reserve GPUs on this node from two different instances I DO get different GPUs according to deviceQuery, so this issue now appears to be resolved. I will roll out the changes to the rest of the GPU nodes in our cluster. Thanks!! Rob On Dec 12, 2017, at 9:38 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=4122#c14> on bug 4122<https://bugs.schedmd.com/show_bug.cgi?id=4122> from Brian Christiansen<mailto:brian@schedmd.com> The way gpu confinement works in the task/cgroup plugin is that it first white lists a list of default devices and then black lists the gpus that weren't requested. This is done with the following configs: cgroup.conf: ConstrainDevices=yes (which you have) cgroup_allowed_devices_file.conf - There is an example file contained in the etc directory of the source tree. Just copy this to your etc directory without the .example extension. e.g. $ cat cgroup_allowed_devices_file.conf.example /dev/null /dev/urandom /dev/zero /dev/sda* /dev/cpu/*/* /dev/pts/* Then in your gres.conf you need to specify the devices that are associated with the gpus. e.g. Name=gpu Count=2 File=/dev/nvidia[0-1] Note how Count=2 matches up with 2 files. The count can be inferred from the number of files. So in summary, Slurm (slurmstepd) will create a cgroup for the job and whitelist everything in the cgroup_allowed_devices_file.conf file and then blacklist the devices not requested. Can you setup the device files and try again? I apologize if you already have this setup -- I haven't seen them in your configs yet. Let me know if you have any questions. ________________________________ You are receiving this mail because: * You reported the bug.
Good to hear. Let us know if you have any other issues. Thanks, Brian