1458 – Setting nVidia GPU access from SLURM - exclusive process

Bug 1458 - Setting nVidia GPU access from SLURM - exclusive process

Summary: Setting nVidia GPU access from SLURM - exclusive process

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other bugs)
Version:	14.03.0
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-02-15 10:51 MST by Josh Bowden
Modified:	2015-02-18 10:34 MST (History)
CC List:	2 users (show)

See Also:
Site:	CSIRO
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	15.08.1-pre3
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Josh Bowden 2015-02-15 10:51:56 MST

Hi,

nVidia GPUs support the following modes which are set using the nvidia-smi -c X switch, where X is one of:

0 – Default - Shared mode available for multiple processes
1 – Exclusive Thread - Only one host thread is allowed to access the GPU for compute
2 – Prohibited - No host threads / processes are allowed to access the GPU for compute
3 – Exclusive Process - Only one host process is allowed to access the GPU for compute

On our system the default compute mode has been set for each device as (3) exclusive process.

How could we let a user change this access type from SLURM? It requires sudo access and using PBS a job ran a pre-launch script to set it how a user reqested.

Thanks,
Josh.

Comment 1 Moe Jette 2015-02-15 13:25:25 MST

Slurm supports a large number of prolog and epilog options. For details see:
http://slurm.schedmd.com/prolog_epilog.html

The environment variables are documented in the slurm.conf man page (see the section labeled "Prolog and Epilog Scripts". (They should probably be added to the web page too, but are not there today). See:
http://slurm.schedmd.com/slurm.conf.html

Ideally the user's job would specify the GPU mode via a "--constraint" option, which would be passed to the script as an environment variable. I'm not sure we'll be able to add that before the version 15.08 release, but the prolog script could get the job's constraint option information now using the squeue or scontrol command today, it's just not as scalable as if Slurm passed it directly to the prolog by including the jobs's constraint information in the launch RPC.

Comment 2 Josh Bowden 2015-02-15 14:44:52 MST

Hi Moe, thanks for the information.

We currently have a script that could be used as a prolog target. It ran in the prolog on our previous PBS Torque system and used the PBS_GPUFILE variable to determine what GPU had been allocated on each node.

How would I determine what GPU has been allocated by SLURM? Is the  CUDA_VISIBLE_DEVICES set for each node allocated before the prolog script run? and if so, how do I access this information?

Thanks again,Josh.

Comment 3 Moe Jette 2015-02-15 14:49:47 MST

Prolog does not currently get the CUDA env var, but the info is available and it could be set.

On February 15, 2015 8:44:52 PM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1458
>
>--- Comment #2 from Josh Bowden <josh.bowden@csiro.au> ---
>Hi Moe, thanks for the information.
>
>We currently have a script that could be used as a prolog target. It
>ran in the
>prolog on our previous PBS Torque system and used the PBS_GPUFILE
>variable to
>determine what GPU had been allocated on each node.
>
>How would I determine what GPU has been allocated by SLURM? Is the 
>CUDA_VISIBLE_DEVICES set for each node allocated before the prolog
>script run?
>and if so, how do I access this information?
>
>Thanks again,Josh.
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 4 Josh Bowden 2015-02-15 15:37:05 MST

So, how could I get that information from within my prolog script then? It is currently a bash script.

Comment 5 Moe Jette 2015-02-15 16:02:56 MST

We'd have to modify the Slurm code to pass the prolog the same CUDA env vars as the app.

On February 15, 2015 9:37:05 PM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1458
>
>--- Comment #4 from Josh Bowden <josh.bowden@csiro.au> ---
>So, how could I get that information from within my prolog script then?
>It is
>currently a bash script.
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 6 Josh Bowden 2015-02-15 16:12:52 MST

is that possible? and will it require a recompiliation of SLURM? And will our system admins be able to update the system easily?

Comment 7 Moe Jette 2015-02-15 16:26:42 MST

It would definitely require rebuilding Slurm, which is simple. At this point I'm not sure how difficult the code changes would be. We'll get back to you later in the week on that.

On February 15, 2015 10:12:52 PM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1458
>
>--- Comment #6 from Josh Bowden <josh.bowden@csiro.au> ---
>is that possible? and will it require a recompiliation of SLURM? And
>will our
>system admins be able to update the system easily?
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 8 Josh Bowden 2015-02-15 16:49:27 MST

Ok, I have warned our system admins. Happy to hear of any potential solution to the problem. I am surprised it has not been encountered previously.
Thanks, Josh.

Comment 9 Josh Bowden 2015-02-16 10:50:06 MST

And just to make things clear about how the prolog script worked, I would need a list of nodes allocated (host names preferably) and a list of GPU devices.

e.g. something like:
<hostname>,<GPUNUM>,<GPUNUM>,...
n002,1,2
n034,0,1
n035,0,2
etc.

The script then ssh's onto the node and uses nvidia-smi to set the access mode.
I also need a variable containing the access mode that a user requested.
For example:

--gres=gpu:2:exclusive_process
or
--gres=gpu:2:normal
or
--gres=gpu:2:exclusive_thread

So i'd need access to access mode 

Is all of that possible?

Thanks,
Josh

Comment 10 Moe Jette 2015-02-16 11:16:20 MST

Wouldn't it be better to just run directly on each compute node rather than using ssh from the head node? See the prolog/epilog web page previously cited.

On February 16, 2015 4:50:06 PM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1458
>
>--- Comment #9 from Josh Bowden <josh.bowden@csiro.au> ---
>And just to make things clear about how the prolog script worked, I
>would need
>a list of nodes allocated (host names preferably) and a list of GPU
>devices.
>
>e.g. something like:
><hostname>,<GPUNUM>,<GPUNUM>,...
>n002,1,2
>n034,0,1
>n035,0,2
>etc.
>
>The script then ssh's onto the node and uses nvidia-smi to set the
>access mode.
>I also need a variable containing the access mode that a user
>requested.
>For example:
>
>--gres=gpu:2:exclusive_process
>or
>--gres=gpu:2:normal
>or
>--gres=gpu:2:exclusive_thread
>
>So i'd need access to access mode 
>
>Is all of that possible?
>
>Thanks,
>Josh
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 11 Josh Bowden 2015-02-16 11:30:21 MST

nvidia-smi requires root access privilages to change the access mode, so I would be running as SlurmdUser from the compute or front end node (i.e. the first row of the first table on page: http://slurm.schedmd.com/prolog_epilog.html). 

I would then have to either:
A/ ssh to any other nodes allocated and run nvidia-smi
or
B/ use 'srun' as SlurmdUser/root to run nvidia-smi on allocated nodes. (if this is possible)

I would have to ensure that PrologFlags=Alloc is not set as I would be possibly modifying the gpu access while previous jobs are running.

Comment 12 Moe Jette 2015-02-16 11:38:15 MST

The prolog on each compute node runs as root before the app gets launched. I was thinking it could be passed the allocated GPUs on that node using a CUDA env var, which we already have to launch the app. 

On February 16, 2015 5:30:21 PM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1458
>
>--- Comment #11 from Josh Bowden <josh.bowden@csiro.au> ---
>nvidia-smi requires root access privilages to change the access mode,
>so I
>would be running as SlurmdUser from the compute or front end node (i.e.
>the
>first row of the first table on page:
>http://slurm.schedmd.com/prolog_epilog.html). 
>
>I would then have to either:
>A/ ssh to any other nodes allocated and run nvidia-smi
>or
>B/ use 'srun' as SlurmdUser/root to run nvidia-smi on allocated nodes.
>(if this
>is possible)
>
>I would have to ensure that PrologFlags=Alloc is not set as I would be
>possibly
>modifying the gpu access while previous jobs are running.
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 13 Josh Bowden 2015-02-16 11:58:17 MST

Well, if that is the case, then that is exactly what is needed and will simplify the needed prolog script.

Comment 14 Moe Jette 2015-02-17 10:11:24 MST

This will be addressed to the extent possible for now in version 14.11.5. Specifically, the Prolog (run as root on each allocated compute node) will now have the environment variable SLURM_JOB_GPUS passed to it. This will be the same as CUDA_VISIBLE_DEVICES on that specific compute node unless the device is bound to the job using cgroups (in that case, SLURM_JOB_GPUS will be the global GPU index while CUDA_VISIBLE_DEVICES always starts at 0). The changes are here if you are anxious to try this.

https://github.com/SchedMD/slurm/commit/2e95c20b3bf9bcddd9b0fe0048e222fb8306c90b

Note that this will not work if you have in slurm.conf PrologFlags=alloc (the necessary information is not available for that to work).

Also note, that you will need to use the squeue or scontrol command to get a job's "constraint" specification (i.e. Exclusive Thread, Prohibited, Exclusive Process, or Default).

Fixing those two things will need to wait until Slurm version 15.08 as changes to the RPCs are required.

Comment 15 Moe Jette 2015-02-18 04:43:47 MST

(In reply to Moe Jette from comment #14)
> Note that this (previous patch) will not work if you have in slurm.conf
>  PrologFlags=alloc (the necessary information is not available for that
> to work).
> 
> Also note, that you will need to use the squeue or scontrol command to get a
> job's "constraint" specification (i.e. Exclusive Thread, Prohibited,
> Exclusive Process, or Default).
> 
> Fixing those two things will need to wait until Slurm version 15.08 as
> changes to the RPCs are required.

Here are the fixes to the two problems described above. Changes are fairly extensive and several RPCs changed, so these changes will not be released until Slurm version 15.08, but you should be able to work around the shortcomings as described above.

https://github.com/SchedMD/slurm/commit/6966f77ee747685d4858dc80b8ea35f61872ee72
https://github.com/SchedMD/slurm/commit/06db2ded72ae192bb55f74f10cdb51610b14a8fb

You will have to define node features as desired in slurm.conf, something like this
NodeName=tux[0-123] Features=gpu_ex,gpu_pro,gpu_ex_pro,gpu_def ...

The user can then specify the desired behaviour, something like this:
sbatch -C gpu_ex --gres=gpu:2 ...

The prolog in SLurm v15.08 will see these env vars
SLURM_JOB_GPUS=0,1
SLURM_JOB_CONSTRAINTS=gpu_ex

In v14.11 with the previously mentioned commit you will see
SLURM_JOB_GPUS=0,1
And use scontrol or squeue to get the constraints, like this:
$ scontrol show job $SLURM_JOB_ID
JobId=26 JobName=hostname
   ...
   Features=gpu_ex Gres=gpu:2 Reservation=(null)

Please re-open if you encounter more difficulties.

Comment 16 Josh Bowden 2015-02-18 10:34:03 MST

Hi Moe,
That is great news. I have passed this on to our systems people so I hope they can have it implemented in the near furture.
Regards,
Josh.