Bug 2626 - Select GPU with gres
Summary: Select GPU with gres
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other bugs)
Version: 15.08.11
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL: https://wp.me/dk5d2
Depends on:
Blocks:
 
Reported: 2016-04-11 07:08 MDT by Davide Vanzo
Modified: 2022-02-18 05:32 MST (History)
0 users

See Also:
Site: Vanderbilt
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Davide Vanzo 2016-04-11 07:08:44 MDT
Some of our cluster nodes have 4 GPU + 12 CPU cores. We set up gres such that to each GPU is bound to three CPU cores on the same PCIe root complex. Everything works fine until I want to use a specific GPU in the nodes.
For example, if I request the resources as:

#SBATCH --account=accre_gpu
#SBATCH --partition=titan
#SBATCH --nodes=1
#SBATCH --tasks-per-node=12
#SBATCH --gres=gpu:4
#SBATCH --mem=100G
#SBATCH --time=48:00:00

SLURM correctly initializes CUDA_VISIBLE_DEVICES=0,1,2,3. However if in my script I do:

export CUDA_VISIBLE_DEVICES=2
srun -n 1 mycuda_program

the process will always run on GPU #0. That is also normal since, as your documentation says, "the environment variable set for the sbatch command only reflects the GPUs allocated to that job on that node, node zero of the allocation". However, it would simplify my life if there was a way to say to srun which GPU I want.
Is there any workaround to select a specific GPU?

Thanks,
Davide
Comment 1 Tim Wickberg 2016-04-11 08:12:49 MDT
Can you use:
"srun -n 1 --gres=gpu:1 mycuda_program"

to select just a single card to use?

The GPU plugin is overriding your CUDA_VISIBLE_DEVICES environment variable; it always wants to be what controls the GPU allocations.
Comment 2 Davide Vanzo 2016-04-11 08:18:18 MDT
Tim,
that works if I don't care which one of the four cards get assigned to my step. But if I want my program to run on GPU #2 leaving the other three idle this doesn't work.
Comment 3 Tim Wickberg 2016-04-11 08:25:40 MDT
(In reply to Davide Vanzo from comment #2)
> Tim,
> that works if I don't care which one of the four cards get assigned to my
> step. But if I want my program to run on GPU #2 leaving the other three idle
> this doesn't work.

Ahh.

You'd need to ensure what you launch via srun overrides the environment variable then. The easiest way to do this would be to create a wrapper around your mycuda_program that sets the environment variable as you desire right before execution.

E.g.

mylaunch.sh:
#!/bin/bash
export CUDA_VISIBLE_DEVICES=$1
mycuda_program

then "srun -n 1 mylaunch.sh 2". 

There's no way currently to stop the srun from resetting the variable at task launch or to explicitly set which of the devices is assigned to that step. That may be worth filing as an enhancement if this is a common issue; I can refile this bug as such if you have no other questions at the moment.

- Tim
Comment 4 Davide Vanzo 2016-04-11 08:33:43 MDT
Yes, your solution works, thanks.
I just wanted to be sure I wasn't missing some feature before filing the enhancement request. Please refile this ticket as such.

Davide
Comment 5 Davide Vanzo 2016-04-12 03:43:52 MDT
Tim,
I found a weird SLURM behavior that seems to be connected to the issue in this ticket.
As I said, I set up gres on the 12 GPU nodes with a shared gres.conf as follows:

NodeName=vmp[1243-1254] Name=gpu File=/dev/nvidia0 CPUs=0-2
NodeName=vmp[1243-1254] Name=gpu File=/dev/nvidia1 CPUs=3-5
NodeName=vmp[1243-1254] Name=gpu File=/dev/nvidia2 CPUs=6-8
NodeName=vmp[1243-1254] Name=gpu File=/dev/nvidia3 CPUs=9-11

and on slurm.conf (extract):

GresTypes=gpu
#
NodeName=vmp[1243-1254] RealMemory=128000 CPUs=12 Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 Gres=gpu:4
#
PartitionName=titan Nodes=vmp[1243-1254] Default=NO MaxTime=20160 DefaultTime=15 DefMemPerNode=2000 MaxMemPerNode=124000 State=UP

Now, in the same cluster we have another set of GPU nodes that are not configured with gres but they are assigned exclusively to a requesting user and they belong to a separate partition:

NodeName=vmp[802,805-808,813,815,818,824-826,833-838,844] RealMemory=45000 CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 Feature=cuda
#
PartitionName=gpu Nodes=vmp[802,805-808,813,815,818,824-826,833-838,844] Default=NO MaxTime=20160 DefaultTime=15 DefMemPerNode=2000 MaxMemPerNode=45000 State=UP Shared=EXCLUSIVE

Before the activation of gres on the new GPU nodes a user could have requested a node in the "gpu" partition and submitted the job step with srun without problems. Now what happens is that the application cannot see any GPU because CUDA_VISIBLE_DEVICES=NoDevFiles even if the nodes are not set with gres. Is this normal?

Davide
Comment 6 Tim Wickberg 2016-04-12 05:06:31 MDT
> Before the activation of gres on the new GPU nodes a user could have
> requested a node in the "gpu" partition and submitted the job step with srun
> without problems. Now what happens is that the application cannot see any
> GPU because CUDA_VISIBLE_DEVICES=NoDevFiles even if the nodes are not set
> with gres. Is this normal?

Expected behavior, yes. (I defer on what "normal" means in an HPC environment... from discussions with sysadmins across the globe I've learned to anticipate a lot of variability in system design.)

When you enabled the gres/gpu plugin, it started enforcing access control to any/all GPUs in the cluster and for setting those CUDA-related environment variables appropriately. Since you have no devices configured for those nodes, it's simply blocking access to everything.

Although it sounds like this is undesired behavior here. Slurm's obviously not anticipated being put in control of just some GPUs in the cluster, while not being meant to intervene on other nodes.

There are a few approaches you can take here:

1) Start managing those GPUs through GRES. Easiest option from Slurm's perspective. You could optionally use a job_submit plugin to automatically set the --gres request for jobs in that partition if you don't want to have your users change their scripts.

2) Change the plugin to only enforce these variables and other settings for a node with gres/gpu devices configured. I can look into this as an enhancement if desired; I don't think this would be too difficult to handle.

3) Disable the gres/gpu plugin throughout. Not ideal for obvious reasons.
Comment 7 Davide Vanzo 2016-04-12 05:12:36 MDT
Thanks for confirming my hypothesis.
I'm already working on managing all GPU nodes via gres. That's something we planned to do anyway.
Thanks again.

DV
Comment 8 Davide Vanzo 2016-04-21 07:42:41 MDT
Tim,
I tried to implement the selection of specific GPUs by using a second bash script but that doesn't work for MPI jobs where the MPI itself is built against SLURM. Hence it would be nice to have a way to do so via a native SLURM feature.

Davide
Comment 9 Tim Wickberg 2016-04-22 06:35:30 MDT
I figured out how you can abuse the Type field to accomplish this right now, although I'm still a bit unclear on your use case and it might not be a perfect match.

For my example here, I'm setting up a node with a first and second GPU in it, and want to allow a job to (optionally) specify a given card.

In slurm.conf I have:
NodeName=zoidberg01 Gres=gpu:2

In gres.conf I have:
NodeName=zoidberg01 Name=gpu Type=a File=/tmp/a
NodeName=zoidberg01 Name=gpu Type=b File=/tmp/b

A job can still ask for a number of cards without caring about the specific type like this:
sbatch --gres=gpu:2 test.sh

or you can pick the specific card like so:
sbatch --gres=gpu:a test.sh
or
sbatch --gres=gpu:b test.sh

I believe you can also use this sub-selection with srun within the allocation:
salloc --gres=gpu:a,gpu:b
#both card a and b allocated
srun --gres=gpu:a mybin
#run on card a only

Note that "salloc --gres=gpu:2" isn't sufficient here - the srun type selection fails for some reason I haven't tracked down.

I'm not sure if that results in something close to what you're after or not - the type selection may have some quirks in your environment and I'd encourage thorough testing if you want to try this.

I'll also caution that there appear to be a few lingering issues with AccountingStorageTRES that can cause slurmdbd to segfault when adding additional TRES types, so I'd avoid adding something like "gres/gpu/a,gres/gpu/b" to that field for now.

- Tim
Comment 10 Davide Vanzo 2016-04-25 05:40:48 MDT
Tim,
thank you for your workaround. That should do exactly what I wanted.

One more related thing I found out recently. Let's assume I don't care on which GPU my process lands. If I submit a job requesting four GPUs on a node (--gres=gpu:4) and then I run four steps as follows, I would expect SLURM to assign one GPU to each step. As a matter of fact, all four GPUs have an associated process, but the performance is very low compared to requesting one GPU and running a single step. Can you clarify how SLURM manages gres allocation across multiple srun?

Davide


#SBATCH --nodes=1
#SBATCH --tasks-per-node=12
#SBATCH --gres=gpu:4
#SBATCH --mem=120G

srun --ntasks=1 --gres=gpu:1 --mem=30G my_cuda_code &
srun --ntasks=1 --gres=gpu:1 --mem=30G my_cuda_code &
srun --ntasks=1 --gres=gpu:1 --mem=30G my_cuda_code &
srun --ntasks=1 --gres=gpu:1 --mem=30G my_cuda_code &
wait
Comment 11 Tim Wickberg 2016-04-26 06:15:38 MDT
(In reply to Davide Vanzo from comment #10)
> Tim,
> thank you for your workaround. That should do exactly what I wanted.
> 
> One more related thing I found out recently. Let's assume I don't care on
> which GPU my process lands. If I submit a job requesting four GPUs on a node
> (--gres=gpu:4) and then I run four steps as follows, I would expect SLURM to
> assign one GPU to each step. As a matter of fact, all four GPUs have an
> associated process, but the performance is very low compared to requesting
> one GPU and running a single step. Can you clarify how SLURM manages gres
> allocation across multiple srun?

This should behave as you expect - each srun is being assigned a separate GPU, and the CUDA_VISIBLE_DEVICES environment variable should be set appropriately for each separate step.

Can you elaborate on how the performance difference is visible?

You may be seeing some performance impact from having the CPUs assigned not local to the GPU itself, although this varies with the system layout.

Take a look at the CPUs configuration line in gres.conf to see how you can manually steer that. You'd need to sort out how the locality is mapped within your nodes though.
Comment 12 Davide Vanzo 2016-04-29 06:20:56 MDT
Yes, I should have bonded the GPU to the correct CPU sockets. Here is my gres.conf:

NodeName=vmp[1243-1254] Name=gpu File=/dev/nvidia0 CPUs=0-2
NodeName=vmp[1243-1254] Name=gpu File=/dev/nvidia1 CPUs=3-5
NodeName=vmp[1243-1254] Name=gpu File=/dev/nvidia2 CPUs=6-8
NodeName=vmp[1243-1254] Name=gpu File=/dev/nvidia3 CPUs=9-11

and hwloc gives me the following topology:

Machine (128GB total)
  NUMANode L#0 (P#0 64GB)
    Package L#0 + L3 L#0 (15MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
    HostBridge L#0
      PCIBridge
        PCI 15b3:1003
          Net L#0 "eth0"
          Net L#1 "eth1"
          OpenFabrics L#2 "mlx4_0"
      PCIBridge
        PCI 10de:17c2
          CoProc L#3 "cuda0"
      PCIBridge
        PCI 10de:17c2
          CoProc L#4 "cuda1"
      PCI 8086:8d62
        Block L#5 "sda"
      PCIBridge
        PCIBridge
          PCI 1a03:2000
      PCI 8086:8d02
        Block L#6 "sr0"
  NUMANode L#1 (P#1 64GB)
    Package L#1 + L3 L#1 (15MB)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
    HostBridge L#6
      PCIBridge
        PCI 8086:1521
          Net L#7 "eth2"
        PCI 8086:1521
          Net L#8 "eth3"
      PCIBridge
        PCI 10de:17c2
          CoProc L#9 "cuda2"
      PCIBridge
        PCI 10de:17c2
          CoProc L#10 "cuda3"

The problem I'm dealing with at the moment is that when I request two gres and two cores as follows:

#!/bin/bash
#SBATCH --account=accre_gpu
#SBATCH --partition=maxwell
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --gres=gpu:2
#SBATCH --mem=40G
#SBATCH --time=48:00:00
#SBATCH --job-name=amber
#SBATCH --output=run_2gpu.log

module load gcc
module load mvapich2_roce
module load amber

srun pmemd.cuda.MPI -O -i Data/mdin.GPU -p Data/prmtop -o mdout_2gpu -inf mdinfo_2gpu -x mdcrd -r restrt

where MVAPICH2 is built against SLURM PMI2.
When querying the GPUs it seems that each of the two processes are bonded to both GPUs instead to be a one-to-one pairing. This strongly affects the performance.

$ nvidia-smi
Fri Apr 29 12:03:30 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  On   | 0000:02:00.0     Off |                  N/A |
| 22%   63C    P2   155W / 250W |    348MiB / 12287MiB |     88%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  On   | 0000:03:00.0     Off |                  N/A |
| 31%   74C    P2   177W / 250W |    369MiB / 12287MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  On   | 0000:82:00.0     Off |                  N/A |
| 22%   29C    P8    16W / 250W |     23MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  On   | 0000:83:00.0     Off |                  N/A |
| 22%   37C    P8    16W / 250W |     23MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      2760    C   .../mvapich2/2.1/cuda/7.5/bin/pmemd.cuda.MPI   211MiB |
|    0      2761    C   .../mvapich2/2.1/cuda/7.5/bin/pmemd.cuda.MPI   110MiB |
|    1      2760    C   .../mvapich2/2.1/cuda/7.5/bin/pmemd.cuda.MPI   110MiB |
|    1      2761    C   .../mvapich2/2.1/cuda/7.5/bin/pmemd.cuda.MPI   232MiB |
+-----------------------------------------------------------------------------+

I'm trying to figure out the cause of this behavior but I cannot pinpoint a specific source yet. Any suggestion?

Thank you,
Davide
Comment 13 Tim Wickberg 2016-04-29 06:31:47 MDT
(In reply to Davide Vanzo from comment #12)
> #!/bin/bash
> #SBATCH --account=accre_gpu
> #SBATCH --partition=maxwell
> #SBATCH --nodes=1
> #SBATCH --ntasks=2
> #SBATCH --gres=gpu:2
> #SBATCH --mem=40G
> #SBATCH --time=48:00:00
> #SBATCH --job-name=amber
> #SBATCH --output=run_2gpu.log
> 
> module load gcc
> module load mvapich2_roce
> module load amber
> 
> srun pmemd.cuda.MPI -O -i Data/mdin.GPU -p Data/prmtop -o mdout_2gpu -inf
> mdinfo_2gpu -x mdcrd -r restrt
> 
> where MVAPICH2 is built against SLURM PMI2.
> When querying the GPUs it seems that each of the two processes are bonded to
> both GPUs instead to be a one-to-one pairing. This strongly affects the
> performance.

I can confirm that's the current behavior; each task within the node does have access to all of the GRES associated with that node.

I don't think there's a way to do this explicit binding within a single step at present. If I understand you correctly, you'd like to have the first task allocated the first GPUs + bound appropriately, then the second task allocated the second GPU only?

I can confirm this doesn't work:

srun --gres=gpu:1 --ntasks=2

causes both launched tasks to only have access to the first card. (The second would be available to a separate task.)
Comment 14 Davide Vanzo 2016-04-29 06:58:45 MDT
What I'm trying to do is to simulate all possible use case scenarios and figure out what doesn't work before users do.
In this case I was simply trying to run Amber in parallel on 1, 2, 3, 4 GPUs on a single node. From the Amber output the code seems to correctly bound task 0 to GPU 0 and task 1 to GPU 1 but as you can see it is not true. I'm testing now with a different application to see if the problem is Amber itself.
Comment 15 Davide Vanzo 2016-05-10 09:16:33 MDT
Tim,
I solved the previous issue. Apparently it is normal for AMBER to run multiple GPUs on each CPU task. Everything ok there.
However, I still have problem running multiple concurrent job within the same sbatch. I am requesting four GPUs as follows:

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gres=gpu:4

Then I submit four job steps with:

srun --ntasks=1 --gres=gpu:1 mycode &
srun --ntasks=1 --gres=gpu:1 mycode &
srun --ntasks=1 --gres=gpu:1 mycode &
srun --ntasks=1 --gres=gpu:1 mycode
wait

My expectation is for SLURM to allocate each step on one of the four GPU simultaneously. However, only the first step gets allocated, while the other wait until the previous is done and return the following warning:

srun: Job step creation temporarily disabled, retrying

Am I doing something wrong?

Davide
Comment 16 Tim Wickberg 2016-05-10 09:49:22 MDT
(In reply to Davide Vanzo from comment #15)
> Tim,
> I solved the previous issue. Apparently it is normal for AMBER to run
> multiple GPUs on each CPU task. Everything ok there.
> However, I still have problem running multiple concurrent job within the
> same sbatch. I am requesting four GPUs as follows:
> 
> #SBATCH --nodes=1
> #SBATCH --ntasks=4
> #SBATCH --gres=gpu:4
> 
> Then I submit four job steps with:
> 
> srun --ntasks=1 --gres=gpu:1 mycode &
> srun --ntasks=1 --gres=gpu:1 mycode &
> srun --ntasks=1 --gres=gpu:1 mycode &
> srun --ntasks=1 --gres=gpu:1 mycode
> wait
> 
> My expectation is for SLURM to allocate each step on one of the four GPU
> simultaneously. However, only the first step gets allocated, while the other
> wait until the previous is done and return the following warning:
> 
> srun: Job step creation temporarily disabled, retrying
> 
> Am I doing something wrong?
> 
> Davide

I'm guessing the memory limit is causing this. If you specify a --mem for each srun step do things run as expected?
Comment 17 Davide Vanzo 2016-05-11 02:28:13 MDT
You're right. I had the same intuition when I was riding back home. I have to put the SchedMD website in a blacklist after 3pm so that I avoid spamming you when my brain is overloaded.
Feel free to close the ticket now.
Thank you again for your patience.

Davide
Comment 18 Tim Wickberg 2016-05-11 05:35:58 MDT
Happens to all of us; glad that one was a simple fix.

- Tim

(In reply to Davide Vanzo from comment #17)
> You're right. I had the same intuition when I was riding back home. I have
> to put the SchedMD website in a blacklist after 3pm so that I avoid spamming
> you when my brain is overloaded.
> Feel free to close the ticket now.
> Thank you again for your patience.
> 
> Davide