Bug 11091 - Support for Nvidia A100 MIG instances
Summary: Support for Nvidia A100 MIG instances
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: GPU (show other bugs)
Version: 20.11.4
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
: 10583 (view as bug list)
Depends on:
Blocks:
 
Reported: 2021-03-15 15:25 MDT by Tony Racho
Modified: 2022-09-19 04:21 MDT (History)
8 users (show)

See Also:
Site: CRAY
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: NIWA/WELLINGTON
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
v1 (2.65 KB, patch)
2021-04-28 15:45 MDT, Michael Hinton
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tony Racho 2021-03-15 15:25:52 MDT
Hi:

We have 8x A100 GPU nodes and we are planning to enable MIG on them.

Is this supported by Slurm or will be supported by Slurm.

Is there a roadmap or an ETA for this?

Thank you.

Cheers,
Tony
Comment 1 Michael Hinton 2021-03-15 17:29:25 MDT
Hi Tony,

(In reply to Tony Racho from comment #0)
> We have 8x A100 GPU nodes and we are planning to enable MIG on them.
> Is this supported by Slurm or will be supported by Slurm.
Can you describe exactly how you want Slurm to support MIG?

In 20.11, we did add a MultipleFiles field in gres.conf to support NVIDIA MIG devices. This allows multiple files to be associated with a single GPU. We are currently working on the documentation for that new field.

I also found this in the FAQ (https://slurm.schedmd.com/faq.html#opencl_pmix):

"In order to use Multi-Instance GPUs with Slurm and PMIx you can instruct hwloc to not query OpenCL devices by setting the HWLOC_COMPONENTS=-opencl environment variable for slurmd, i.e. setting this variable in systemd unit file for slurmd." 

See also bug 10356 for more context. But it sounds like this is really just a workaround and will require some additional work.

> Is there a roadmap or an ETA for this?
I know that MultipleFiles was meant to pave the way in the code to support MIG, but it's unclear if that is sufficient or if more needs to be done to properly support MIG. Let me get back to you on that.

Thanks,
-Michael
Comment 3 Michael Hinton 2021-03-16 10:49:36 MDT
(In reply to Tony Racho from comment #0)
> We have 8x A100 GPU nodes and we are planning to enable MIG on them.
> Is this supported by Slurm or will be supported by Slurm.
> Is there a roadmap or an ETA for this?
It looks like this just got approved and sponsored for 21.08. Please feel free to let us know how you expect Slurm to support MIG and we'll take that into consideration.

-Michael
Comment 4 Simon Raffeiner 2021-03-26 01:49:20 MDT
We at KIT are also interested in improved MIG support, especially for use cases like JupyterHub. Most users don't need a full A100 during the training/development phase, with MIG support we could handle up to 7 times more users on the same hardware while 

Support for static partitioning is a good first step, but the ideal solution would support dynamic creation and destruction of MIG instances. This would probably require new parameters for sbatch/salloc and scheduling of sub-resources within a node, though.

- Simon
Comment 5 Simon Raffeiner 2021-03-26 01:57:41 MDT
*** Bug 10583 has been marked as a duplicate of this bug. ***
Comment 6 Chris Samuel (NERSC) 2021-03-31 13:10:21 MDT
Hi there,

NERSC are also interested (as per private discussions with Tim).

I'm being asked about the situation by our user support folks, is there an update/plan I can pass on please?

All the best,
Chris
Comment 8 Michael Hinton 2021-03-31 14:12:53 MDT
Hey Chris,

(In reply to Chris Samuel (NERSC) from comment #6)
> I'm being asked about the situation by our user support folks, is there an
> update/plan I can pass on please?
Specifically, MIG auto-detection with AutoDetect=nvml has been sponsored for 21.08. In theory, statically-partitioned MIG devices are supported today in 20.11 with the MultipleFiles gres.conf field, though we still need to test that this works as expected and to document MultipleFiles.

Simon,

(In reply to Simon Raffeiner from comment #4)
> Support for static partitioning is a good first step, but the ideal solution
> would support dynamic creation and destruction of MIG instances.
This is currently not on the table, and probably won't be, barring a sponsorship.

Thanks!
-Michael
Comment 9 Chris Samuel (NERSC) 2021-03-31 14:24:33 MDT
Hi Michael,

Thank you, that's much appreciated!

All the best,
Chris
Comment 10 Michael Hinton 2021-04-06 13:10:47 MDT
Tony,

To summarize:

Statically-partitioned MIG devices should be supported today in 20.11 with the addition of MultipleFiles to gres.conf. We are in the process of documenting MultipleFiles.

Using AutoDetect=nvml to automatically detect MIG devices is not supported today, but will be in 21.08.

Using Slurm to dynamically change MIG partitions is not on the roadmap.

I'll go ahead and close this out. Feel free to reopen if you have further questions.

Thanks!
-Michael
Comment 11 Michael Hinton 2021-04-28 12:26:50 MDT
Just an update on this:

After playing around with MIGs for my 21.08 dev work, I would say that Slurm 20.11 doesn't support them out of the box. While MultipleFiles exists today in 20.11, Slurm does not allow duplicate files to be specified for GPUs in gres.conf, and MIGs share some device files between partitions (since they are all children of e.g. /dev/nvidia0, for instance, so each MIG partition needs that in their device cgroup, I believe).

This restriction is pretty superficial in the code, but then there is the issue of GPU binding. To select an MPS device, you need to specify the UUID, GI ID, and CI ID in CUDA_VISIBLE_DEVICES, and this format is not yet supported in Slurm. So for more than one GPU, your step will need to manually set CUDA_VISIBLE_DEVICES to achieve the right binding. This is something I hope to address in 21.08, but I can't make any promises.

-Michael
Comment 12 Tony Racho 2021-04-28 14:55:59 MDT
Hi Michael:

Wanted to check if it’s possible to use MIGs at all with 20.11, e.g., perhaps by putting /dev/nvidia0 into cgroups ahead of time? Trying to understand whether nothing more can be done even just for internal testing, until 21.08 or not.

Cheers,
Tony
Comment 13 Michael Hinton 2021-04-28 15:45:49 MDT
Created attachment 19176 [details]
v1

Hey Tony,

(In reply to Michael Hinton from comment #11)
> While MultipleFiles exists today
> in 20.11, Slurm does not allow duplicate files to be specified for GPUs in
> gres.conf...
> 
> This restriction is pretty superficial in the code
The attached v1 patch for 20.11 removes this restriction for files specified by MultipleFiles and should allow you to properly play around with MIGs in 20.11.

Just note that CUDA_VISIBLE_DEVICES will not be set appropriately, so you will have to set it yourself before you start your Cuda application. If there is only one MIG device in your job or step allocation, though, then I've found that you can set CUDA_VISIBLE_DEVICES=0 and it will select the first device it sees. It might even select the first available device if CUDA_VISIBLE_DEVICES is unset, IIRC.

Here's the MIG documentation for how to set up MIG partitions: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#running-with-mig

In gres.conf, you'll want to create a single GRES line with MultipleFiles set for each MIG partition. It will probably look something like this:

    Name=gpu Type=a100 MultipleFile=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap2,/dev/nvidia-caps/nvidia-cap3
    Name=gpu Type=a100 MultipleFile=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap2,/dev/nvidia-caps/nvidia-cap4
    ...

The specific `nvidia-capX` files will vary, depending on what partitions you create. You'll specify the GPU device file, one cap file for the child GPU instance (GI), and one cap file for the child compute instance (CI). nvidia-cap file to GI/CI mapping can be found in /proc/driver/nvidia-caps/mig-minors, and this is explained further in the MIG docs.

You'll know it's working if `nvidia-smi -L` and `nvidia-smi` only show the GPU MIG partitions you requested for a job or step allocation. This means that cgroups is properly restricting device file access.

Make sure that if you do testing with salloc that you have `LaunchParameters=use_interactive_step` set (this replaces SallocDefaultCommand). If not, salloc will be a local shell with access to all the GPUs on the login machine with no cgroup restrictions, instead of on the node. This can be really confusing in test setups where the login machine and node are the same machine.

That's about as much support as I can give you for 20.11, just to get your feet wet; in 21.08, MIG should have more official support.

-Michael
Comment 14 Tony Racho 2021-04-28 15:48:38 MDT
Thanks Michael. That should get us started.