Bug 10970 - Support auto-discovery of NVIDIA devices partitioned into separate MIG instances
Summary: Support auto-discovery of NVIDIA devices partitioned into separate MIG instances
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: GPU (show other bugs)
Version: 21.08.x
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Director of Support
QA Contact:
URL:
Depends on: 10827 11883 12039
Blocks:
  Show dependency treegraph
 
Reported: 2021-02-26 15:07 MST by Tim Wickberg
Modified: 2021-07-30 11:34 MDT (History)
7 users (show)

See Also:
Site: NVIDIA (PSLA)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.0rc1
Target Release: 21.08
DevPrio: 1 - Paid
Emory-Cloud Sites: ---


Attachments
v1 (74.96 KB, patch)
2021-05-14 16:47 MDT, Michael Hinton
Details | Diff
v2 (81.64 KB, patch)
2021-05-18 12:46 MDT, Michael Hinton
Details | Diff
2108 v3 (88.93 KB, patch)
2021-07-07 13:34 MDT, Michael Hinton
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Wickberg 2021-02-26 15:07:05 MST

    
Comment 6 Michael Hinton 2021-05-14 16:38:53 MDT
Ok, here is an example of MIG autodetection working on a gcloud instance with a real a100:

$ cat slurm.conf | grep -i gres
GresTypes=gpu
AccountingStorageTRES=gres/gpu,gres/gpu:a100
NodeName=DEFAULT Gres=gpu:a100:4

$ cat gres.conf 
Autodetect=nvml

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

$ nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-4caf5c13-9f23-e8e2-8dba-baca4b78b728)
  MIG 1g.5gb Device 0: (UUID: MIG-GPU-4caf5c13-9f23-e8e2-8dba-baca4b78b728/7/0)
  MIG 1g.5gb Device 1: (UUID: MIG-GPU-4caf5c13-9f23-e8e2-8dba-baca4b78b728/8/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-4caf5c13-9f23-e8e2-8dba-baca4b78b728/9/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-4caf5c13-9f23-e8e2-8dba-baca4b78b728/11/0)

$ srun --gres=gpu:1 sleep 6000 &
[1] 14096

$ srun --gres=gpu:1 sleep 6000 &
[2] 14117

$ srun --gpus=1 env | grep CUDA
CUDA_VISIBLE_DEVICES=MIG-GPU-4caf5c13-9f23-e8e2-8dba-baca4b78b728/9/0

$ srun --gpus=2 env | grep CUDA
CUDA_VISIBLE_DEVICES=MIG-GPU-4caf5c13-9f23-e8e2-8dba-baca4b78b728/9/0,MIG-GPU-4caf5c13-9f23-e8e2-8dba-baca4b78b728/11/0

$ salloc --gpus=2
salloc: Granted job allocation 65

[salloc job=65]$ nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-4caf5c13-9f23-e8e2-8dba-baca4b78b728)
  MIG 1g.5gb Device 0: (UUID: MIG-GPU-4caf5c13-9f23-e8e2-8dba-baca4b78b728/9/0)
  MIG 1g.5gb Device 1: (UUID: MIG-GPU-4caf5c13-9f23-e8e2-8dba-baca4b78b728/11/0)

As you can see, cgroups is working great. I verified that I can't set CUDA_VISIBLE_DEVICES to another MIG GPU and use it if it wasn't given in the allocation.

I will attach a v1 shortly.

P.S.: How I created the MIG partitions for this example (4 GIs each with 1 CI slice):

sudo nvidia-smi -mig 1
sudo reboot -h now
sudo nvidia-smi mig -cgi 19 -C
sudo nvidia-smi mig -cgi 19 -C
sudo nvidia-smi mig -cgi 19 -C
sudo nvidia-smi mig -cgi 19 -C
Comment 8 Michael Hinton 2021-05-14 16:47:16 MDT
Created attachment 19509 [details]
v1

Here is v1. Commits 1-12 are really just minor fixups and clarifications to the existing code. Commits 13-26 are the meat of the patch. Commits 27-29 optionally add a UniqueId field to gres.conf and do other things for testing.
Comment 11 Michael Hinton 2021-05-18 12:46:20 MDT
Created attachment 19550 [details]
v2

v2 is mostly the same as v1, but it adds the opt-in AutoDetect=uuid option. This option allows CUDA_VISIBLE_DEVICES to use UUIDs instead of GPU indexes (in v1, it would always use the UUID if available, which is not backwards-compatible and would surprise people).

v2 now also *always* sends the AutoDetect value from the slurmd to the stepd. Before, it was only sent if the job requested GPU binding or frequency modulation. This change was needed to enable AutoDetect=uuid.

Note that although v1 and v2 have a UniqueId gres.conf option, that is really only for my convenience for testing, and I think we may want to ultimately remove that field from the final patchset. Combined with AutoDetect=uuid, this would mean that gres_slurmd_conf_t's unique_id would only be accessible via AutoDetect, which is probably what we want.

Now that test 39.18 has been updated in bug 8421, I plan on extending test 39.18 to test the changes in v2.

3 Commits added to v2 compared to v1:

[PATCH 11/32] AutoDetect: Ignore case for `off` option
[PATCH 28/32] GRES: Always send AutoDetect flags to stepd
[PATCH 29/32] AutoDetect: Add `uuid` AutoDetect option
Comment 14 Michael Hinton 2021-07-07 13:34:52 MDT
Created attachment 20277 [details]
2108 v3

Same as v2, except rebased on latest master and also on 10827-2018-v2, and added the following three commits:


[PATCH 33/35] GRES: Print out UniqueID
Useful for debugging, and needed for testing.

[PATCH 34/35] GRES: Add unique_id to MPS records
Also needed for tests to work.

[PATCH 35/35] Testsuite - Test unique_id AutoDetect parsing and sorting
Comment 40 Michael Hinton 2021-07-30 10:40:33 MDT
Ok, MIG code is now in Slurm 21.08.0rc1. Here are the relevant commits:

The bulk of the MIG code work:
* https://github.com/SchedMD/slurm/compare/b02f5c8f6adc...3fa77496caf4

Tweak configure script so Slurm MIG code only compiles with CUDA 11.1:
* https://github.com/SchedMD/slurm/compare/5abc165e833e...4926f049afb9

Remove over zealous error from MIG code:
* https://github.com/SchedMD/slurm/commit/28f041cba0072f1809e381c103c564164fe2a549

Further tweaks of the MIG configure process:
* https://github.com/SchedMD/slurm/compare/fbcf6a9fb064...23c8b7b9e828
* https://github.com/SchedMD/slurm/commit/1ca5f08f038c389b5073ea3f00de3b4aef628467

If you have any issue with the new MIG additions, please open up a new ticket and we'll address it there.

Thanks!
-Michael