Bug 10583 - Missing support for Nvidia A100 MIG mode
Summary: Missing support for Nvidia A100 MIG mode
Status: RESOLVED DUPLICATE of bug 11091
Alias: None
Product: Slurm
Classification: Unclassified
Component: GPU (show other bugs)
Version: 20.11.2
Hardware: Linux Linux
: --- 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-01-07 07:24 MST by Bernhard
Modified: 2021-03-26 01:57 MDT (History)
3 users (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bernhard 2021-01-07 07:24:31 MST
Hello SchedMD Team,

First of all thanks very much for your very good software development!

We have recently deployed a DGX-A100 in our cluster. So far, we have been able to get Slurm 20.11.2 on Ubuntu 20.04 up and running in A100 standard mode.
The A100 also supports a MIG (Multi-Instance GPU) mode, which allows to partition the A100 hardware. Even if this is not the preferred operation method from our point of view for cluster operation, it still offers options to better adapt the operation in the cluster to meet user requirements.
More infos on MIG:
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

However, it appears that Slurm seems to have no knowledge of the MIG configuration. The system in MIG mode is considered by Slurm as a system in "normal" mode. Only a Cuda application submitted via sbatch reports an error. E.g. 
CUDA error at dmmaTensorCoreGemm.cu:855 code=46(cudaErrorDevicesUnavailable) "cudaMalloc((void**)&A, sizeof(double) * M_GLOBAL * K_GLOBAL)" . 

gres.conf
NodeName=dgxa1 AutoDetect=nvml

slurm.conf
GresTypes=gpu,mps
NodeName=dgxa1 Gres=gpu:a100-sxm4-40gb:8,mps:1600 CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1031883

Interestingly, there is no configuration error message. Neither from slurmd nor from slurmctld.

Output Slurmd:
[2021-01-07T14:19:58.311] gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected
[2021-01-07T14:19:58.311] debug:  Gres GPU plugin: Normalizing gres.conf with system GPUs
[2021-01-07T14:19:58.311] debug2: gres/gpu: _normalize_gres_conf: gres_list_conf:
[2021-01-07T14:19:58.311] debug2:     GRES[gpu] Type:a100-sxm4-40gb Count:8 Cores(256):(null)  Links:(null) Flags:HAS_TYPE File:(null)
[2021-01-07T14:19:58.311] debug2:     GRES[mps] Type:(null) Count:1600 Cores(256):(null)  Links:(null) Flags: File:(null)
[2021-01-07T14:19:58.311] debug2: gres/gpu: _normalize_gres_conf: preserving original `mps` GRES record
[2021-01-07T14:19:58.311] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-01-07T14:19:58.311] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia7
[2021-01-07T14:19:58.311] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-01-07T14:19:58.311] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,0,0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia6
[2021-01-07T14:19:58.311] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-01-07T14:19:58.311] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,0,0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia5
[2021-01-07T14:19:58.311] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-01-07T14:19:58.311] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,0,-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia4
[2021-01-07T14:19:58.311] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-01-07T14:19:58.311] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,-1,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
[2021-01-07T14:19:58.311] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-01-07T14:19:58.311] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,-1,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
[2021-01-07T14:19:58.312] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-01-07T14:19:58.312] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,-1,0,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
[2021-01-07T14:19:58.312] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2021-01-07T14:19:58.312] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:-1,0,0,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-01-07T14:19:58.312] debug2: gres/gpu: _normalize_gres_conf: gres_list_gpu
[2021-01-07T14:19:58.312] debug2:     GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:-1,0,0,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-01-07T14:19:58.312] debug2:     GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,-1,0,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
[2021-01-07T14:19:58.312] debug2:     GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,-1,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
[2021-01-07T14:19:58.312] debug2:     GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,-1,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
[2021-01-07T14:19:58.312] debug2:     GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,0,-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia4
[2021-01-07T14:19:58.312] debug2:     GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,0,0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia5
[2021-01-07T14:19:58.312] debug2:     GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,0,0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia6
[2021-01-07T14:19:58.312] debug2:     GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia7
[2021-01-07T14:19:58.312] debug:  Gres GPU plugin: Final normalized gres.conf list:
[2021-01-07T14:19:58.312] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:-1,0,0,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-01-07T14:19:58.312] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,-1,0,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
[2021-01-07T14:19:58.312] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,-1,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
[2021-01-07T14:19:58.312] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,-1,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
[2021-01-07T14:19:58.312] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,0,-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia4
[2021-01-07T14:19:58.312] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,0,0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia5
[2021-01-07T14:19:58.312] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,0,0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia6
[2021-01-07T14:19:58.312] debug:      GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127  Links:0,0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia7
[2021-01-07T14:19:58.312] debug:      GRES[mps] Type:(null) Count:1600 Cores(256):(null)  Links:(null) Flags: File:(null)
[2021
[2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia0 major 195, minor 0
[2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia1 major 195, minor 1
[2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia2 major 195, minor 2
[2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia3 major 195, minor 3
[2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia4 major 195, minor 4
[2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia5 major 195, minor 5
[2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia6 major 195, minor 6
[2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia7 major 195, minor 7
slurmctld.log
[2021-01-07T14:35:07.050] node dgxa1 returned to service
[2021-01-07T14:35:07.355] sched: Allocate JobId=267 NodeList=dgxa1 #CPUs=2 Partition=All
[2021-01-07T14:35:09.032] _job_complete: JobId=267 WEXITSTATUS 0
[2021-01-07T14:35:09.033] _job_complete: JobId=267 done
[2021-01-07T14:37:01.912] _slurm_rpc_submit_batch_job: JobId=268 InitPrio=61 usec=393
[2021-01-07T14:37:02.600] sched: Allocate JobId=268 NodeList=dgxa1 #CPUs=2 Partition=All
[2021-01-07T14:37:03.852] _job_complete: JobId=268 WEXITSTATUS 1
[2021-01-07T14:37:03.852] _job_complete: JobId=268 done

Would be great if you could take a look at it and if this feature could be supported by Slurm in the future.
I may be overlooking configuration options for Nvidia MIG?

If not a roadmap for Nvidia MIG would be very preferable

Best regards
Bernhard
Comment 1 Simon Raffeiner 2021-03-26 01:57:41 MDT
Duplicate of bug 11091

*** This bug has been marked as a duplicate of bug 11091 ***