Hello SchedMD Team, First of all thanks very much for your very good software development! We have recently deployed a DGX-A100 in our cluster. So far, we have been able to get Slurm 20.11.2 on Ubuntu 20.04 up and running in A100 standard mode. The A100 also supports a MIG (Multi-Instance GPU) mode, which allows to partition the A100 hardware. Even if this is not the preferred operation method from our point of view for cluster operation, it still offers options to better adapt the operation in the cluster to meet user requirements. More infos on MIG: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html However, it appears that Slurm seems to have no knowledge of the MIG configuration. The system in MIG mode is considered by Slurm as a system in "normal" mode. Only a Cuda application submitted via sbatch reports an error. E.g. CUDA error at dmmaTensorCoreGemm.cu:855 code=46(cudaErrorDevicesUnavailable) "cudaMalloc((void**)&A, sizeof(double) * M_GLOBAL * K_GLOBAL)" . gres.conf NodeName=dgxa1 AutoDetect=nvml slurm.conf GresTypes=gpu,mps NodeName=dgxa1 Gres=gpu:a100-sxm4-40gb:8,mps:1600 CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1031883 Interestingly, there is no configuration error message. Neither from slurmd nor from slurmctld. Output Slurmd: [2021-01-07T14:19:58.311] gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected [2021-01-07T14:19:58.311] debug: Gres GPU plugin: Normalizing gres.conf with system GPUs [2021-01-07T14:19:58.311] debug2: gres/gpu: _normalize_gres_conf: gres_list_conf: [2021-01-07T14:19:58.311] debug2: GRES[gpu] Type:a100-sxm4-40gb Count:8 Cores(256):(null) Links:(null) Flags:HAS_TYPE File:(null) [2021-01-07T14:19:58.311] debug2: GRES[mps] Type:(null) Count:1600 Cores(256):(null) Links:(null) Flags: File:(null) [2021-01-07T14:19:58.311] debug2: gres/gpu: _normalize_gres_conf: preserving original `mps` GRES record [2021-01-07T14:19:58.311] debug: gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration: [2021-01-07T14:19:58.311] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia7 [2021-01-07T14:19:58.311] debug: gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration: [2021-01-07T14:19:58.311] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,0,0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia6 [2021-01-07T14:19:58.311] debug: gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration: [2021-01-07T14:19:58.311] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,0,0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia5 [2021-01-07T14:19:58.311] debug: gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration: [2021-01-07T14:19:58.311] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,0,-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia4 [2021-01-07T14:19:58.311] debug: gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration: [2021-01-07T14:19:58.311] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,-1,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3 [2021-01-07T14:19:58.311] debug: gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration: [2021-01-07T14:19:58.311] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,-1,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2 [2021-01-07T14:19:58.312] debug: gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration: [2021-01-07T14:19:58.312] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,-1,0,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1 [2021-01-07T14:19:58.312] debug: gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration: [2021-01-07T14:19:58.312] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:-1,0,0,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0 [2021-01-07T14:19:58.312] debug2: gres/gpu: _normalize_gres_conf: gres_list_gpu [2021-01-07T14:19:58.312] debug2: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:-1,0,0,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0 [2021-01-07T14:19:58.312] debug2: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,-1,0,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1 [2021-01-07T14:19:58.312] debug2: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,-1,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2 [2021-01-07T14:19:58.312] debug2: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,-1,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3 [2021-01-07T14:19:58.312] debug2: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,0,-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia4 [2021-01-07T14:19:58.312] debug2: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,0,0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia5 [2021-01-07T14:19:58.312] debug2: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,0,0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia6 [2021-01-07T14:19:58.312] debug2: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia7 [2021-01-07T14:19:58.312] debug: Gres GPU plugin: Final normalized gres.conf list: [2021-01-07T14:19:58.312] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:-1,0,0,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0 [2021-01-07T14:19:58.312] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,-1,0,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1 [2021-01-07T14:19:58.312] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,-1,0,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2 [2021-01-07T14:19:58.312] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,-1,0,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3 [2021-01-07T14:19:58.312] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,0,-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia4 [2021-01-07T14:19:58.312] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,0,0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia5 [2021-01-07T14:19:58.312] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,0,0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia6 [2021-01-07T14:19:58.312] debug: GRES[gpu] Type:a100-sxm4-40gb Count:1 Cores(256):0-127 Links:0,0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia7 [2021-01-07T14:19:58.312] debug: GRES[mps] Type:(null) Count:1600 Cores(256):(null) Links:(null) Flags: File:(null) [2021 [2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia0 major 195, minor 0 [2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia1 major 195, minor 1 [2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia2 major 195, minor 2 [2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia3 major 195, minor 3 [2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia4 major 195, minor 4 [2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia5 major 195, minor 5 [2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia6 major 195, minor 6 [2021-01-07T14:19:58.312] debug3: gres_device_major : /dev/nvidia7 major 195, minor 7 slurmctld.log [2021-01-07T14:35:07.050] node dgxa1 returned to service [2021-01-07T14:35:07.355] sched: Allocate JobId=267 NodeList=dgxa1 #CPUs=2 Partition=All [2021-01-07T14:35:09.032] _job_complete: JobId=267 WEXITSTATUS 0 [2021-01-07T14:35:09.033] _job_complete: JobId=267 done [2021-01-07T14:37:01.912] _slurm_rpc_submit_batch_job: JobId=268 InitPrio=61 usec=393 [2021-01-07T14:37:02.600] sched: Allocate JobId=268 NodeList=dgxa1 #CPUs=2 Partition=All [2021-01-07T14:37:03.852] _job_complete: JobId=268 WEXITSTATUS 1 [2021-01-07T14:37:03.852] _job_complete: JobId=268 done Would be great if you could take a look at it and if this feature could be supported by Slurm in the future. I may be overlooking configuration options for Nvidia MIG? If not a roadmap for Nvidia MIG would be very preferable Best regards Bernhard
Duplicate of bug 11091 *** This bug has been marked as a duplicate of bug 11091 ***