Bug 14890 - MIG configuration for A100
Summary: MIG configuration for A100
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: GPU (show other bugs)
Version: 22.05.2
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-09-05 04:25 MDT by GSK-ONYX-SLURM
Modified: 2022-09-28 01:14 MDT (History)
3 users (show)

See Also:
Site: GSK
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: CentOS
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmd logs with debug 2 (14.57 KB, text/plain)
2022-09-06 00:48 MDT, GSK-ONYX-SLURM
Details
lstopo-no-graphics (8.17 KB, text/plain)
2022-09-09 00:56 MDT, GSK-ONYX-SLURM
Details
mig-minors (111.18 KB, text/plain)
2022-09-19 06:33 MDT, GSK-ONYX-SLURM
Details

Note You need to log in before you can comment on or make changes to this bug.
Description GSK-ONYX-SLURM 2022-09-05 04:25:36 MDT
Hi SchedMD Team,

I am trying to set up the MIGs for the A100 card, where there are 2 physical GPUs.

The following instances have been already created:

{STV}[compute_current]root@gpu-504:~# nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-327e1f68-0093-e55a-7e99-f4ee02aa7597)
  MIG 3g.40gb     Device  0: (UUID: MIG-023cde4b-faa3-5b40-976d-8a9d3becfbdb)
  MIG 3g.40gb     Device  1: (UUID: MIG-e4655770-8526-5e3e-beda-daefb9347206)
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-c41234ea-31fb-9e16-8cfd-0e73a9261d4f)
  MIG 3g.40gb     Device  0: (UUID: MIG-e66ffd4e-288d-59b4-9b2d-7549b6052fc3)
  MIG 3g.40gb     Device  1: (UUID: MIG-e1faa744-8f62-5df4-969d-903adecf4fe3)


My gres.conf file:

[...]
NodeName=gpu-504 Name=gpu Type=A100 AutoDetect=nvml
[...]

I did try to set up MultipleFiles= as well, but I didn't work for me.

My slurm.conf consists of cons_tres and gpu declarations:

[I am root!@uk1us104:~]# cat /etc/slurm/slurm.conf | grep gpu
AccountingStorageTRES=gres/gpu
GresTypes=gpu

[I am root!@uk1us104:~]# cat /etc/slurm/slurm.conf | grep gres
AccountingStorageTRES=gres/gpu

The nodes.conf file is included to the slurm.conf and here's the entry for the node:

[I am root!@uk1us104:~]# cat /etc/slurm/nodes.conf | grep gpu-504
NodeName=gpu-504 CPUs=96 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=515800 TmpDisk=1048576 Weight=1 State=UNKNOWN Feature=gpu,gpunode,amd,amd_7352,rome,A100 Gres=gpu:A100:4

Previously it has been configured with two GPUs and it worked fine. 

I looked into the other tickets with the similar issues, but no luck: https://bugs.schedmd.com/show_bug.cgi?id=10970 and https://bugs.schedmd.com/show_bug.cgi?id=12826.

The problem is that the node is in INVAL state and cannot be resumed:

[I am root!@uk1us104:~]# scontrol update node=gpu-504 state=resume reason=
slurm_update error: Invalid node state specified
[Stevenage (ukhpc)]
[I am root!@uk1us104:~]# sinfo -N -o "%15N %10T %150E" | grep gpu-504
gpu-504         inval      MIG testing                                                                              
gpu-504         inval      MIG testing                                                                              
gpu-504         inval      MIG testing                                                                              
gpu-504         inval      MIG testing                                                                              
[Stevenage (ukhpc)]
[I am root!@uk1us104:~]#

Logs taken for this node:

[2022-09-05T10:58:58.674] error: _slurm_rpc_node_registration node=gpu-504: Invalid argument
[2022-09-05T10:59:40.391] Invalid node state transition requested for node gpu-504 from=INVAL to=RESUME
[2022-09-05T10:59:40.391] _slurm_rpc_update_node for gpu-504: Invalid node state specified
[2022-09-05T11:00:16.077] update_node: node gpu-504 reason set to: MIG testing
[2022-09-05T11:00:16.077] update_node: node gpu-504 state set to DOWN
[2022-09-05T11:02:45.156] error: _slurm_rpc_node_registration node=gpu-504: Invalid argument
[2022-09-05T11:03:54.953] node_did_resp: node gpu-504 returned to service
[2022-09-05T11:05:13.784] error: Setting node gpu-504 state to INVAL with reason:gres/gpu count reported lower than configured (0 < 4)
[2022-09-05T11:05:13.784] error: _slurm_rpc_node_registration node=gpu-504: Invalid argument
[2022-09-05T11:05:19.510] Invalid node state transition requested for node gpu-504 from=INVAL to=RESUME
[2022-09-05T11:05:19.510] _slurm_rpc_update_node for gpu-504: Invalid node state specified

Thanks in advance for the support.

Radek
Comment 1 Marcin Stolarek 2022-09-05 06:23:29 MDT
It looks that slurmd tries to register without GPUs while it's configured to have 4.
>[2022-09-05T11:05:13.784] error: Setting node gpu-504 state to INVAL with reason:gres/gpu count reported lower than configured (0 < 4)

Could you please attach slurmd log with debug2 level and GRES debug flag[1] enabled? 

cheers,
Marcin
[1]https://slurm.schedmd.com/slurm.conf.html#OPT_Gres
Comment 2 GSK-ONYX-SLURM 2022-09-06 00:48:42 MDT
Created attachment 26614 [details]
slurmd logs with debug 2

Hi Marcin,

Attached you can find slurmd logs with debug2. I haven't found the entry for the gpu-504 node, but there's something in the logs which may be related to this node:

[2022-09-06T07:19:57.251] We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.

I am not sure how to set up the debug flag for Gres - is it: DebugFlags=GRES?

Thanks,
Radek
Comment 3 Marcin Stolarek 2022-09-06 02:25:44 MDT
>[2022-09-06T07:19:57.251] We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.

This means that a ./configure script failed to find NVML libraries preparing Slurm build. You may search/grep for NVML in configure.log to find a details of commands that were executed to check if NVML is available.

cheers,
Marcin
Comment 4 GSK-ONYX-SLURM 2022-09-06 04:26:08 MDT
Hi Marcin,

you're right, I found something like this:

> configure:23406: WARNING: unable to locate libnvidia-ml.so and/or nvml.h
[...]
> ac_cv_header_nvml_h=no
[...]

I am able to find the libnvidia-ml.so library:

[I am root!@uk1us104:slurm-22.05.2]# ldconfig -p | grep libnvidia-ml.so
        libnvidia-ml.so.1 (libc6,x86-64) => /cm/local/apps/cuda-driver/libs/450.80.02/lib64/libnvidia-ml.so.1
        libnvidia-ml.so (libc6,x86-64) => /cm/local/apps/cuda-driver/libs/450.80.02/lib64/libnvidia-ml.so

but I am unable to find the nvml one. Do you know what package provides this header? 

I guess it means that I cannot use the AutoDetect=nvml setting. Does this also mean that I cannot configure MIGs or I just need to do it manually?

Cheers,
Radek
Comment 5 Marcin Stolarek 2022-09-06 05:21:20 MDT
>[...]but I am unable to find the nvml one. Do you know what package provides this header? 
It should be part of GPU Deployment Kit[1].

>Does this also mean that I cannot configure MIGs or I just need to do it manually?
Manual configuration is still possible. Just set MultipleFiles= and Cores= appropriately in your gres.conf for your cards[2,3] and remove AutoDetect=nvml.

The lack of NVML results in inability to set memory and graphic clock frequency on user request.

cheers,
Marcin
[1]https://developer.nvidia.com/gpu-deployment-kit
[2]https://slurm.schedmd.com/gres.conf.html#OPT_MultipleFiles
[3]https://slurm.schedmd.com/gres.conf.html#OPT_Cores
Comment 6 GSK-ONYX-SLURM 2022-09-06 06:58:12 MDT
(In reply to Marcin Stolarek from comment #5)

> Manual configuration is still possible. Just set MultipleFiles= and Cores=
> appropriately in your gres.conf for your cards[2,3] and remove
> AutoDetect=nvml.
> 

For the following devices:

{STV}[compute_current]root@gpu-504:~# ls -l /dev/nvidia-caps/
total 0
cr-------- 1 root root 233,   1 May 13 12:41 nvidia-cap1
cr--r--r-- 1 root root 233,  12 May 13 12:41 nvidia-cap12
cr--r--r-- 1 root root 233,  13 May 13 12:41 nvidia-cap13
cr--r--r-- 1 root root 233, 147 May 13 12:41 nvidia-cap147
cr--r--r-- 1 root root 233, 148 May 13 12:41 nvidia-cap148
cr--r--r-- 1 root root 233, 156 May 13 12:41 nvidia-cap156
cr--r--r-- 1 root root 233, 157 May 13 12:41 nvidia-cap157
cr--r--r-- 1 root root 233,   2 May 13 12:41 nvidia-cap2
cr--r--r-- 1 root root 233,  21 May 13 12:41 nvidia-cap21
cr--r--r-- 1 root root 233,  22 May 13 12:41 nvidia-cap22

{STV}[compute_current]root@gpu-504:~# nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-327e1f68-0093-e55a-7e99-f4ee02aa7597)
  MIG 3g.40gb     Device  0: (UUID: MIG-023cde4b-faa3-5b40-976d-8a9d3becfbdb)
  MIG 3g.40gb     Device  1: (UUID: MIG-e4655770-8526-5e3e-beda-daefb9347206)
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-c41234ea-31fb-9e16-8cfd-0e73a9261d4f)
  MIG 3g.40gb     Device  0: (UUID: MIG-e66ffd4e-288d-59b4-9b2d-7549b6052fc3)
  MIG 3g.40gb     Device  1: (UUID: MIG-e1faa744-8f62-5df4-969d-903adecf4fe3)

I am trying to set up MultipleFiles= parameter in gres.conf, but it's still not working:

[...]
NodeName=gpu-504 Name=gpu Type=A100 MultipleFiles=/dev/nvidia-caps/nvidia-cap12,/dev/nvidia-caps/nvidia-cap13,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22
[...]

There's only one device visible:

[2022-09-06T13:56:15.036] error: _slurm_rpc_node_registration node=gpu-504: Invalid argument
[2022-09-06T13:56:23.523] error: gres/gpu on node gpu-504 configured for 4 resources but 1 found, ignoring topology support
[2022-09-06T13:56:23.523] error: Setting node gpu-504 state to INVAL with reason:gres/gpu count reported lower than configured (1 < 4)
[2022-09-06T13:56:23.523] error: _slurm_rpc_node_registration node=gpu-504: Invalid argument

What I'm doing wrong?

Thanks,
Radek
Comment 7 Marcin Stolarek 2022-09-08 02:53:21 MDT
I don't have access to a MIG card to experiment. Just to check if I'm reading this correctly. The node has 2 GPU cars, each splitted into to MIGs?

In this case I'd expect you'll need to entries in a gres.conf - one per physical card, each should have MultipleFiles directive. You can use Slurm name expansions in MultipleFiles, like: MultipleFiles=/dev/nvidia-caps/nvidia-cap[12,13].

Could you please share lstopo-no-graphics output, so I can advice on Cores=.... setting?

cheers,
Marcin
Comment 8 GSK-ONYX-SLURM 2022-09-09 00:55:42 MDT
(In reply to Marcin Stolarek from comment #7)
> I don't have access to a MIG card to experiment. Just to check if I'm
> reading this correctly. The node has 2 GPU cars, each splitted into to MIGs?

Exactly. These are the outputs from the nvidia-smi commands executed on the node with MIGs:

{STV}[compute_current]root@gpu-504:~# nvidia-smi
Fri Sep  9 07:38:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:25:00.0 Off |                   On |
| N/A   31C    P0    42W / 300W |     38MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:E1:00.0 Off |                   On |
| N/A   29C    P0    43W / 300W |     38MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    1   0   0  |     19MiB / 40192MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    2   0   1  |     19MiB / 40192MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    1   0   0  |     19MiB / 40192MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    2   0   1  |     19MiB / 40192MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
{STV}[compute_current]root@gpu-504:~#
{STV}[compute_current]root@gpu-504:~# nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-327e1f68-0093-e55a-7e99-f4ee02aa7597)
  MIG 3g.40gb     Device  0: (UUID: MIG-023cde4b-faa3-5b40-976d-8a9d3becfbdb)
  MIG 3g.40gb     Device  1: (UUID: MIG-e4655770-8526-5e3e-beda-daefb9347206)
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-c41234ea-31fb-9e16-8cfd-0e73a9261d4f)
  MIG 3g.40gb     Device  0: (UUID: MIG-e66ffd4e-288d-59b4-9b2d-7549b6052fc3)
  MIG 3g.40gb     Device  1: (UUID: MIG-e1faa744-8f62-5df4-969d-903adecf4fe3)
{STV}[compute_current]root@gpu-504:~#


> 
> In this case I'd expect you'll need to entries in a gres.conf - one per
> physical card, each should have MultipleFiles directive. You can use Slurm
> name expansions in MultipleFiles, like:
> MultipleFiles=/dev/nvidia-caps/nvidia-cap[12,13].


So, in that particular case, two entries should be added to the gres.conf - is that correct?

I did try something like this:

NodeName=gpu-504 Name=gpu Type=A100 MultipleFiles=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap[12,13]
NodeName=gpu-504 Name=gpu Type=A100 MultipleFiles=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap[21,22]

I am confused whether the physical GPU card along with MIGs should be mentioned as well. Reading the documentation I though that once the MIGs are created there is no necessary to add physical cards. 
 

> 
> Could you please share lstopo-no-graphics output, so I can advice on
> Cores=.... setting?

Thanks. I am attaching the file. 

Cheers,
Radek
Comment 9 GSK-ONYX-SLURM 2022-09-09 00:56:19 MDT
Created attachment 26681 [details]
lstopo-no-graphics
Comment 10 GSK-ONYX-SLURM 2022-09-13 01:13:12 MDT
Hey Marcin,

have you had a chance to take a look at this?

Cheers,
Radek
Comment 11 Dominik Bartkiewicz 2022-09-15 04:10:09 MDT
Hi

Sorry for the late response. Could you check if those lines make Slurm work as expected?
gres.conf:
...
NodeName=gpu-504 Name=gpu Type=A100 MultipleFiles=/dev/nvidia-caps/nvidia-cap[12,13] Cores=[0-23]
NodeName=gpu-504 Name=gpu Type=A100 MultipleFiles=/dev/nvidia-caps/nvidia-cap[21,22] Cores=[24-47]
...

Dominik
Comment 12 GSK-ONYX-SLURM 2022-09-15 23:59:40 MDT
(In reply to Dominik Bartkiewicz from comment #11)

Hi Dominik,

> NodeName=gpu-504 Name=gpu Type=A100
> MultipleFiles=/dev/nvidia-caps/nvidia-cap[12,13] Cores=[0-23]
> NodeName=gpu-504 Name=gpu Type=A100
> MultipleFiles=/dev/nvidia-caps/nvidia-cap[21,22] Cores=[24-47]

It's not working, log is saying:

[2022-09-16T06:37:36.559] error: Setting node gpu-504 state to INVAL with reason:gres/gpu count reported lower than configured (2 < 4)

However, I modified gres.conf a little bit by leaving Cores= and replacing MultipleFiles= with File= and mentioning all the MIGs separately:

NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia-caps/nvidia-cap12 Cores=[0-23]
NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia-caps/nvidia-cap13 Cores=[0-23]
NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia-caps/nvidia-cap21 Cores=[24-47]
NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia-caps/nvidia-cap22 Cores=[24-47]

Now it's working:

[2022-09-16T06:37:17.433] error: _slurm_rpc_node_registration node=gpu-504: Invalid argument
[2022-09-16T06:37:36.559] error: Setting node gpu-504 state to INVAL with reason:gres/gpu count reported lower than configured (2 < 4)
[2022-09-16T06:37:36.559] error: _slurm_rpc_node_registration node=gpu-504: Invalid argument
[2022-09-16T06:38:48.689] update_node: node gpu-504 reason set to: MIG testing
[2022-09-16T06:38:48.689] update_node: node gpu-504 state set to DOWN
[2022-09-16T06:39:02.934] Invalid node state transition requested for node gpu-504 from=INVAL to=RESUME
[2022-09-16T06:39:02.934] _slurm_rpc_update_node for gpu-504: Invalid node state specified
[2022-09-16T06:42:32.217] node_did_resp: node gpu-504 returned to service
[2022-09-16T06:44:10.718] _node_config_validate: gres/gpu: Count changed on node gpu-504 (2 != 4)
[2022-09-16T06:44:26.323] update_node: node gpu-504 reason set to: MIG testing
[2022-09-16T06:44:26.323] update_node: node gpu-504 state set to DOWN
[2022-09-16T06:44:41.195] update_node: node gpu-504 state set to IDLE
[2022-09-16T06:45:17.410] Node gpu-504 now responding

The MIGs have not been tested yet, but at least the node is not in the invalid state anymore.

Could you please advise here, why the MultipleFiles= directive is not working as it should be?

Thanks a lot!
Radek
Comment 13 Marcin Stolarek 2022-09-19 04:48:37 MDT
>[2022-09-16T06:37:36.559] error: Setting node gpu-504 state to INVAL with reason:gres/gpu count reported lower than configured (2 < 4)

This comes from the fact that number of GPUs configured in your slurm.conf didn't match number of entries in gres.conf. In general you'll need one entry with MultipleFiles directive per GPU instance. Depending on how you partition the card different nvidia-capXX files are required in MultipleFiles directive.

However, manual configuration of MIG is possible I'd recommend building slurm with NVML and relying on autodetect functionality.

Could you please share the content of /proc/driver/nvidia-caps/mig-minors - which should contain GI/CI mapping to nvidia-capX files.

cheers,
Marcin
Comment 14 GSK-ONYX-SLURM 2022-09-19 06:32:51 MDT
(In reply to Marcin Stolarek from comment #13)

> This comes from the fact that number of GPUs configured in your slurm.conf
> didn't match number of entries in gres.conf. In general you'll need one
> entry with MultipleFiles directive per GPU instance. Depending on how you
> partition the card different nvidia-capXX files are required in
> MultipleFiles directive.

That makes sense. So, my config is ok, because number of entries corresponds with the number of GPUs declared in the slurm.conf / nodes.conf. However, if the one GPU (MIG) consisted of more than one nvidia-capXX, it would be possible to use MultipleFiles. Just to make sure I understood it correctly...

> However, manual configuration of MIG is possible I'd recommend building
> slurm with NVML and relying on autodetect functionality.

Noted. Because we upgraded Slurm recently, I will do this next time. 

> Could you please share the content of /proc/driver/nvidia-caps/mig-minors -
> which should contain GI/CI mapping to nvidia-capX files.

The file attached.

Cheers,
Radek
Comment 15 GSK-ONYX-SLURM 2022-09-19 06:33:36 MDT
Created attachment 26853 [details]
mig-minors
Comment 16 Marcin Stolarek 2022-09-19 06:49:24 MDT
>However, if the one GPU (MIG) consisted of more than one nvidia-capXX, it would be possible to use MultipleFiles. Just to make sure I understood it correctly...

Yes - that's correct, one MIG device is handled by a few files. (Take a look at Bug 11091 comment 13).

Altogether the gres.conf handling your MIG configuration should look like:
>NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap12,/dev/nvidia-caps/nvidia-cap13 Cores=[0-23]
>NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22 Cores=[0-23]
>NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap147,/dev/nvidia-caps/nvidia-cap148 Cores=[24-47]
>NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap156,/dev/nvidia-caps/nvidia-cap157 Cores=[24-47]

I did double check above lines, but since I don't have access to a MIG GPU I wasn't able to test that. As mentioned before I'd recommend the use of AutoDetect=nvml.

cheers,
Marcin
Comment 17 GSK-ONYX-SLURM 2022-09-20 01:16:40 MDT
(In reply to Marcin Stolarek from comment #16)

> Altogether the gres.conf handling your MIG configuration should look like:
> >NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap12,/dev/nvidia-caps/nvidia-cap13 Cores=[0-23]
> >NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22 Cores=[0-23]
> >NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap147,/dev/nvidia-caps/nvidia-cap148 Cores=[24-47]
> >NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap156,/dev/nvidia-caps/nvidia-cap157 Cores=[24-47]

My initial configuration which works (I mean where the node can be resumed) is:

NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia-caps/nvidia-cap12 Cores=[0-23]
NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia-caps/nvidia-cap13 Cores=[0-23]
NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia-caps/nvidia-cap21 Cores=[24-47]
NodeName=gpu-504 Name=gpu Type=A100 File=/dev/nvidia-caps/nvidia-cap22 Cores=[24-47]

Looking at both configurations, there are differences between what I created and your suggestion. Based on what you suggested, the physical GPU must be also added. I also noticed that you used File= instead of MultipleFiles=. Are you sure it's correct in that particular example, when we have more then one entry?

I did test both and they work fine, I mean the node can be resumed to the idle state.

> I did double check above lines, but since I don't have access to a MIG GPU I
> wasn't able to test that. As mentioned before I'd recommend the use of
> AutoDetect=nvml.

I will proceed with adding auto detection next time.

Cheers,
Radek
Comment 18 Marcin Stolarek 2022-09-20 01:30:03 MDT
>Based on what you suggested, the physical GPU must be also added. [...]
Yes - according to NVIDIA docs[1] the three capabilities are required to run the app. Each /dev file has a related capability and setting those triples should allow the app to work correctly with ConstrainDevices=yes[2].

>I also noticed that you used File= instead of MultipleFiles=. Are you sure it's correct in that particular example, when we have more then one entry?

My fault (as said I'm unable to test it) you should use MultipleFiles.

>I did test both and they work fine, I mean the node can be resumed to the idle state.

I'd suggest running a simple cuda benchmark on the GPUs to verify if cgroup and CUDA_VISIBLE_DEVICES environment var is set correctly.


cheers,
Marcin
[1]https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-baremetal
[2]https://slurm.schedmd.com/cgroup.conf.html#OPT_ConstrainDevices
Comment 19 GSK-ONYX-SLURM 2022-09-21 05:45:03 MDT
(In reply to Marcin Stolarek from comment #18)

> I'd suggest running a simple cuda benchmark on the GPUs to verify if cgroup
> and CUDA_VISIBLE_DEVICES environment var is set correctly.

As per your suggestion, I ran a simple job:

#!/bin/bash
#SBATCH --job-name=gpu-hello
#SBATCH --partition=gpu
#SBATCH --gres=gpu:4
#SBATCH --nodelist=gpu-504
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --time=00:10:00
#SBATCH --error=gpu_test_%A-%a.err
#SBATCH --output=gpu_test_%A-%a.log

env | grep "^SLURM" | sort
sleep 20
srun ./hello

and got the following result:

SLURM_CLUSTER_NAME=ukhpc
SLURM_CONF=//cm/shared/slurm/etc/slurm.conf
SLURM_CPUS_ON_NODE=2
SLURMD_NODENAME=gpu-504
SLURM_GPUS_ON_NODE=4
SLURM_GTIDS=0
SLURM_JOB_ACCOUNT=default
SLURM_JOB_CPUS_PER_NODE=2
SLURM_JOB_GID=1084
SLURM_JOB_GPUS=0,1,2,3
SLURM_JOB_ID=11273474
SLURM_JOBID=11273474
SLURM_JOB_NAME=gpu-hello
SLURM_JOB_NODELIST=gpu-504
SLURM_JOB_NUM_NODES=1
SLURM_JOB_PARTITION=gpu
SLURM_JOB_QOS=normal
SLURM_JOB_UID=13573
SLURM_JOB_USER=rd178639
SLURM_LOCALID=0
SLURM_MEM_PER_CPU=8192
SLURM_NNODES=1
SLURM_NODE_ALIASES=(null)
SLURM_NODEID=0
SLURM_NODELIST=gpu-504
SLURM_NPROCS=1
SLURM_NTASKS=1
SLURM_NTASKS_PER_NODE=1
SLURM_PRIO_PROCESS=0
SLURM_PROCID=0
SLURM_SCRIPT_CONTEXT=prolog_task
SLURM_SUBMIT_DIR=/home/rd178639/scripts
SLURM_SUBMIT_HOST=login2
SLURM_TASK_PID=20897
SLURM_TASKS_PER_NODE=1
SLURM_TOPOLOGY_ADDR=gpu-504
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_WORKING_CLUSTER=ukhpc:uk1us104:6917:9728:109
Hello world! I'm thread 0 in block 1
Hello world! I'm thread 1 in block 1
Hello world! I'm thread 2 in block 1
Hello world! I'm thread 3 in block 1
Hello world! I'm thread 4 in block 1
Hello world! I'm thread 5 in block 1
Hello world! I'm thread 6 in block 1
Hello world! I'm thread 7 in block 1
Hello world! I'm thread 8 in block 1
Hello world! I'm thread 9 in block 1
Hello world! I'm thread 10 in block 1
Hello world! I'm thread 11 in block 1
Hello world! I'm thread 12 in block 1
Hello world! I'm thread 13 in block 1
Hello world! I'm thread 14 in block 1
Hello world! I'm thread 15 in block 1
Hello world! I'm thread 0 in block 0
Hello world! I'm thread 1 in block 0
Hello world! I'm thread 2 in block 0
Hello world! I'm thread 3 in block 0
Hello world! I'm thread 4 in block 0
Hello world! I'm thread 5 in block 0
Hello world! I'm thread 6 in block 0
Hello world! I'm thread 7 in block 0
Hello world! I'm thread 8 in block 0
Hello world! I'm thread 9 in block 0
Hello world! I'm thread 10 in block 0
Hello world! I'm thread 11 in block 0
Hello world! I'm thread 12 in block 0
Hello world! I'm thread 13 in block 0
Hello world! I'm thread 14 in block 0
Hello world! I'm thread 15 in block 0
That's all!

However, when I run the deviceQuery taken from the nvidia samples I get only one device:

{STV}[compute_current]root@gpu-504:/tmp# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA A100 80GB PCIe MIG 3g.40gb"
  CUDA Driver Version / Runtime Version          11.6 / 11.4
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 40192 MBytes (42144366592 bytes)
  (042) Multiprocessors, (064) CUDA Cores/MP:    2688 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1512 Mhz
  Memory Bus Width:                              2560-bit
  L2 Cache Size:                                 20971520 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 37 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.6, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS
{STV}[compute_current]root@gpu-504:/tmp#


Is it something to do with partitioning or deviceQuery doesn't show MIGs?

Cheers,
Radek
Comment 20 Marcin Stolarek 2022-09-21 06:12:22 MDT
Radek,

However, it's outside of our(SchedMD) expertise using multiple instances of a partitioned MIG GPU seems to be unsupported today. Per previously shared Nvidia MIG User Guide[1]:
>CUDA Device Enumeration
>MIG supports running CUDA applications by specifying the CUDA device on which the 
>application should be run. With CUDA 11, only enumeration of a single MIG 
>instance is supported.

If I understand the case correctly - if one want's to use a full card than it should be partitioned as single instance.

cheers,
Marcin
[1]https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-baremetal
Comment 21 Marcin Stolarek 2022-09-26 05:56:17 MDT
Is there anything else I can help you with?
Comment 22 GSK-ONYX-SLURM 2022-09-27 07:42:47 MDT
Let's close the ticket. I did ask the end user to test the configuration and haven't got the feedback yet.

Thanks a lot for your support!

Cheers,
Radek
Comment 23 Marcin Stolarek 2022-09-28 01:14:59 MDT
I'm glad to be of help. Feel free to reopen if needed.

cheers,
Marcin