11897 – Setup MPS on multi-GPU nodes

Bug 11897 - Setup MPS on multi-GPU nodes

Summary: Setup MPS on multi-GPU nodes

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	GPU (show other bugs)
Version:	20.11.7
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-06-23 12:15 MDT by Misha Ahmadian
Modified:	2021-07-07 02:38 MDT (History)
CC List:	0 users

See Also:	7834
Site:	TTU
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Misha Ahmadian 2021-06-23 12:15:09 MDT

Hello,

We're considering the MPS feature to be added to the GPU nodes and allow Slurm to handle that for users. I went through the online documents for both Slurm and Nvidia, but I still, have a few questions before I proceed:

1) According to the Slurm documents (https://slurm.schedmd.com/gres.html#MPS_Management): 
> "Job requests for MPS will be processed the same as any other GRES except
 that the request must be satisfied using only one GPU per node and only one
 GPU per node may be configured for use with MPS."

I understand that users cannot split the MPS across multiple GPU devices. However, does what mentioned above say that Slurm can enable the MPS on one GPU device only? Suppose we have 3 GPU devices within a node, 6 users submit 6 jobs targeting the same node, and each job request for gres/mps:50. Should we expect all the 6 jobs to start running within 1 node and utilize half of the GPUs simultaneously? Or only one GPU will be set up for MPS, and the rest 2 GPUs will stay idle until they get allocated to full GPU (non-MPS) jobs?


2) According to Nvidia (https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf  --> section 2.3.1.1.Limitations (p.9)):
> "Only one user on a system may have an active MPS server. The MPS control
 daemon will queue MPS server activation requests from separate users, leading
 to serialized exclusive access of the GPU between users regardless of GPU 
 exclusivity settings"

If so, then how does Slurm handle multiple jobs from multiple users on a single GPU node? Can we expect to have multiple users with multiple MPS jobs on a single node? How does Slurm treat all the GPUs within a node? Suppose we have 1 GPU node with 2 GPUS. Two users submit a job, one with MPS and one without MPS. Does Slurm land both jobs on that single node at the same time?


3) I was looking at the example Prolog script (https://github.com/SchedMD/slurm/blob/master/etc/prolog.example), and got a few question out of that:

3.1)
>echo quit | ${MPS_CMD_DIR}nvidia-cuda-mps-control
># Test for presence of MPS zombie process
>ps aux | grep nvidia-cuda-mps | grep -v grep > /dev/null
>if [ $? -eq 0 ]; then
>	logger "`hostname` Slurm Prolog: MPS refusing to quit! Downing node"
>	${SLURM_CMD_DIR}scontrol update nodename=${SLURMD_NODENAME} State=DOWN Reason="MPS not quitting"
>fi

These lines sound like Slurm considers one user with one MPS job on a single node. Otherwise, multiple users with multiple MPS jobs may lead to killing the "nvidia-cuda-mps-control" service. (I'm still suspicious about multiple users' jobs on a single GPU node)


3.2)
Somewhere at the bottom of the 'prolog.example' file:
> unset CUDA_MPS_ACTIVE_THREAD_PERCENTAGE

According to the Nvidia document (https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf  --> section 2.3.5.2. Volta MPS Execution Resource Provisioning (p.11)):
> "The MPS control utility provides 2 sets of commands to set / query the 
limit of all future MPS clients. See section 4.1.1 for more details. The limit
 can be further constrained for new clients by setting the environment 
variable CUDA_MPS_ACTIVE_THREAD_PERCENTAGE for a client process. See section
 4.2.5 for more details."

So, why does Prolog unset "CUDA_MPS_ACTIVE_THREAD_PERCENTAGE" variable if we expect to have multiple jobs (processes) landing on a single GPU utilizing a specific portion of GPU? Am I missing something here?


4) It is also mentioned in the Slurm document (https://slurm.schedmd.com/gres.html#MPS_Management) that:
> "An alternate mode of operation would be to permit jobs to be allocated
 whole GPUs then trigger the starting of an MPS server based upon comments in
 the job. For example, if a job is allocated whole GPUs, then search for a 
comment of "mps-per-gpu" or "mps-per-node" in the job (using the "scontrol 
show job" command) and use that as a basis for starting one MPS daemon per GPU
 or across all GPUs respectively."

I understand that it would be helpful to dedicate a full GPU node for MPI jobs and allow each process to utilize a portion of the GPU while they run concurrently on a single node. However, I don't understand the "search for a comment of mps-per-gpu or mps-per-node in the job (using the "scontrol show job" command)" part. Is that saying the Prolog has to search for the "mps-per-gpu" or "mps-per-node" comments inside the jobs submission script? If so, then how can we get into the job submission script file from Prolog? I could not find any Slurm variable inside the Prolog pointing to the job script file. Then how does Prolog supposed to handle this situation?

Best Regards,
Misha

Comment 2 Marcin Stolarek 2021-06-24 08:12:16 MDT

 Misha,

I found an older bug discussing very similar if not exactly same topic - Bug 7834.
Could you please take a look at it?

Let me know if it makes things clear for you.

cheers,
Marcin

Comment 3 Marcin Stolarek 2021-07-01 08:09:22 MDT

Did you find the answers you were looking for in the referenced bug? 

Is there anything else I can help you with?

cheers,
Marcin

Comment 4 Misha Ahmadian 2021-07-06 07:53:26 MDT

Hi Marcin,

Sorry for the delay in my response, and thanks for your quick reply. I think the Bug 7834 makes almost everything clear to me. Actually, I'm not impressed with the way Nvidia handles the MPS, and it seems to me, and it could make users too confused when they submit their jobs. I'd also suggest looking into the Multi-Instance GPUs (MIG) feature that comes with A100. I think that could be a great option to be implemented in Slurm.

Best Regards,
Misha

Comment 6 Marcin Stolarek 2021-07-07 02:38:34 MDT

> I'd also suggest looking into the Multi-Instance GPUs (MIG) feature that comes with A100. I think that could be a great option to be implemented in Slurm.
You may add yourself to CC in Bug 10970 which is public and deals with the enhancement request from NVIDIA.

>I think the Bug 7834 makes almost everything clear to me.
I'll take it as a confirmation that I can close the ticket. If you need any help here (to remove 'almost' from the sentence above), please reopen.

cheers,
Marcin