Bug 7834

Summary: questions wrt mps
Product: Slurm Reporter: hpc-admin
Component: GPUAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: dev
Version: 19.05.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=6846
https://bugs.schedmd.com/show_bug.cgi?id=10273
https://bugs.schedmd.com/show_bug.cgi?id=11897
Site: Ghent Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description hpc-admin 2019-09-30 02:35:43 MDT
hello,

i am trying to use mps as described in the gres documentation, but i'm a bit confused by the explanation in there.
we configured the mps in the node entries in the slurm.conf, the node gres.config only has the nvml autodetection entry.
the nodes have 4 gpus, and the configuration sets mps=400 per node (so 100 per gpu)

a. although one can configure all devices as mps devices, a job can only request at most the mps "resources" provided by one gpu. is that the correct interpretation of the docs? (it also looks like that from initial testing). is there any reason why a job can't for more than one gpu for mps resources? we are intersted in multi-gpu mps support and mps for certain MPI applications.
i found https://github.com/mknoxnv/ubuntu-slurm/blob/master/prolog.d/prolog-mps-per-gpu to handle it via comments, but i'm no fan. 
i can alwasy do this as a regular user, but i'd prefer integration in slurm ofcourse ;)

b. there is also "the same GPU can be allocated as MPS generic resources to multiple jobs belonging to multiple users". this suggestthat multiple users can run at the same time, but th eMPS documentation https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf section 3.3.1 seems to contradict this: the different clients will have to wait for one another.


c. also, the example prologue script that is shipped does not seem to work:  it looks like you need to set CUDA_MPS_PIPE_DIRECTORY and/or CUDA_MPS_LOG_DIRECTORY before the "echo quit | ..." to kill the controller (i set both, didn't try if only one of them works)

some other remarks wrt the script
- it's better to reset the device id file after the cleanup of the controller went fine (in case you manually need to cleanup, you have easier access to the original ids)
- when killing the server, it tries to reset the cuda devices to default (ie the newly requested ones) instead of the old mps devices
- the cuda devices are not set to exclusive mode (not sure why the reset to default is required)


stijn
Comment 6 Michael Hinton 2019-10-04 14:22:11 MDT
Hello Stijn, thanks for the feedback.

(In reply to hpc-admin from comment #0)
> i am trying to use mps as described in the gres documentation, but i'm a bit
> confused by the explanation in there.
> we configured the mps in the node entries in the slurm.conf, the node
> gres.config only has the nvml autodetection entry.
> the nodes have 4 gpus, and the configuration sets mps=400 per node (so 100
> per gpu)
> 
> a. although one can configure all devices as mps devices, a job can only
> request at most the mps "resources" provided by one gpu. is that the correct
> interpretation of the docs? (it also looks like that from initial testing).
> is there any reason why a job can't for more than one gpu for mps resources?
> we are intersted in multi-gpu mps support and mps for certain MPI
> applications.
> i found
> https://github.com/mknoxnv/ubuntu-slurm/blob/master/prolog.d/prolog-mps-per-
> gpu to handle it via comments, but i'm no fan. 
> i can alwasy do this as a regular user, but i'd prefer integration in slurm
> ofcourse ;)
Yes, that is the correct interpretation. The restriction where MPS can only run on one GPU at a time is due to the fact that you have to statically allocate GPUs to the MPS server beforehand, and can't change that on the fly.

Multiple GPUs running MPS concurrently is obviously a desirable thing, but it might not be possible right now. The MPS docs referenced below (section 3.3.1) say "the control daemon allows at most one MPS server to be active at a time."

> b. there is also "the same GPU can be allocated as MPS generic resources to
> multiple jobs belonging to multiple users". this suggest that multiple users
> can run at the same time, but the MPS documentation
> https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
> section 3.3.1 seems to contradict this: the different clients will have to
> wait for one another.
Moe (who implemented MPS in Slurm in close coordination with NVIDIA) was surprised to find this in the docs, as he didn't recall this section and that diagram being there. At one point during development, he recalls being able to run multiple jobs with different users at the same time on the same GPU with minimal overhead.

It's possible that this section is still technically accurate, and that what's going on underneath the hood is a fast tear-down and setup of the MPS server when context switching, but to the user it still appears to be running concurrently. Or, it's also possible that MPS recently changed in this regard. I'll need to look into this some more before I can say definitively. Feel free to try it out and let us know if you are able to run multiple users on the same GPU.

> c. also, the example prologue script that is shipped does not seem to work: 
> it looks like you need to set CUDA_MPS_PIPE_DIRECTORY and/or
> CUDA_MPS_LOG_DIRECTORY before the "echo quit | ..." to kill the controller
> (i set both, didn't try if only one of them works)
> some other remarks wrt the script
> - it's better to reset the device id file after the cleanup of the
> controller went fine (in case you manually need to cleanup, you have easier
> access to the original ids)
> - when killing the server, it tries to reset the cuda devices to default (ie
> the newly requested ones) instead of the old mps devices
> - the cuda devices are not set to exclusive mode (not sure why the reset to
> default is required)
The script does say, "NOTE: This is only a sample and may need modification for your environment." Feel free to open up a separate “Contributions” bug and attach any patches for it. I'll be sure to keep these in mind as I look into MPS some more.

Thanks,
-Michael
Comment 7 Michael Hinton 2019-10-04 14:49:38 MDT
(In reply to hpc-admin from comment #0)
> i found https://github.com/mknoxnv/ubuntu-slurm/blob/master/prolog.d/prolog-mps-per-gpu
> to handle it via comments, but i'm no fan. 
> i can alwasy do this as a regular user, but i'd prefer integration in slurm
> ofcourse ;)
This method is actually how the folks at NVIDIA do it today (and this script you found is nearly line-by-line identical to the copyrighted one we got from them). It might be your best bet to do what you want for now, if you are able.

Here is what Moe had to say about it: "This is separate from and incompatible with the gres/mps work done in version 19.05... It requires the whole node be allocated to the user (multiple users running in MPS mode at the same time would interfere) and the user to include some MPS directives in their job's "comment" field... This is a different method of managing MPS that is incompatible with the the gres/mps logic, but might be attractive for some situations... This would only be used in rare situations."

We'll continue to look into this and see how we can improve Slurm MPS integration going forward.
Comment 9 hpc-admin 2019-10-05 07:47:07 MDT
hi michael,

thanks for the explanation.

with multiple gpus i mean multiple gpus assigned to the same job (so also same user). i totally understand that starting a second job also requesting mps on more gpus on a node that is already dealing with a mps server (ie growing and possibly shrinking the amount of gpus) is a mess, but i'm a bit puzzled why there's a limitiation on the number of gpus for the first starting job; ie starting the server with multiple gpus from the start.

wrt multi-user, the way i interpret the docs is that eg 2 users requesting mps=50 (of 100 total) is simply gpu sharing (combined with the 50% limitiation), so it makes no real sense. it's useful for oversubscription eg to setup debugging nodes, but for performance it's probably not ideal (and you can probably achieve the same without mps and simply alow sharing gpus).

anyway, i think that for now we'll not use the slurm mps; i think that we would need multi-gpu support (even in user-exclusive scheduling mode). our main use case is asking for one or more full gpus with sbatch, and then using srun to give task eg 50% of a single gpu. maybe i can use a taskprolog to already achieve this scenario.

(i'm fine with you closing this issue; i understand this is not a quick fix ;)

stijn
Comment 11 Moe Jette 2019-10-09 09:37:43 MDT
Christopher Samuel chris@csamuel.org via lists.schedmd.com 

On 10/8/19 12:30 PM, Goetz, Patrick G wrote:

> It looks like GPU resources can only be shared by processes run by the
> same user?

This is touched on in this bug
https://bugs.schedmd.com/show_bug.cgi?id=7834 where it appears at one
point MPS appeared to work for multiple users.

It may be that the side-channel attack via performance counters when
sharing NVIDIA GPUs between users caused them to rethink that
(speculating here, but would make a lot of sense).  That was
CVE‑2018‑6260 patched back in February.
Comment 12 Michael Hinton 2019-10-09 11:58:33 MDT
(In reply to Moe Jette from comment #11)
> Christopher Samuel chris@csamuel.org via lists.schedmd.com 
> 
> It may be that the side-channel attack via performance counters when
> sharing NVIDIA GPUs between users caused them to rethink that
> (speculating here, but would make a lot of sense).  That was
> CVE‑2018‑6260 patched back in February.
Well, that makes a lot of sense. It seems similar in kind to recent CPU hardware side-channel attacks, and NVIDIA’s fix of no longer allowing multiple users to run at once in MPS seems akin to disabling hyperthreading in CPUs to avoid Spectre-class attacks. It's unfortunate that we were not apprised of this situation back in February.

From Section 2.3.1.1 of the MPS docs: "Only one user on a system may have an active MPS server. The MPS control daemon will queue MPS server activation requests from separate users, leading to serialized exclusive access of the GPU between users regardless of GPU exclusivity settings."

From Section 3.3.1: “If there is an MPS server already active and the user id of the server and client match, then the control daemon allows the client to proceed to connect to the server. If there is an MPS server already active, but the server and client were launched with different user id’s, the control daemon requests the existing server to shutdown once all its clients have disconnected. Once the existing server has shutdown, the control daemon launches a new server with the same user id as that of the new user's client process.”
Comment 13 Michael Hinton 2019-10-09 12:57:31 MDT
Hi Stijn,

(In reply to hpc-admin from comment #9)
> with multiple gpus i mean multiple gpus assigned to the same job (so also
> same user). I totally understand that starting a second job also requesting
> mps on more gpus on a node that is already dealing with an mps server (ie
> growing and possibly shrinking the amount of gpus) is a mess, but I'm a bit
> puzzled why there's a limitation on the number of gpus for the first
> starting job; ie starting the server with multiple gpus from the start.
You are correct in that only allowing one GPU for the starting job with MPS is an arbitrary limitation set by us, and not by NVIDIA; we figured that the most common use case was to have a single GPU allocated as the 'spill-over' GPU for multiple jobs. However, we may look at expanding this to any number of GPUs in the future.

> anyway, i think that for now we'll not use the slurm mps; i think that we
> would need multi-gpu support (even in user-exclusive scheduling mode). our
> main use case is asking for one or more full gpus with sbatch, and then
> using srun to give task eg 50% of a single gpu. maybe i can use a taskprolog
> to already achieve this scenario.
Fair enough. If you can figure out a good alternative strategy using taskprolog, let me know. But your use case of splitting up multiple GPUs into subunits within a batch script does seem like a good argument for allowing multiple GPUs with MPS.

> wrt multi-user, the way I interpret the docs is that eg 2 users requesting
> mps=50 (of 100 total) is simply gpu sharing (combined with the 50%
> limitation), so it makes no real sense. it's useful for oversubscription eg
> to setup debugging nodes, but for performance it's probably not ideal (and
> you can probably achieve the same without mps and simply allow sharing gpus).
If MPS doesn't buy you better performance and utilization versus simple oversubscription, then I'm not sure what the point of it is. The docs do say “A single process may not utilize all the compute and memory-bandwidth capacity available on the GPU. MPS allows kernel and memcopy operations from different processes to overlap on the GPU, achieving higher utilization and shorter running times.” Now, whether the performance gain and higher utilization is worth the hassle is another question :) Of course, I would always expect exclusive use to have the highest performance.

Though NVIDIA apparently recently restricted MPS sharing to the same user, we believe that sharing a GPU between multiple jobs of a single user could still be valuable. If a user has a bunch of jobs that each need some fraction of a GPU, then it would be convenient to let Slurm manage their scheduling.

-Michael
Comment 14 Michael Hinton 2019-11-25 17:14:50 MST
Hi Stijn,

I'm going to close this for now. We'll be sure to consider your feedback the next time we touch MPS. I think you have good arguments, but for now, supporting multiple GPUs concurrently under MPS in a Slurm-native way would be an enhancement request that we would need to prioritize.

Thanks,
Michael