4979 – Allow jobs to request GRES per task rather than per node

Ticket 4979 - Allow jobs to request GRES per task rather than per node

Summary: Allow jobs to request GRES per task rather than per node

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	17.11.5
Hardware:	Linux Linux

Importance:	--- 5 - Enhancement
Assignee:	Unassigned Developer
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-03-25 20:45 MDT by Christopher Samuel
Modified:	2019-03-06 12:09 MST (History)
CC List:	4 users (show)

See Also:
Site:	Swinburne
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	19.05.0-pre2
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Christopher Samuel 2018-03-25 20:45:32 MDT

Hi folks,

[Couldn't tell if this should be configuration, scheduling, slurmd or slurmctld sorry!]

Currently it appears that GRES are allocated per-node, meaning that a job that does:

#!/bin/bash
#SBATCH --ntasks=2
#SBATCH --tasks-per-node=2
#SBATCH -c 4
#SBATCH --gres=gpu:2

srun --gres=gpu:1 nvidia-smi -L

Results in the job being allocated 2 GPUs, but then the individual tasks end up being allocated the same GPU rather than getting a different GPU each.

$ cat slurm-51744.out 
GPU 0: Tesla P100-PCIE-12GB (UUID: GPU-74d38c15-d0b4-c3e5-c825-8db93c583c01)
GPU 0: Tesla P100-PCIE-12GB (UUID: GPU-74d38c15-d0b4-c3e5-c825-8db93c583c01)

$ uniq -c slurm-51744.out 
      2 GPU 0: Tesla P100-PCIE-12GB (UUID: GPU-74d38c15-d0b4-c3e5-c825-8db93c583c01)

It would be nice if there was a --gres-per-task option in the same way that there is a --cpus-per-task option so that codes that would benefit from each task having a dedicated GPU could take advantage of it.

As a side issue it'd be nice to be able to request memory per task too for symmetry. :-)

All the best,
Chris

Comment 1 Tim Wickberg 2018-03-25 23:53:17 MDT

Chris - 

Retagging as an enhancement request. I believe we do have some similar work in progress for a future release, but can't quite comment on that at the moment.

- Tim

Comment 2 Christopher Samuel 2018-03-25 23:55:12 MDT

On 26/03/18 16:53, bugs@schedmd.com wrote:

> Retagging as an enhancement request.

I selected that, I was sure I had!  Need more caffeine..

> I believe we do have some similar work in progress for a future
> release, but can't quite comment on that at the moment.

Not a problem, thanks for letting me know. Or not. :-)

cheers,
Chris

Comment 3 Tim Wickberg 2018-03-25 23:59:31 MDT

(In reply to Christopher Samuel from comment #2)
> On 26/03/18 16:53, bugs@schedmd.com wrote:
> 
> > Retagging as an enhancement request.
> 
> I selected that, I was sure I had!  Need more caffeine..

... I need to document it somewhere, but I punt Sev5 requests by customers to Sev4 to make sure they get some initial triage before being relegated to an enhancement request.

> > I believe we do have some similar work in progress for a future
> > release, but can't quite comment on that at the moment.
> 
> Not a problem, thanks for letting me know. Or not. :-)

I'll try to update this as more details are available. Expect some news on this front by SLUG'18 at the latest. :)

Comment 8 Gordon Dexter 2019-03-06 09:37:50 MST

This issue is critical to our adoption of Slurm.

Tasks should never be given colliding GPU assignments.  We use our GPUs in Exclusive Process mode to allow proper use of GPU memory, therefore GPU collisions cause jobs to fail.

More generally, tasks should run wherever there are resources, spread out across nodes if necessary, or sharing nodes if that works better.

Frankly, the idea of per-node resource requests is completely unhelpful to us.  We have a variety of GPU servers, some have 2, 4, or even 8 GPUs, and we need to be able to schedule a job that will take maximum advantage of heterogeneous hardware.  Splitting the work up into small tasks, which the scheduler can fit on a smaller server, or fit several of on a larger server, is a core part of our workflow.

Comment 9 Moe Jette 2019-03-06 09:43:47 MST

This functionality is in Slurm version 19.05. It's official release date is in May. There should be a new pre-release (unsupported) next week.

Comment 10 Chris Samuel (NERSC) 2019-03-06 12:09:30 MST

Thanks Moe, much appreciated. I've been keeping an eye on master. :-)