Ticket 1725 - Enforcing GRES/CPUs binding
Summary: Enforcing GRES/CPUs binding
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 14.11.7
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Moe Jette
QA Contact:
URL:
: 3705 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2015-06-04 13:25 MDT by Kilian Cavalotti
Modified: 2018-06-21 05:40 MDT (History)
4 users (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 16.05.0-pre2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Kludge for desired behaviour (618 bytes, patch)
2015-06-09 10:36 MDT, Moe Jette
Details | Diff
A better patch? (655 bytes, patch)
2015-06-10 05:08 MDT, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2015-06-04 13:25:09 MDT
Hello,

We have dual-socket nodes featuring 4 GPUs, with 2 GPUs connected to each socket. We have a gres.conf like this:
--8<--------------------------------------------------------------------
# 4 GPUs nodes
NodeName=gpu-9-[6-10] Name=gpu Type=gtx File=/dev/nvidia[0-1] CPUs=[0-7]
NodeName=gpu-9-[6-10] Name=gpu Type=gtx File=/dev/nvidia[2-3] CPUs=[8-15]
--8<--------------------------------------------------------------------

Some of our users want to run multi-GPU codes that take advantage of P2P communication between the GPUs. So they usually submit 2-GPU jobs and they really want to be allocated a pair of GPUs connected to same socket.

That works fine when the node is empty: the first 2-GPU job gets GPUs 0,1 and the second one gets 2,3. But sometimes, single-GPU jobs enter the dance, and there are situations where GPUs 0 and 2 are each busy with a single-GPU job, and a 2-GPU job is submitted. Slurm then allocated GPU 1 and 3 (which are each on a different CPU) to the job, and users complain of bad performance.

They would prefer to wait in the queue rather than getting allocated GPUs that are not connected to the same socket. Is there a way to enforce this? I tried using --ntasks-per-socket=1, but it doesn't seem to work (although I have the feeling it used to in previous versions, but maybe I'm wrong).


Here how to reproduce this:

1. run 3 1-GPU jobs
2. kill the job that uses GPU 1
3. submit a 2-GPU job

The job is allocated GPUs 1 and 3. 
Some users would prefer for the job to wait until either GPU 0 or GPU 3 becomes available, and be allocated 0,1 or 2,3

--8<--------------------------------------------------------------------
$ for i in {1..3}; do
  srun --gres=gpu:1 bash -c 'echo -n "$SLURM_JOB_ID :: $CUDA_VISIBLE_DEVICES" ; sleep 100' &
done
2374029 :: 2
2374028 :: 0
2374030 :: 1

$ scancel 2374030

$ srun -n 1 --ntasks-per-socket=1 --gres=gpu:2 --pty bash
$ echo $CUDA_VISIBLE_DEVICES
1,3
--8<--------------------------------------------------------------------

I can see how other users would prefer to see their job start rather than waiting on a close pair to be available, so ideally this behavior would be a srun/salloc/sbatch switch, like --enforce-gres-cpu-binding or something similar.

Thanks!
Kilian
Comment 1 Kilian Cavalotti 2015-06-04 13:29:04 MDT
And BTW, the gres.conf documentation says:

CPUs   [...] If specified, then only the identified CPUs can be allocated with each generic resource; an attempt to use other CPUs will not  be honored. 

which does not seems to be the case where a 1 CPU job gets allocated 2 GPUs on different sockets (in my previous example, the last job ran on CPU 9, but got allocated GPU 1 which should be only usable by CPUs 0-7.

Thanks!
Kilian
Comment 2 Moe Jette 2015-06-04 13:34:42 MDT
I was doing related work today and seeing behaviour as documented. Perhaps you could attach your config files. If you just added the CPU options daemons might need restarting for that effect, not sure...

On June 4, 2015 6:29:04 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1725
>
>--- Comment #1 from Kilian Cavalotti <kilian@stanford.edu> ---
>And BTW, the gres.conf documentation says:
>
>CPUs   [...] If specified, then only the identified CPUs can be
>allocated with
>each generic resource; an attempt to use other CPUs will not  be
>honored. 
>
>which does not seems to be the case where a 1 CPU job gets allocated 2
>GPUs on
>different sockets (in my previous example, the last job ran on CPU 9,
>but got
>allocated GPU 1 which should be only usable by CPUs 0-7.
>
>Thanks!
>Kilian
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 3 Kilian Cavalotti 2015-06-05 03:27:32 MDT
Hi Moe, 

The GRES config hasn't changed in a long time, the CPUs= option was there from the start. I sent you our config files by email (I'm not too comfortable posting them here).

Thanks!
Kilian
Comment 4 Moe Jette 2015-06-05 08:19:39 MDT
I was able to reproduce this problem described and was able to insure that the user's tasks did not get split across sockets by using the --cores-per-socket option.

Here's my use case:
sbatch --gres=gpu:1 -n1 tmp1
sbatch --gres=gpu:1 -n1 tmp1
sbatch --gres=gpu:1 -n1 tmp1
scancel (the first job)

sbatch --gres=gpu:2 -n1 --cores-per-socket=8 tmp1
(this job waits until a whole socket is available)

The obvious downside is the job consumes all cores on one socket, even if it only needs needs one. I'll keep working on this, but wanted to give you a progress report.
Comment 5 Kilian Cavalotti 2015-06-05 09:04:01 MDT
Hi Moe,

That works great! That will be a good workaround for our users in the mean time.

Thanks!
Kilian
Comment 6 Moe Jette 2015-06-05 10:16:21 MDT
There will be a fair bit of new logic required to add something like the "--enforce-gres-cpu-binding" option you describe. Its definitely not practical to add to version 14.11. I agree the functionality is worthwhile adding, but given the remaining time, this is highly unlikely to get into version 15.08 either. The best case would be for us to get the logic added to the next major release of Slurm and provide you with a patch for version 15.08 later this year. Since you at least have a work-around, I'm changing the ticket's severity to 5: enhancement.
Comment 7 Kilian Cavalotti 2015-06-08 07:11:39 MDT
Hi Moe, 

(In reply to Moe Jette from comment #6)
> There will be a fair bit of new logic required to add something like the
> "--enforce-gres-cpu-binding" option you describe. Its definitely not
> practical to add to version 14.11. I agree the functionality is worthwhile
> adding, but given the remaining time, this is highly unlikely to get into
> version 15.08 either. The best case would be for us to get the logic added
> to the next major release of Slurm and provide you with a patch for version
> 15.08 later this year. Since you at least have a work-around, I'm changing
> the ticket's severity to 5: enhancement.

The --cores-per-socket=8 works great with the caveat your mentioned: it takes a full socket for a 2-GPU job. So in our dual-socket setup, two 2-GPU jobs will take a whole node. It's fine for nodes with 4 GPUs, but we actually have a fair number of compute nodes featuring 8 GPUs. In those, half of the GPUs would be wasted by just two 2-GPU jobs. Obviously our users wont' be too happy with that, and a planned fix in 16.05 will seem like a long time for them to wait. :)

So, just so I understand correctly, the CPUs= information in gres.conf is currently only used as a "hint", and works mostly to choose what CPUs to assign to the job once the GRES resources have been allocated (and not the other way around (choose GRES after CPUs have been allocated), is that correct?


I still observe unexpected behavior when submitting jobs to a partially-filled node. On an empty 4 GPUs node with the following gres.conf:
NodeName=gpu-9-[6-10] Name=gpu Type=gtx File=/dev/nvidia[0-1] CPUs=[0-7]
NodeName=gpu-9-[6-10] Name=gpu Type=gtx File=/dev/nvidia[2-3] CPUs=[8-15]

1. I submit a first job that uses 1 GPU:
$ srun --gres gpu:1 --pty bash
$ echo $CUDA_VISIBLE_DEVICES
0

2. while the first one is still running, a 2-GPU job asking for 1 task per node waits (and I don't really understand why):
$ srun --ntasks-per-node=1 --gres=gpu:2 --pty bash
srun: job 2390816 queued and waiting for resources

3. whereas a 2-GPU job requesting 1 core per socket (so just 1 socket) actually gets GPUs allocated from two different sockets!
$ srun -n 1  --cores-per-socket=1 --gres=gpu:2 -p testk --pty bash
$ echo $CUDA_VISIBLE_DEVICES
1,2

That all seems quite counter-intuitive to me.

Exploring a little more, I found out that requesting --ntasks-per-node=1 would achieve the same benefits as the --cores-per-socket=8 workaround, without the drawback of consuming 8 cores. But it doesn't actually make sense to me. :)

So if you could just confirm what I'm seeing and maybe explain this behavior, I'd be very grateful.

Thanks!
Kilian
Comment 8 Moe Jette 2015-06-08 07:29:20 MDT
(In reply to Kilian Cavalotti from comment #7)
> Hi Moe, 
> 
> (In reply to Moe Jette from comment #6)
> > There will be a fair bit of new logic required to add something like the
> > "--enforce-gres-cpu-binding" option you describe. Its definitely not
> > practical to add to version 14.11. I agree the functionality is worthwhile
> > adding, but given the remaining time, this is highly unlikely to get into
> > version 15.08 either. The best case would be for us to get the logic added
> > to the next major release of Slurm and provide you with a patch for version
> > 15.08 later this year. Since you at least have a work-around, I'm changing
> > the ticket's severity to 5: enhancement.
> 
> The --cores-per-socket=8 works great with the caveat your mentioned: it
> takes a full socket for a 2-GPU job. So in our dual-socket setup, two 2-GPU
> jobs will take a whole node. It's fine for nodes with 4 GPUs, but we
> actually have a fair number of compute nodes featuring 8 GPUs. In those,
> half of the GPUs would be wasted by just two 2-GPU jobs. Obviously our users
> wont' be too happy with that, and a planned fix in 16.05 will seem like a
> long time for them to wait. :)

It's definitely too late to start a major development effort for the 15.08 release. We have several projects currently underway that need to be wrapped up, followed by testing, etc. We may be able to get you a patch for this later this year.


> So, just so I understand correctly, the CPUs= information in gres.conf is
> currently only used as a "hint", and works mostly to choose what CPUs to
> assign to the job once the GRES resources have been allocated (and not the
> other way around (choose GRES after CPUs have been allocated), is that
> correct?

It's more complex than that. First, there is a pass over the GRES information to determine whichy CPUs are available on each node for the specified GRES count (e.g. 1 GPU -> 8 CPUs, 2 GPUs -> 16 CPUs). Next there is a bunch of logic to select specific CPUs from those available to the job on each node based upon a multitude of user options (e.g. --ntasks-per-node). Finally, the specific GRES are selected that best match the selected CPUs.


> I still observe unexpected behavior when submitting jobs to a
> partially-filled node. On an empty 4 GPUs node with the following gres.conf:
> NodeName=gpu-9-[6-10] Name=gpu Type=gtx File=/dev/nvidia[0-1] CPUs=[0-7]
> NodeName=gpu-9-[6-10] Name=gpu Type=gtx File=/dev/nvidia[2-3] CPUs=[8-15]
> 
> 1. I submit a first job that uses 1 GPU:
> $ srun --gres gpu:1 --pty bash
> $ echo $CUDA_VISIBLE_DEVICES
> 0
> 
> 2. while the first one is still running, a 2-GPU job asking for 1 task per
> node waits (and I don't really understand why):
> $ srun --ntasks-per-node=1 --gres=gpu:2 --pty bash
> srun: job 2390816 queued and waiting for resources
> 
> 3. whereas a 2-GPU job requesting 1 core per socket (so just 1 socket)
> actually gets GPUs allocated from two different sockets!
> $ srun -n 1  --cores-per-socket=1 --gres=gpu:2 -p testk --pty bash
> $ echo $CUDA_VISIBLE_DEVICES
> 1,2
> 
> That all seems quite counter-intuitive to me.

I would not expect this either. I will need to study the logic some more to answer your questions. I will try to respond later this week.


> Exploring a little more, I found out that requesting --ntasks-per-node=1
> would achieve the same benefits as the --cores-per-socket=8 workaround,
> without the drawback of consuming 8 cores. But it doesn't actually make
> sense to me. :)
> 
> So if you could just confirm what I'm seeing and maybe explain this
> behavior, I'd be very grateful.
> 
> Thanks!
> Kilian
Comment 9 Kilian Cavalotti 2015-06-08 08:23:21 MDT
Hi Moe, 

> It's definitely too late to start a major development effort for the 15.08
> release. We have several projects currently underway that need to be wrapped
> up, followed by testing, etc. We may be able to get you a patch for this
> later this year.

I perfectly understand. Hence my questions, because I feel everything should already be in place for this to work with the current code.

> It's more complex than that. First, there is a pass over the GRES
> information to determine whichy CPUs are available on each node for the
> specified GRES count (e.g. 1 GPU -> 8 CPUs, 2 GPUs -> 16 CPUs). Next there
> is a bunch of logic to select specific CPUs from those available to the job
> on each node based upon a multitude of user options (e.g.
> --ntasks-per-node). Finally, the specific GRES are selected that best match
> the selected CPUs.

Thanks for the explanation, very useful.

> I would not expect this either. I will need to study the logic some more to
> answer your questions. I will try to respond later this week.

Thanks a lot, much appreciated!

Cheers,
Kilian
Comment 10 Moe Jette 2015-06-09 09:30:53 MDT
(In reply to Kilian Cavalotti from comment #7)
> I still observe unexpected behavior when submitting jobs to a
> partially-filled node. On an empty 4 GPUs node with the following gres.conf:
> NodeName=gpu-9-[6-10] Name=gpu Type=gtx File=/dev/nvidia[0-1] CPUs=[0-7]
> NodeName=gpu-9-[6-10] Name=gpu Type=gtx File=/dev/nvidia[2-3] CPUs=[8-15]
> 
> 1. I submit a first job that uses 1 GPU:
> $ srun --gres gpu:1 --pty bash
> $ echo $CUDA_VISIBLE_DEVICES
> 0
> 
> 2. while the first one is still running, a 2-GPU job asking for 1 task per
> node waits (and I don't really understand why):
> $ srun --ntasks-per-node=1 --gres=gpu:2 --pty bash
> srun: job 2390816 queued and waiting for resources
> 
> 3. whereas a 2-GPU job requesting 1 core per socket (so just 1 socket)
> actually gets GPUs allocated from two different sockets!
> $ srun -n 1  --cores-per-socket=1 --gres=gpu:2 -p testk --pty bash
> $ echo $CUDA_VISIBLE_DEVICES
> 1,2
> 
> That all seems quite counter-intuitive to me.
> 
> Exploring a little more, I found out that requesting --ntasks-per-node=1
> would achieve the same benefits as the --cores-per-socket=8 workaround,
> without the drawback of consuming 8 cores. But it doesn't actually make
> sense to me. :)

That might work depending upon what else is allocated on the node, but definitely not reliably. In some cases jobs will be prevented from even being submitted. I've changed the logic so that #2 will now act like #3, which provides some consistency anyway. The commit is here:
https://github.com/SchedMD/slurm/commit/e1a00772c58a7b6f82c2489ee9169aad719dbb2d

I'll continue digging into the GRES scheduling logic. I'm now thinking that I may be able to at least provide you with a patch to provide the functionality you want.
Comment 11 Moe Jette 2015-06-09 10:36:22 MDT
Created attachment 1966 [details]
Kludge for desired behaviour

The attached patch, in addition to the previously made code changes, will give you the desired behaviour with Slurm version 14.11.8. This might need to be handled as a local patch for some time though (until the next major release after 15.08).

For this to work properly, all of the GPUs associated with a single socket on each need to appear on a single line in the gres.conf file, like this:
Name=gpu File=/dev/nvidia[0-1] CPUs=[0-7]
Name=gpu File=/dev/nvidia[2-3] CPUs=[8-15]
Do NOT list each device on a separate line (which would create a separate entry in the table used here). The NodeName specifications in gres.conf are not relevant to this patch.

Then the user needs to request more than 1 GPU. In that case, ALL of the GPUs on that line of the gres.conf file must be un-allocated.

Here's the patch in-line:
diff --git a/src/common/gres.c b/src/common/gres.c
index 3059918..70699e4 100644
--- a/src/common/gres.c
+++ b/src/common/gres.c
@@ -3097,6 +3097,11 @@ static void	_job_core_filter(void *job_gres_data, void *node_gres_data,
 		    (node_gres_ptr->topo_gres_cnt_alloc[i] >=
 		     node_gres_ptr->topo_gres_cnt_avail[i]))
 			continue;
+if (!use_total_gres &&
+    (job_gres_ptr->gres_cnt_alloc > 1) &&
+    (node_gres_ptr->topo_gres_cnt_alloc[i] > 0) &&
+    (!strcmp(gres_name, "gpu")))
+	continue;
 		if (job_gres_ptr->type_model &&
 		    (!node_gres_ptr->topo_model[i] ||
 		     xstrcmp(job_gres_ptr->type_model,
Comment 12 Kilian Cavalotti 2015-06-09 12:27:03 MDT
Hi Moe, 

> The attached patch, in addition to the previously made code changes, will
> give you the desired behaviour with Slurm version 14.11.8. This might need
> to be handled as a local patch for some time though (until the next major
> release after 15.08).

Thanks for this.

> Do NOT list each device on a separate line (which would create a separate
> entry in the table used here). The NodeName specifications in gres.conf are
> not relevant to this patch.

I'm not sure what that means: will the NodeNames still be taken into account, or does the patch requires having a different gres.conf file on each node?

> Then the user needs to request more than 1 GPU. In that case, ALL of the
> GPUs on that line of the gres.conf file must be un-allocated.

Mmmh, so if I understand that right, it means that on a 8-GPU node with the following gres.conf:
--8<--------------------------------------------------------------------
Name=gpu Type=gtx File=/dev/nvidia[0-3] CPUs=[0-7]
Name=gpu Type=gtx File=/dev/nvidia[4-7] CPUs=[8-15]
--8<--------------------------------------------------------------------
only two 2-GPU jobs will be able to run at the same time, is that right?

A first 2-GPU job would use say GPUs 0,1, then, since you mentioned that all of the GPUs on the same gres.conf must be un-allocated, that means that the 2nd 2-GPU job would be allocated GPUs 4,5 and no other 2-GPU job will be able to run. Which would waste 4 GPUs. I'm not sure that could fly with out users either.

Re-reading the documentation I now understand that I got the meaning of --cores-per-socket=x wrong: I thought it would assign x cores per socket to the job, where it actually just select nodes that physically have at least x cores per socket, whereas they're already allocated or not.
Now, --cpu_bind=socket looks like it should do what I want: allocate CPU cores from the same socket to the job. But it doesn't seem like it does:

--8<--------------------------------------------------------------------
$ srun -n 2 --cpu_bind=cores,verbose --pty bash
cpu_bind_cores=UNK  - gpu-9-10, task  0  0 [26513]: mask 0x2 set

$ scontrol -d show job $SLURM_JOBID | grep CPU_IDs
     Nodes=gpu-9-10 CPU_IDs=1,9 Mem=8000
--8<--------------------------------------------------------------------

CPUs 1 and 9 are from different sockets, right? Shouldn't be like CPUs 1 and 2 be allocated to this job?

I can't find a way to make the scheduler allocate 2 cores from the same socket to a job. I feel that would be the first steps in my tries to get to bound-to-socket GPU allocation.

Thanks!
Kilian
Comment 13 Moe Jette 2015-06-10 02:16:59 MDT
(In reply to Kilian Cavalotti from comment #12)
> > Do NOT list each device on a separate line (which would create a separate
> > entry in the table used here). The NodeName specifications in gres.conf are
> > not relevant to this patch.
> 
> I'm not sure what that means: will the NodeNames still be taken into
> account, or does the patch requires having a different gres.conf file on
> each node?

Your current gres.conf file is fine. If you change it, then keep all of the GPUs associated with a single socket on one line. For example:

Nodename=tux[1-10] Name=gpu Type=gtx File=/dev/nvidia[0-1] CPUs=[0-7]
Nodename=tux[1-10] Name=gpu Type=gtx File=/dev/nvidia[2-3] CPUs=[8-15]

is good, but

Nodename=tux[1-10] Name=gpu Type=gtx File=/dev/nvidia0 CPUs=[0-7]
Nodename=tux[1-10] Name=gpu Type=gtx File=/dev/nvidia1 CPUs=[0-7]
Nodename=tux[1-10] Name=gpu Type=gtx File=/dev/nvidia2 CPUs=[8-15]
Nodename=tux[1-10] Name=gpu Type=gtx File=/dev/nvidia3 CPUs=[8-15]

will not work as desired with the latest patch.

> > Then the user needs to request more than 1 GPU. In that case, ALL of the
> > GPUs on that line of the gres.conf file must be un-allocated.
> 
> Mmmh, so if I understand that right, it means that on a 8-GPU node with the
> following gres.conf:
> --8<--------------------------------------------------------------------
> Name=gpu Type=gtx File=/dev/nvidia[0-3] CPUs=[0-7]
> Name=gpu Type=gtx File=/dev/nvidia[4-7] CPUs=[8-15]
> --8<--------------------------------------------------------------------
> only two 2-GPU jobs will be able to run at the same time, is that right?

Two two 4-GPU jobs would run fine, but a job needing 2-GPUs would prevent anything other than a 1-GPU job from using the other GPUs on that same socket.


> A first 2-GPU job would use say GPUs 0,1, then, since you mentioned that all
> of the GPUs on the same gres.conf must be un-allocated, that means that the
> 2nd 2-GPU job would be allocated GPUs 4,5 and no other 2-GPU job will be
> able to run. Which would waste 4 GPUs. I'm not sure that could fly with out
> users either.

Understood.


> Re-reading the documentation I now understand that I got the meaning of
> --cores-per-socket=x wrong: I thought it would assign x cores per socket to
> the job, where it actually just select nodes that physically have at least x
> cores per socket, whereas they're already allocated or not.
> Now, --cpu_bind=socket looks like it should do what I want: allocate CPU
> cores from the same socket to the job. But it doesn't seem like it does:
> 
> --8<--------------------------------------------------------------------
> $ srun -n 2 --cpu_bind=cores,verbose --pty bash
> cpu_bind_cores=UNK  - gpu-9-10, task  0  0 [26513]: mask 0x2 set
> 
> $ scontrol -d show job $SLURM_JOBID | grep CPU_IDs
>      Nodes=gpu-9-10 CPU_IDs=1,9 Mem=8000
> --8<--------------------------------------------------------------------
> 
> CPUs 1 and 9 are from different sockets, right? Shouldn't be like CPUs 1 and
> 2 be allocated to this job?

The cpu_bind option only controls the binding of tasks to allocated resources after those resources are allocated. It has no influence over resource allocation, which is where we need to address the issue.


> I can't find a way to make the scheduler allocate 2 cores from the same
> socket to a job. I feel that would be the first steps in my tries to get to
> bound-to-socket GPU allocation.

I might be able to provide a patch that confines a job allocation to one socket unless it specifically requests more sockets. That could lead to idle cores on some sockets that remain unused. I know it's a kludge, but that's about all that I can offer near-term. Would that be an attractive option?


> Thanks!
> Kilian
Comment 14 Moe Jette 2015-06-10 03:32:51 MDT
(In reply to Moe Jette from comment #13)
> > I can't find a way to make the scheduler allocate 2 cores from the same
> > socket to a job. I feel that would be the first steps in my tries to get to
> > bound-to-socket GPU allocation.
> 
> I might be able to provide a patch that confines a job allocation to one
> socket unless it specifically requests more sockets. That could lead to idle
> cores on some sockets that remain unused. I know it's a kludge, but that's
> about all that I can offer near-term. Would that be an attractive option?

My idea above is not going to work without a major re-write of the logic, but I have an alternate patch that should work ok. I'll test and post soon.
Comment 15 Moe Jette 2015-06-10 05:08:11 MDT
Created attachment 1967 [details]
A better patch?

The attachment (and below) is probably a better patch to address this specific issue. If the job requests a GPU count that is equal to the count on one like of the gres.conf file (i.e. representing the GPUs on one socket), then only sockets with none of those GPUs allocated to another job will be considered. It's clearly still a kludge, but from my understanding would probably satisfy your needs with some sockets having 2 GPUs and others 4 GPUs.

diff --git a/src/common/gres.c b/src/common/gres.c
index 3059918..a68a0c5 100644
--- a/src/common/gres.c
+++ b/src/common/gres.c
@@ -3097,6 +3097,11 @@ static void	_job_core_filter(void *job_gres_data, void *node_gres_data,
 		    (node_gres_ptr->topo_gres_cnt_alloc[i] >=
 		     node_gres_ptr->topo_gres_cnt_avail[i]))
 			continue;
+if (!use_total_gres &&
+    (job_gres_ptr->gres_cnt_alloc == node_gres_ptr->topo_gres_cnt_avail[i]) &&
+    (node_gres_ptr->topo_gres_cnt_alloc[i] > 0) &&
+    (!strcmp(gres_name, "gpu")))
+	continue;
 		if (job_gres_ptr->type_model &&
 		    (!node_gres_ptr->topo_model[i] ||
 		     xstrcmp(job_gres_ptr->type_model,
Comment 16 Kilian Cavalotti 2015-06-10 07:44:10 MDT
Hi Moe, 

Thanks a lot for spending time of this. From your description, I think that this would work. I'll have to test it and I'll let you know.

Cheers,
Kilian
Comment 17 Moe Jette 2015-06-26 05:29:22 MDT
(In reply to Kilian Cavalotti from comment #16)
> Hi Moe, 
> 
> Thanks a lot for spending time of this. From your description, I think that
> this would work. I'll have to test it and I'll let you know.
> 
> Cheers,
> Kilian

Hi Kilian,

Do you have an update on this?
Comment 18 Kilian Cavalotti 2015-06-26 05:52:42 MDT
Hi Moe, 

> Do you have an update on this?

Unfortunately, not really. It turns out that some of our users need P2P access between GPUs, but other don't care, as they usually require several GPUs in the same job, but use them independently. So enforcing this kind of binding (a 4-GPU job could only run on a node if 4 GPUs on the same socket are available, but would have to wait if 2 GPUs are available on each socket) makes some users happy, and some not.

Bottom line, and ideally, we're really more interested in a switch that would allow users to choose that behavior at submission time. 

Currently, we told the P2P users to check the values of CUDA_VISIBLE_DEVICES when their job start, and if the allocated GPUs are not on the same socket, to just resubmit their job. That's obviously not ideal since that kind of makes them loose their spot in the queue and wait again, but it's a bit better than wasting CPU cycles on a job that runs at half its expected speed.
Comment 19 Moe Jette 2015-09-10 08:57:49 MDT
Related emails:

Quoting "Pocina, Goran" <Goran.Pocina@deshawresearch.com>:
> We're interested in this as well.  We have two different job requirements:
> 	1-  tightly coupled jobs that need to run on GPUs attached to the 
> same CPU.   These jobs would need 2, 3 or 4 GPUs, and
> 	2- loosely coupled jobs that could need anywhere from 1 to 40 GPUs, 
> distributed across multiple nodes, and that don't benefit from GPUs 
> sharing the same CPU.
>
> We've run into a couple challenges meeting these requirements.  If we 
> use a configuration file similar to yours, (mapping 4 GPUs to the 
> CPUs on the same socket), we can meet the first requirement, but 
> can't run jobs that need more than 4 GPUs.     If we set up a 
> gres.conf that simply lists GPUs without CPU mappings, then we can 
> meet the second requirement, but not the first.
>
> What we currently do is a little ugly, in that we use CPU allocation 
> as a proxy for GPU allocation.   That is, we disable all but 8 CPUs 
> on a system (4 per socket), and can then ask slurm for either an 
> entire socket, or any number of CPUs.   We then have code that looks 
> at the CPU assignment, and sets CUDA_VISIBLE_DEVICES according to a 
> fixed algorithm, before the job starts.
>
> Another challenge we run into is that our GPU/CPU resources quickly 
> become fragmented, and it becomes difficult to schedule any tightly 
> coupled jobs while loosely coupled jobs are running/pending.   It's 
> possible that fragmentation would be lessened if we were able to use 
> the following heuristic for scheduling:
>
> If a job needs N resources:
>
> A. If there are nodes with exactly N resources available, use those. END.
> B. If there are nodes with more than N resources available, use the 
> node with the *fewest* available resources. END.
> C. If there are only nodes with less than N resources available, 
> allocate all of the resources from the node with the *most* resources 
> available, reduce N by number of allocated resources, repeat.
>
> Currently for step C, slurm will either strive to spread jobs across 
> nodes, or concentrate jobs on the fewest number of nodes, depending 
> on settings.   Both these strategies result in resource 
> fragmentation, however.  Step C makes it more likely that resources 
> will be allocated and released in large chucks, if possible.
>
> -----Original Message-----
> From: Nathan Crawford [mailto:nathan.crawford@uci.edu]
> Sent: Tuesday, September 08, 2015 8:20 PM
> To: slurm-dev
> Subject: [slurm-dev] Recommendations for scheduling 8-GPU nodes as 
> two sets of four
>
>
> Hi All,
>
>    We have a couple nodes with 8 Nvidia Titan X GPUs each. We have 
> some software
> that can run in parallel across GPUs, but performance is only good if the
> inter-GPU communication stays on the PCI links of a single CPU socket.
>
>    Right now, the only thing I have been able to work reliably [with slurm
> 14.11.8 on Scientific Linux 6] is to define two types of gpus in the 
> gres.conf:
>
> NodeName=c-3-29,c-9-9 Name=gpu Type=titanxa File=/dev/nvidia0 CPUs=0-15
> NodeName=c-3-29,c-9-9 Name=gpu Type=titanxa File=/dev/nvidia1 CPUs=0-15
> NodeName=c-3-29,c-9-9 Name=gpu Type=titanxa File=/dev/nvidia2 CPUs=0-15
> NodeName=c-3-29,c-9-9 Name=gpu Type=titanxa File=/dev/nvidia3 CPUs=0-15
> NodeName=c-3-29,c-9-9 Name=gpu Type=titanxb File=/dev/nvidia4 CPUs=16-31
> NodeName=c-3-29,c-9-9 Name=gpu Type=titanxb File=/dev/nvidia5 CPUs=16-31
> NodeName=c-3-29,c-9-9 Name=gpu Type=titanxb File=/dev/nvidia6 CPUs=16-31
> NodeName=c-3-29,c-9-9 Name=gpu Type=titanxb File=/dev/nvidia7 CPUs=16-31
>
>
>    The downside is that the user needs to specify one GRES type or 
> the other at
> job submission. I suppose I could modify the job submit lua script to 
> pick one
> randomly or on current usage, but that could still lead to imbalanced usage.
>
>    I had earlier tried to have a single Type=titanx, with each device 
> restricted
> to the cores on one socket or the other. I couldn't figure out a way 
> to reliably
> restrict a single job to cores on a single socket.  Also, even with 
> the device
> restrictions, I was able to get a job with CPU cores on one socket, but using
> the GPU connected to the other socket.
>
>    Is there a recommended way to handle this situation? I'd like to 
> preserve the
> option of having a single job be able to use all 8 GPUs.
>
> Thanks,
> Nate Crawford
>
> --
> ________________________________________________________________________
> Dr. Nathan Crawford              nathan.crawford@uci.edu
> Modeling Facility Director
> Department of Chemistry
> 1102 Natural Sciences II         Office: 2101 Natural Sciences II
> University of California, Irvine  Phone: 949-824-4508
> Irvine, CA 92697-2025, USA
Comment 21 Moe Jette 2016-03-22 05:27:05 MDT
You will find this fixed in Slurm version 16.05 when released in late May. I 
added a "--gres-flags=enforce-binding" option to the salloc, sbatch and srun commands. If set, the only CPUs available to the job will be those bound to the selected GRES (i.e. the CPUs identifeed in the gres.conf file will be strictly enforced rather than advisory).

The changes are fairly extensive and not suitable for back-porting to version 15.08. The commit is here:
https://github.com/SchedMD/slurm/commit/5d7f8b7650f4a257520d12920d800edbe52167b4
Comment 22 Kilian Cavalotti 2016-03-22 05:33:39 MDT
Hi Moe, 

(In reply to Moe Jette from comment #21)
> You will find this fixed in Slurm version 16.05 when released in late May. I 
> added a "--gres-flags=enforce-binding" option to the salloc, sbatch and srun
> commands. If set, the only CPUs available to the job will be those bound to
> the selected GRES (i.e. the CPUs identifeed in the gres.conf file will be
> strictly enforced rather than advisory).
> 
> The changes are fairly extensive and not suitable for back-porting to
> version 15.08. The commit is here:
> https://github.com/SchedMD/slurm/commit/
> 5d7f8b7650f4a257520d12920d800edbe52167b4

This is excellent news, thank you very much!

Cheers, 
--
Kilian
Comment 23 Stephane Thiell 2016-05-27 09:59:58 MDT
Hi,

We tried this feature using 16.05.0-0rc2 but got a crash of slurmctld.

We have nodes with 2 sockets, 16 GPUs, 8 GPUs on each socket. As a test, we pre-allocated 6 GPUs of each socket, and tried to run a 4 GPUs jobs:

- without --gres-flags=enforce-binding, the job runs as expected, using GPUs from both sides (SLURM_JOB_GPUS=0,1,13,15)

- with --gres-flags=enforce-binding, the slurm daemon crashes.

I ran slurmctld into gdb and got the following bt:

(gdb) bt
#0  0x00000030e56325e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00000030e5633dc5 in abort () at abort.c:92
#2  0x00000030e562b70e in __assert_fail_base (fmt=<value optimized out>, assertion=0x6505fb "(b2) != ((void *)0)", file=0x650430 "bitstring.c", line=<value optimized out>, function=<value optimized out>)
    at assert.c:96
#3  0x00000030e562b7d0 in __assert_fail (assertion=0x6505fb "(b2) != ((void *)0)", file=0x650430 "bitstring.c", line=623, function=0x6508a7 "bit_and") at assert.c:105
#4  0x000000000050efc4 in bit_and (b1=0x9ee120, b2=0x0) at bitstring.c:623
#5  0x00007ffff699a564 in _gres_sock_job_test (job_gres_list=0xa37200, node_gres_list=0x8ea2f0, use_total_gres=false, core_bitmap=0x9ee120, core_start_bit=0, core_end_bit=19, job_id=44545, 
    node_name=0x8ab650 "xs-0001", node_i=0, s_p_n=0) at job_test.c:1088
#6  0x00007ffff69992e0 in _can_job_run_on_node (job_ptr=0xa56e80, core_map=0x9ee120, node_i=0, s_p_n=0, node_usage=0x9d0f70, cr_type=20, test_only=false, part_core_map=0x0) at job_test.c:605
#7  0x00007ffff699a7d7 in _get_res_usage (job_ptr=0xa56e80, node_map=0x9d3c70, core_map=0x9ee120, cr_node_cnt=65, node_usage=0x9d0f70, cr_type=20, cpu_cnt_ptr=0x7fffffffd4f8, test_only=false, 
    part_core_map=0x0) at job_test.c:1156
#8  0x00007ffff699fa7e in _select_nodes (job_ptr=0xa56e80, min_nodes=1, max_nodes=1, req_nodes=1, node_map=0x9d3c70, cr_node_cnt=65, core_map=0x9ee120, node_usage=0x9d0f70, cr_type=20, test_only=false, 
    part_core_map=0x0, prefer_alloc_nodes=false) at job_test.c:2905
#9  0x00007ffff69a01ea in cr_job_test (job_ptr=0xa56e80, node_bitmap=0x9d3c70, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, cr_type=20, job_node_req=NODE_CR_ONE_ROW, cr_node_cnt=65, cr_part_ptr=0x9cfc90, 
    node_usage=0x9d0f70, exc_core_bitmap=0x0, prefer_alloc_nodes=false, qos_preemptor=false, preempt_mode=false) at job_test.c:3091
#10 0x00007ffff699117a in _run_now (job_ptr=0xa56e80, bitmap=0x9d3c70, min_nodes=1, max_nodes=1, req_nodes=1, job_node_req=1, preemptee_candidates=0x0, preemptee_job_list=0x7fffffffdbb8, exc_core_bitmap=0x0)
    at select_cons_res.c:1552
#11 0x00007ffff6992ce0 in select_p_job_test (job_ptr=0xa56e80, bitmap=0x9d3c70, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, preemptee_candidates=0x0, preemptee_job_list=0x7fffffffdbb8, exc_core_bitmap=0x0)
    at select_cons_res.c:2245
#12 0x000000000052be86 in select_g_job_test (job_ptr=0xa56e80, bitmap=0x9d3c70, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, preemptee_candidates=0x0, preemptee_job_list=0x7fffffffdbb8, exc_core_bitmap=0x0)
    at node_select.c:587
#13 0x00000000004947f0 in _pick_best_nodes (node_set_ptr=0x9ede40, node_set_size=1, select_bitmap=0x7fffffffdbd0, job_ptr=0xa56e80, part_ptr=0x91b920, min_nodes=1, max_nodes=1, req_nodes=1, test_only=false, 
    preemptee_candidates=0x0, preemptee_job_list=0x7fffffffdbb8, has_xand=false, exc_core_bitmap=0x0, resv_overlap=false) at node_scheduler.c:1823
#14 0x0000000000493469 in _get_req_features (node_set_ptr=0x9ede40, node_set_size=1, select_bitmap=0x7fffffffdbd0, job_ptr=0xa56e80, part_ptr=0x91b920, min_nodes=1, max_nodes=1, req_nodes=1, test_only=false, 
    preemptee_job_list=0x7fffffffdbb8, can_reboot=true) at node_scheduler.c:1279
#15 0x0000000000495a2f in select_nodes (job_ptr=0xa56e80, test_only=false, select_node_bitmap=0x0, unavail_node_str=0x9d53e0 "", err_msg=0x0) at node_scheduler.c:2294
#16 0x000000000047be7b in _schedule (job_limit=100) at job_scheduler.c:1745
#17 0x0000000000479ad3 in schedule (job_limit=0) at job_scheduler.c:967
#18 0x0000000000442904 in _slurmctld_background (no_data=0x0) at controller.c:1897
#19 0x000000000043f7a6 in main (argc=2, argv=0x7fffffffe518) at controller.c:603

Our gres.conf is:
Name=gpu Type=k80 File=/dev/nvidia0  CPUs=0,1,2,3,4,5,6,7,8,9
Name=gpu Type=k80 File=/dev/nvidia1  CPUs=0,1,2,3,4,5,6,7,8,9
Name=gpu Type=k80 File=/dev/nvidia2  CPUs=0,1,2,3,4,5,6,7,8,9
Name=gpu Type=k80 File=/dev/nvidia3  CPUs=0,1,2,3,4,5,6,7,8,9
Name=gpu Type=k80 File=/dev/nvidia4  CPUs=0,1,2,3,4,5,6,7,8,9
Name=gpu Type=k80 File=/dev/nvidia5  CPUs=0,1,2,3,4,5,6,7,8,9
Name=gpu Type=k80 File=/dev/nvidia6  CPUs=0,1,2,3,4,5,6,7,8,9
Name=gpu Type=k80 File=/dev/nvidia7  CPUs=0,1,2,3,4,5,6,7,8,9
Name=gpu Type=k80 File=/dev/nvidia8  CPUs=10,11,12,13,14,15,16,17,18,19
Name=gpu Type=k80 File=/dev/nvidia9  CPUs=10,11,12,13,14,15,16,17,18,19
Name=gpu Type=k80 File=/dev/nvidia10 CPUs=10,11,12,13,14,15,16,17,18,19
Name=gpu Type=k80 File=/dev/nvidia11 CPUs=10,11,12,13,14,15,16,17,18,19
Name=gpu Type=k80 File=/dev/nvidia12 CPUs=10,11,12,13,14,15,16,17,18,19
Name=gpu Type=k80 File=/dev/nvidia13 CPUs=10,11,12,13,14,15,16,17,18,19
Name=gpu Type=k80 File=/dev/nvidia14 CPUs=10,11,12,13,14,15,16,17,18,19
Name=gpu Type=k80 File=/dev/nvidia15 CPUs=10,11,12,13,14,15,16,17,18,19

Remove the job spool file allowed us to re-launch the slurmctld daemon.

Please reopen this bug and fix it as we really need this feature :)

Let me know if you need more details, core dumps, configs..

Thanks,

Stephane Thiell
Stanford
Comment 24 Moe Jette 2016-05-27 10:14:17 MDT
I'm treating this bug (#1725) as a request for the functionality.

With respect to problems in the new code, I have opened a new bug:
https://bugs.schedmd.com/show_bug.cgi?id=2766

The abort that you report in comment 23, is fixed in a recent commit, described in comment 3 of bug 2766:
https://bugs.schedmd.com/show_bug.cgi?id=2766#c3

I'm planning to do more testing in the week of May 30. We plan to release Slurm version 16.05.0 with this functionality on May 31.
Comment 25 Stephane Thiell 2016-05-27 10:40:24 MDT
Hi Moe,

Thank you for the quick reply.

Could we have access to #2766?

Good luck for the final release!

Stephane
Comment 26 Moe Jette 2016-05-27 10:42:26 MDT
(In reply to Stephane Thiell from comment #25)
> Could we have access to #2766?

You should be able to see it now.

> Good luck for the final release!

Thanks
Comment 27 Kilian Cavalotti 2016-06-02 08:50:21 MDT
Hi Moe, 

I'm happy to report that we tested the --gres-flags=enforce-bindings option in 16.05 and that is works exactly as expected.

Thanks so much for your work on this.
Comment 28 Moe Jette 2016-06-02 08:56:47 MDT
(In reply to Kilian Cavalotti from comment #27)
> Hi Moe, 
> 
> I'm happy to report that we tested the --gres-flags=enforce-bindings option
> in 16.05 and that is works exactly as expected.
> 
> Thanks so much for your work on this.

I appreciate the feedback. Thanks!
Comment 29 Tim Wickberg 2017-04-18 20:13:43 MDT
*** Ticket 3705 has been marked as a duplicate of this ticket. ***