5189 – CPU/GPU binding somewhat enforced without --gres-flags=enforce-binding

Ticket 5189 - CPU/GPU binding somewhat enforced without --gres-flags=enforce-binding

Summary: CPU/GPU binding somewhat enforced without --gres-flags=enforce-binding

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	17.11.6
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-05-18 13:59 MDT by Kilian Cavalotti
Modified:	2018-07-18 09:05 MDT (History)
CC List:	2 users (show)

See Also:	1725
Site:	Stanford
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kilian Cavalotti 2018-05-18 13:59:07 MDT

Hi!

We have a CPU/GPU binding case that seems to be working a bit differently than what we expect.

We have a few nodes with 4 GPUs each, using a NVLink interconnect, and that are only connected to a single socket. The nodes are dual-socket (2x 10 cores) and the GPU topology is as follows:

# nvidia-smi topo -mp
        GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity
GPU0     X      PIX     PIX     PIX     PHB     0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18
GPU1    PIX      X      PIX     PIX     PHB     0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18
GPU2    PIX     PIX      X      PIX     PHB     0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18
GPU3    PIX     PIX     PIX      X      PHB     0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18
mlx5_0  PHB     PHB     PHB     PHB      X

All the GPUs are connected to the first socket, and CPUs [1,3,5,7,9,11,13,15,17,19] have no direct access to the GPUs.

So we configured the gres.conf on that node like this:

NodeName=sh-114-04   name=gpu    File=/dev/nvidia[0-3]   COREs=0,2,4,6,8,10,12,14,16,18


Now, we would like to run a job using the 20 cores and 4 GPUs on that node, and the scheduler doesn't seem to think that it's possible, even when we don't use --gres-flags=enforce-binding. When that option is not used, the srun/sbatch man page mentions that "the CPUs identified in the gres.conf file will be [...] advisory". But in practice, that doesn't seem to be the case:

$ srun -N 1 -n 20 --gres gpu:4 -w sh-114-04 --pty bash
srun: error: Unable to allocate resources: Requested node configuration is not available

or:

$ srun -N 1 -n 1 -c 20 --gres gpu:4 -w sh-114-04 --pty bash
srun: error: Unable to allocate resources: Requested node configuration is not available

Actually, as soon as 1 GPU and more than 10 cores are requested, the job is rejected:

$ srun -N 1 -n 1 -c 11 --gres gpu:1 -w sh-114-04 --pty bash
srun: error: Unable to allocate resources: Requested node configuration is not available


When only requesting 10 cores, the job goes through, but then all the allocated cores are *not* on the same socket:

$ srun -N 1 -n 1 -c 10 --gres gpu:4 -w sh-114-04 --pty bash
sh-114-04 $ taskset -c -p $$
pid 164762's current affinity list: 0,1,4,5,8,9,12,13,16,17

Even when --gres-flags=enforce-binding is specified, the allocated cores are spread across both sockets:

$ srun -N 1 -n 1 -c 10 --gres-flags=enforce-binding --gres gpu:4 -w sh-114-04 --pty bash
sh-114-04 $ taskset -c -p $$
pid 165298's current affinity list: 0,1,4,5,8,9,12,13,16,17


So, I have to admit that I'm a bit confused. How could we handle the case where all GRES are bound to a single socket, and still be able to schedule jobs that would span both sockets?

Thanks!
-- 
Kilian

Comment 1 Felip Moll 2018-05-24 02:27:41 MDT

Hi Kilian,

There seems to be a bug already known (that's a part of bug 4584) about this issue.

Please, try to change your gres.conf line to this:

NodeName=sh-114-04   name=gpu    File=/dev/nvidia[0-3]   CPUs=0-10


And repeat the tests.


It may be useful to set the debug flags to DebugFlags=CPU_Bind,GRES and see how the binding is done.
Send me slurmctld log if it doesn't work properly.

Comment 2 Felip Moll 2018-05-24 05:02:03 MDT

If it works, you may want then to change it to:

NodeName=sh-114-04   name=gpu    File=/dev/nvidia[0-3]   Cores=0-10

Since CPUs keyword is deprecated.

Let me how it goes.

Comment 3 Kilian Cavalotti 2018-05-24 10:07:57 MDT

Hi Felip, 

Thanks for getting back to me, but I'm confused:

1. I don't have access to bug #4584, so I can't really tell what's relevant to this case

2. how would changing the core id mapping to one that is not accurate would solve the issue? 

If I use COREs=0-10 (instead of the actual COREs=0,2,4,6,8,10,12,14,16,18), it will just trick the scheduler into thinking that both sockets have direct access to the GPUs, because cores 0-10 are effectively spread over the 2 sockets. 

But that's not the case, cores 0-10 don't have the same access to the GPUs, and only even-numbered cores have direct access. Odd-numbered cores are physically located on a socket that has no PCI link to the NVLink board, and data would have to go through the inter-socket QPI link to go to/from COREs=1,3,5,7,9 to the GPUs. 

So we could end up with one job that allocates core#3 (based on a gres.conf that say that this core is directly connected to the GPU), which will yield bad performance, and a job that allocate core#0, for which performance will be much better. That would make thing pretty unpredictable and hardly reproducible for users,

So, unless I'm missing something, that would actually defeat the whole purpose of having a COREs list in gres.conf.

Could you please clarify?

Thanks,
-- 
Kilian

Comment 4 Felip Moll 2018-05-25 02:48:06 MDT

(In reply to Kilian Cavalotti from comment #3)
> Hi Felip, 
> 
> Thanks for getting back to me, but I'm confused:
> 
> 1. I don't have access to bug #4584, so I can't really tell what's relevant
> to this case
> 
> 2. how would changing the core id mapping to one that is not accurate would
> solve the issue? 
> 
> If I use COREs=0-10 (instead of the actual COREs=0,2,4,6,8,10,12,14,16,18),
> it will just trick the scheduler into thinking that both sockets have direct
> access to the GPUs, because cores 0-10 are effectively spread over the 2
> sockets. 
> 
> But that's not the case, cores 0-10 don't have the same access to the GPUs,
> and only even-numbered cores have direct access. Odd-numbered cores are
> physically located on a socket that has no PCI link to the NVLink board, and
> data would have to go through the inter-socket QPI link to go to/from
> COREs=1,3,5,7,9 to the GPUs. 
> 
> So we could end up with one job that allocates core#3 (based on a gres.conf
> that say that this core is directly connected to the GPU), which will yield
> bad performance, and a job that allocate core#0, for which performance will
> be much better. That would make thing pretty unpredictable and hardly
> reproducible for users,
> 
> So, unless I'm missing something, that would actually defeat the whole
> purpose of having a COREs list in gres.conf.
> 
> Could you please clarify?
> 
> Thanks,
> -- 
> Kilian

Yes,

Slurm uses an abstract cpu indexing to identify cores. For example, if you have 2 sockets, 4 cores per socket, 2 threads per core:

           S1                            S2
  C1   C3     C5    C7         C2   C4     C6    C8
T1|T2 T1|T2 T1|T2 T1|T2      T1|T2 T1|T2 T1|T2 T1|T2  

They will be numbered in Linux as 1,3,5,7 for socket 1, and 2,4,6,8 for socket 2.

But Slurm identifies cores like this:

CPU_ID = Board_ID  x  threads_per_board  +  Socket_ID  x threads_per_socket + Core_ID x threads_per_core + Thread_ID


So it ends like this:
           S1                            S2
  C1   C3     C5    C7         C2   C4     C6    C8
T1|T2 T1|T2 T1|T2 T1|T2      T1|T2 T1|T2 T1|T2 T1|T2  
1   2  3  4  5  6  7  8      9  10 11 12 13 14 15 16

And, because the logic in Slurm is based on Cores, when you specify your gres.conf you must only select the first thread of each core, so your gres.conf would end up like: 1,3,5,7 for the C1, C3, C5 and C7.

BUT, if you don't have HyperThreading, that I assume is your situation, the thing changes to:

           S1                            S2
  C1   C3     C5    C7         C2   C4     C6    C8
  1    2       3    4           5    6     7      8

So in gres.conf you must specify, for the first socket's cores: 1,2,3,4.

This is the reason I suggested:

NodeName=sh-114-04   name=gpu    File=/dev/nvidia[0-3]   Cores=0-10


Maybe it is not clear enough in the gres.conf man page and we should put an example added to this:

<<Specify the first thread CPU index numbers for the specific cores which can use this resource.>>

Comment 5 Kilian Cavalotti 2018-05-29 11:30:31 MDT

Hi Felip, 

Thanks for the clarification about core numbering, but there are still things that are not very clear to me.


1. The first thing is that the part about Slurm's core numbering scheme (CPU_ID = Board_ID  x  threads_per_board  +  Socket_ID  x threads_per_socket + Core_ID x threads_per_core + Thread_ID) has been removed from the gres.conf man page in commit 7c071ec990249dea8ceba4bfd51b97cd40c4051d 
And I can't seem to be finding any more reference about it in all of the current (17.11.6) documentation, except for the srun --cpu_bind option.

My understanding was that this numbering scheme was obsolete. Is it still in use? And if it is, why has it been removed from the man pages?



2. I don't really follow the way that formula would be used. In my case, I have a 2-socket node, with two 10-core CPUs, and interleaved CPU ids, as defined by the BIOS:
# lscpu  | grep NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19
# slurmd -C
NodeName=sh-114-04 CPUs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=257672

If I take the "CPU_ID = Board_ID x threads_per_board + Socket_ID x threads_per_socket + Core_ID x threads_per_core + Thread_ID" formula, I understand that I have:
* one board, so Board_ID=1, 
* no thread per board, so threads_per_board=0, 
* no thread per socket, so threads_per_socket=0, 
* one thread per core (HT disabled), so threads_per_core=1
* and thread_id=0, because again, HT is off.

So if I understand things correctly, the formula basically becomes CPU_ID = Core_ID, doesn't it?

In all cases, the formula is a direct function of Core_ID as defined by the kernel, so I don't really get how an interleaved core-id distribution at the kernel level would end up in an even core-id distribution on the Slurm side.


But in any case, I tried your suggestion, and modified my gres.conf as you recommended:

NodeName=sh-114-04  name=gpu    File=/dev/nvidia[0-3]   COREs=0-10

I restarted slurmctld and ssh-114-04's slurmd, but unfortunately, that didn't change anything, and I still can't submit a job using the whole machine, as before:

$ srun -N 1 -n 1 -c 20 --gres gpu:4 -w sh-114-04 --pty bash
srun: error: Unable to allocate resources: Requested node configuration is not available



3. Lastly, you recommended to specify COREs=0-10 in gres.conf, which is 11 cores total. Is that intentionally listing the 10-core that supposedly belong to the same socket plus one extra on the second socket? I have to admit I really have a hard time following the logic here.



Thanks a lot for looking into this.

Cheers,
-- 
Kilian

Comment 6 Alejandro Sanchez 2018-05-30 10:30:03 MDT

Hi Kilian. Felip's on vacation and should be back shortly to continue with work in this bug.

Comment 7 Felip Moll 2018-05-31 12:45:53 MDT

(In reply to Kilian Cavalotti from comment #5)
> Hi Felip, 
> 
> Thanks for the clarification about core numbering, but there are still
> things that are not very clear to me.
> 
> 
> 1. The first thing is that the part about Slurm's core numbering scheme
> (CPU_ID = Board_ID  x  threads_per_board  +  Socket_ID  x threads_per_socket
> + Core_ID x threads_per_core + Thread_ID) has been removed from the
> gres.conf man page in commit 7c071ec990249dea8ceba4bfd51b97cd40c4051d 
> And I can't seem to be finding any more reference about it in all of the
> current (17.11.6) documentation, except for the srun --cpu_bind option.
> 
> My understanding was that this numbering scheme was obsolete. Is it still in
> use? And if it is, why has it been removed from the man pages?

Well, I have to confirm internally why this change happened, It is not clear to me why it has been removed.
Please, let me check it next week when I come back.

> 
> 2. I don't really follow the way that formula would be used. In my case, I
> have a 2-socket node, with two 10-core CPUs, and interleaved CPU ids, as
> defined by the BIOS:

Here an explanation on how it has/had to be used:

CPU_ID = Board_ID  x  threads_per_board  +  Socket_ID  x threads_per_socket + Core_ID x threads_per_core + Thread_ID

-----------------
NUMA node(s):          2
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19
NodeName=sh-114-04 CPUs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=257672, HT disabled.
------------------

Since you have only 1:
Board_ID = 0
Threads_per_board=20

Socket ids are:
 0 = node0
 1 = node1

Threads_per_socket = 10

Since you have only 1 thread per core, no HT:
Threads_per_core = 1
Thread_ID = 0

Core_ID is the position of the core in your socket, from 1 to 10, this is the trick and the probably not very well explained thing.
All the id's for the formula are referenced from the index on the instance they appear, so for example, in socket 1, if we have 10 cores
numbered 0,2,4,6,8,10,12,14,16,18, core_id is the position in this list. Imagine you have 2 boards, and 2 sockets per board. In both boards
the socket ID would be 0 and 1, not [0-4].


Here some examples:

So, for getting CPU_ID of the 3rd core of socket 0 (core numbered 4 by linux):

CPU_ID=0x20 + 0x10 + 4x1 + 0 = 4

For getting the CPU_ID of core numbered 11, which corresponds to the 6th core of socket 1:

CPU_ID=0x20 + 1x10 + 6x1 + 0 = 16



> But in any case, I tried your suggestion, and modified my gres.conf as you
> recommended:
> 
> NodeName=sh-114-04  name=gpu    File=/dev/nvidia[0-3]   COREs=0-10
> 

Uhms, that's not good, I expected it to work. This shouldn't change anything but have you tried using Cores=0,1,2,3,4,5,6,7,8,9 or CPUS=0-9?


> I restarted slurmctld and ssh-114-04's slurmd

You don't need to restart any daemon to get gres.conf loaded. It will be read when launching the task.

> 
> 3. Lastly, you recommended to specify COREs=0-10 in gres.conf, which is 11
> cores total. 

Absolutely my mistake, wanted to say 0-9. Have you tried 0-9?


> 
> Thanks a lot for looking into this.

Anytime.

Comment 8 Felip Moll 2018-06-01 02:56:19 MDT

Kilian,

I'd like also to see the cpu binding log, enable it with:

DebugLevel=debug2
DebugFlags=CPU_Bind,GRES

With both options: with Cores=0-9 and with Cores=0,2,4,6,8,10,12,14,16,18


Thanks

Comment 9 Kilian Cavalotti 2018-06-01 16:38:09 MDT

Hi Felip, 

(In reply to Felip Moll from comment #7)
> Well, I have to confirm internally why this change happened, It is not clear
> to me why it has been removed.
> Please, let me check it next week when I come back.

Sure, and no worries, I'll be away myself for the next couple weeks.


> Core_ID is the position of the core in your socket, from 1 to 10, this is
> the trick and the probably not very well explained thing.
> All the id's for the formula are referenced from the index on the instance
> they appear, so for example, in socket 1, if we have 10 cores
> numbered 0,2,4,6,8,10,12,14,16,18, core_id is the position in this list.

Wow, this is entirely news to me.
How does that relate to the "physical", kernel-level core numbering, then?

> Imagine you have 2 boards, and 2 sockets per board. In both boards
> the socket ID would be 0 and 1, not [0-4].

Gotcha.

> NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
> NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19

> So, for getting CPU_ID of the 3rd core of socket 0 (core numbered 4 by
> linux):
> 
> CPU_ID=0x20 + 0x10 + 4x1 + 0 = 4
> 
> For getting the CPU_ID of core numbered 11, which corresponds to the 6th
> core of socket 1:
> 
> CPU_ID=0x20 + 1x10 + 6x1 + 0 = 16

I see, thanks a lot for detailing it.
But I have to say it is pretty counter-intuitive, and I'm pretty sure 99% of the Slurm users are doing it wrong in their gres.conf, then.


> Uhms, that's not good, I expected it to work. This shouldn't change anything
> but have you tried using Cores=0,1,2,3,4,5,6,7,8,9 or CPUS=0-9?

> You don't need to restart any daemon to get gres.conf loaded. It will be
> read when launching the task.

I tried both with COREs=[0-9] and COREs=0,1,2,3,4,5,6,7,8,9. In both cases, I got the following:

$ srun -w sh-114-01 -n 1 -c 20 --gres gpu:4 --pty bash
srun: error: Unable to allocate resources: Requested node configuration is not available


Here's some more debugging information, as requested.

# scontrol show config | grep -i debug
DebugFlags              = CPU_Bind,Gres
SlurmctldDebug          = info
SlurmctldSyslogDebug    = debug2
SlurmdDebug             = info
SlurmdSyslogDebug       = debug2


* With Cores=0-9:

Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres: gpu state for job 19145692
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt:4 node_cnt:0 type:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: no job_resources info for job 19145692 rc=-1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-101-65
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-101-67
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-101-69
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-101-70
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-101-71
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-101-72
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-102-09
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-102-16
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-102-22
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-102-28
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-102-29
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-102-30
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-53
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-63
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-65
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-66
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-67
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-106-49
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-106-50
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-113-08
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:2
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[1]:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[1]:1,3,5,7,9,11,13,15,17,19
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[1]:2-3
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[1]:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[1]:2
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-113-09
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-01
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0-9
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-03
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-04
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:2
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:0-1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-01
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0-9
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: error: gres/gpu: job 19092237 dealloc node sh-114-02 topo gres count underflow (0 1)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-02
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-03
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-04
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-25
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-26
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-27
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-28
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-02
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: no job_resources info for job 19145692 rc=-1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: no job_resources info for job 19145692 rc=-1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: no job_resources info for job 19145692 rc=-1
Jun  1 15:36:32 sh-sl01 slurmctld[26753]: _pick_best_nodes: job 19145692 never runnable in partition test


* With Cores=0,2,4,6,8,10,12,14,16,18

Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres: gpu state for job 19145623
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt:4 node_cnt:0 type:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: no job_resources info for job 19145623 rc=-1
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-01
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0-9
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: error: gres/gpu: job 19092237 dealloc node sh-114-02 topo gres count underflow (0 1)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-02
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-03
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-04
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:1
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-25
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-26
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-27
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-28
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0 avail:0 alloc:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-02
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4 avail:4 alloc:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  type[0]:(null)
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: no job_resources info for job 19145623 rc=-1
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: no job_resources info for job 19145623 rc=-1
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: no job_resources info for job 19145623 rc=-1
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: _pick_best_nodes: job 19145623 never runnable in partition test
Jun  1 15:33:19 sh-sl01 slurmctld[26753]: _slurm_rpc_allocate_resources: Requested node configuration is not available


Cheers,
-- 
Kilian

Comment 10 Felip Moll 2018-06-06 05:38:54 MDT

(In reply to Kilian Cavalotti from comment #9)
> Hi Felip, 
> 
> (In reply to Felip Moll from comment #7)
> > Well, I have to confirm internally why this change happened, It is not clear
> > to me why it has been removed.
> > Please, let me check it next week when I come back.
> 
> Sure, and no worries, I'll be away myself for the next couple weeks.
> 
> 
> > Core_ID is the position of the core in your socket, from 1 to 10, this is
> > the trick and the probably not very well explained thing.
> > All the id's for the formula are referenced from the index on the instance
> > they appear, so for example, in socket 1, if we have 10 cores
> > numbered 0,2,4,6,8,10,12,14,16,18, core_id is the position in this list.
> 
> Wow, this is entirely news to me.
> How does that relate to the "physical", kernel-level core numbering, then?
> 
> > Imagine you have 2 boards, and 2 sockets per board. In both boards
> > the socket ID would be 0 and 1, not [0-4].
> 
> Gotcha.
> 
> > NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
> > NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19
> 
> > So, for getting CPU_ID of the 3rd core of socket 0 (core numbered 4 by
> > linux):
> > 
> > CPU_ID=0x20 + 0x10 + 4x1 + 0 = 4
> > 
> > For getting the CPU_ID of core numbered 11, which corresponds to the 6th
> > core of socket 1:
> > 
> > CPU_ID=0x20 + 1x10 + 6x1 + 0 = 16
> 
> I see, thanks a lot for detailing it.
> But I have to say it is pretty counter-intuitive, and I'm pretty sure 99% of
> the Slurm users are doing it wrong in their gres.conf, then.
> 
> 
> > Uhms, that's not good, I expected it to work. This shouldn't change anything
> > but have you tried using Cores=0,1,2,3,4,5,6,7,8,9 or CPUS=0-9?
> 
> > You don't need to restart any daemon to get gres.conf loaded. It will be
> > read when launching the task.
> 
> I tried both with COREs=[0-9] and COREs=0,1,2,3,4,5,6,7,8,9. In both cases,
> I got the following:
> 
> $ srun -w sh-114-01 -n 1 -c 20 --gres gpu:4 --pty bash
> srun: error: Unable to allocate resources: Requested node configuration is
> not available
> 
> 
> Here's some more debugging information, as requested.
> 
> # scontrol show config | grep -i debug
> DebugFlags              = CPU_Bind,Gres
> SlurmctldDebug          = info
> SlurmctldSyslogDebug    = debug2
> SlurmdDebug             = info
> SlurmdSyslogDebug       = debug2
> 
> 
> * With Cores=0-9:
> 
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres: gpu state for job 19145692
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt:4 node_cnt:0 type:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: no job_resources info for job
> 19145692 rc=-1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-101-65
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-101-67
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-101-69
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-101-70
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-101-71
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-101-72
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-102-09
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-102-16
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-102-22
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-102-28
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-102-29
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-102-30
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-53
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-63
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-65
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-66
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-67
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-106-49
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-106-50
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-113-08
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  
> topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:2
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[1]:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  
> topo_cpus_bitmap[1]:1,3,5,7,9,11,13,15,17,19
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[1]:2-3
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[1]:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[1]:2
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-113-09
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-01
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0-9
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-03
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  
> topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-04
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:2
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:0-1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  
> topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-01
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0-9
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: error: gres/gpu: job 19092237
> dealloc node sh-114-02 topo gres count underflow (0 1)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-02
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  
> topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-03
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  
> topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-04
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  
> topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-25
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-26
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-27
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-28
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-02
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:  
> topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: no job_resources info for job
> 19145692 rc=-1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: no job_resources info for job
> 19145692 rc=-1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: no job_resources info for job
> 19145692 rc=-1
> Jun  1 15:36:32 sh-sl01 slurmctld[26753]: _pick_best_nodes: job 19145692
> never runnable in partition test
> 
> 
> * With Cores=0,2,4,6,8,10,12,14,16,18
> 
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres: gpu state for job 19145623
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt:4 node_cnt:0 type:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: no job_resources info for job
> 19145623 rc=-1
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-01
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_cpus_bitmap[0]:0-9
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: error: gres/gpu: job 19092237
> dealloc node sh-114-02 topo gres count underflow (0 1)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-02
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  
> topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-03
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  
> topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-04
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:1
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  
> topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-25
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-26
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-27
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-105-28
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:0 configured:0
> avail:0 alloc:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: gres/gpu: state for sh-114-02
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_cnt found:4 configured:4
> avail:4 alloc:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_bit_alloc:
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  gres_used:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  type[0]:(null)
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:  
> topo_cpus_bitmap[0]:0,2,4,6,8,10,12,14,16,18
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_bitmap[0]:0-3
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_alloc[0]:0
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]:   topo_gres_cnt_avail[0]:4
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: no job_resources info for job
> 19145623 rc=-1
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: no job_resources info for job
> 19145623 rc=-1
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: no job_resources info for job
> 19145623 rc=-1
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: _pick_best_nodes: job 19145623
> never runnable in partition test
> Jun  1 15:33:19 sh-sl01 slurmctld[26753]: _slurm_rpc_allocate_resources:
> Requested node configuration is not available
> 
> 
> Cheers,
> -- 
> Kilian

Hi Kilian,

As you commented in the first post, we see 2 problems here.

First one is that you cannot run with 20 cores at a time when asking for gpus.
Second one is that the binding to cpus is done incorrectly.

I was trying first to fix the second issue, so please, can you repeat the test having put Cores=0-9 in gres.conf and doing:


$ srun -N 1 -n 1 -c 10 --gres gpu:4 -w sh-114-04 --pty bash
sh-114-04 $ taskset -c -p $$


and 


$ srun -N 1 -n 1 -c 10 --gres-flags=enforce-binding --gres gpu:4 -w sh-114-04 --pty bash
sh-114-04 $ taskset -c -p $$



instead of -c 20 ?

I am working currently trying to reproduce the first issue.

Comment 11 Felip Moll 2018-06-06 05:43:33 MDT

Btw, I was wrong with re-reading gres.conf:

Changes to the configuration file take effect upon restart of Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the command "scontrol reconfigure".

The internal bitmaps have to be recreated, I was having a confusion with another file. Sorry.

Comment 14 Jesse Hostetler 2018-06-06 15:25:07 MDT

I can confirm this issue. (I also can't see #4584). We have 2 sockets each (Xeon 16 cores hyperthreaded, each socket is a NUMA node) with 2 GPUs per socket. 

In gres.conf we have:
File=/dev/nvidia[0-1] Cores=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
File=/dev/nvidia[2-3] Cores=32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62

The core indices are the "logical" indices of the first processor unit (PU) in each core as reported by 'lstopo'. I tried many different ways of determining the indices and I am sure the ones I listed produce the desired affinity behavior.

Note that these are different from what is reported by nvidia-smi:
$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    CPU Affinity
GPU0     X      PHB     SYS     SYS     0-15,32-47
GPU1    PHB      X      SYS     SYS     0-15,32-47
GPU2    SYS     SYS      X      PHB     16-31,48-63
GPU3    SYS     SYS     PHB      X      16-31,48-63

As Killian observed, the scheduler seems to always enforce CPU<->GPU binding regardless of whether --gres-flags=enforce-binding is used. I confirmed this by filling up all the cores with jobs, canceling jobs on specific cores, and then submitting GPU jobs.

Comment 15 Felip Moll 2018-06-07 06:12:09 MDT

(In reply to Jesse Hostetler from comment #14)
> I can confirm this issue. (I also can't see #4584). We have 2 sockets each
> (Xeon 16 cores hyperthreaded, each socket is a NUMA node) with 2 GPUs per
> socket. 
> 
> In gres.conf we have:
> File=/dev/nvidia[0-1] Cores=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
> File=/dev/nvidia[2-3] Cores=32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62
> 
> The core indices are the "logical" indices of the first processor unit (PU)
> in each core as reported by 'lstopo'. I tried many different ways of
> determining the indices and I am sure the ones I listed produce the desired
> affinity behavior.
> 
> Note that these are different from what is reported by nvidia-smi:
> $ nvidia-smi topo -m
>         GPU0    GPU1    GPU2    GPU3    CPU Affinity
> GPU0     X      PHB     SYS     SYS     0-15,32-47
> GPU1    PHB      X      SYS     SYS     0-15,32-47
> GPU2    SYS     SYS      X      PHB     16-31,48-63
> GPU3    SYS     SYS     PHB      X      16-31,48-63
> 
> As Killian observed, the scheduler seems to always enforce CPU<->GPU binding
> regardless of whether --gres-flags=enforce-binding is used. I confirmed this
> by filling up all the cores with jobs, canceling jobs on specific cores, and
> then submitting GPU jobs.

Thanks Jesse for your contribution.

You and Kilian seem right. I'd already reproduced the situation and I was currently looking for the issue/reasoning behind, but yes, the enforce-binding seems not be taking any effect.

When you say 'the core indices are the "logical" indices of the first processor unit.. as reported by lstopo', I understand then that Cores=xxx in gres.conf matches exactly with the physical hierarchy and theerefore these ids are not Slurm abstract ids, is that it? This doesn't seem to be the case for Kilian.

Comment 16 Jesse Hostetler 2018-06-07 11:26:22 MDT

(In reply to Felip Moll from comment #15)
> When you say 'the core indices are the "logical" indices of the first
> processor unit.. as reported by lstopo', I understand then that Cores=xxx in
> gres.conf matches exactly with the physical hierarchy and theerefore these
> ids are not Slurm abstract ids, is that it? This doesn't seem to be the case
> for Kilian.

Here is an excerpt of the output of 'lstopo' (I'll paste full output below):

$ lstopo
Machine (252GB total)
  NUMANode L#0 (P#0 126GB)
    Package L#0 + L3 L#0 (40MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#32)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#33)
[...]

Each core has corresponding 'PU L#n (P#m)' and 'PU L#n+1 (P#m+32)'. I use 'n' for the gres.conf indices. In our case this seems to match the (Socket * ThreadsPerSocket) + (Cores * ThreadsPerCore) format. 'nvidia-smi' and 'top' seem to use the 'm' numbers.

In the terminology of hwloc, I'm using the "logical" PU indices. I'm not sure if that's the same as "Slurm abstract ids". Parts of the docs suggest using the IDs from 'nvidia-smi topo', while other parts say to use the scheme I described. It would be helpful to describe exactly how Slurm determines the indices, so that the user can use the same method.

Full output:

$ lstopo
Machine (252GB total)
  NUMANode L#0 (P#0 126GB)
    Package L#0 + L3 L#0 (40MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#32)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#33)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#34)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#35)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#36)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#37)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#38)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#39)
      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#40)
      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#41)
      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#42)
      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#43)
      L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#44)
      L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#45)
      L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#46)
      L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#47)
    HostBridge L#0
      PCIBridge
        PCI 10de:1b06
          GPU L#0 "card1"
          GPU L#1 "renderD128"
      PCIBridge
        PCI 10de:1b06
          GPU L#2 "card2"
          GPU L#3 "renderD129"
      PCI 8086:8d62
      PCIBridge
        PCI 8086:1533
          Net L#4 "enp5s0"
      PCIBridge
        PCI 8086:1533
          Net L#5 "enp6s0"
      PCIBridge
        PCIBridge
          PCI 1a03:2000
            GPU L#6 "card0"
            GPU L#7 "controlD64"
      PCI 8086:8d02
        Block(Disk) L#8 "sda"
        Block(Disk) L#9 "sdb"
        Block(Disk) L#10 "sdc"
        Block(Disk) L#11 "sdd"
        Block(Removable Media Device) L#12 "sr0"
  NUMANode L#1 (P#1 126GB)
    Package L#1 + L3 L#1 (40MB)
      L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
        PU L#32 (P#16)
        PU L#33 (P#48)
      L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
        PU L#34 (P#17)
        PU L#35 (P#49)
      L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
        PU L#36 (P#18)
        PU L#37 (P#50)
      L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
        PU L#38 (P#19)
        PU L#39 (P#51)
      L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
        PU L#40 (P#20)
        PU L#41 (P#52)
      L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
        PU L#42 (P#21)
        PU L#43 (P#53)
      L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
        PU L#44 (P#22)
        PU L#45 (P#54)
      L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
        PU L#46 (P#23)
        PU L#47 (P#55)
      L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
        PU L#48 (P#24)
        PU L#49 (P#56)
      L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
        PU L#50 (P#25)
        PU L#51 (P#57)
      L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
        PU L#52 (P#26)
        PU L#53 (P#58)
      L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
        PU L#54 (P#27)
        PU L#55 (P#59)
      L2 L#28 (256KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
        PU L#56 (P#28)
        PU L#57 (P#60)
      L2 L#29 (256KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
        PU L#58 (P#29)
        PU L#59 (P#61)
      L2 L#30 (256KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
        PU L#60 (P#30)
        PU L#61 (P#62)
      L2 L#31 (256KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
        PU L#62 (P#31)
        PU L#63 (P#63)
    HostBridge L#7
      PCIBridge
        PCI 14e4:16a1
          Net L#13 "ens10f0"
        PCI 14e4:16a1
          Net L#14 "ens10f1"
      PCIBridge
        PCI 10de:1b06
          GPU L#15 "card3"
          GPU L#16 "renderD130"
      PCIBridge
        PCI 10de:1b06
          GPU L#17 "card4"
          GPU L#18 "renderD131"

Comment 17 Felip Moll 2018-06-07 13:11:32 MDT

(In reply to Jesse Hostetler from comment #16)
> (In reply to Felip Moll from comment #15)
>...
> Here is an excerpt of the output of 'lstopo' (I'll paste full output below):
>...

This is a bit tricky indeed.

Do you have hyper-threading enabled?

Slurm gets all the 'P#n' found and numbers them sequentially, so it seems Slurm abstract ids would just match the 'PU L#n' shown by lstopo.

Different systems will get different results, for example:

felip@knc:~$ lstopo-no-graphics 
Machine (24GB total)
  NUMANode L#0 (P#0 12GB) + Package L#0 + L3 L#0 (12MB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2)
    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#4)
    L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#6)
    L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#8)
    L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#10)
  NUMANode L#1 (P#1 12GB) + Package L#1 + L3 L#1 (12MB)
    L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#1)
    L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#3)
    L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#5)
    L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#7)
    L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#9)
    L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)

i.e. here the gres.conf should look like:
File=/dev/nvidia[0-1] Cores=0,1,2,3,4,5  <<-- for socket 0
File=/dev/nvidia[2-3] Cores=6,7,8,9,10,11 <<-- for socket 1

But yes, we probably could say that the 'PU L#n' is what we need to use, and if one has HT enabled just the first PU L#n of each Core like you are doing. I don't know if there's any possibility that this rule is not always accomplished, I think the formula is the safest way to go.

Moreover, I like to use physical names to not get confused in different configurations. I know that P#0 will *always* be P#0 regardless if you change NUMA settings and topology. For me, your phyisical ids are 0,32; 1,33; 2,34;.. and so on. Slurm will number it like this:

0,32,1,33,2, 34,...
0 1  2  3 4  5, ...

Does it make sense?

Comment 18 Jesse Hostetler 2018-06-07 15:58:14 MDT

(In reply to Felip Moll from comment #17)
> Do you have hyper-threading enabled?

Yes, hyper-threading is enabled.

> But yes, we probably could say that the 'PU L#n' is what we need to use, and
> if one has HT enabled just the first PU L#n of each Core like you are doing.
> I don't know if there's any possibility that this rule is not always
> accomplished, I think the formula is the safest way to go.

IMHO, the docs should describe how Slurm actually calculates the indices. If Slurm uses hwloc to look up the 'PU L#' numbers, then that's the way the user should determine them as well. But if Slurm is actually using its own formula that happens to coincide with lstopo most of the time, I'd rather be told to use the formula. This might deserve its own ticket, since we're pretty far from the original --gres-flags=enforce-bindings issue at this point.

Thanks for your help!

Comment 19 Felip Moll 2018-06-08 01:49:41 MDT

(In reply to Jesse Hostetler from comment #18)
> (In reply to Felip Moll from comment #17)

> But if Slurm is actually using its own
> formula that happens to coincide with lstopo most of the time, I'd rather be
> told to use the formula.

That's it. In comment 7 I just tell that I don't know why the explanation was removed in the docs. We are internally discussing this at that moment.


> This might deserve its own ticket, since we're
> pretty far from the original --gres-flags=enforce-bindings issue at this
> point.
> 

Well, affinity was one of the two problems that Kilian had, I wanted to be sure its config is working well before dealing with enforce-bindings problem.


Thanks

Comment 20 Felip Moll 2018-06-19 04:38:25 MDT

Hi Kilian,

Could you please refer to my comment 10?

Thanks!

Comment 21 Felip Moll 2018-06-20 10:55:14 MDT

Kilian,

Is it possible you are using cgroup task affinity plugin?

If the affinity does not work correctly I would like you to try to switch to task/affinity plugin, so change:

cgroup.conf:
TaskAffinity=no

slurm.conf:
TaskPlugin=task/cgroup,task/affinity

waiting for your feedbacks.

Comment 23 Felip Moll 2018-06-21 06:31:02 MDT

Well, after analyzing the code and other bugs about --gres-flags=enforce-binding, I see what's happening.

--gres-flags=enforce-binding is only intended to allow/disallow the usage of CPUS of different sockets when asking for multiple gres.

This means that, for example, in this configuration:

NodeName=moll[1-3] Name=gpu Type=p100 File=/dev/nvidia[0-1] COREs=[0-9]
NodeName=moll[1-3] Name=gpu Type=p100 File=/dev/nvidia[2-3] COREs=[10-19]

as you commented in past bug 1725, if enable debugflags "Gres" and you run:

]$ srun -N 1 -n 1  --gres gpu:1 -w moll1  sleep 1000 &   <--- [2018-06-21T14:05:01.275]   gres_bit_alloc:0
]$ srun -N 1 -n 1  --gres gpu:1 -w moll1  sleep 1000 &   <--- [2018-06-21T14:08:09.175]   gres_bit_alloc:0,2
]$ srun -N 1 -n 1  --gres gpu:1 -w moll1  sleep 1000 &   <--- [2018-06-21T14:10:10.145]   gres_bit_alloc:0-2

]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                19   skylake    sleep    slurm  R       0:05      1 moll1
                18   skylake    sleep    slurm  R       0:13      1 moll1
                17   skylake    sleep    slurm  R       0:26      1 moll1

]$ scancel 17   <--- [2018-06-21T14:05:01.275]   gres_bit_alloc:1-2


Now, look the difference in using the enforce-binding parameter when asking for 2 GPUs:

]$ srun -N 1 -n 1 --gres-flags=enforce-binding --gres gpu:2 -w moll1 --pty bash   <--- Here it won't run because there's no one single socket available
srun: job 21 queued and waiting for resources

]$ srun -N 1 -n 1  --gres gpu:2 -w moll1 --pty bash    <--- Here it will run
[slurm@moll1 ~]$ exit


This is the behavior you tested in bug 1725 and seems correct to me.


What's true is that the documentation says:

"the CPUs identified in the gres.conf file will be [...] advisory"

But I think it is not well explained, basically because I didn't see an intention to do this in the code. When you ask for a --gres, the available cpus are *always* filtered to those that are available in gres.conf, this means that if your COREs doesn't appear in the gres.conf and you ask for a gres, then there will not be possibility to ask for all the cores while asking gres.

For your curiosity you can look at the function: _can_job_run_on_node(). You will see at the start that there's a gres_plugin_job_core_filter() function call that removes the cores not available to the job gres list. It basically clears the cpu_bitmap for CPUs which are not usable by this job (i.e. for CPUs which are already bound to other jobs or lack GRES).

After this gres_plugin_job_test() or _gres_sock_job_test() are called depending on if enforce-binding is set, or not, respectivelly (s_p_n = NO_VAL means use all sockets).

I will check with dev. team about this use case and let you know what I find.



In the meanwhile, please refer to my comment 10 to fix your affinity problems.

Comment 28 Felip Moll 2018-06-25 05:29:52 MDT

Hi Kilian,

Please, could you refer to my last comments 23 and 10?

One more note, the cons_tres plugin will address GPU scheduling in a different way than the current cons_res plugin and will fix this kind of issues. It will be available in the 19.05 release.

Comment 31 Felip Moll 2018-07-03 10:35:24 MDT

Documentation now explain how Cores specification must be done in gres.conf:

Commit 3ee3795 in 17.11.8 next release.

Kilian, still waiting for a response on comment 28.

Thanks,
Felip

Comment 32 Felip Moll 2018-07-04 11:17:52 MDT

I am closing this bug as info provided and haven't been any response in the last month.

If you have further issues, please, open a new bug and make reference to this one.

Regards

Comment 33 Kilian Cavalotti 2018-07-05 12:35:57 MDT

Hi Felip,

(In reply to Felip Moll from comment #20)
> Could you please refer to my comment 10?

Very very sorry for the delayed response, been swamped with many things, and I'm just getting to this now. Please se below

> I was trying first to fix the second issue, so please, can you repeat the
> test having put Cores=0-9 in gres.conf and doing:
> 
> $ srun -N 1 -n 1 -c 10 --gres gpu:4 -w sh-114-04 --pty bash
> sh-114-04 $ taskset -c -p $$
> 
> and 
> 
> $ srun -N 1 -n 1 -c 10 --gres-flags=enforce-binding --gres gpu:4 -w
> sh-114-04 --pty bash
> sh-114-04 $ taskset -c -p $$
> 
> instead of -c 20 ?
> 
> I am working currently trying to reproduce the first issue.


Sure! Here it is:

$ grep sh-114-04 /etc/slurm/gres.conf
NodeName=sh-114-04  name=gpu  File=/dev/nvidia[0-3]   COREs=0-9

$ srun -N 1 -n 1 -c 10 --gres gpu:4 -w sh-114-04 --pty bash
srun: job 21356724 queued and waiting for resources
srun: job 21356724 has been allocated resources
[kilian@sh-114-04 ~]$ taskset -c -p $$
pid 3557's current affinity list: 0,1,4,5,8,9,12,13,16,17

$ srun -N 1 -n 1 -c 10 --gres gpu:4 --gres-flags=enforce-binding -w sh-114-04 --pty bash
[kilian@sh-114-04 ~]$ taskset -c -p $$
pid 4457's current affinity list: 0,1,4,5,8,9,12,13,16,17


I take that "enforce-binding" doesn't seem to enforce much here, since the job got allocated CPUs that are not in the 0-9 range?


Cheers,
-- 
Kilian

Comment 34 Felip Moll 2018-07-06 08:09:02 MDT

Hi Kilian,

> Very very sorry for the delayed response, been swamped with many things,

No problem.

I assume that after so much chattering in this thread you could've miss some information.
Let me point the two key comments.

> Sure! Here it is:
>
>$ grep sh-114-04 /etc/slurm/gres.conf
>NodeName=sh-114-04  name=gpu  File=/dev/nvidia[0-3]   COREs=0-9
>...

Well, the response to your initial questions I think can be responded by:

1. How to configure gres.conf, see: comment 17

The short answer is to run 'lstopo-no-graphics' command, and configure gres.conf accordingly to this output.
Use the logical number L#i putting it in gres.conf. This logical number is related to the Physical number P#i, the same seen by the OS and shown by taskset.



and:

> I take that "enforce-binding" doesn't seem to enforce much here, since the
> job got allocated CPUs that are not in the 0-9 range?
> 

2. See comment 23 for an extensive explanation on how enforce-binding works.


Please, tell me if these comments get things clearer and don't doubt to ask more.

Comment 35 Kilian Cavalotti 2018-07-06 10:55:14 MDT

Hi Felip, 

Oh boy, I realize that I spent quite some time writing a detailled comment on that bug yesterday and that it somehow didn't get submitted... Ugh.

Anyway, I'll try to summarize. And to make things a bit more easy to track, I'd like to keep discussing the enforce-binding behavior in that ticket, and have created #5388 to discuss how the core ids should be specified in gres.conf.

(In reply to Felip Moll from comment #34)
> I assume that after so much chattering in this thread you could've miss some
> information.

Not really, I read the thread thoroughly and replied to many points, but that answer never made its way to the bug tracker. Sorry about this. :(  

> 1. How to configure gres.conf, see: comment 17

Thanks, I continued the discussion in #5388.

> 2. See comment 23 for an extensive explanation on how enforce-binding works.

Right, and thanks for the explanation. I think this is the very reason for my initial problem report in this ticket.

> When you ask for a --gres, the available
> cpus are *always* filtered to those that are available in gres.conf, this
> means that if your COREs doesn't appear in the gres.conf and you ask for a
> gres, then there will not be possibility to ask for all the cores while
> asking gres.

That explains the behavior we've observed, but I don't understand the rationale. I mean, I guess that configurations where GPUs are attached to a single socket were quite rare when that feature was first implemented, but with the advent of NVLink GPUs, it's becoming common place to have only one socket connected to the GPU board.

That's also the case with NICs or HCAs, which are typically connected to a single socket: if somebody defines a NIC as a GRES and specify COREs= for it on a dual-socket node, it means that no job requesting to use that NIC could ever be scheduled on the whole node?

> For your curiosity you can look at the function: _can_job_run_on_node(). You
> will see at the start that there's a gres_plugin_job_core_filter() function
> call that removes the cores not available to the job gres list. It basically
> clears the cpu_bitmap for CPUs which are not usable by this job (i.e. for
> CPUs which are already bound to other jobs or lack GRES).

So maybe it would make sense to remove that filtering? Maybe a better approach would be to select the cores listed in COREs= as primary candidates to run a job
that asks for that GRES, but if none is available, still consider the other ones as potential candidates (unless enforce-binding is specified, of course), rather than filtering them out?

Cheers,
-- 
Kilian

Comment 36 Felip Moll 2018-07-10 05:46:50 MDT

> That's also the case with NICs or HCAs, which are typically connected to a
> single socket: if somebody defines a NIC as a GRES and specify COREs= for it
> on a dual-socket node, it means that no job requesting to use that NIC could
> ever be scheduled on the whole node?

Yes, that's the case unfortunately.


> So maybe it would make sense to remove that filtering? 

I discussed previously about removing that filter, i.e. we could eliminate cores only if the job requests --gres-flags=enforce-binding.

What happens is that after omitting the filtering, the call to select cores for the job would find it much more likely that cores selected would not have GPUs available in the same socket, (unless --gres-flags=enforce-binding were used), leading to not being able to allocate the right cores.

It is not difficult to change, but the results would probably not be desirable.


> Maybe a better approach would be to select the cores listed in COREs= as primary candidates
> to run a job that asks for that GRES, but if none is available, still consider the other
> ones as potential candidates (unless enforce-binding is specified, of
> course), rather than filtering them out?

That's true. Like I responded in the bug 5388 and in comment 28, we are developing a new cons_tres plugin that will address GPU scheduling much better than the current cons_res plugin.
Details on the design are still in progress, but I can say it will be available in the 19.05 release.

I will take your recommendations and add it to our design document to make sure all of this is addressed.
The idea is basically to let the plugin identify the topology and just select the best cores for the job accordingly to the TRES, but the details on the implementation are still not clear.

I will also comment it with dev. team to get some more precise info.

Comment 40 Kilian Cavalotti 2018-07-12 09:07:06 MDT

(In reply to Felip Moll from comment #36)
> > So maybe it would make sense to remove that filtering? 
> 
> I discussed previously about removing that filter, i.e. we could eliminate
> cores only if the job requests --gres-flags=enforce-binding.
> 
> What happens is that after omitting the filtering, the call to select cores
> for the job would find it much more likely that cores selected would not
> have GPUs available in the same socket, (unless --gres-flags=enforce-binding
> were used), leading to not being able to allocate the right cores.

Well, the documentation says that COREs= in gres.conf is advisory, so unless "enforce-binding" is used, I think people expect that sometimes, the allocated cores may not be directly connected to their GRES.

At least, --gres-flags=enforce-binding would give them a way to control the behavior, because it looks like right now, the binding is enforced with or without the option.


To clarify further, I have a couple associated questions:

1. when a job requests n CPU cores and m GRES, which resource is selected first during the allocation? Is it GRES, and then CPU cores are chosen based on availability and job options? Or is it cores first, and then GRES are selected depending on the other criteria?

2. even without requesting any GRES, what's the best way to request that a single-task, 4-core job is allocated 4 cores on the same socket? Or 2 cores on each socket?
I know the --cores-per-socket option, but it's a selection option which basically makes all the nodes that don't have the requested number of cores/socket ineligible to run the job. But it doesn't determine the allocation of those cores, does it?


> That's true. Like I responded in the bug 5388 and in comment 28, we are
> developing a new cons_tres plugin that will address GPU scheduling much
> better than the current cons_res plugin.
> Details on the design are still in progress, but I can say it will be
> available in the 19.05 release.
> 
> I will take your recommendations and add it to our design document to make
> sure all of this is addressed.
> The idea is basically to let the plugin identify the topology and just
> select the best cores for the job accordingly to the TRES, but the details
> on the implementation are still not clear.
> 
> I will also comment it with dev. team to get some more precise info.

Thank you!

Cheers,
-- 
Kilian

Comment 41 Felip Moll 2018-07-17 14:17:51 MDT

> Well, the documentation says that COREs= in gres.conf is advisory, so unless
> "enforce-binding" is used, I think people expect that sometimes, the
> allocated cores may not be directly connected to their GRES.
> 
> At least, --gres-flags=enforce-binding would give them a way to control the
> behavior, because it looks like right now, the binding is enforced with or
> without the option.

Yep, that's the current discussion. As I commented in comment 23 the documentation is wrong.

What we could do as a workaround and temporary measure is the opposite, add a --gres-flags=disable-binding to disable this filtering.
This way the user would be conscious about the possibility of running under the best performance when using a gpu.

It would be a temporary measure until the cons_tres plugin is finished.

Would this work for you?

I have very little time to add it to 18.08, but maybe I could provide a local patch. In any case the new feature wouldn't be in 17.11 branch.


> To clarify further, I have a couple associated questions:
> 
> 1. when a job requests n CPU cores and m GRES, which resource is selected
> first during the allocation? Is it GRES, and then CPU cores are chosen based
> on availability and job options? Or is it cores first, and then GRES are
> selected depending on the other criteria?

In a very high level the order is as follows:

a) Given a Pending job, identify all usable nodes, cores and GPUs based upon partitions, reservations, other jobs, and so on.
b) Then, if the job requested GPUs and this are pinned to some Cores as defined in gres.conf, remove the other cores from the available list. Here is where the mentioned filtering occurs.
c) Select the best cores for the job. Here only cpus-per-task, tasks-per-socket, tasks-per-node, mem-per-cpu and other many options are considered, but not GPU specifications.
d) Finally identify which GPUs can be used with the selected cores in c).


> 2. even without requesting any GRES, what's the best way to request that a
> single-task, 4-core job is allocated 4 cores on the same socket? Or 2 cores
> on each socket?

In that case you can use the --cpu-bind option to bind a task to a NUMA domain (or to a socket).

You have many options, like sockets, map_cpu:<list>, map_ldom:<list>, etc.

'man srun' will provide all the info.


> I know the --cores-per-socket option, but it's a selection option which
> basically makes all the nodes that don't have the requested number of
> cores/socket ineligible to run the job. But it doesn't determine the
> allocation of those cores, does it?

Your statement is correct.

Comment 42 Kilian Cavalotti 2018-07-17 14:50:55 MDT

Hi Felip,

(In reply to Felip Moll from comment #41)
> What we could do as a workaround and temporary measure is the opposite, add
> a --gres-flags=disable-binding to disable this filtering.
> This way the user would be conscious about the possibility of running under
> the best performance when using a gpu.
> 
> It would be a temporary measure until the cons_tres plugin is finished.
> 
> Would this work for you?
> 
> I have very little time to add it to 18.08, but maybe I could provide a
> local patch. In any case the new feature wouldn't be in 17.11 branch.

That could be useful to have this option in 18.08, yes.

> In a very high level the order is as follows:
> 
> a) Given a Pending job, identify all usable nodes, cores and GPUs based upon
> partitions, reservations, other jobs, and so on.
> b) Then, if the job requested GPUs and this are pinned to some Cores as
> defined in gres.conf, remove the other cores from the available list. Here
> is where the mentioned filtering occurs.
> c) Select the best cores for the job. Here only cpus-per-task,
> tasks-per-socket, tasks-per-node, mem-per-cpu and other many options are
> considered, but not GPU specifications.
> d) Finally identify which GPUs can be used with the selected cores in c).

That's extremely useful, thank you for the description!

> In that case you can use the --cpu-bind option to bind a task to a NUMA
> domain (or to a socket).
> 
> You have many options, like sockets, map_cpu:<list>, map_ldom:<list>, etc.
> 
> 'man srun' will provide all the info.

Got it, thanks!

Cheers,
-- 
Kilian

Comment 52 Felip Moll 2018-07-18 06:54:50 MDT

Kilian,

We've managed to add this into 18.08.

https://github.com/SchedMD/slurm/commit/aa61233bb271fd09774c614b94edc2b71062f538
https://github.com/SchedMD/slurm/commit/6d349bc5a1ee9fb90d79b6cc8bbfcdc49466a2ca

We added the --gres-flags=disable-binding and the corresponding entry in the man pages. Hope it helps with this.

If you need the patch for 17.11 I could create a backport for you. Otherwise it will be available in the future 18.08 release planned for mid august.

Note this is a 'workaround'. The final solution will be to change the logic in the new cons_tres plugin in 19.05.

Comment 56 Felip Moll 2018-07-18 08:58:24 MDT

The wrong sentence about "the CPUs identified in the gres.conf file will be [...] advisory" have been deleted in commit 7cc2553b.

Kilian, at that point, is everything fine&clear for you?

If you don't need the 17.11 patch I will close the bug, otherwise I'll prepare the backport first.

Thanks for your patience.

Comment 57 Kilian Cavalotti 2018-07-18 09:05:58 MDT

(In reply to Felip Moll from comment #56)
> The wrong sentence about "the CPUs identified in the gres.conf file will be
> [...] advisory" have been deleted in commit 7cc2553b.
> 
> Kilian, at that point, is everything fine&clear for you?
> 
> If you don't need the 17.11 patch I will close the bug, otherwise I'll
> prepare the backport first.

Great, thanks a lot! Everything is fine, we don't really need a patch for 17.11 and can wait for 18.08 to be released, but are really looking forward to the new con_tres in 19.05!

Cheers,
--
Kilian