Ticket 10827 - Invalid gres specification with AutoDetect=nvml
Summary: Invalid gres specification with AutoDetect=nvml
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 20.11.4
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
: 11277 11693 11697 (view as ticket list)
Depends on:
Blocks: 10933 10970
  Show dependency treegraph
 
Reported: 2021-02-09 15:24 MST by Kilian Cavalotti
Modified: 2021-12-15 10:06 MST (History)
5 users (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.0
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (72.18 KB, text/plain)
2021-02-09 16:10 MST, Kilian Cavalotti
Details
lstopo output (4.39 KB, text/plain)
2021-02-09 18:58 MST, Kilian Cavalotti
Details
slurmd logs (3.47 KB, text/x-log)
2021-02-16 11:01 MST, Kilian Cavalotti
Details
debug v1 (13.08 KB, patch)
2021-02-16 18:47 MST, Michael Hinton
Details | Diff
slurmctld/slurmd logs with debug patch (253.67 KB, application/gzip)
2021-02-17 15:50 MST, Kilian Cavalotti
Details
slurmctld logs with gres and selecttype debugflags (12.92 MB, application/x-bzip)
2021-02-19 18:15 MST, Kilian Cavalotti
Details
slurmd debug3 logs w/ AutoDetect=NVML (36.04 KB, text/x-log)
2021-02-22 17:57 MST, Kilian Cavalotti
Details
v1 (22.45 KB, patch)
2021-06-09 13:17 MDT, Michael Hinton
Details | Diff
2011 v2 (will not land in 20.11) (26.82 KB, patch)
2021-06-10 16:25 MDT, Michael Hinton
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2021-02-09 15:24:34 MST
Hi SchedMD, 

We have a weird situation were job allocation seem to be failing on certain GPU nodes with an "Invalid gres specification" at step creation time.

For instance, submitting a job to a particular host fails with "--gres=gpu:4", but works with `--gres=gpu:3" or "-G 4":

-- 8< -------------------------------------------------------
$ srun -p jamesz -w sh03-13n14  --gres=gpu:3 --pty bash
srun: job 17809957 queued and waiting for resources
srun: job 17809957 has been allocated resources

$ srun -p jamesz -w sh03-13n14 --gres=gpu:4 --pty bash
srun: job 17809893 queued and waiting for resources
srun: job 17809893 has been allocated resources
srun: error: Unable to create step for job 17809893: Invalid generic resource (gres) specification
srun: Force Terminated job 17809893

$ srun  -p jamesz -w sh03-13n14  -G 4 --pty bash
srun: job 17809948 queued and waiting for resources
srun: job 17809948 has been allocated resources
-- 8< -------------------------------------------------------

The gres.conf on that node just contains: "AutoDetect=nvml". 

Here's its "scontrol show node" detailed output:
-- 8< -------------------------------------------------------
NodeName=sh03-13n14 Arch=x86_64 CpuBind=cores CoresPerSocket=16 
   CPUAlloc=32 CPUTot=32 CPULoad=13.12
   AvailableFeatures=IB:HDR,CPU_MNF:AMD,CPU_GEN:RME,CPU_SKU:7502P,CPU_FRQ:2.50GHz,GPU_GEN:TUR,GPU_BRD:GEFORCE,GPU_SKU:RTX_2080Ti,GPU_MEM:11GB,GPU_CC:7.5,CLASS:SH3_G4FP32
   ActiveFeatures=IB:HDR,CPU_MNF:AMD,CPU_GEN:RME,CPU_SKU:7502P,CPU_FRQ:2.50GHz,GPU_GEN:TUR,GPU_BRD:GEFORCE,GPU_SKU:RTX_2080Ti,GPU_MEM:11GB,GPU_CC:7.5,CLASS:SH3_G4FP32
   Gres=gpu:4(S:0-1)
   GresDrain=N/A
   GresUsed=gpu:(null):0(IDX:N/A)
   NodeAddr=sh03-13n14 NodeHostName=sh03-13n14 Version=20.11.3
   OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 
   RealMemory=256000 AllocMem=179072 FreeMem=180067 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=144441 Owner=N/A MCS_label=N/A
   Partitions=jamesz,owners 
   BootTime=2021-01-14T17:02:07 SlurmdStartTime=2021-02-09T12:58:35
   CfgTRES=cpu=32,mem=250G,billing=154,gres/gpu=4
   AllocTRES=cpu=32,mem=179072M
   CapWatts=n/a
   CurrentWatts=n/s AveWatts=n/s
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)
-- 8< -------------------------------------------------------

And here's the last log of slurmd starting on the node:
-- 8< -------------------------------------------------------
Feb 09 12:57:35 sh03-13n14.int systemd[1]: Started Slurm node daemon.
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: Considering each NUMA node as a socket
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: Considering each NUMA node as a socket
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: slurmd version 20.11.3 started
Feb 09 12:57:35 sh03-13n14.int slurmd[49199]: slurmd started on Tue, 09 Feb 2021 12:57:35 -0800
-- 8< -------------------------------------------------------

Any idea where the "Invalid gres specification" error could come from?

Thanks!
--
Kilian
Comment 1 Michael Hinton 2021-02-09 16:02:22 MST
Hi Killian,

That does seem bizarre and I wouldn't expect that. Is this perfectly repeatable with `--gres=gpu:4` on that node? Or is this dependent on what is allocated on the node?

Can you reproduce this without the `--pty` arg? Maybe with a `hostname` or `grep CUDA` instead?

Are you seeing this on just this node, or throughout the cluster?

Thanks,
-Michael
Comment 2 Kilian Cavalotti 2021-02-09 16:10:09 MST
Created attachment 17851 [details]
slurm.conf
Comment 3 Kilian Cavalotti 2021-02-09 16:12:28 MST
Hi Michael, 

(In reply to Michael Hinton from comment #1)
> That does seem bizarre and I wouldn't expect that. Is this perfectly
> repeatable with `--gres=gpu:4` on that node? 

Yes, perfectly repeatable. On this node and all the nodes that share the same CPU characteristics (single-socket AMD Rome CPU, with NPS=2). It doesn't seem to be occurring on dual-socket Intel nodes (although the number of NUMA nodes would be the same).

> Or is this dependent on what is
> allocated on the node?

This is a 4-GPU node, so the job can only be allocated if the 4 GPUs are available.

> Can you reproduce this without the `--pty` arg? Maybe with a `hostname` or
> `grep CUDA` instead?

It doesn't seem to be related to `--pty`:

$ srun  --partition=jamesz -w sh03-13n14 --gres=gpu:4 nvidia-smi -L
srun: job 17816858 queued and waiting for resources
srun: job 17816858 has been allocated resources
srun: error: Unable to create step for job 17816858: Invalid generic resource (gres) specification


> Are you seeing this on just this node, or throughout the cluster?

So far, it looks like it's moistly happening on AMD Rome GPU nodes (the sh03-* GPU nodes in slurm.conf). I haven't seen the same error on sh01-* or sh02-* nodes yet.

I couldn't find any additional info in the slurmd logs, even in debug3. Do you have any hint on how to get more information about the step creation error?

Thanks!
--
Kilian
Comment 4 Michael Hinton 2021-02-09 17:14:49 MST
(In reply to Kilian Cavalotti from comment #3)
> I couldn't find any additional info in the slurmd logs, even in debug3. Do
> you have any hint on how to get more information about the step creation
> error?
Yes - add the gres debug flag and hopefully that will help us see what kind of invalid GRES error it is. That was something we added in 20.11 to help debug these invalid GRES errors. Could you do that and post what you find?

Thanks!
-Michael
Comment 5 Michael Hinton 2021-02-09 17:27:07 MST
Although, if I remember correctly, there is only one path for an invalid GRES error at the step level like you are seeing... I'll take a look and see what I find
Comment 6 Kilian Cavalotti 2021-02-09 18:16:26 MST
(In reply to Michael Hinton from comment #4)
> (In reply to Kilian Cavalotti from comment #3)
> > I couldn't find any additional info in the slurmd logs, even in debug3. Do
> > you have any hint on how to get more information about the step creation
> > error?
> Yes - add the gres debug flag and hopefully that will help us see what kind
> of invalid GRES error it is. That was something we added in 20.11 to help
> debug these invalid GRES errors. Could you do that and post what you find?


Ok, I gave it a try:


$ srun  --partition=jamesz -w sh03-13n14 --gres=gpu:4 nvidia-smi -L
srun: job 17823788 queued and waiting for resources
srun: job 17823788 has been allocated resources
srun: error: Unable to create step for job 17823788: Invalid generic resource (gres) specification
srun: Force Terminated job 17823788


and it produced this in the slurmctld logs:


2021-02-09T16:58:44.767185-08:00 sh03-sl01 slurmctld[30893]: gres:gpu(7696487) type:(null)(0) job:17823788 flags: state
2021-02-09T16:58:44.767295-08:00 sh03-sl01 slurmctld[30893]:  gres_per_node:4 node_cnt:0
2021-02-09T16:58:44.767405-08:00 sh03-sl01 slurmctld[30893]:  ntasks_per_gres:65534
2021-02-09T16:58:44.774632-08:00 sh03-sl01 slurmctld[30893]: Sock_gres state for sh02-15n11
2021-02-09T16:58:44.774741-08:00 sh03-sl01 slurmctld[30893]: Gres:gpu Type:(null) TotalCnt:4 MaxNodeGres:0
2021-02-09T16:58:44.774853-08:00 sh03-sl01 slurmctld[30893]:  Sock[ANY]Cnt:4 Bits:0-3 of 4
2021-02-09T16:58:44.774963-08:00 sh03-sl01 slurmctld[30893]: Sock_gres state for sh03-13n12
2021-02-09T16:58:44.775072-08:00 sh03-sl01 slurmctld[30893]: Gres:gpu Type:(null) TotalCnt:4 MaxNodeGres:0
2021-02-09T16:58:44.775179-08:00 sh03-sl01 slurmctld[30893]:  Sock[ANY]Cnt:4 Bits:0-3 of 4
2021-02-09T16:58:44.775285-08:00 sh03-sl01 slurmctld[30893]: Sock_gres state for sh03-13n13
2021-02-09T16:58:44.775396-08:00 sh03-sl01 slurmctld[30893]: Gres:gpu Type:(null) TotalCnt:4 MaxNodeGres:0
2021-02-09T16:58:44.775514-08:00 sh03-sl01 slurmctld[30893]:  Sock[ANY]Cnt:4 Bits:0-3 of 4
2021-02-09T16:58:44.775621-08:00 sh03-sl01 slurmctld[30893]: Sock_gres state for sh03-13n14
2021-02-09T16:58:44.775727-08:00 sh03-sl01 slurmctld[30893]: Gres:gpu Type:(null) TotalCnt:4 MaxNodeGres:0
2021-02-09T16:58:44.775832-08:00 sh03-sl01 slurmctld[30893]:  Sock[ANY]Cnt:4 Bits:0-3 of 4
2021-02-09T16:58:44.775936-08:00 sh03-sl01 slurmctld[30893]: Sock_gres state for sh03-13n15
2021-02-09T16:58:44.776044-08:00 sh03-sl01 slurmctld[30893]: Gres:gpu Type:(null) TotalCnt:4 MaxNodeGres:0
2021-02-09T16:58:44.776151-08:00 sh03-sl01 slurmctld[30893]:  Sock[ANY]Cnt:4 Bits:0-3 of 4
2021-02-09T16:58:44.776258-08:00 sh03-sl01 slurmctld[30893]: sched: _slurm_rpc_allocate_resources JobId=17823788 NodeList=(null) usec=28433

[...]

2021-02-09T16:58:59.601396-08:00 sh03-sl01 slurmctld[30893]: Sock_gres state for sh03-13n14
2021-02-09T16:58:59.601495-08:00 sh03-sl01 slurmctld[30893]: Gres:gpu Type:(null) TotalCnt:4 MaxNodeGres:0
2021-02-09T16:58:59.601597-08:00 sh03-sl01 slurmctld[30893]:  Sock[ANY]Cnt:4 Bits:0-3 of 4
2021-02-09T16:58:59.601693-08:00 sh03-sl01 slurmctld[30893]: gres/gpu: state for sh03-13n14
2021-02-09T16:58:59.601792-08:00 sh03-sl01 slurmctld[30893]:  gres_cnt found:4 configured:4 avail:4 alloc:3
2021-02-09T16:58:59.601894-08:00 sh03-sl01 slurmctld[30893]:  gres_bit_alloc:0-2 of 4
2021-02-09T16:58:59.601993-08:00 sh03-sl01 slurmctld[30893]:  gres_used:(null)
2021-02-09T16:58:59.602091-08:00 sh03-sl01 slurmctld[30893]:  links[0]:0, 0, 0, -1
2021-02-09T16:58:59.602190-08:00 sh03-sl01 slurmctld[30893]:  links[1]:0, 0, -1, 0
2021-02-09T16:58:59.602289-08:00 sh03-sl01 slurmctld[30893]:  links[2]:0, -1, 0, 0
2021-02-09T16:58:59.602391-08:00 sh03-sl01 slurmctld[30893]:  links[3]:-1, 0, 0, 0
2021-02-09T16:58:59.602494-08:00 sh03-sl01 slurmctld[30893]:  topo[0]:(null)(0)
2021-02-09T16:58:59.602599-08:00 sh03-sl01 slurmctld[30893]:   topo_core_bitmap[0]:0-31 of 32
2021-02-09T16:58:59.602700-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_bitmap[0]:0 of 4
2021-02-09T16:58:59.602802-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_alloc[0]:1
2021-02-09T16:58:59.602903-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_avail[0]:1
2021-02-09T16:58:59.603004-08:00 sh03-sl01 slurmctld[30893]:  topo[1]:(null)(0)
2021-02-09T16:58:59.603103-08:00 sh03-sl01 slurmctld[30893]:   topo_core_bitmap[1]:0-31 of 32
2021-02-09T16:58:59.603201-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_bitmap[1]:1 of 4
2021-02-09T16:58:59.603298-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_alloc[1]:1
2021-02-09T16:58:59.603396-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_avail[1]:1
2021-02-09T16:58:59.603523-08:00 sh03-sl01 slurmctld[30893]:  topo[2]:(null)(0)
2021-02-09T16:58:59.603626-08:00 sh03-sl01 slurmctld[30893]:   topo_core_bitmap[2]:0-31 of 32
2021-02-09T16:58:59.603722-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_bitmap[2]:2 of 4
2021-02-09T16:58:59.603820-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_alloc[2]:1
2021-02-09T16:58:59.603917-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_avail[2]:1
2021-02-09T16:58:59.604014-08:00 sh03-sl01 slurmctld[30893]:  topo[3]:(null)(0)
2021-02-09T16:58:59.604109-08:00 sh03-sl01 slurmctld[30893]:   topo_core_bitmap[3]:0-31 of 32
2021-02-09T16:58:59.604206-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_bitmap[3]:3 of 4
2021-02-09T16:58:59.604308-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_alloc[3]:0
2021-02-09T16:58:59.604409-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_avail[3]:1
2021-02-09T16:58:59.604513-08:00 sh03-sl01 slurmctld[30893]: gres:gpu(7696487) type:(null)(0) job:17823788 flags: state
2021-02-09T16:58:59.604618-08:00 sh03-sl01 slurmctld[30893]:  gres_per_node:4 node_cnt:1
2021-02-09T16:58:59.604714-08:00 sh03-sl01 slurmctld[30893]:  ntasks_per_gres:65534
2021-02-09T16:58:59.604810-08:00 sh03-sl01 slurmctld[30893]:  gres_bit_step_alloc:NULL
2021-02-09T16:58:59.604905-08:00 sh03-sl01 slurmctld[30893]:  gres_cnt_node_alloc[0]:3
2021-02-09T16:58:59.605002-08:00 sh03-sl01 slurmctld[30893]:  gres_bit_alloc[0]:0-2 of 4
2021-02-09T16:58:59.605100-08:00 sh03-sl01 slurmctld[30893]:  gres_cnt_step_alloc[0]:0
2021-02-09T16:58:59.605199-08:00 sh03-sl01 slurmctld[30893]:  total_node_cnt:1693 (sparsely populated for resource selection)
2021-02-09T16:58:59.605298-08:00 sh03-sl01 slurmctld[30893]:  gres_cnt_node_select[1672]:3
2021-02-09T16:58:59.605396-08:00 sh03-sl01 slurmctld[30893]:  gres_bit_select[1672]:0-2 of 4
2021-02-09T16:58:59.605517-08:00 sh03-sl01 slurmctld[30893]: sched: Allocate JobId=17823788 NodeList=sh03-13n14 #CPUs=1 Partition=jamesz

and then, prolog runs and the job completes right after:

2021-02-09T16:59:04.067424-08:00 sh03-sl01 slurmctld[30893]: prolog_running_decr: Configuration for JobId=17823788 is complete
2021-02-09T16:59:04.680645-08:00 sh03-sl01 slurmctld[30893]: _job_complete: JobId=17823788 WTERMSIG 1
2021-02-09T16:59:04.680938-08:00 sh03-sl01 slurmctld[30893]: gres/gpu: state for sh03-13n14
2021-02-09T16:59:04.681049-08:00 sh03-sl01 slurmctld[30893]:  gres_cnt found:4 configured:4 avail:4 alloc:0
2021-02-09T16:59:04.681151-08:00 sh03-sl01 slurmctld[30893]:  gres_bit_alloc: of 4
2021-02-09T16:59:04.681260-08:00 sh03-sl01 slurmctld[30893]:  gres_used:(null)
2021-02-09T16:59:04.681364-08:00 sh03-sl01 slurmctld[30893]:  links[0]:0, 0, 0, -1
2021-02-09T16:59:04.681465-08:00 sh03-sl01 slurmctld[30893]:  links[1]:0, 0, -1, 0
2021-02-09T16:59:04.681569-08:00 sh03-sl01 slurmctld[30893]:  links[2]:0, -1, 0, 0
2021-02-09T16:59:04.681666-08:00 sh03-sl01 slurmctld[30893]:  links[3]:-1, 0, 0, 0
2021-02-09T16:59:04.681762-08:00 sh03-sl01 slurmctld[30893]:  topo[0]:(null)(0)
2021-02-09T16:59:04.681861-08:00 sh03-sl01 slurmctld[30893]:   topo_core_bitmap[0]:0-31 of 32
2021-02-09T16:59:04.681955-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_bitmap[0]:0 of 4
2021-02-09T16:59:04.682050-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_alloc[0]:0
2021-02-09T16:59:04.682153-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_avail[0]:1
2021-02-09T16:59:04.682249-08:00 sh03-sl01 slurmctld[30893]:  topo[1]:(null)(0)
2021-02-09T16:59:04.682346-08:00 sh03-sl01 slurmctld[30893]:   topo_core_bitmap[1]:0-31 of 32
2021-02-09T16:59:04.682447-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_bitmap[1]:1 of 4
2021-02-09T16:59:04.682554-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_alloc[1]:0
2021-02-09T16:59:04.682655-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_avail[1]:1
2021-02-09T16:59:04.682753-08:00 sh03-sl01 slurmctld[30893]:  topo[2]:(null)(0)
2021-02-09T16:59:04.682850-08:00 sh03-sl01 slurmctld[30893]:   topo_core_bitmap[2]:0-31 of 32
2021-02-09T16:59:04.682946-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_bitmap[2]:2 of 4
2021-02-09T16:59:04.683042-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_alloc[2]:0
2021-02-09T16:59:04.683137-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_avail[2]:1
2021-02-09T16:59:04.683234-08:00 sh03-sl01 slurmctld[30893]:  topo[3]:(null)(0)
2021-02-09T16:59:04.683331-08:00 sh03-sl01 slurmctld[30893]:   topo_core_bitmap[3]:0-31 of 32
2021-02-09T16:59:04.683428-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_bitmap[3]:3 of 4
2021-02-09T16:59:04.683532-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_alloc[3]:0
2021-02-09T16:59:04.683630-08:00 sh03-sl01 slurmctld[30893]:   topo_gres_cnt_avail[3]:1
2021-02-09T16:59:04.683725-08:00 sh03-sl01 slurmctld[30893]: _job_complete: JobId=17823788 done


So, from the controller's perspective, everything looks fine. It looks like the issue happens on the slurmd side.
But there, I can't find anything in the logs. With a new jobid (17823871) and SlurmdSyslogDebug=debug3, the only messages in the slurmd logs are:

Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug3: in the service_connection
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug2: Start processing RPC: REQUEST_LAUNCH_PROLOG
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug2: Processing RPC: REQUEST_LAUNCH_PROLOG
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug3: state for jobid 17812328: ctime:1612908337 revoked:0 expires:2147483647
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug3: state for jobid 17807268: ctime:1612912180 revoked:0 expires:2147483647
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug3: state for jobid 17815518: ctime:1612919377 revoked:1612919415 expires:1612919537
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug3: state for jobid 17823852: ctime:1612919426 revoked:1612919427 expires:1612919547
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug:  Checking credential with 620 bytes of sig data
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug2: _insert_job_state: we already have a job state for job 17823871.  No big deal, just an FYI.
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug:  prep/script: _run_spank_job_script: _run_spank_job_script: calling /usr/sbin/slurmstepd spank prolog
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug:  [job 17823871] attempting to run prolog [/etc/slurm/scripts/prolog.sh]
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: job 17823871 prolog: creating local scratch dirs...
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug3: _spawn_prolog_stepd: call to _forkexec_slurmstepd
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug3: slurmstepd rank 0 (sh03-13n14), parent rank -1 (NONE), children 0, depth 0, max_depth 0
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug3: _spawn_prolog_stepd: return from _forkexec_slurmstepd 0
Feb 09 17:11:13 sh03-13n14.int slurmd[126697]: debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug3: in the service_connection
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug2: Start processing RPC: REQUEST_TERMINATE_JOB
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug2: Processing RPC: REQUEST_TERMINATE_JOB
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug:  _rpc_terminate_job, uid = 398 JobId=17823871
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug:  credential for job 17823871 revoked
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug3: _kill_all_active_steps: Looking for job 17823871, found step from job 17812328
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug3: _kill_all_active_steps: Looking for job 17823871, found step from job 17812328
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug2: container signal 999 to StepId=17823871.extern
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug3: _kill_all_active_steps: Looking for job 17823871, found step from job 17807268
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug3: _kill_all_active_steps: Looking for job 17823871, found step from job 17807268
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug3: _kill_all_active_steps: Looking for job 17823871, found step from job 17812328
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug3: _kill_all_active_steps: Looking for job 17823871, found step from job 17812328
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug2: container signal 18 to StepId=17823871.extern
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug3: _kill_all_active_steps: Looking for job 17823871, found step from job 17807268
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug3: _kill_all_active_steps: Looking for job 17823871, found step from job 17807268
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug3: _kill_all_active_steps: Looking for job 17823871, found step from job 17812328
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug3: _kill_all_active_steps: Looking for job 17823871, found step from job 17812328
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug2: container signal 15 to StepId=17823871.extern
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug3: _kill_all_active_steps: Looking for job 17823871, found step from job 17807268
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug3: _kill_all_active_steps: Looking for job 17823871, found step from job 17807268
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug2: set revoke expiration for jobid 17823871 to 1612919594 UTS
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug:  Waiting for job 17823871's prolog to complete
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug:  Finished wait for job 17823871's prolog to complete
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug:  prep/script: _run_spank_job_script: _run_spank_job_script: calling /usr/sbin/slurmstepd spank epilog
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug:  [job 17823871] attempting to run epilog [/etc/slurm/scripts/epilog.sh]
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug:  completed epilog for jobid 17823871
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug:  JobId=17823871: sent epilog complete msg: rc = 0
Feb 09 17:11:14 sh03-13n14.int slurmd[126697]: debug2: Finish processing RPC: REQUEST_TERMINATE_JOB


There's nothing between REQUEST_LAUNCH_PROLOG and REQUEST_TERMINATE_JOB and nothing that I can correlate with an invalid gres specification... :\



Another troubling point is that when the Gres specification is really invalid, the job is rejected right away, not after resources have been allocated on the node:

$ srun  --partition=jamesz -w sh03-13n14 --gres=foobar:4 nvidia-smi -L
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification

which is different from: 

$ srun  --partition=jamesz -w sh03-13n14 --gres=gpu:4 nvidia-smi -L
srun: job 17823830 queued and waiting for resources
srun: job 17823830 has been allocated resources
srun: error: Unable to create step for job 17823830: Invalid generic resource (gres) specification
srun: Force Terminated job 17823830


Hope this helps,

Thanks!
--
Kilian
Comment 7 Michael Hinton 2021-02-09 18:31:27 MST
Sorry, can you also reproduce this with the "Steps" debug flag enabled? You may see a new error that includes "No task can start on node"
Comment 8 Michael Hinton 2021-02-09 18:31:50 MST
(In reply to Michael Hinton from comment #7)
> Sorry, can you also reproduce this with the "Steps" debug flag enabled? You
> may see a new error that includes "No task can start on node"
You can turn off the gres debug flag, too
Comment 9 Michael Hinton 2021-02-09 18:36:00 MST
Can you also attach your current gres.conf?
Comment 10 Michael Hinton 2021-02-09 18:43:06 MST
Could you also restart or reconfigure sh03-13n14 with log level debug2 or debug3 so we can see better what AutoDetect is doing?

Do you still get the error without AutoDetect? Like, if you specified gres.conf manually for those nodes, does the error go away?
Comment 11 Kilian Cavalotti 2021-02-09 18:44:08 MST
(In reply to Michael Hinton from comment #7)
> Sorry, can you also reproduce this with the "Steps" debug flag enabled? You
> may see a new error that includes "No task can start on node"

With "scontrol setdebugflags +steps", I only see this in the slurmctld logs

Feb 09 17:39:19 sh03-sl01.int slurmctld[30893]: sched: Allocate JobId=17824584 NodeList=sh03-13n14 #CPUs=1 Partition=jamesz
Feb 09 17:39:21 sh03-sl01.int slurmctld[30893]: _pick_step_nodes: step pick 1-1 nodes, avail:sh02-03n08 idle:sh02-03n08 picked:NONE
Feb 09 17:39:21 sh03-sl01.int slurmctld[30893]: prolog_running_decr: Configuration for JobId=17824584 is complete

But that's the not the right node, so probably from another job starting?

> Can you also attach your current gres.conf?

It's literally just this:

[root@sh03-13n14 ~]# cat /etc/slurm/gres.conf 
AutoDetect=nvml
Comment 12 Kilian Cavalotti 2021-02-09 18:49:18 MST
(In reply to Michael Hinton from comment #10)
> Could you also restart or reconfigure sh03-13n14 with log level debug2 or
> debug3 so we can see better what AutoDetect is doing?

Here's the slurmd start messages with debug3

2021-02-09T17:43:56.269921-08:00 sh03-13n14 slurmd[132028]: Considering each NUMA node as a socket
2021-02-09T17:43:56.271512-08:00 sh03-13n14 slurmd[132028]: Considering each NUMA node as a socket
2021-02-09T17:43:56.271894-08:00 sh03-13n14 slurmd[132028]: GRES: Global AutoDetect=nvml(1)
2021-02-09T17:43:56.342958-08:00 sh03-13n14 slurmd[132028]: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
2021-02-09T17:43:56.345605-08:00 sh03-13n14 slurmd[132028]: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
2021-02-09T17:43:56.348185-08:00 sh03-13n14 slurmd[132028]: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
2021-02-09T17:43:56.349945-08:00 sh03-13n14 slurmd[132028]: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
2021-02-09T17:43:56.350177-08:00 sh03-13n14 slurmd[132028]: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
2021-02-09T17:43:56.350261-08:00 sh03-13n14 slurmd[132028]: Gres GPU plugin: Normalizing gres.conf with system GPUs
2021-02-09T17:43:56.350511-08:00 sh03-13n14 slurmd[132028]: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
2021-02-09T17:43:56.350593-08:00 sh03-13n14 slurmd[132028]: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
2021-02-09T17:43:56.350673-08:00 sh03-13n14 slurmd[132028]: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
2021-02-09T17:43:56.350756-08:00 sh03-13n14 slurmd[132028]: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
2021-02-09T17:43:56.352020-08:00 sh03-13n14 slurmd[132028]: Gres GPU plugin: Final normalized gres.conf list:
2021-02-09T17:43:56.352100-08:00 sh03-13n14 slurmd[132028]:    GRES[gpu] Type:(null) Count:1 Cores(32):0-31  Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
2021-02-09T17:43:56.352179-08:00 sh03-13n14 slurmd[132028]:    GRES[gpu] Type:(null) Count:1 Cores(32):0-31  Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
2021-02-09T17:43:56.352257-08:00 sh03-13n14 slurmd[132028]:    GRES[gpu] Type:(null) Count:1 Cores(32):0-31  Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
2021-02-09T17:43:56.352343-08:00 sh03-13n14 slurmd[132028]:    GRES[gpu] Type:(null) Count:1 Cores(32):0-31  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
2021-02-09T17:43:56.352783-08:00 sh03-13n14 slurmd[132028]: gres/gpu: common_node_config_load: GRES: gpu device number 0(/dev/nvidia0):c 195:0 rwm
2021-02-09T17:43:56.352866-08:00 sh03-13n14 slurmd[132028]: gres/gpu: common_node_config_load: GRES: gpu device number 1(/dev/nvidia1):c 195:1 rwm
2021-02-09T17:43:56.352950-08:00 sh03-13n14 slurmd[132028]: gres/gpu: common_node_config_load: GRES: gpu device number 2(/dev/nvidia2):c 195:2 rwm
2021-02-09T17:43:56.353032-08:00 sh03-13n14 slurmd[132028]: gres/gpu: common_node_config_load: GRES: gpu device number 3(/dev/nvidia3):c 195:3 rwm
2021-02-09T17:43:56.353115-08:00 sh03-13n14 slurmd[132028]: Gres Name=gpu Type=(null) Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-31 CoreCnt=32 Links=0,0,0,-1
2021-02-09T17:43:56.353197-08:00 sh03-13n14 slurmd[132028]: Gres Name=gpu Type=(null) Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-31 CoreCnt=32 Links=0,0,-1,0
2021-02-09T17:43:56.353287-08:00 sh03-13n14 slurmd[132028]: Gres Name=gpu Type=(null) Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-31 CoreCnt=32 Links=0,-1,0,0
2021-02-09T17:43:56.353374-08:00 sh03-13n14 slurmd[132028]: Gres Name=gpu Type=(null) Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-31 CoreCnt=32 Links=-1,0,0,0
2021-02-09T17:43:56.353539-08:00 sh03-13n14 slurmd[132028]: topology/tree: init: topology tree plugin loaded
2021-02-09T17:43:56.353780-08:00 sh03-13n14 slurmd[132028]: topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
2021-02-09T17:43:56.360289-08:00 sh03-13n14 slurmd[132028]: route/default: init: route default plugin loaded
2021-02-09T17:43:56.361759-08:00 sh03-13n14 slurmd[132028]: task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffff
2021-02-09T17:43:56.362535-08:00 sh03-13n14 slurmd[132028]: spank/lua: Loaded 0 plugins in this context
2021-02-09T17:43:56.362769-08:00 sh03-13n14 slurmd[132028]: cred/munge: init: Munge credential signature plugin loaded
2021-02-09T17:43:56.363000-08:00 sh03-13n14 slurmd[132028]: slurmd version 20.11.3 started
2021-02-09T17:43:56.364977-08:00 sh03-13n14 slurmd[132028]: slurmd started on Tue, 09 Feb 2021 17:43:56 -0800
2021-02-09T17:44:06.638341-08:00 sh03-13n14 slurmd[132028]: CPUs=32 Boards=1 Sockets=2 Cores=16 Threads=1 Memory=257614 TmpDisk=24564 Uptime=2248995 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
Waiting for data... (interrupt to abort)


> Do you still get the error without AutoDetect? Like, if you specified
> gres.conf manually for those nodes, does the error go away?

I tried replacing AutoDetect=nvml in gres.conf with this:

name=gpu    File=/dev/nvidia[0-3]   COREs=0-31

and restart slurmd on the node, but the result is exactly the same:
* --gres=gpu:4 fails
* --gres=gpu:3 works
* -G 4 works


Cheers,
--
Kilian
Comment 13 Michael Hinton 2021-02-09 18:57:04 MST
Could you attach the `lstopo-no-graphics` output for node sh03-13n14, as well as what `slurmd -D` reports for that node? Just want to verify the topology, since it seems like something weird is going on.

(In reply to Kilian Cavalotti from comment #11)
> (In reply to Michael Hinton from comment #7)
> > Sorry, can you also reproduce this with the "Steps" debug flag enabled? You
> > may see a new error that includes "No task can start on node"
> 
> With "scontrol setdebugflags +steps", I only see this in the slurmctld logs
> 
> Feb 09 17:39:19 sh03-sl01.int slurmctld[30893]: sched: Allocate
> JobId=17824584 NodeList=sh03-13n14 #CPUs=1 Partition=jamesz
> Feb 09 17:39:21 sh03-sl01.int slurmctld[30893]: _pick_step_nodes: step pick
> 1-1 nodes, avail:sh02-03n08 idle:sh02-03n08 picked:NONE
> Feb 09 17:39:21 sh03-sl01.int slurmctld[30893]: prolog_running_decr:
> Configuration for JobId=17824584 is complete
Is your log level for the ctld at least verbose? If not, then it will miss this error:

https://github.com/SchedMD/slurm/blob/slurm-20-11-3-1/src/slurmctld/step_mgr.c#L1180-L1217

If you log level was at least verbose or debug[X], then we aren't hitting that ESLURM_INVALID_GRES, and so I think the only other spot we could be hitting is this:

https://github.com/SchedMD/slurm/blob/slurm-20-11-3-1/src/common/gres.c#L12440-L12524

For this last section, we could provide a debug patch to you that will print out the exact ESLURM_INVALID_GRES error being triggered.

(In reply to Kilian Cavalotti from comment #12)
> I tried replacing AutoDetect=nvml in gres.conf with this:
> 
> name=gpu    File=/dev/nvidia[0-3]   COREs=0-31
> 
> and restart slurmd on the node, but the result is exactly the same:
> * --gres=gpu:4 fails
> * --gres=gpu:3 works
> * -G 4 works
That seems good to me, so I don't think AutoDetect has anything to do with this.

-Michael
Comment 14 Kilian Cavalotti 2021-02-09 18:58:37 MST
Created attachment 17853 [details]
lstopo output
Comment 15 Kilian Cavalotti 2021-02-09 19:39:01 MST
(In reply to Michael Hinton from comment #13)
> Could you attach the `lstopo-no-graphics` output for node sh03-13n14, as
> well as what `slurmd -D` reports for that node? Just want to verify the
> topology, since it seems like something weird is going on.

The thing is that jobs can be sucessfully allocated and started with `-G 4`, it's juste the `--gres gpu:4` syntax that generates the error. So it seems like the GPUs and topology are fine, right?

Here's the `slurmd -D` output;

[root@sh03-13n14 ~]# slurmd -D
slurmd: Considering each NUMA node as a socket
slurmd: Considering each NUMA node as a socket
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
slurmd: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
slurmd: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
slurmd: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
slurmd: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
slurmd: Gres Name=gpu Type=(null) Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-31 CoreCnt=32 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=(null) Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-31 CoreCnt=32 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=(null) Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-31 CoreCnt=32 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=(null) Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-31 CoreCnt=32 Links=-1,0,0,0
slurmd: topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
slurmd: slurmd version 20.11.3 started
slurmd: killing old slurmd[132990]
slurmd: slurmd started on Tue, 09 Feb 2021 17:59:20 -0800
slurmd: CPUs=32 Boards=1 Sockets=2 Cores=16 Threads=1 Memory=257614 TmpDisk=24564 Uptime=2249919 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)


> Is your log level for the ctld at least verbose? If not, then it will miss
> this error:

Ah right, sorry. With debug logs in verbose, I got this for job 17827064:

2021-02-09T18:34:56.134368-08:00 sh03-sl01 slurmctld[31531]: STEPS: _pick_step_nodes: JobId=17827064 Currently running steps use 0 of allocated 1 CPUs on node sh03-13n14
2021-02-09T18:34:56.134470-08:00 sh03-sl01 slurmctld[31531]: STEPS: _pick_step_nodes: JobId=17827064 No task can start on node
2021-02-09T18:34:56.134594-08:00 sh03-sl01 slurmctld[31531]: STEPS: _slurm_rpc_job_step_create for JobId=17827064: Invalid generic resource (gres) specification
2021-02-09T18:34:56.135152-08:00 sh03-sl01 slurmctld[31531]: _job_complete: JobId=17827064 WTERMSIG 1
2021-02-09T18:34:56.135615-08:00 sh03-sl01 slurmctld[31531]: _job_complete: JobId=17827064 done
2021-02-09T18:34:56.210949-08:00 sh03-sl01 slurmctld[31531]: STEPS: Processing RPC details: REQUEST_STEP_COMPLETE for StepId=17827064.extern nodes 0-0 rc=0
2021-02-09T18:34:56.211055-08:00 sh03-sl01 slurmctld[31531]: STEPS: step dealloc on job node 0 (sh03-13n14) used: 0 of 1 CPUs
2021-02-09T18:34:56.211158-08:00 sh03-sl01 slurmctld[31531]: STEPS: _slurm_rpc_step_complete StepId=17827064.extern usec=140



That means we're in the case where gres_cpus < avail_cpus, right?

Cheers,
--
Kilian
Comment 16 Michael Hinton 2021-02-09 23:38:15 MST
(In reply to Kilian Cavalotti from comment #15)
> That means we're in the case where gres_cpus < avail_cpus, right?
Yes. I believe gres_plugin_step_test() is returning 0 both times, possibly from _step_test(), eventually setting gres_cpus, total_cpus, avail_cpus, avail_tasks, and total_tasks all to 0 and triggering the ESLURM_INVALID_GRES error. _step_test() code:

https://github.com/SchedMD/slurm/blob/slurm-20-11-3-1/src/common/gres.c#L12267-L12360

You can see that there are slightly different code paths for gres_per_node (--gres=gpu) vs gres_per_step (--gpus) in _step_test(), which might account for the difference.

I've actually seen this specific ESLURM_INVALID_GRES error get triggered back in 20.02.[0,1] for a customer, and it was due to some GRES bitmaps getting corrupted. I'm not sure if something similar is happening here, but it's almost like the step is expecting 4 GPUs but the job's GRES bitmaps have < 4 GPUs.

I'll try to see if I can emulate your config and hopefully reproduce the issue tomorrow. If I can't, I might resort to giving you a debug patch that can illuminate exactly what's going on, if you are willing.

Thanks,
-Michael
Comment 17 Michael Hinton 2021-02-09 23:55:00 MST
(In reply to Michael Hinton from comment #16)
> I've actually seen this specific ESLURM_INVALID_GRES error get triggered
> back in 20.02.[0,1] for a customer, and it was due to some GRES bitmaps
> getting corrupted.
I forgot that you guys were also partially privy to that whole fiasco :) But I don't believe this is nearly as serious as that was.
Comment 18 Kilian Cavalotti 2021-02-10 09:24:44 MST
(In reply to Michael Hinton from comment #16)
> I've actually seen this specific ESLURM_INVALID_GRES error get triggered
> back in 20.02.[0,1] for a customer, and it was due to some GRES bitmaps
> getting corrupted. I'm not sure if something similar is happening here, but
> it's almost like the step is expecting 4 GPUs but the job's GRES bitmaps
> have < 4 GPUs.
>
> I'll try to see if I can emulate your config and hopefully reproduce the
> issue tomorrow. If I can't, I might resort to giving you a debug patch that
> can illuminate exactly what's going on, if you are willing.

Good, thanks! And yes, a debugging path would be fine if you can't reproduce the issue on your end (if you can make it log in "info" debug level, that'd be best, as we have submission rates that sometimes go in the 500 jobs/second range, so logs quickly get crowded).

Thanks!
--
Kilian
Comment 19 Michael Hinton 2021-02-16 10:14:30 MST
Kilian,

(In reply to Kilian Cavalotti from comment #12)
> I tried replacing AutoDetect=nvml in gres.conf with this:
> 
> name=gpu    File=/dev/nvidia[0-3]   COREs=0-31
> 
> and restart slurmd on the node, but the result is exactly the same:
> * --gres=gpu:4 fails
> * --gres=gpu:3 works
> * -G 4 works
Can you still reproduce the error for the node if you set up a gres.conf like this?:

    Name=gpu Type=rtx File=/dev/nvidia[0-1] Cores=0-15
    Name=gpu Type=rtx File=/dev/nvidia[2-3] Cores=16-31

-Michael
Comment 20 Kilian Cavalotti 2021-02-16 11:01:16 MST
Created attachment 17950 [details]
slurmd logs

Hi Michael, 

(In reply to Michael Hinton from comment #19)
> Can you still reproduce the error for the node if you set up a gres.conf
> like this?:
> 
>     Name=gpu Type=rtx File=/dev/nvidia[0-1] Cores=0-15
>     Name=gpu Type=rtx File=/dev/nvidia[2-3] Cores=16-31

thanks for the suggestion, and that's something I tried too, but unfortunately, it didn't help:

[root@sh03-13n14 ~]# cat /etc/slurm/gres.conf 
Name=gpu Type=rtx File=/dev/nvidia[0-1] Cores=0-15
Name=gpu Type=rtx File=/dev/nvidia[2-3] Cores=16-31

[root@sh03-13n14 ~]# systemctl restart slurmd
[root@sh03-13n14 ~]# 

and still:
$ srun -p jamesz -w sh03-13n14  -t 0:2:0 --gres=gpu:4 nvidia-smi -L
srun: job 18400762 queued and waiting for resources
srun: job 18400762 has been allocated resources
srun: error: Unable to create step for job 18400762: Invalid generic resource (gres) specification
$

But still works with --gres=gpu:3 and --gpus 4

$ srun -p jamesz -w sh03-13n14  -t 0:2:0 --mem 1G --gres=gpu:3 nvidia-smi -L
srun: job 18401201 queued and waiting for resources
srun: job 18401201 has been allocated resources
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-4558f0f3-a437-5acc-4501-4b066593f31a)
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-565c7761-2b61-7cbc-69bf-3c9dff3b59a1)
GPU 2: GeForce RTX 2080 Ti (UUID: GPU-ae9eb151-45c4-4af3-3513-01b533f08770)

$ srun -p jamesz -w sh03-13n14  -t 0:2:0 --mem 1G --gpus 4 nvidia-smi -L
srun: job 18402637 queued and waiting for resources
srun: job 18402637 has been allocated resources
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-1118f950-5276-c3ca-dce8-5c1a24ca732e)
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-4558f0f3-a437-5acc-4501-4b066593f31a)
GPU 2: GeForce RTX 2080 Ti (UUID: GPU-565c7761-2b61-7cbc-69bf-3c9dff3b59a1)
GPU 3: GeForce RTX 2080 Ti (UUID: GPU-ae9eb151-45c4-4af3-3513-01b533f08770)

I'm attaching the slurmd logs, but they don't seem to show much. Not too sure about the two "WARNING: A line in gres.conf for GRES gpu:rtx has 2 more configured than expected in slurm.conf. Ignoring extra GRES." lines, though.

Thanks!
--
Kilian
Comment 21 Michael Hinton 2021-02-16 15:45:08 MST
(In reply to Kilian Cavalotti from comment #20)
> [root@sh03-13n14 ~]# cat /etc/slurm/gres.conf 
> Name=gpu Type=rtx File=/dev/nvidia[0-1] Cores=0-15
> Name=gpu Type=rtx File=/dev/nvidia[2-3] Cores=16-31
> 
Oops, my mistake. Your GPUs are type-less in slurm.conf, so it's ignoring the Type=rtx gres.conf entries and just going off of slurm.conf. Could you try it like this instead and see if that works?

    Name=gpu File=/dev/nvidia[0-1] Cores=0-15
    Name=gpu File=/dev/nvidia[2-3] Cores=16-31

-Michael
Comment 23 Michael Hinton 2021-02-16 18:56:53 MST
Hey Kilian,

Sorry for the delay. I wasn't able to reproduce things locally, so attached is a v1 debug patch for you to try. After applying the patch, only the slurmctld should need to be restarted.

Please do `scontrol setdebugflags +elasticsearch` to enable the logs and reproduce. You can turn off the v1 logging with a subsequent `scontrol setdebugflags -elasticsearch`. The log statements emitted are at level info, and they are disabled by default. Every log is prepended with "BUG_10827" and the job and/or step ID.

commit 2/2 has optional debug code that I don't think we'll need, but if we do, it's enabled with `scontrol setdebugflags +db_tres`.

Thanks,
-Michael

P.S. elasticsearch and db_tres were just unused debug flags that I repurposed for this patch only, so you won't get any other log junk :)
Comment 24 Kilian Cavalotti 2021-02-17 10:33:24 MST
(In reply to Michael Hinton from comment #21)
> Oops, my mistake. Your GPUs are type-less in slurm.conf, so it's ignoring
> the Type=rtx gres.conf entries and just going off of slurm.conf. Could you
> try it like this instead and see if that works?
> 
>     Name=gpu File=/dev/nvidia[0-1] Cores=0-15
>     Name=gpu File=/dev/nvidia[2-3] Cores=16-31
> 

No worries, but that doesn't make a difference, unfortunately.
I'll test the patch next.

Thanks!
--
Kilian
Comment 25 Kilian Cavalotti 2021-02-17 15:50:56 MST
Created attachment 17974 [details]
slurmctld/slurmd logs with debug patch

All right so first of all, an observation: after passing the node down *AND* restarting slurmctld, the problem _sometimes_ disappears. But only for that given node, and not in all cases.

Of course, that probably doesn't help much in understanding the root cause, so I've applied the patch you provided and here are the logs for a node that still exhibits the problem.

The job submission was: 

$ srun -p menon -w sh03-13n03  -t 0:2:0 --gres=gpu:4 nvidia-smi -L
srun: job 18518225 queued and waiting for resources
srun: job 18518225 has been allocated resources
srun: error: Unable to create step for job 18518225: Invalid generic resource (gres) specification

That's the full slurmctld log, so to make things a bit easier to find: 
- the Slurm controller was restarted at 2021-02-17T14:33:09
- job 18518225 was submitted at 2021-02-17T14:40:44

The relevant lines are:
-- 8< ---------------------------------------------------------------
BUG_10827: JobId=18518225: _calc_cpus_per_task: cpus_per_task (1) = step_specs->cpu_count (1) / step_specs->num_tasks (1) 
BUG_10827: JobId=18518225: _calc_cpus_per_task: num_tasks (1) -= (cpu_array_value[0] (1) / cpus_per_task (1)) * job_ptr->job_resrcs->cpu_array_reps[0] (1) == (1)
BUG_10827: JobId=18518225: _calc_cpus_per_task: cpus_per_task = 1                
BUG_10827: JobId=18518225: _step_test: A: min_gres (4) > gres_cnt (3); core_cnt = 0; max_rem_nodes = 1; ignore_alloc = 1
BUG_10827: JobId=18518225: _step_test: bit_set_count(job_gres_ptr->gres_bit_alloc[0]) = 3 (3)
BUG_10827: JobId=18518225: _step_test: step_gres_ptr->gres_per_node = 4          
BUG_10827: JobId=18518225: _pick_step_nodes: A: gres_cpus = 0                    
BUG_10827: JobId=18518225: _step_test: A: min_gres (4) > gres_cnt (3); core_cnt = 0; max_rem_nodes = 1; ignore_alloc = 0
BUG_10827: JobId=18518225: _step_test: bit_set_count(job_gres_ptr->gres_bit_alloc[0]) = 3 (3)
BUG_10827: JobId=18518225: _step_test: step_gres_ptr->gres_per_node = 4          
BUG_10827: JobId=18518225: _pick_step_nodes: B: gres_cpus = 0                    
BUG_10827: JobId=18518225: _pick_step_nodes: avail_cpus = 1, total_cpus = 0, cpus_per_task = 1, max_rem_nodes = 1, node_inx = 0, first_step_node = 1
BUG_10827: JobId=18518225: _pick_step_nodes: avail_tasks (0) and total_tasks (0) /= cpus_per_task (1)
BUG_10827: JobId=18518225: _pick_step_nodes: ESLURM_INVALID_GRES    
-- 8< ---------------------------------------------------------------

Slurmd was running with AutoDetect=Nvml in gres.conf

I hope this will help pinpoint the issue!

Thanks,
--
Kilian
Comment 26 Michael Hinton 2021-02-17 16:51:47 MST
Thanks!

(In reply to Kilian Cavalotti from comment #25)
> All right so first of all, an observation: after passing the node down *AND*
> restarting slurmctld, the problem _sometimes_ disappears. But only for that
> given node, and not in all cases.
> 
> ...
> 
> BUG_10827: JobId=18518225: _step_test: A: min_gres (4) > gres_cnt (3);
> core_cnt = 0; max_rem_nodes = 1; ignore_alloc = 1
> BUG_10827: JobId=18518225: _step_test:
> bit_set_count(job_gres_ptr->gres_bit_alloc[0]) = 3 (3)
> BUG_10827: JobId=18518225: _step_test: step_gres_ptr->gres_per_node = 4     
> 
> BUG_10827: JobId=18518225: _pick_step_nodes: A: gres_cpus = 0               
> 
> BUG_10827: JobId=18518225: _step_test: A: min_gres (4) > gres_cnt (3);
> core_cnt = 0; max_rem_nodes = 1; ignore_alloc = 0
> BUG_10827: JobId=18518225: _step_test:
> bit_set_count(job_gres_ptr->gres_bit_alloc[0]) = 3 (3)
> BUG_10827: JobId=18518225: _step_test: step_gres_ptr->gres_per_node = 4     
> 
> BUG_10827: JobId=18518225: _pick_step_nodes: B: gres_cpus = 0               
> 
> BUG_10827: JobId=18518225: _pick_step_nodes: avail_cpus = 1, total_cpus = 0,
> cpus_per_task = 1, max_rem_nodes = 1, node_inx = 0, first_step_node = 1
> BUG_10827: JobId=18518225: _pick_step_nodes: avail_tasks (0) and total_tasks
> (0) /= cpus_per_task (1)
> BUG_10827: JobId=18518225: _pick_step_nodes: ESLURM_INVALID_GRES    
> -- 8< ---------------------------------------------------------------
The big takeaway here is that the job's gres_bit_alloc[] bitmap (i.e. bitmap of allocated GPUs on the node) only had a count of 3, not 4 like we would expect. So the job thinks it only had 3 GPUs allocated, while the step was expecting it to have 4.

It seems like the job's gres_bit_alloc bitmap is getting corrupted somehow (not being able to reproduce the issue consistently is also a red flag for bitmap corruption issues, in my experience, so that's helpful to know).

I'll look into making another debug patch that can hopefully help track down where the job's gres_bit_alloc is getting changed unexpectedly.

In the meantime, be on the lookout for any clues on the necessary conditions to reproduce this. E.g., last year around this time, 20.02 had some node GRES bitmap corruption issues, and IIRC it originated from nodes where jobs got preempted, and I think only if there were multiple jobs on the node. So anything along those lines. Or, if you can prove that the issue *doesn't* happen under certain conditions, that's helpful, too.

I think one thing we should try is to not use AutoDetect=nvml (at least for the AMD nodes) and explicitly set the gres.conf info for them. It might not affect currently-affected nodes, but sometimes it takes time to "flush out" corrupted bitmap issues (and might require draining the node until all jobs are done before it can "reset"). So my recommendation is to run without AutoDetect for a few days to see if that reduces the number of affected nodes.

How disruptive is this?

Thanks,
-Michael
Comment 27 Kilian Cavalotti 2021-02-17 18:07:45 MST
(In reply to Michael Hinton from comment #26)

> The big takeaway here is that the job's gres_bit_alloc[] bitmap (i.e. bitmap
> of allocated GPUs on the node) only had a count of 3, not 4 like we would
> expect. So the job thinks it only had 3 GPUs allocated, while the step was
> expecting it to have 4.
> 
> It seems like the job's gres_bit_alloc bitmap is getting corrupted somehow
> (not being able to reproduce the issue consistently is also a red flag for
> bitmap corruption issues, in my experience, so that's helpful to know).

Given that the issue sometimes disappears after a controller restart and marking the node down, I agree that it looks like a bitmap corruption, indeed.

> I'll look into making another debug patch that can hopefully help track down
> where the job's gres_bit_alloc is getting changed unexpectedly.

Sounds great, thanks!

> In the meantime, be on the lookout for any clues on the necessary conditions
> to reproduce this. E.g., last year around this time, 20.02 had some node
> GRES bitmap corruption issues, and IIRC it originated from nodes where jobs
> got preempted, and I think only if there were multiple jobs on the node. 

Oh yeah, that sounds like 100% of our workload :)

Although in this current instance, the issue also seems to happen on nodes where no preemptable jobs can run, so preemption may not be a factor. But we definitely have multiple jobs running on all our nodes.

> I think one thing we should try is to not use AutoDetect=nvml (at least for
> the AMD nodes) and explicitly set the gres.conf info for them. It might not
> affect currently-affected nodes, but sometimes it takes time to "flush out"
> corrupted bitmap issues (and might require draining the node until all jobs
> are done before it can "reset"). So my recommendation is to run without
> AutoDetect for a few days to see if that reduces the number of affected
> nodes.

Ok, I can try to continue to play with the gres.conf contents, but as far as I can tell, the autodetect part seems to be detecting GPUs correctly, and manual specification of the gres devices doesn't seem to make much of a difference.

Draining/downing and restarting the controller seems to help a bit, although I didn't find a reproducible sequence of actions that cures the problem in 100% of the cases. But I'll continue to look.

> How disruptive is this?

Not too much, because the majority of our GPU workload uses just 1 GPU at a time. Since it doesn't affect all the nodes, and because using --gpus works, it's not a huge blocker. 

But having bitmap corruption is quite worrisome so I'd very much like to get to the bottom of this and see the underlying problem fixed.

Thank you!
--
Kilian
Comment 28 Kilian Cavalotti 2021-02-19 10:26:28 MST
Hi Michael, 

Just a quick note to let you know that we deployed 20.11.4 this morning, and that the issue is still present (the changelog had a couple gres-related entries so I'd just wanted to give it a try for the sake of completeness).

Cheers,
--
Kilian
Comment 29 Michael Hinton 2021-02-19 11:20:13 MST
(In reply to Kilian Cavalotti from comment #28)
> Just a quick note to let you know that we deployed 20.11.4 this morning, and
> that the issue is still present (the changelog had a couple gres-related
> entries so I'd just wanted to give it a try for the sake of completeness).
Ok, good to know.

I know you temporarily tried not using AutoDetect, with the same results, but what I was suggesting is to not use AutoDetect for an extended period of time (maybe a few days) to see if the number of issues/affected nodes goes down. Is that something you could try?

Thanks,
Michael
Comment 30 Michael Hinton 2021-02-19 11:23:00 MST
(In reply to Michael Hinton from comment #29)
> I know you temporarily tried not using AutoDetect, with the same results,
> but what I was suggesting is to not use AutoDetect for an extended period of
> time (maybe a few days) to see if the number of issues/affected nodes goes
> down. Is that something you could try?
You should be able to use AutoDetect in "sanity check" mode to derive the correct gres.conf, and then you can just turn off AutoDetect.

From https://slurm.schedmd.com/gres.html:

"By default, all system-detected devices are added to the node. However, if Type and File in gres.conf match a GPU on the system, any other properties explicitly specified (e.g. Cores or Links) can be double-checked against it. If the system-detected GPU differs from its matching GPU configuration, then the GPU is omitted from the node with an error. This allows gres.conf to serve as an optional sanity check and notifies administrators of any unexpected changes in GPU properties."
Comment 31 Kilian Cavalotti 2021-02-19 11:29:52 MST
Thanks for the suggestion!

We have such a variety of different configurations and GPU topologies that Autodetect was really a godsend for us, and I would really prefer not having to go back to manual specification. :)

Now, since we suspect that the GRES bitmaps may be corrupted for some of those nodes, what would be the best way to reset/recreate them?

Cheers,
--
Kilian
Comment 32 Michael Hinton 2021-02-19 11:42:36 MST
(In reply to Kilian Cavalotti from comment #31)
> Thanks for the suggestion!
> 
> We have such a variety of different configurations and GPU topologies that
> Autodetect was really a godsend for us, and I would really prefer not having
> to go back to manual specification. :)
> 
> Now, since we suspect that the GRES bitmaps may be corrupted for some of
> those nodes, what would be the best way to reset/recreate them?
Well, we know the job bitmaps are getting corrupted somehow, but I haven't seen any direct evidence that the node bitmaps are corrupted yet.

If the node bitmaps do get corrupted, I think the best way to deal with it is to drain the nodes until it's idle and then restart the nodes.

So this is the command that fails, right?

    srun -p jamesz -w sh03-13n14 --gres=gpu:4 --pty bash

That is a job allocation and step creation all in one go. Could we try exploring things a bit? What if we separated out the job and step creation, then check to see what the node reports in between?

Maybe something like:

salloc -p jamesz -w sh03-13n14 --gres=gpu:4
scontrol show node -d sh03-13n14
scontrol show job -d $SLURM_JOB_ID
srun --gres=gpu:4 sleep 60 &
scontrol show node -d sh03-13n14
scontrol show job -d $SLURM_JOB_ID

1) Does this fail in the same way as the all-in-one srun?
2) Do the job and node report anything unexpected?

You could derive a similar test with sbatch and a batch script, if that is more convenient.

-Michael
Comment 33 Michael Hinton 2021-02-19 11:43:49 MST
(In reply to Michael Hinton from comment #32)
> salloc -p jamesz -w sh03-13n14 --gres=gpu:4
> scontrol show node -d sh03-13n14
> scontrol show job -d $SLURM_JOB_ID
> srun --gres=gpu:4 sleep 60 &
> scontrol show node -d sh03-13n14
> scontrol show job -d $SLURM_JOB_ID
Maybe throw in an scontrol show steps, as well
Comment 34 Michael Hinton 2021-02-19 11:44:59 MST
scontrol show steps $SLURM_JOB_ID, I should say
Comment 35 Kilian Cavalotti 2021-02-19 16:00:48 MST
(In reply to Michael Hinton from comment #32)
> Well, we know the job bitmaps are getting corrupted somehow, but I haven't
> seen any direct evidence that the node bitmaps are corrupted yet.

Got it.

> If the node bitmaps do get corrupted, I think the best way to deal with it
> is to drain the nodes until it's idle and then restart the nodes.

So that's what I mentioned in #c25:
>> after passing the node down *AND* restarting slurmctld, the problem _sometimes_ disappears. But only for that given node, and not in all cases.

But unfortunately, some nodes still exhibit the same problem after having been drained, so it doesn't seem like a 100% working solution.

> So this is the command that fails, right?
> 
>     srun -p jamesz -w sh03-13n14 --gres=gpu:4 --pty bash

Yep!

> That is a job allocation and step creation all in one go. Could we try
> exploring things a bit? What if we separated out the job and step creation,
> then check to see what the node reports in between?
> 
> Maybe something like:
> 
> salloc -p jamesz -w sh03-13n14 --gres=gpu:4
> scontrol show node -d sh03-13n14
> scontrol show job -d $SLURM_JOB_ID
> srun --gres=gpu:4 sleep 60 &
> scontrol show node -d sh03-13n14
> scontrol show job -d $SLURM_JOB_ID
> 
> 1) Does this fail in the same way as the all-in-one srun?
> 2) Do the job and node report anything unexpected?

Very good idea, and very interesting results! 
It looks like the issue with the allocation step, which only allocates 3 GPUs:

-- 8< -------------------------------------------------------
$ salloc -p jamesz -w sh03-13n14 --gres=gpu:4
salloc: Pending job allocation 18663120
salloc: job 18663120 queued and waiting for resources
salloc: job 18663120 has been allocated resources
salloc: Granted job allocation 18663120
salloc: Waiting for resource configuration
salloc: Nodes sh03-13n14 are ready for job

$ scontrol show node -d sh03-13n14
NodeName=sh03-13n14 Arch=x86_64 CpuBind=cores CoresPerSocket=16 
   CPUAlloc=1 CPUTot=32 CPULoad=0.01
   AvailableFeatures=IB:HDR,CPU_MNF:AMD,CPU_GEN:RME,CPU_SKU:7502P,CPU_FRQ:2.50GHz,GPU_GEN:TUR,GPU_BRD:GEFORCE,GPU_SKU:RTX_2080Ti,GPU_MEM:11GB,GPU_CC:7.5,CLASS:SH3_G4FP32
   ActiveFeatures=IB:HDR,CPU_MNF:AMD,CPU_GEN:RME,CPU_SKU:7502P,CPU_FRQ:2.50GHz,GPU_GEN:TUR,GPU_BRD:GEFORCE,GPU_SKU:RTX_2080Ti,GPU_MEM:11GB,GPU_CC:7.5,CLASS:SH3_G4FP32
   Gres=gpu:4(S:0-1)
   GresDrain=N/A
   GresUsed=gpu:(null):3(IDX:0-2)
   NodeAddr=sh03-13n14 NodeHostName=sh03-13n14 Version=20.11.4
   OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 
   RealMemory=256000 AllocMem=6400 FreeMem=212252 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=144441 Owner=N/A MCS_label=N/A
   Partitions=jamesz,owners 
   BootTime=2021-01-14T17:00:52 SlurmdStartTime=2021-02-19T09:08:42
   CfgTRES=cpu=32,mem=250G,billing=154,gres/gpu=4
   AllocTRES=cpu=1,mem=6400M,gres/gpu=3
   CapWatts=n/a
   CurrentWatts=n/s AveWatts=n/s
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)
-- 8< -------------------------------------------------------

See how GresUsed=gpu:(null):3(IDX:0-2) and AllocTRES=cpu=1,mem=6400M,gres/gpu=3
I would definitely expect gres/gpu=4 here.

Then scontrol show job confirms it:
-- 8< -------------------------------------------------------
$ scontrol show job -d $SLURM_JOB_ID
JobId=18663120 JobName=interactive
   UserId=kilian(215845) GroupId=ruthm(32264) MCS_label=N/A
   Priority=112400 Nice=0 Account=ruthm QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:55 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2021-02-19T10:48:43 EligibleTime=2021-02-19T10:48:43
   AccrueTime=2021-02-19T10:49:29
   StartTime=2021-02-19T10:49:33 EndTime=2021-02-19T12:49:59 Deadline=N/A
   PreemptEligibleTime=2021-02-19T10:49:33 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-19T10:49:33
   Partition=jamesz AllocNode:Sid=sh03-ln03.stanford.edu:75441
   ReqNodeList=sh03-13n14 ExcNodeList=(null)
   NodeList=sh03-13n14
   BatchHost=sh03-13n14
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=6400M,node=1,billing=47,gres/gpu=3
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=gpu:3
     Nodes=sh03-13n14 CPU_IDs=0 Mem=6400 GRES=gpu:3(IDX:0-2)
   MinCPUsNode=1 MinMemoryCPU=6400M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/users/kilian
   Power=
   TresPerNode=gpu:4
   NtasksPerTRES:0
-- 8< -------------------------------------------------------

We have JOB_GRES=gpu:3 but    TresPerNode=gpu:4?


Then we got the "invalid gres specification" when steps are created by srun:
-- 8< -------------------------------------------------------
$ srun --gres=gpu:4 sleep 60 &
[1] 7292
srun: error: Unable to create step for job 18663120: Invalid generic resource (gres) specification
[1]+  Exit 1                  srun --gres=gpu:4 sleep 60

$ scontrol show node -d sh03-13n14
NodeName=sh03-13n14 Arch=x86_64 CpuBind=cores CoresPerSocket=16 
   CPUAlloc=1 CPUTot=32 CPULoad=0.01
   AvailableFeatures=IB:HDR,CPU_MNF:AMD,CPU_GEN:RME,CPU_SKU:7502P,CPU_FRQ:2.50GHz,GPU_GEN:TUR,GPU_BRD:GEFORCE,GPU_SKU:RTX_2080Ti,GPU_MEM:11GB,GPU_CC:7.5,CLASS:SH3_G4FP32
   ActiveFeatures=IB:HDR,CPU_MNF:AMD,CPU_GEN:RME,CPU_SKU:7502P,CPU_FRQ:2.50GHz,GPU_GEN:TUR,GPU_BRD:GEFORCE,GPU_SKU:RTX_2080Ti,GPU_MEM:11GB,GPU_CC:7.5,CLASS:SH3_G4FP32
   Gres=gpu:4(S:0-1)
   GresDrain=N/A
   GresUsed=gpu:(null):3(IDX:0-2)
   NodeAddr=sh03-13n14 NodeHostName=sh03-13n14 Version=20.11.4
   OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 
   RealMemory=256000 AllocMem=6400 FreeMem=212252 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=144441 Owner=N/A MCS_label=N/A
   Partitions=jamesz,owners 
   BootTime=2021-01-14T17:00:52 SlurmdStartTime=2021-02-19T09:08:42
   CfgTRES=cpu=32,mem=250G,billing=154,gres/gpu=4
   AllocTRES=cpu=1,mem=6400M,gres/gpu=3
   CapWatts=n/a
   CurrentWatts=n/s AveWatts=n/s
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

$ scontrol show job -d $SLURM_JOB_ID
JobId=18663120 JobName=interactive
   UserId=kilian(215845) GroupId=ruthm(32264) MCS_label=N/A
   Priority=112400 Nice=0 Account=ruthm QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:01:46 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2021-02-19T10:48:43 EligibleTime=2021-02-19T10:48:43
   AccrueTime=2021-02-19T10:49:29
   StartTime=2021-02-19T10:49:33 EndTime=2021-02-19T12:49:59 Deadline=N/A
   PreemptEligibleTime=2021-02-19T10:49:33 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-19T10:49:33
   Partition=jamesz AllocNode:Sid=sh03-ln03.stanford.edu:75441
   ReqNodeList=sh03-13n14 ExcNodeList=(null)
   NodeList=sh03-13n14
   BatchHost=sh03-13n14
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=6400M,node=1,billing=47,gres/gpu=3
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=gpu:3
     Nodes=sh03-13n14 CPU_IDs=0 Mem=6400 GRES=gpu:3(IDX:0-2)
   MinCPUsNode=1 MinMemoryCPU=6400M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/users/kilian
   Power=
   TresPerNode=gpu:4
   NtasksPerTRES:0

$ scontrol show steps $SLURM_JOB_ID
StepId=18663120.extern UserId=215845 StartTime=2021-02-19T10:49:33 TimeLimit=UNLIMITED
   State=RUNNING Partition=jamesz NodeList=sh03-13n14
   Nodes=1 CPUs=0 Tasks=1 Name=extern Network=(null)
   TRES=(null)
   ResvPorts=(null)
   CPUFreqReq=Default
   SrunHost:Pid=(null):0
-- 8< -------------------------------------------------------

Step creation fails, so no steps are reported, as excepted.

So I guess the issue is at GRES allocation time, then?

Cheers,
--
Kilian














> 
> You could derive a similar test with sbatch and a batch script, if that is
> more convenient.
> 
> -Michael
Comment 36 Michael Hinton 2021-02-19 16:08:04 MST
Ok, great! (I mean not great) So we can forget about the step stuff - it's just naturally failing because the job allocation is bad to begin with.

Could you briefly set the select and gres debug flags and reproduce the issue with salloc? Let's start with that, and if I can't make sense of it still, I'll work on another debug patch. Could you also do salloc with --gpus and the debug flags and attach the results, so we can compare the two? Thanks!

-Michael
Comment 37 Michael Hinton 2021-02-19 16:08:59 MST
And I just want the slurmctld.log output
Comment 38 Kilian Cavalotti 2021-02-19 17:44:35 MST
(In reply to Michael Hinton from comment #36)
> Could you briefly set the select and gres debug flags and reproduce the
> issue with salloc? 

All right, I'm trying this, but it's pretty challenging to get anything through with the SelectType debugflag: it makes the controller grinds to a halt and spend all its cycles logging things: jobs are almost not scheduled anymore... :(

We have currently 25,000 jobs in queue and new ones are submitted at up to 500 jobs/sec, so the additional logging part is really taking its toll. :)

I'll keep trying and keep you posted.

Cheers,
--
Kilian
Comment 39 Kilian Cavalotti 2021-02-19 18:15:31 MST
Created attachment 18040 [details]
slurmctld logs with gres and selecttype debugflags

All right, I have things to report :)


* First case (with --gres gpu:4) was job 18681496, and failed as expected:

$ salloc -p jamesz -w sh03-13n14 --gres=gpu:4
salloc: Pending job allocation 18681496
salloc: job 18681496 queued and waiting for resources
salloc: job 18681496 has been allocated resources
salloc: Granted job allocation 18681496
salloc: Waiting for resource configuration
salloc: Nodes sh03-13n14 are ready for job
$ srun nvidia-smi -L
srun: error: Unable to create step for job 18681496: Invalid generic resource (gres) specification


* Second case (with --gpus 4) was job 18681498, and made me realize that with --gpus, I got a multi-node allocation, which I guess is to be expected as well:

$ salloc -p jamesz -w sh03-13n14 --gpus 4
salloc: Pending job allocation 18681498
salloc: job 18681498 queued and waiting for resources
salloc: job 18681498 has been allocated resources
salloc: Granted job allocation 18681498
salloc: Waiting for resource configuration
salloc: Nodes sh03-13n[12,14-15] are ready for job

Step creation worked and used GPUs from those 3 different nodes:

$ srun nvidia-smi -L
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-1118f950-5276-c3ca-dce8-5c1a24ca732e)
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-f4efa9c1-4ad9-5608-174b-4fdf1d30ff50)
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-15038071-b900-d120-8288-60f3a9c509ac)
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-9d889fa6-f52f-09e6-9d01-3cd78b490514)


* Finally, adding -N1 to ensure single-node allocation, was job 18681610:

$ salloc -p jamesz -w sh03-13n14 -N 1 --gpus 4
salloc: Pending job allocation 18681610
salloc: job 18681610 queued and waiting for resources
salloc: job 18681610 has been allocated resources
salloc: Granted job allocation 18681610
salloc: Waiting for resource configuration
salloc: Nodes sh03-13n14 are ready for job

And step creation worked as we noted before:

$ srun nvidia-smi -L
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-1118f950-5276-c3ca-dce8-5c1a24ca732e)
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-4558f0f3-a437-5acc-4501-4b066593f31a)
GPU 2: GeForce RTX 2080 Ti (UUID: GPU-565c7761-2b61-7cbc-69bf-3c9dff3b59a1)
GPU 3: GeForce RTX 2080 Ti (UUID: GPU-ae9eb151-45c4-4af3-3513-01b533f08770)


Now, for the logs, I'm attaching the slurmctld log that covers the period when those jobs were submitted and allocated. I didn't want to miss any useful information so it's verbatim, and it's large.


I also noticed an odd behavior when trying to change logging level with "scontrol setdebug": it makes the daemon stop logging to systemd. I typically use "journalctl -u slurmctld -f" to follow the controller logs, and when the logging level is changed with scontrol, logging stops and nothing is logged anymore through journalctl until the daemon is restarted. 
The logging level used doesn't seem to make any difference, and it doesn't happen when enabling or disabling debugflags. It also looks like a relatively recent behavior as I don't recall seeing it in previous versions. I can file a separate ticket if necessary.


Thanks!
-- 
Kilian
Comment 41 Michael Hinton 2021-02-22 11:15:24 MST
(In reply to Kilian Cavalotti from comment #39)
> I also noticed an odd behavior when trying to change logging level with
> "scontrol setdebug": it makes the daemon stop logging to systemd. I
> typically use "journalctl -u slurmctld -f" to follow the controller logs,
> and when the logging level is changed with scontrol, logging stops and
> nothing is logged anymore through journalctl until the daemon is restarted. 
> The logging level used doesn't seem to make any difference, and it doesn't
> happen when enabling or disabling debugflags. It also looks like a
> relatively recent behavior as I don't recall seeing it in previous versions.
> I can file a separate ticket if necessary.
Interesting. Please open up a separate ticket for that. Thanks!
Comment 42 Kilian Cavalotti 2021-02-22 11:33:57 MST
(In reply to Michael Hinton from comment #41)
> Interesting. Please open up a separate ticket for that. Thanks!

Done in #10922.

Cheers,
--
Kilian
Comment 43 Michael Hinton 2021-02-22 13:14:52 MST
Good news: I was able to reproduce the issue, and I think I have a solution for you!

First, we noticed this in your attached slurmctld.conf:

2021-02-19T16:32:24-08:00 sh03-sl01 slurmctld[13294]: gres/gpu: state for sh03-13n14
2021-02-19T16:32:24-08:00 sh03-sl01 slurmctld[13294]:  gres_cnt found:4 configured:4 avail:4 alloc:0
2021-02-19T16:32:24-08:00 sh03-sl01 slurmctld[13294]:  gres_bit_alloc: of 4
2021-02-19T16:32:24-08:00 sh03-sl01 slurmctld[13294]:  gres_used:(null)
2021-02-19T16:32:24-08:00 sh03-sl01 slurmctld[13294]:  links[0]:0, 0, 0, -1
2021-02-19T16:32:24-08:00 sh03-sl01 slurmctld[13294]:  links[1]:0, 0, -1, 0
2021-02-19T16:32:24-08:00 sh03-sl01 slurmctld[13294]:  links[2]:0, -1, 0, 0
2021-02-19T16:32:24-08:00 sh03-sl01 slurmctld[13294]:  links[3]:-1, 0, 0, 0

'links' (i.e. NVIDIA NVLink) is backwards. It should be an identity matrix, like this:

links[0]:-1, 0, 0, 0
links[1]:0, -1, 0, 0
links[2]:0, 0, -1, 0
links[3]:0, 0, 0, -1

An identity matrix basically means that the GPU has no NVLinks, and the GPU is linked to itself. I believe this has the same effect in Slurm as if Links were unset in gres.conf.

Looking at the slurmctld log, it looks like all your AMD nodes with GPUs (sh03-12n[01-16],sh03-13n[0-15],sh03-14n[01-02]) have backwards links. You should run `nvidia-smi topo -m` to verify what NVML thinks the links are on those AMD nodes. I bet it’s an NVML problem, not a Slurm problem. At any rate, AutoDetect is messed up on those AMD nodes, like you initially thought.

Here are the links on a non-AMD node (sh02-12n07), which is correct:
 
2021-02-19T16:32:21-08:00 sh03-sl01 slurmctld[13294]: gres/gpu: state for sh02-12n07
2021-02-19T16:32:21-08:00 sh03-sl01 slurmctld[13294]:  gres_cnt found:4 configured:4 avail:4 alloc:1
2021-02-19T16:32:21-08:00 sh03-sl01 slurmctld[13294]:  gres_bit_alloc:0 of 4
2021-02-19T16:32:21-08:00 sh03-sl01 slurmctld[13294]:  gres_used:(null)
2021-02-19T16:32:21-08:00 sh03-sl01 slurmctld[13294]:  links[0]:-1, 0, 0, 0
2021-02-19T16:32:21-08:00 sh03-sl01 slurmctld[13294]:  links[1]:0, -1, 0, 0
2021-02-19T16:32:21-08:00 sh03-sl01 slurmctld[13294]:  links[2]:0, 0, -1, 0
2021-02-19T16:32:21-08:00 sh03-sl01 slurmctld[13294]:  links[3]:0, 0, 0, -1

In my reproducer, I was able to fix the issue by simply setting a gres.conf line with `Links=` unset.

I know you said that you tried that and it didn't work, but I wonder if maybe you didn't restart the slurmctld and slurmds (I don't believe an scontrol reconfig is sufficient for gres.conf changes, since it’s changing the node definition). I also noticed in your recent slurmctld log that there were a lot of "error: Node XXXXXX appears to have a different slurm.conf than the slurmctld" errors. So maybe your confs were not “going through” when you tried it last time.

So my advice is to leave AutoDetect on for all non-AMD GPU nodes, but turn it off for the AMD nodes and specify a gres.conf line explicitly for them. Here comes 20.11 to the rescue! 20.11 now allows AutoDetect to apply only to specific nodes. You can do that with a gres.conf like this, I think:

AutoDetect=nvml
NodeName=sh03-12n[01-16],sh03-13n[0-15],sh03-14n[01-02] AutoDetect=off Name=gpu File=/dev/nvidia[0-3]

I believe that should cover all your AMD GPU nodes.

This leaves AutoDetect enabled for all the nodes except those specified in the line with AutoDetect=off. I believe leaving “Cores=” unset is the same as setting Cores=0-31 (i.e. meaning that all cores/sockets are fair game for scheduling the GPU). And I believe leaving Links unset is the same as setting an identity matrix.

If for some reason the above doesn’t work, and you know that the configs are all properly synced, we could try this instead:

AutoDetect=nvml
NodeName=sh03-12n[01-16],sh03-13n[0-15],sh03-14n[01-02] AutoDetect=off Name=gpu File=/dev/nvidia0 Cores=0-31 Links=-1,0,0,0
NodeName=sh03-12n[01-16],sh03-13n[0-15],sh03-14n[01-02] AutoDetect=off Name=gpu File=/dev/nvidia1 Cores=0-31 Links=0,-1,0,0
NodeName=sh03-12n[01-16],sh03-13n[0-15],sh03-14n[01-02] AutoDetect=off Name=gpu File=/dev/nvidia2 Cores=0-31 Links=0,0,-1,0
NodeName=sh03-12n[01-16],sh03-13n[0-15],sh03-14n[01-02] AutoDetect=off Name=gpu File=/dev/nvidia3 Cores=0-31 Links=0,0,0,-1

That will match more closely to what AutoDetect would have actually set if it worked properly, but I still think this is equivalent in effect to the first simpler gres.conf.

The only side effect with disabling AutoDetect for these nodes is that I don’t believe the --gpu-freq option will work. This is because --gpu-freq is coupled to the auto detecting mechanism, so if it’s turned off, it won’t work. This is something that we want to eventually decouple in the future. If that is going to cause problems for you, let us know so I can point to a real-life customer who needs that enhancement.

If you could temporarily turn on the `gres` debug flag before restarting the nodes and ctld with the new config, the slurmctld log should confirm if links was set properly. You can turn off that debug flag immediately after restarting everything.

Let me know if that works for you or if you have any questions.

Thanks!
-Michael

---------------------

P.S. I’m sure you are aware, but just in case you aren't, the recommended procedure for adding nodes, removing nodes, or otherwise updating node definitions is:

a) Stop slurmctld(s)
b) Push out the updated slurm.conf/gres.conf throughout the cluster
c) Restart all slurmds simultaneously
d) Start slurmctld(s)

And the less safe, but usually ok procedure:

a) Push out the updated slurm.conf/gres.conf throughout the cluster
b) Restart slurmctld
c) Restart all slurmds really quickly

This second procedure may temporarily give you mismatching config errors.
Comment 45 Kilian Cavalotti 2021-02-22 13:31:55 MST
Hi Michael, 

Thanks a lot for the very complete response! I haven't tried setting up a manual gres.conf file yet, but I have a few questions/observations first::

1. none of the affected nodes has any NVLink GPU: they all are PCIe-based GPUs, so I'm not sure how NVlink could come at play here.

2. we *do* have NVLink GPUs on other nodes, all using AutoDetect=NVML as well,  and they don't exhibit the problem. Other nodes with PCIe GPUs are working fine as well. All the GPU nodes are using the same NVIDIA driver and the same version of the NVML library, so I assume that all the GPU nodes should have the same issue, right? But that's not what we observe.

3. I'm not sure to understand how the "links" property of a given GPU would change the number of GPUs detected on the node and explain that mismatch we've seen. Especially considering that `-N 1 --gpus 4` works while `--gres gpu:4` doesn't.

I'll give a try to your suggestion and set "links" of a few nodes to test, and report back.

Thanks!
--
Kilian
Comment 46 Michael Hinton 2021-02-22 13:42:32 MST
(In reply to Kilian Cavalotti from comment #45)
> 1. none of the affected nodes has any NVLink GPU: they all are PCIe-based
> GPUs, so I'm not sure how NVlink could come at play here.
You are right; NVLinks shouldn't come into play. But I believe NVML still returns an identity matrix, which is basically NVML saying that it sees no NVLinks on any GPUs. Slurm just takes that Links matrix and runs with it. But since that got messed up, it messed up Slurm's GPU scheduling.

> 2. we *do* have NVLink GPUs on other nodes, all using AutoDetect=NVML as
> well,  and they don't exhibit the problem. Other nodes with PCIe GPUs are
> working fine as well. All the GPU nodes are using the same NVIDIA driver and
> the same version of the NVML library, so I assume that all the GPU nodes
> should have the same issue, right? But that's not what we observe.
No, because it's possible that the NVML driver gets confused on AMD nodes, but not Intel nodes, and returns nvlinks that are wrong. If you run `nvidia-smi topo -m` on any of those AMD GPU nodes, that should indicate if it's NVML's fault or Slurm's fault. I'd like to see the output of that, if you have the time.

> 3. I'm not sure to understand how the "links" property of a given GPU would
> change the number of GPUs detected on the node and explain that mismatch
> we've seen. Especially considering that `-N 1 --gpus 4` works while `--gres
> gpu:4` doesn't.
Well, the node still shows 4 GPUs, right? It's just that Slurm's GPU scheduling looks at links to see if it can co-allocate GPUs a bit better when scheduling a job. If Links is in a messed up form, I could see it messing up the GPU scheduling in weird way, like we are seeing right now.

Thanks,
-Michael
Comment 47 Michael Hinton 2021-02-22 13:43:31 MST
(In reply to Michael Hinton from comment #46)
> (In reply to Kilian Cavalotti from comment #45)
> > 2. we *do* have NVLink GPUs on other nodes, all using AutoDetect=NVML as
> > well,  and they don't exhibit the problem. Other nodes with PCIe GPUs are
> > working fine as well. All the GPU nodes are using the same NVIDIA driver and
> > the same version of the NVML library, so I assume that all the GPU nodes
> > should have the same issue, right? But that's not what we observe.
> No, because it's possible that the NVML driver gets confused on AMD nodes,
> but not Intel nodes
*NVML library and/or NVIDIA driver
Comment 48 Kilian Cavalotti 2021-02-22 13:44:34 MST
Also a quick note that, without any configuration change over the week-end, "srun -p jamesz -w sh03-13n14 -N 1 --gpus 4 nvidia-smi -L" doesn't not generate any step allocation error this morning. Which also seems to points towards a temporary bitmap corruption, rather than a permanent configuration issue.
Comment 49 Michael Hinton 2021-02-22 13:53:29 MST
(In reply to Kilian Cavalotti from comment #48)
> Which also seems to points
> towards a temporary bitmap corruption, rather than a permanent configuration
> issue.
That is strange. I'm not sure how to reconcile the transitory nature of that. But I do know that I can reproduce the issue with a malformed Links to get a job allocation with 3 GPUs when I request 4. So in this case, I wouldn't say it's a "corrupted bitmap" so much as the scheduler is confused and under-allocated the job because links is bad.

How often do you restart daemons? Is it possible that a slurmd was restarted between runs that could account for a change in configuration? I wonder if NVML reports bad links only some of the time on AMD nodes...
Comment 50 Kilian Cavalotti 2021-02-22 14:20:13 MST
(In reply to Kilian Cavalotti from comment #48)
> Also a quick note that, without any configuration change over the week-end,
> "srun -p jamesz -w sh03-13n14 -N 1 --gpus 4 nvidia-smi -L" doesn't not
> generate any step allocation error this morning. Which also seems to points
> towards a temporary bitmap corruption, rather than a permanent configuration
> issue.

Argh, we can forget about that, I was not testing the right node so let's pretend 
#c48 never happened. Sorry for the noise, back to real testing. :)
Comment 51 Kilian Cavalotti 2021-02-22 15:38:09 MST
(In reply to Michael Hinton from comment #43)
> 2021-02-19T16:32:24-08:00 sh03-sl01 slurmctld[13294]:  links[0]:0, 0, 0, -1
> 2021-02-19T16:32:24-08:00 sh03-sl01 slurmctld[13294]:  links[1]:0, 0, -1, 0
> 2021-02-19T16:32:24-08:00 sh03-sl01 slurmctld[13294]:  links[2]:0, -1, 0, 0
> 2021-02-19T16:32:24-08:00 sh03-sl01 slurmctld[13294]:  links[3]:-1, 0, 0, 0
> 
> 'links' (i.e. NVIDIA NVLink) is backwards. It should be an identity matrix,
> like this:
> 
> links[0]:-1, 0, 0, 0
> links[1]:0, -1, 0, 0
> links[2]:0, 0, -1, 0
> links[3]:0, 0, 0, -1

Right there definitely seems like something's weird here, indeed.

> Looking at the slurmctld log, it looks like all your AMD nodes with GPUs
> (sh03-12n[01-16],sh03-13n[0-15],sh03-14n[01-02]) have backwards links. 

Correct, I've been able to check this from the slurmctld logs with the +gres debugflag:

# grep -B 4 'links\[0\]' slurmctld.log | awk '/state/ {node=$NF}; /links/ {if ($NF == "-1") {print node,": ERR"} else print node,": OK"}' | sort -u | clubak -b
---------------
sh02-14n[01-15],sh02-12n[04-17],sh02-13n[01-08,10-15],sh02-15n[01-11,13],sh01-29n[01-08],sh01-19n[01-06,08],sh02-16n[01-03,05-06,08-09],sh01-28n[01-02,11-12],sh01-27n[21,35] (83)
---------------
OK
---------------
sh03-13n[01-16],sh03-12n[01-06,11-16],sh03-14n[01-02] (30)
---------------
ERR

All the AMD-based GPU nodes (sh03-* nodes) have -1 as the last index in the links[0] line, whereas non of the Intel nodes (sh01-* and sh02-*) do.

> You
> should run `nvidia-smi topo -m` to verify what NVML thinks the links are on
> those AMD nodes.

Here's the output of "nvidia-smi topo -m" on a 4-GPU Intel node:

# ssh sh02-15n13 nvidia-smi topo  -m
        GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity    NUMA Affinity
GPU0     X      PHB     SYS     SYS     SYS     0-9             0
GPU1    PHB      X      SYS     SYS     SYS     0-9             0
GPU2    SYS     SYS      X      PHB     PHB     10-19           1
GPU3    SYS     SYS     PHB      X      PHB     10-19           1
mlx5_0  SYS     SYS     PHB     PHB      X 

and on an AMD node:

# ssh sh03-13n14 nvidia-smi topo  -m
        GPU0    GPU1    GPU2    GPU3    mlx5_0  mlx5_1  mlx5_2  CPU Affinity    NUMA Affinity
GPU0     X      NODE    SYS     SYS     SYS     SYS     PHB     16-31           1
GPU1    NODE     X      SYS     SYS     SYS     SYS     NODE    16-31           1
GPU2    SYS     SYS      X      NODE    PHB     PHB     SYS     0-15            0
GPU3    SYS     SYS     NODE     X      NODE    NODE    SYS     0-15            0
mlx5_0  SYS     SYS     PHB     NODE     X      PIX     SYS
mlx5_1  SYS     SYS     PHB     NODE    PIX      X      SYS
mlx5_2  PHB     NODE    SYS     SYS     SYS     SYS      X 

So the only difference I can see is that on AMD nodes, the GPU-to-NUMA node affinity is reversed, in that GPUs 0-1 are connected to the higher-numbered CPUs, and GPU-23 to the lower-numbered ones.

Could that explain the issue?


> In my reproducer, I was able to fix the issue by simply setting a gres.conf
> line with `Links=` unset.
> 
> I know you said that you tried that and it didn't work, but I wonder if
> maybe you didn't restart the slurmctld and slurmds (I don't believe an
> scontrol reconfig is sufficient for gres.conf changes, since it’s changing
> the node definition). 

Let me try again:

1. before any change:
# ssh sh03-13n14 cat /etc/slurm/gres.conf
Autodetect=nvml

$ srun -p jamesz -w sh03-13n14 -N 1 --gres=gpu:4 nvidia-smi -L
srun: job 18870857 queued and waiting for resources
srun: job 18870857 has been allocated resources
srun: error: Unable to create step for job 18870857: Invalid generic resource (gres) specification
srun: Force Terminated job 18870857
$

2. changing gres.conf to manually specify GPU-CPU bindings, and Links=
# ssh sh03-13n14 cat /etc/slurm/gres.conf
NodeName=sh03-13n14 AutoDetect=off Name=gpu File=/dev/nvidia[0-1] Cores=16-31
NodeName=sh03-13n14 AutoDetect=off Name=gpu File=/dev/nvidia[2-3] Cores=0-15
# ssh sh03-13n14 systemctl restart slurmd
#

3. restarting controller
# ssh sh03-sl01 systemctl restart slurmctld

4. trying again
$ srun -p jamesz -w sh03-13n14 -N 1 --gres=gpu:4 nvidia-smi -L
srun: job 18873767 queued and waiting for resources
srun: job 18873767 has been allocated resources
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-1118f950-5276-c3ca-dce8-5c1a24ca732e)
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-4558f0f3-a437-5acc-4501-4b066593f31a)
GPU 2: GeForce RTX 2080 Ti (UUID: GPU-565c7761-2b61-7cbc-69bf-3c9dff3b59a1)
GPU 3: GeForce RTX 2080 Ti (UUID: GPU-ae9eb151-45c4-4af3-3513-01b533f08770)


So that worked! I believe I left the Autodetect=NVLM line in gres.conf before, thinking that the node-specific line would override it, but I didn't have AutoDetect=Off, so maybe that was the problem.

> I also noticed in>  your recent slurmctld log that there
> were a lot of "error: Node XXXXXX appears to have a different slurm.conf
> than the slurmctld" errors. So maybe your confs were not “going through”
> when you tried it last time.

Yeah, those messages were coming from the fact that I changed the default logging level in slurm.conf on the controller (when trying to investigate bug#10922) but not on the other nodes. I don't believe the gres.conf file is considered in this consistency check, is it? We don't have any gres.conf file on the controller, only on the GPU nodes.

> So my advice is to leave AutoDetect on for all non-AMD GPU nodes, but turn
> it off for the AMD nodes and specify a gres.conf line explicitly for them.
> Here comes 20.11 to the rescue! 20.11 now allows AutoDetect to apply only to
> specific nodes. You can do that with a gres.conf like this, I think:
> 
> AutoDetect=nvml
> NodeName=sh03-12n[01-16],sh03-13n[0-15],sh03-14n[01-02] AutoDetect=off
> Name=gpu File=/dev/nvidia[0-3]
> 
> I believe that should cover all your AMD GPU nodes.

Ok, I think we can do this on the short term to temporarily work around the issue and allow our users to keep using --gres=gpu:n for those who haven't embraced the new --gpu* options. But beyond this:

1. why is it working with --gpus? I mean it's good it's working, but shouldn't the links (mis)detection have the same effect, when the GPUs are requested via --gres or --gpus?

2. but more importantly, could we please fix AutoDetect=NVML? :)

The whole point of it is to avoid having to specify GPU topology manually, and having to update it each time new nodes are added to the cluster (which happens continuously for us). So if we need to maintain an exception list, that pretty much defeats its whole purpose. :\


Thanks!
--
Kilian
Comment 52 Michael Hinton 2021-02-22 16:20:17 MST
(In reply to Kilian Cavalotti from comment #51)
> So the only difference I can see is that on AMD nodes, the GPU-to-NUMA node
> affinity is reversed, in that GPUs 0-1 are connected to the higher-numbered
> CPUs, and GPU-23 to the lower-numbered ones.
> 
> Could that explain the issue?
I think it might.

> So that worked! I believe I left the Autodetect=NVLM line in gres.conf
> before, thinking that the node-specific line would override it, but I didn't
> have AutoDetect=Off, so maybe that was the problem.
Oh, right, that might have been it. Glad it works now!

> > I also noticed in>  your recent slurmctld log that there
> > were a lot of "error: Node XXXXXX appears to have a different slurm.conf
> > than the slurmctld" errors. So maybe your confs were not “going through”
> > when you tried it last time.
> 
> Yeah, those messages were coming from the fact that I changed the default
> logging level in slurm.conf on the controller (when trying to investigate
> bug#10922) but not on the other nodes. I don't believe the gres.conf file is
> considered in this consistency check, is it? We don't have any gres.conf
> file on the controller, only on the GPU nodes.
You're right about the controller not really caring about gres.conf, and that the mismatch error is with slurm.conf only. So forget I mentioned that in relation to gres.conf.

> 1. why is it working with --gpus? I mean it's good it's working, but
> shouldn't the links (mis)detection have the same effect, when the GPUs are
> requested via --gres or --gpus?
> 
> 2. but more importantly, could we please fix AutoDetect=NVML? :)
Those are great questions/requests. Let me look into it some more and get back to you :)

Thanks!
-Michael
Comment 53 Michael Hinton 2021-02-22 16:28:52 MST
Could you omit one of the AMD GPU nodes from the AutoDetect=off line in gres.conf, set SlurmdDebug=debug2, and then restart that slurmd on one of those AMD GPU nodes and attach the output? I want to see what AutoDetect is returning for Links (your prior slurmd logs were not verbose enough to show this).

-Michael
Comment 54 Michael Hinton 2021-02-22 16:35:50 MST
*output of the slurmd.log for that node
Comment 56 Kilian Cavalotti 2021-02-22 17:31:36 MST
(In reply to Michael Hinton from comment #53)
> Could you omit one of the AMD GPU nodes from the AutoDetect=off line in
> gres.conf, set SlurmdDebug=debug2, and then restart that slurmd on one of
> those AMD GPU nodes and attach the output? I want to see what AutoDetect is
> returning for Links (your prior slurmd logs were not verbose enough to show
> this).

Restored AutoDtect=NVML in gres.conf for sh03-13n14, bumped SlurmdSyslogDebug to debug2, and restarted slurmd. It logged this:

2021-02-22T16:28:32.374956-08:00 sh03-13n14 sysadm_audit: sysadm=kilian.root pid=191259 cmd="service slurmd restart" newpwd=/root ret=0
2021-02-22T16:28:32.437762-08:00 sh03-13n14 slurmd[191425]: Considering each NUMA node as a socket
2021-02-22T16:28:32.439937-08:00 sh03-13n14 slurmd[191425]: Considering each NUMA node as a socket
2021-02-22T16:28:32.507957-08:00 sh03-13n14 slurmd[191425]: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
2021-02-22T16:28:32.509754-08:00 sh03-13n14 slurmd[191425]: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
2021-02-22T16:28:32.510706-08:00 sh03-13n14 slurmd[191425]: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
2021-02-22T16:28:32.511645-08:00 sh03-13n14 slurmd[191425]: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
2021-02-22T16:28:32.511813-08:00 sh03-13n14 slurmd[191425]: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
2021-02-22T16:28:32.512143-08:00 sh03-13n14 slurmd[191425]: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
2021-02-22T16:28:32.512226-08:00 sh03-13n14 slurmd[191425]: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
2021-02-22T16:28:32.512313-08:00 sh03-13n14 slurmd[191425]: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
2021-02-22T16:28:32.512395-08:00 sh03-13n14 slurmd[191425]: gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2080_ti`. Setting system GRES type to NULL
2021-02-22T16:28:32.513969-08:00 sh03-13n14 slurmd[191425]: Gres Name=gpu Type=(null) Count=1
2021-02-22T16:28:32.514050-08:00 sh03-13n14 slurmd[191425]: Gres Name=gpu Type=(null) Count=1
2021-02-22T16:28:32.514132-08:00 sh03-13n14 slurmd[191425]: Gres Name=gpu Type=(null) Count=1
2021-02-22T16:28:32.514213-08:00 sh03-13n14 slurmd[191425]: Gres Name=gpu Type=(null) Count=1
2021-02-22T16:28:32.514301-08:00 sh03-13n14 slurmd[191425]: topology/tree: init: topology tree plugin loaded
2021-02-22T16:28:32.516012-08:00 sh03-13n14 slurmd[191425]: topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
2021-02-22T16:28:32.525390-08:00 sh03-13n14 slurmd[191425]: route/default: init: route default plugin loaded
2021-02-22T16:28:32.526768-08:00 sh03-13n14 slurmd[191425]: task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffff
2021-02-22T16:28:32.527640-08:00 sh03-13n14 slurmd[191425]: spank/lua: Loaded 0 plugins in this context
2021-02-22T16:28:32.527885-08:00 sh03-13n14 slurmd[191425]: cred/munge: init: Munge credential signature plugin loaded
2021-02-22T16:28:32.528008-08:00 sh03-13n14 slurmd[191425]: slurmd version 20.11.4 started
2021-02-22T16:28:32.528529-08:00 sh03-13n14 slurmd[191425]: slurmd started on Mon, 22 Feb 2021 16:28:32 -0800
2021-02-22T16:28:39.804700-08:00 sh03-13n14 slurmd[191425]: CPUs=32 Boards=1 Sockets=2 Cores=16 Threads=1 Memory=257614 TmpDisk=24564 Uptime=3367668 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

I tried SlurmdSyslogDebug=debug3 as well but it doesn't seem to log anything more. Should there be anything else?

Cheers,
--
Kilian
Comment 57 Michael Hinton 2021-02-22 17:37:22 MST
I'm expecting to see something like this:

[2021-02-22T17:35:45.329] debug:  Log file re-opened
[2021-02-22T17:35:45.329] debug2: hwloc_topology_init
[2021-02-22T17:35:45.331] debug2: hwloc_topology_load
[2021-02-22T17:35:45.369] debug2: hwloc_topology_export_xml
[2021-02-22T17:35:45.369] debug:  CPUs:12 Boards:1 Sockets:1 CoresPerSocket:6 ThreadsPerCore:2
[2021-02-22T17:35:45.370] debug:  Reading cgroup.conf file /home/hintron/slurm/20.11/bazooka/etc/cgroup.conf
[2021-02-22T17:35:45.370] debug2: hwloc_topology_init
[2021-02-22T17:35:45.371] debug2: xcpuinfo_hwloc_topo_load: xml file (/home/hintron/slurm/20.11/bazooka/spool/slurmd-test1/hwloc_topo_whole.xml) found
[2021-02-22T17:35:45.372] debug:  CPUs:12 Boards:1 Sockets:1 CoresPerSocket:6 ThreadsPerCore:2
[2021-02-22T17:35:45.372] debug:  gres/gpu: init: loaded
[2021-02-22T17:35:45.373] debug:  gpu/nvml: init: init: GPU NVML plugin loaded
[2021-02-22T17:35:45.375] debug2: gpu/nvml: _nvml_init: Successfully initialized NVML
[2021-02-22T17:35:45.375] debug:  gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 450.36.06
[2021-02-22T17:35:45.375] debug:  gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 11.450.36.06
[2021-02-22T17:35:45.375] debug2: gpu/nvml: _get_system_gpu_list_nvml: Total CPU count: 12
[2021-02-22T17:35:45.375] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device count: 1
[2021-02-22T17:35:45.389] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 0:
[2021-02-22T17:35:45.389] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Name: geforce_rtx_2060
[2021-02-22T17:35:45.389] debug2: gpu/nvml: _get_system_gpu_list_nvml:     UUID: GPU-f33de13b-d843-c631-b4b5-624315b535a1
[2021-02-22T17:35:45.389] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Domain/Bus/Device: 0:1:0
[2021-02-22T17:35:45.389] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Bus ID: 00000000:01:00.0
[2021-02-22T17:35:45.389] debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: -1
[2021-02-22T17:35:45.389] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia0
[2021-02-22T17:35:45.389] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 0-11
[2021-02-22T17:35:45.389] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 0-5
[2021-02-22T17:35:45.390] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported

It looks like there are no debug or debug2 logs in your slurmd.log, let alone AutoDetect-specific debug2 logs. Are you sure that the slurmd is logging at level debug2?
Comment 59 Kilian Cavalotti 2021-02-22 17:57:53 MST
Created attachment 18059 [details]
slurmd debug3 logs w/ AutoDetect=NVML

(In reply to Michael Hinton from comment #57)
> It looks like there are no debug or debug2 logs in your slurmd.log, let
> alone AutoDetect-specific debug2 logs. Are you sure that the slurmd is
> logging at level debug2?

That's what it has in its /etc/slurm/slurm.conf:

[root@sh03-13n14 ~]# grep SlurmdSyslog /etc/slurm/slurm.conf 
SlurmdSyslogDebug=debug3

And with this, it looks like the debug messages are not making their way to the syslog files in /var/log, only the lines I sent before are logged there.

But journalctl has the debug entries, indeed. See attached w/ debug3.

It looks like the links are properly detected? But maybe the GPUs are not enumerated in the "right" order (see the "Note:"s below)?

-- 8< ----------------------------------------------------------
_get_system_gpu_list_nvml: GPU index 0:                                         
_get_system_gpu_list_nvml:     NVLinks: -1,0,0,0                                
_get_system_gpu_list_nvml: Note: GPU index 0 is different from minor number 3   
_get_system_gpu_list_nvml: GPU index 1:                                         
_get_system_gpu_list_nvml:     NVLinks: 0,-1,0,0                                
_get_system_gpu_list_nvml: Note: GPU index 1 is different from minor number 2   
_get_system_gpu_list_nvml: GPU index 2:                                         
_get_system_gpu_list_nvml:     NVLinks: 0,0,-1,0                                
_get_system_gpu_list_nvml: Note: GPU index 2 is different from minor number 1   
_get_system_gpu_list_nvml: GPU index 3:                                         
_get_system_gpu_list_nvml:     NVLinks: 0,0,0,-1                                
_get_system_gpu_list_nvml: Note: GPU index 3 is different from minor number 0  
-- 8< ----------------------------------------------------------


Thanks!
--
Kilian
Comment 61 Kilian Cavalotti 2021-02-23 10:31:45 MST
So, I've continued to look into this a little more, and it looks like this mis-ordering of the GPUs has also been causing *over* allocation of GRES, as I found a few occurrences of this in the slurmctld logs:

error: gres/gpu: job 18914375 node sh03-13n04 overallocated resources by 1, (5 > 4)


In any case, the common trait to all those AMD nodes is that the GPU indexes reported by the NVML don't match the minor number of the corresponding GPU (the X in /dev/nvidiaX). As shown below, the NVML numbers the GPUs in the order of their PCI address on the PCI bus (which is the default behavior), but for some reasons, the minor numbers are created in the reverse order:

# nvidia-smi -q | grep -Ei "^GPU|minor"
GPU 00000000:05:00.0
    Minor Number                          : 3
GPU 00000000:44:00.0
    Minor Number                          : 2
GPU 00000000:89:00.0
    Minor Number                          : 1
GPU 00000000:C4:00.0
    Minor Number                          : 0

So GPU0 is /dev/nvidia3, GPU1 is /dev/nvidia2, etc.

From what I can gather from the NVML documentation, there's no guarantee that the indexes would match, or that they need to. It seems to be a legit case.


Does the NVML GPU pluin assumes that minor numbers will match GPU numbers? Maybe there's an issue there that's causing GPUs to be re-ordered and the generating that backwards links matrix?

Cheers,
--
Kilian
Comment 62 Michael Hinton 2021-02-23 10:47:02 MST
(In reply to Kilian Cavalotti from comment #61)
> So, I've continued to look into this a little more, and it looks like this
> mis-ordering of the GPUs has also been causing *over* allocation of GRES, as
> I found a few occurrences of this in the slurmctld logs:
> 
> error: gres/gpu: job 18914375 node sh03-13n04 overallocated resources by 1,
> (5 > 4)
That's not good. I'll increase the severity of this to 2.

> In any case, the common trait to all those AMD nodes is that the GPU indexes
> reported by the NVML don't match the minor number of the corresponding GPU
> (the X in /dev/nvidiaX). As shown below, the NVML numbers the GPUs in the
> order of their PCI address on the PCI bus (which is the default behavior),
> but for some reasons, the minor numbers are created in the reverse order:
> 
> # nvidia-smi -q | grep -Ei "^GPU|minor"
> GPU 00000000:05:00.0
>     Minor Number                          : 3
> GPU 00000000:44:00.0
>     Minor Number                          : 2
> GPU 00000000:89:00.0
>     Minor Number                          : 1
> GPU 00000000:C4:00.0
>     Minor Number                          : 0
> 
> So GPU0 is /dev/nvidia3, GPU1 is /dev/nvidia2, etc.
> 
> From what I can gather from the NVML documentation, there's no guarantee
> that the indexes would match, or that they need to. It seems to be a legit
> case.
> 
> 
> Does the NVML GPU pluin assumes that minor numbers will match GPU numbers?
> Maybe there's an issue there that's causing GPUs to be re-ordered and the
> generating that backwards links matrix?
You've hit the nail on the head.

From https://slurm.schedmd.com/gres.html:
"GPU device files (e.g. /dev/nvidia1) are based on the Linux minor number assignment, while NVML's device numbers are assigned via PCI bus ID, from lowest to highest. Mapping between these two is indeterministic and system dependent, and could vary between boots after hardware or OS changes. For the most part, this assignment seems fairly stable. However, an after-bootup check is required to guarantee that a GPU device is assigned to a specific device file."

This was the motivation behind the "Note: GPU index 2 is different from minor number 1" warning.

In order to simplify AutoDetect matching, we also sort GRES records by device file, but I guess we don't modify the links. That right there might be the thing we just need to fix to accommodate this.

I'll look into it and see if I can't get a patch to you soon.

Thanks,
-Michael
Comment 63 Michael Hinton 2021-02-23 11:54:31 MST
I thought I noticed this earlier in the gres flag debug output, but this confirms it. It looks like AutoDetect is not setting the Core affinity correctly, either... 

Feb 22 16:39:57 sh03-13n14.int slurmd[192597]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: -1,0,0,0
Feb 22 16:39:57 sh03-13n14.int slurmd[192597]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia3
Feb 22 16:39:57 sh03-13n14.int slurmd[192597]: debug:  gpu/nvml: _get_system_gpu_list_nvml: Note: GPU index 0 is different from minor number 3
Feb 22 16:39:57 sh03-13n14.int slurmd[192597]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 16-31
Feb 22 16:39:57 sh03-13n14.int slurmd[192597]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 0-31

Since ThreadsPerCore=1, Core Affinity Range should be the same as CPU Affinity Range. That's something else I need to look into.
Comment 64 Michael Hinton 2021-02-23 12:11:46 MST
(In reply to Michael Hinton from comment #63)
> I thought I noticed this earlier in the gres flag debug output, but this
> confirms it. It looks like AutoDetect is not setting the Core affinity
> correctly, either... 
I opened bug 10932 to investigate the Core affinity being off with AutoDetect.
Comment 66 Michael Hinton 2021-02-23 13:29:16 MST
(In reply to Kilian Cavalotti from comment #61)
> So, I've continued to look into this a little more, and it looks like this
> mis-ordering of the GPUs has also been causing *over* allocation of GRES, as
> I found a few occurrences of this in the slurmctld logs:
> 
> error: gres/gpu: job 18914375 node sh03-13n04 overallocated resources by 1,
> (5 > 4)
Is this occurring even with the current workaround?
Comment 67 Kilian Cavalotti 2021-02-23 13:34:06 MST
On Tue, Feb 23, 2021 at 12:29 PM <bugs@schedmd.com> wrote:
> > error: gres/gpu: job 18914375 node sh03-13n04 overallocated resources by 1,
> > (5 > 4)
> Is this occurring even with the current workaround?

No I don't believe so, I've just noticed that on nodes still using Autodetect.

Cheers,
--
Kilian
Comment 68 Michael Hinton 2021-02-23 14:30:22 MST
(In reply to Kilian Cavalotti from comment #67)
> On Tue, Feb 23, 2021 at 12:29 PM <bugs@schedmd.com> wrote:
> > > error: gres/gpu: job 18914375 node sh03-13n04 overallocated resources by 1,
> > > (5 > 4)
> > Is this occurring even with the current workaround?
> 
> No I don't believe so, I've just noticed that on nodes still using
> Autodetect.
Just AMD nodes using AutoDetect, right? Ok, good.

I may be able to make a patch that fixes the links, which will help. But looking at the code, Slurm has always assumed that that minor number == GPU ID when setting CUDA_VISIBLE_DEVICES (which is why we emitted that warning that the GPU ID didn't match the minor numbers).

See https://github.com/SchedMD/slurm/blob/slurm-20-11-4-1/src/plugins/gres/common/gres_common.c#L265-L292

So even if we got a patch to fix the links, you would still run into incorrect task binding for multi-GPU jobs, I think. I will open up an enhancement request to look into decoupling the minor number from the GPU ID.

Since you have a workaround for the links issue, I will downgrade this to a sev 3 and turn my attention to bug 10932, since that does look like a regression that impacts the entire cluster.

-Michael
Comment 69 Kilian Cavalotti 2021-02-23 14:42:13 MST
Hi Michael, 

Thanks for the update!

Yes, I think the root of the issue is that assumption that minor_number == GPU_ID. I looked up in the NVML docs but couldn't find much details about how the minor numbers are generated. But it seems clear that there's no guarantee they'll match GPU_IDs or the PCI address order. 

So decoupling them seems like the right thing to do and that will likely fix all of the issues, from links detection to CPU binding, indeed.

Thanks!
--
Kilian
Comment 70 Michael Hinton 2021-03-01 12:58:16 MST
From bug 10932 comment 10:
> On sh03-13n15, I know see a links matrix that seems to be correctly oriented:
> sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: -1,0,0,0
> sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 0,-1,0,0
> sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 0,0,-1,0
> sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 0,0,0,-1
Most machines I've dealt with have a consistent GPU device file enumeration order between boots once the hardware is set (though I've read somewhere that changing the hardware could affect this order). But apparently those AMD boxes do not enumerate the GPU minor numbers (and corresponding device files) in a static order from boot to boot.

Is the enumeration order random? Or is it always forwards (0,1,2,3) or backwards (3,2,1,0)? "Forwards" and "backwards" here mean ascending PCI bus ID and descending PCI bus ID, respectively, as that is how NVML/NVIDIA determines its GPU IDs.

What version of Linux are those nodes running? Is everything mostly the same as your other nodes except that they are using AMD CPUs?

Is there a way that this minor number assignment can be controlled? Until we can figure out why the OS is enumerating the GPUs in an undesired order, maybe the best we can do in the short term is write a script to reboot the nodes until it enumerates properly...
Comment 71 Kilian Cavalotti 2021-03-01 14:19:54 MST
(In reply to Michael Hinton from comment #70)
> From bug 10932 comment 10:
> > On sh03-13n15, I know see a links matrix that seems to be correctly oriented:
> > sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: -1,0,0,0
> > sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 0,-1,0,0
> > sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 0,0,-1,0
> > sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: 0,0,0,-1
> Most machines I've dealt with have a consistent GPU device file enumeration
> order between boots once the hardware is set (though I've read somewhere
> that changing the hardware could affect this order). But apparently those
> AMD boxes do not enumerate the GPU minor numbers (and corresponding device
> files) in a static order from boot to boot.

I do think the enumeration is consistent between reboots. Here's the order of minor numbers across all our GPU nodes (1st lineis GPU0, 2nd is GPU1, etc):

# clush -bw@gpu 'nvidia-smi -q | grep -Ei "minor"'
---------------
sh02-14n[05-15],sh02-15n[01-02,04-11,13],sh02-12n[04-12],sh02-13n[06-08,10-15],sh02-16n[02,05-06,08-09] (45)
---------------
    Minor Number                          : 0
    Minor Number                          : 1
    Minor Number                          : 2
    Minor Number                          : 3
---------------
sh01-29n[01-08],sh01-19n[01-06,08],sh02-12n[13-17],sh02-13n[01-05],sh01-28n[01-02,11-12],sh01-27n[21,35],sh02-[15-16]n03,sh02-16n01 (34)
---------------
    Minor Number                          : 0
    Minor Number                          : 1
    Minor Number                          : 2
    Minor Number                          : 3
    Minor Number                          : 4
    Minor Number                          : 5
    Minor Number                          : 6
    Minor Number                          : 7
---------------
sh03-[12-13]n[01-16],sh03-14n[01-02] (34)
---------------
    Minor Number                          : 3
    Minor Number                          : 2
    Minor Number                          : 1
    Minor Number                          : 0

NVML GPU ids and minor numbers seem to match on all nodes except the sh03-* which are all AMD Rome-based nodes.

If the enumeration order was changing between reboots, we'd have much variation that that.

> Is the enumeration order random? 

The enumeration of the GPUs by the NVML depends on the order of the devices on the PCIe bus. So it's deterministic in that way.
The minor numbers, on the other hand, does not follow the same ordering fashion: I don't think it's random, it has to be determined somehow, but it's certainly not guaranteed to match the NVML GPU ids.

> Or is it always forwards (0,1,2,3) or
> backwards (3,2,1,0)? "Forwards" and "backwards" here mean ascending PCI bus
> ID and descending PCI bus ID, respectively, as that is how NVML/NVIDIA
> determines its GPU IDs.

That would be a question for NVIDIA, I guess :)

> What version of Linux are those nodes running? Is everything mostly the same
> as your other nodes except that they are using AMD CPUs?

CentOS 7.6, kernel kernel-3.10.0-957.27.2.el7.x86_64 and NVIDIA driver 460.32.03, all the same on all the nodes.

> Is there a way that this minor number assignment can be controlled? Until we
> can figure out why the OS is enumerating the GPUs in an undesired order,
> maybe the best we can do in the short term is write a script to reboot the
> nodes until it enumerates properly...

I don't believe the number assignment can be controlled, and it doesn't seem to change across reboots.

I think the assumption that it would match the NVML GPU ids is maybe what's causing the problem here. There's no guarantee it would match (and it actually does not always, as we can see here), so maybe the gres NVML plugin should not assume so. Could the minor number be considered just as another GPU property, like its name or its UUID?

Cheers,
--
Kilian
Comment 72 Michael Hinton 2021-03-02 12:10:08 MST
(In reply to Kilian Cavalotti from comment #71)
> I don't believe the number assignment can be controlled, and it doesn't seem
> to change across reboots.
Ok, good to know.

> I think the assumption that it would match the NVML GPU ids is maybe what's
> causing the problem here. There's no guarantee it would match (and it
> actually does not always, as we can see here), so maybe the gres NVML plugin
> should not assume so. Could the minor number be considered just as another
> GPU property, like its name or its UUID?
I agree that this is what needs to happen in Slurm. But I think this is going to be a 21.08 thing at the earliest, since Slurm has always implicitly assumed this. I.e., it's not a bug, but rather an inherent limitation. Also, adding a field will require altering packing/unpacking. I'll discuss this internally to see if we can start working on this.

-Michael
Comment 73 Michael Hinton 2021-03-02 14:08:32 MST
(In reply to Michael Hinton from comment #72)
> I'll discuss this
> internally to see if we can start working on this.
I was given the go-ahead to open an internal ticket to work on this, and I hope to be able to get this working by 21.08. I may ask for your help to test this along the way, since we don't really have access to AMD nodes with multiple NVIDIA GPUs.

-Michael
Comment 74 Kilian Cavalotti 2021-03-02 14:10:31 MST
On Tue, Mar 2, 2021 at 1:08 PM <bugs@schedmd.com> wrote:
> (In reply to Michael Hinton from comment #72)
> > I'll discuss this
> > internally to see if we can start working on this.
> I was given the go-ahead to open an internal ticket to work on this, and I hope
> to be able to get this working by 21.08. I may ask for your help to test this
> along the way, since we don't really have access to AMD nodes with multiple
> NVIDIA GPUs.

Great, thanks for letting me know!
I'll be happy to help in providing any information I can.

Cheers,
--
Kilian
Comment 76 Kilian Cavalotti 2021-04-06 10:52:46 MDT
*** Ticket 11277 has been marked as a duplicate of this ticket. ***
Comment 77 Michael Hinton 2021-04-09 15:11:43 MDT
Hey Kilian,

I'm going to go ahead and mark this as a sev 4, since we provided a workaround (explicitly specifying gres.conf GRES lines for AMD nodes) and since we are looking into ultimately fixing this in (internal) bug 10933 for (hopefully) 21.08. Once we make progress on 10933, I'll go ahead and contact you here to get your thoughts on the patch and maybe see how it works for you. How does that sound?

Thanks!
-Michael
Comment 78 Bas van der Vlies 2021-04-09 15:12:00 MDT
I am on holidays from 21-dec-2020 till 4-jan-2021.


Wish you a merry Christmas and Happy New Year
Comment 79 Kilian Cavalotti 2021-04-09 15:40:42 MDT
Hi Michael, 

> I'm going to go ahead and mark this as a sev 4, since we provided a
> workaround (explicitly specifying gres.conf GRES lines for AMD nodes) and
> since we are looking into ultimately fixing this in (internal) bug 10933 for
> (hopefully) 21.08. Once we make progress on 10933, I'll go ahead and contact
> you here to get your thoughts on the patch and maybe see how it works for
> you. How does that sound?

Yep, that sounds good, thanks!

Cheers,
--
Kilian
Comment 80 Michael Hinton 2021-05-04 13:38:39 MDT
(In reply to Kilian Cavalotti from comment #71)
> # clush -bw@gpu 'nvidia-smi -q | grep -Ei "minor"'
> ...
> sh03-[12-13]n[01-16],sh03-14n[01-02] (34)
> ---------------
>     Minor Number                          : 3
>     Minor Number                          : 2
>     Minor Number                          : 1
>     Minor Number                          : 0
Kilian, could I get a confirmation that NVIDIA GPUs appear to be enumerated in nvidia-smi based on PCI bus ID (for both AMD and non-AMD nodes)?
Comment 81 Michael Hinton 2021-05-04 13:40:05 MDT
(In reply to Michael Hinton from comment #80)
> appear to be enumerated
I mean the order that they appear in in nvidia-smi
Comment 82 Kilian Cavalotti 2021-05-04 14:01:42 MDT
(In reply to Michael Hinton from comment #81)
> (In reply to Michael Hinton from comment #80)
> > appear to be enumerated
> I mean the order that they appear in in nvidia-smi

(In reply to Michael Hinton from comment #80)
> (In reply to Kilian Cavalotti from comment #71)
> > # clush -bw@gpu 'nvidia-smi -q | grep -Ei "minor"'
> > ...
> > sh03-[12-13]n[01-16],sh03-14n[01-02] (34)
> > ---------------
> >     Minor Number                          : 3
> >     Minor Number                          : 2
> >     Minor Number                          : 1
> >     Minor Number                          : 0
> Kilian, could I get a confirmation that NVIDIA GPUs appear to be enumerated
> in nvidia-smi based on PCI bus ID (for both AMD and non-AMD nodes)?

It depends :)

This is actually controlled by the CUDA_DEVICE_ORDER environment variable:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars

> Values: FASTEST_FIRST, PCI_BUS_ID, (default is FASTEST_FIRST)	
> Description: FASTEST_FIRST causes CUDA to enumerate the available devices in fastest to slowest order using a simple heuristic. PCI_BUS_ID orders devices by PCI bus ID in ascending order.

Also see https://docs.nvidia.com/dgx/dgx-station-a100-user-guide/index.html#run-on-bare-metal-hw-guide

I'm not even sure the behavior is guaranteed when the environment variable is not defined and all the GPUs are identical. Maybe this is something that SchedMD's contacts at NVIDIA could help with?


Also, I found cases where the minor numbers are not just the NVML GPU ids in reverse, but something more... creative:

# nvidia-smi -q  | grep -Ei "minor|bus  "
    Minor Number                          : 1
        Bus                               : 0x01
    Minor Number                          : 0
        Bus                               : 0x41
    Minor Number                          : 3
        Bus                               : 0x81
    Minor Number                          : 2
        Bus                               : 0xC1

or:

# nvidia-smi -q  | grep -Ei "minor|bus  "
    Minor Number                          : 2
        Bus                               : 0x07
    Minor Number                          : 3
        Bus                               : 0x0A
    Minor Number                          : 0
        Bus                               : 0x47
    Minor Number                          : 1
        Bus                               : 0x4D
    Minor Number                          : 6
        Bus                               : 0x87
    Minor Number                          : 7
        Bus                               : 0x8D
    Minor Number                          : 4
        Bus                               : 0xC7
    Minor Number                          : 5
        Bus                               : 0xCA

About that NVML id, the nvidia-smi man page says:

   -i, --id=ID
       Display data for a single specified GPU or Unit.  The specified id may be the GPU/Unit's 0-based index in the natural enumeration returned by the driver, the GPU's board serial number, the GPU's UUID, or the GPU's PCI bus ID (as  domain:bus:device.function in hex).  It is recommended that users desiring consistency use either UUID or PCI bus ID, since device enumeration ordering is not guaranteed to be consistent between reboots and board serial number might be shared between multiple GPUs on the same board.


Cheers,
--
Kilian
Comment 83 Michael Hinton 2021-05-04 14:41:06 MDT
I think you are conflating NVML with CUDA (in this case).

We do say this in https://slurm.schedmd.com/gres.html#GPU_Management:

"NVML (which powers the nvidia-smi tool) numbers GPUs in order by their PCI bus IDs. For this numbering to match the numbering reported by CUDA, the CUDA_DEVICE_ORDER environmental variable must be set to CUDA_DEVICE_ORDER=PCI_BUS_ID."

The justification for this comes from an email exchange from one of the NVIDIA NVML developers back in 2018:

"/dev/nvidiaX enumeration is based on Linux minor number assignment, and NVML’s enumeration is done after sorting devices using PCI Bus ID. In short, mapping between these two is indeterministic and system dependent (based on PCI Bus ID assignment).
 
"Setting CUDA_​DEVICE_​ORDER=PCI_BUS_ID is sufficient to match CUDA and NVML enumeration. NVML already sorts device using PCI bus ID."

I think the confusion is that CUDA could number devices differently than NVML, depending on if the CUDA application sets CUDA_DEVICE_ORDER. But NVML (and by extension, nvidia-smi) should always number devices by PCI bus ID. And your output is showing that, so that is the confirmation I was looking for :)

Have you ever seen nvidia-smi _not_ enumerate devices in order of PCI bus ID? That would be news-worthy if so.

-Michael

P.S.: Now that I'm thinking about it, it might be good if Slurm automatically sets CUDA_​DEVICE_​ORDER=PCI_BUS_ID as a convenience to make sure that CUDA applications are guaranteed to have the same order as Slurm. That's something I will look into.
Comment 84 Kilian Cavalotti 2021-05-04 14:49:32 MDT
(In reply to Michael Hinton from comment #83)
> I think you are conflating NVML with CUDA (in this case).

Ah yes, I think you're right, I actually missed that additional level of abstraction. So there is the NVML id, the device minor number and the CUDA id, which can all be different (yet in the same 0-#GPUs integer range).

> I think the confusion is that CUDA could number devices differently than
> NVML, depending on if the CUDA application sets CUDA_DEVICE_ORDER. But NVML
> (and by extension, nvidia-smi) should always number devices by PCI bus ID.
> And your output is showing that, so that is the confirmation I was looking
> for :)

Right!

> Have you ever seen nvidia-smi _not_ enumerate devices in order of PCI bus
> ID? That would be news-worthy if so.

Nope: on all the GPU nodes we have (9 different server models), the NVML ids are in the order of the PCI bus addresses.

> P.S.: Now that I'm thinking about it, it might be good if Slurm
> automatically sets CUDA_​DEVICE_​ORDER=PCI_BUS_ID as a convenience to make
> sure that CUDA applications are guaranteed to have the same order as Slurm.
> That's something I will look into.

I think that would make perfect sense indeed!

Thanks,
--
Kilian
Comment 85 Michael Hinton 2021-05-04 15:06:25 MDT
(In reply to Kilian Cavalotti from comment #84)
> (In reply to Michael Hinton from comment #83)
> > I think you are conflating NVML with CUDA (in this case).
> 
> Ah yes, I think you're right, I actually missed that additional level of
> abstraction. So there is the NVML id, the device minor number and the CUDA
> id, which can all be different (yet in the same 0-#GPUs integer range).
Yeah, I think that's a good summary of the three different "GPU ids" we are talking about.

> > P.S.: Now that I'm thinking about it, it might be good if Slurm
> > automatically sets CUDA_​DEVICE_​ORDER=PCI_BUS_ID as a convenience to make
> > sure that CUDA applications are guaranteed to have the same order as Slurm.
> > That's something I will look into.
> 
> I think that would make perfect sense indeed!
Opened bug 11529 to track this.

Thanks!
-Michael
Comment 86 Michael Hinton 2021-05-25 11:53:17 MDT
*** Ticket 11693 has been marked as a duplicate of this ticket. ***
Comment 87 Michael Hinton 2021-05-25 12:04:41 MDT
Kilian,

I've been thinking about it some more, and I think this issue can be solved in 20.11, at least the part where GPUs are getting ignored by the scheduler. The GPU binding issue is another, less serious matter.

The basic idea is that if AutoDetect is just going to set links to be e.g.

-1 0 0
0 -1 0
0 0 -1

(i.e. no nvlinks present), then we might as well set links to NULL, which would avoid the issue entirely for the vast majority of cases.

However, if nvlinks exist and links are not an "identity matrix," then AutoDetect can just sort the GPUs by links at the end instead of by device file, and that should preserve the original PCI bus order (since that is the order that links are set). The -1 (self) should indicate the position of the GPU on the PCI bus.

I will look into this and keep you posted.

For the record, do any of your GPUs have nvlinks?

-Michael
Comment 88 Kilian Cavalotti 2021-05-25 12:26:51 MDT
Hi Michael, 

(In reply to Michael Hinton from comment #87)
> I've been thinking about it some more, and I think this issue can be solved
> in 20.11, at least the part where GPUs are getting ignored by the scheduler.
> The GPU binding issue is another, less serious matter.
> 
> The basic idea is that if AutoDetect is just going to set links to be e.g.
> 
> -1 0 0
> 0 -1 0
> 0 0 -1
> 
> (i.e. no nvlinks present), then we might as well set links to NULL, which
> would avoid the issue entirely for the vast majority of cases.
> 
> However, if nvlinks exist and links are not an "identity matrix," then
> AutoDetect can just sort the GPUs by links at the end instead of by device
> file, and that should preserve the original PCI bus order (since that is the
> order that links are set). The -1 (self) should indicate the position of the
> GPU on the PCI bus.
> 
> I will look into this and keep you posted.

Thanks for the update! Yes, that seems like an interesting idea worth exploring.

> For the record, do any of your GPUs have nvlinks?

Yes, we have a mix of configurations, over multiple GPU generations, some with plain old PCIe, some with NVLink.

Cheers,
--
Kilian
Comment 89 Michael Hinton 2021-06-02 17:37:42 MDT
*** Ticket 11697 has been marked as a duplicate of this ticket. ***
Comment 90 Michael Hinton 2021-06-09 12:50:05 MDT
Comment on attachment 17956 [details]
debug v1

Renaming to debug1 to differentiate from forthcoming real v1
Comment 91 Michael Hinton 2021-06-09 13:17:48 MDT
Created attachment 19901 [details]
v1

Kilian,

Here is v1. This should allow AutoDetect to work properly with your AMD GPU nodes. Let me know if it fixes the issue for you!

When AutoDetect is used, it now does a final sort of GPUs based on Links instead of device file. Since the 'self' index (-1) correlates to the PCI bus ID order, this should effectively sort GPUs by PCI bus ID. At the very least, it will make Links be in the right order and prevent the GPU scheduling issue initially reported in this bug.

While I don't think there will be any actual issues with GPU binding, we do still have some hardcoded assumptions that device file number is equivalent to GPU index/order, but that's out of the scope of what we can fix in 20.11, and is something I believe I fixed for 21.08 in bug 10933 patch v1.

Thanks!
-Michael
Comment 93 Michael Hinton 2021-06-10 16:25:38 MDT
Created attachment 19923 [details]
2011 v2 (will not land in 20.11)

v2 cleans up some stuff in v1, but doesn't change the behavior much. The biggest difference is that we keep the device file sort, and then do a links sort *after*. This is needed to prevent breaking things on AMD GPUs, which don't have links and thus need to be sorted by device file. (AMD GPUs may also have the same issue described in this bug, but I'll leave that to 21.08). v2 also adds another test and adds some more error messages.
Comment 95 Kilian Cavalotti 2021-06-11 17:12:53 MDT
Hi Michael,

(In reply to Michael Hinton from comment #93)
> v2 cleans up some stuff in v1, but doesn't change the behavior much. The
> biggest difference is that we keep the device file sort, and then do a links
> sort *after*. This is needed to prevent breaking things on AMD GPUs, which
> don't have links and thus need to be sorted by device file. (AMD GPUs may
> also have the same issue described in this bug, but I'll leave that to
> 21.08). v2 also adds another test and adds some more error messages.

Thank you very much for the patch! 

I just applied it and I can confirm that the original issue (Unable to create step for job XXX: Invalid generic resource (gres) specification) is now fixed.

Things look good on all the nodes I checked so far.
Looking forward to https://bugs.schedmd.com/show_bug.cgi?id=10933 in 21.08

Thanks a lot!

Cheers,
--
Kilian
Comment 96 Michael Hinton 2021-06-14 11:06:20 MDT
(In reply to Kilian Cavalotti from comment #95)
> I just applied it and I can confirm that the original issue (Unable to
> create step for job XXX: Invalid generic resource (gres) specification) is
> now fixed.
> 
> Things look good on all the nodes I checked so far.
Excellent! Glad it's working as expected.

> Looking forward to https://bugs.schedmd.com/show_bug.cgi?id=10933 in 21.08
I'm hoping that the patches in this bug can help simplify bug 10933. I'll take a look at it soon.

Thanks!
-Michael
Comment 99 Michael Hinton 2021-07-02 14:00:48 MDT
To Kilian (and anybody following along),

We are going to go ahead and land patch 2011 v2 in 21.08, since it's a non-trivial change and we don't want to risk destabilizing 20.11.9 at this point.

For 20.11, feel free to continue using patch 2011 v2 in production. Specifying GRES explicitly in gres.conf without using AutoDetect for affected nodes is also a workaround for those who do not want to run with a patch.

Thanks!
-Michael
Comment 101 Kilian Cavalotti 2021-07-02 14:18:40 MDT
Hi Michael,

(In reply to Michael Hinton from comment #99)
> To Kilian (and anybody following along),
> 
> We are going to go ahead and land patch 2011 v2 in 21.08, since it's a
> non-trivial change and we don't want to risk destabilizing 20.11.9 at this
> point.
> 
> For 20.11, feel free to continue using patch 2011 v2 in production.
> Specifying GRES explicitly in gres.conf without using AutoDetect for
> affected nodes is also a workaround for those who do not want to run with a
> patch.

Thanks for the heads-up. We'll continue applying the patch in 20.11, it's been working well so far. Thanks for providing it!

Cheers,
--
Kilian
Comment 107 Michael Hinton 2021-07-21 19:15:56 MDT
Hey Kilian,

This is finally fixed in master with commits https://github.com/SchedMD/slurm/compare/716b81c981c8...ef533e1f30cc. Now, AutoDetected GPUs will be sorted in order of enumeration rather than minor number, for both NVML and RSMI.

Thanks!
-Michael
Comment 108 Kilian Cavalotti 2021-07-22 04:35:14 MDT
Awesome, thanks!
--
Kilian