Bug 12642

Summary: Unable to create a step in job with GRES (typed, no_consume) after upgrade to 21.08
Product: Slurm Reporter: Trey Dockendorf <tdockendorf>
Component: ConfigurationAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: andrew.dangelo, bas.vandervlies, cinek, csc-slurm-tickets, quirin.lohr, scott, troy
Version: 21.08.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=12459
Site: Ohio State OSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 21.08.3 22.05pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: gres.conf
slurm.conf
job_submit.lua
job_submit_lib.lua
osc_common.lua
v2
v2 (from bug10498)
v1
v3
v4

Description Trey Dockendorf 2021-10-11 09:23:56 MDT
Created attachment 21685 [details]
gres.conf

I am testing the upgrade from 20.11.8 to 21.08.2 and have discovered I'm unable to submit jobs that request a GRES:

$ sbatch -A PZS0708 --gres=gpfs:1 --wrap 'sleep 600'
sbatch: error: Invalid generic resource (gres) specification
$ sbatch -A PZS0708 --gres=gpu:1 --wrap 'sleep 600'
sbatch: error: Invalid generic resource (gres) specification

I'm attaching our gres.conf and slurm.conf that did not have this issue on 20.11.8.

The TRES entries are in database too:

[root@slurmdbd01-test ~]# sacctmgr show tres
    Type            Name     ID
-------- --------------- ------
     cpu                      1
     mem                      2
  energy                      3
    node                      4
 billing                      5
      fs            disk      6
    vmem                      7
   pages                      8
    gres        gpfs:ess   1001
    gres    gpfs:project   1002
    gres    gpfs:scratch   1003
    gres             gpu   1004
    gres        gpu:v100   1005
    gres             ime   1006
    gres          pfsdir   1007
    gres      pfsdir:ess   1008
    gres  pfsdir:scratch   1009
    gres    gpu:v100-32g   1010
    gres   gpu:v100-quad   1011
    gres             vis   1012
    gres        gpu:p100   1013
    gres             mps   1014
 license          abaqus   1015
 license      abaqus@osc   1016
 license   abaquscae@osc   1017
 license abaqusexplicit+   1018
 license abaqusextended+   1019
 license       ansys@osc   1020
 license    ansyspar@osc   1021
 license comsolscript@o+   1022
 license        epik@osc   1023
 license       glide@osc   1024
 license     ligprep@osc   1025
 license      lsdyna@osc   1026
 license  macromodel@osc   1027
 license     qikprep@osc   1028
 license     starccm@osc   1029
 license  starccmpar@osc   1030
 license       stata@osc   1031
 license     usearch@osc   1032
 license     qikprop@osc   1033
Comment 1 Trey Dockendorf 2021-10-11 09:24:17 MDT
Created attachment 21686 [details]
slurm.conf
Comment 2 Trey Dockendorf 2021-10-11 10:30:02 MDT
Here is verbose sbatch and can see the issue is it tries to submit "gres:gres:gpu":

$ sbatch -A PZS0708 --gres=gpu --wrap 'sleep 600' -vvv
sbatch: defined options
sbatch: -------------------- --------------------
sbatch: account             : PZS0708
sbatch: gres                : gres:gres:gpu
sbatch: verbose             : 3
sbatch: wrap                : sleep 600
sbatch: -------------------- --------------------
sbatch: end of defined options
sbatch: debug2: spank: lua.so: init_post_opt = 0
sbatch: debug2: spank: private-tmpdir.so: init_post_opt = 0
sbatch: debug:  propagating SLURM_PRIO_PROCESS=0
sbatch: debug:  propagating UMASK=0002
sbatch: select/cons_res: common_init: select/cons_res loaded
sbatch: select/cons_tres: common_init: select/cons_tres loaded
sbatch: select/cray_aries: init: Cray/Aries node selection plugin loaded
sbatch: select/linear: init: Linear node selection plugin loaded with argument 276
sbatch: debug:  _get_next_gres: Failed to locate GRES gres
sbatch: error: Invalid generic resource (gres) specification
sbatch: debug2: spank: lua.so: exit = 0
sbatch: debug2: spank: private-tmpdir.so: exit = 0
Comment 3 Trey Dockendorf 2021-10-11 10:33:40 MDT
Looks like CLI filter Lua behavior changed. We have this in "slurm_cli_pre_submit":

  if options["gres"] ~= nil then
    local new_gres = gres_defaults(options["gres"])
    options["gres"] = new_gres
    posix.setenv("SLURM_JOB_GRES", options["gres"])
  end

The function gres_defaults:

function gres_defaults(gres)
  if gres == nil then
    return gres
  end
  local new_gres = {}
  for g in string.gmatch(gres, "([^,]+)") do
    if g == "pfsdir" then
      g = "pfsdir:ess"
    end
    new_gres[#new_gres+1] = g
  end
  gres = ""
  for i,g in ipairs(new_gres) do
    if gres ~= "" then
      gres = gres .. "," .. g
    else
      gres = g
    end
  end
  return gres
end

The entire point of this code is that if you request "pfsdir" gres it's transformed into "pfsdir:ess" GRES.  Is there some bug or new behavior with how setting GRES via CLI filter is handled?
Comment 4 Trey Dockendorf 2021-10-11 10:38:15 MDT
Dug into this comparing 20.11 and 21.08 and it looks like options["gres"] in 20.11 would be "gpu" but in 21.08 it's "gres:gpu".  Is this intentional or a bug?
Comment 5 Trey Dockendorf 2021-10-11 10:42:57 MDT
One more bit of data, this is also affecting job submit Lua so I'm guessing the GRES handling in Lua is either changed or broken.
Comment 6 Trey Dockendorf 2021-10-11 10:59:54 MDT
So there is seemingly a disconnect between sbatch and slurmctld because I got the -vvv output to show "gres:gpu" but now the scheduler is rejecting the job:

$ sbatch -A PZS0708 --gres=gpu --wrap 'sleep 600' -vvv
DEBUG: gres=gres:gpusbatch: defined options
sbatch: -------------------- --------------------
sbatch: account             : PZS0708
sbatch: gres                : gres:gpu
sbatch: verbose             : 3
sbatch: wrap                : sleep 600
sbatch: -------------------- --------------------
sbatch: end of defined options
sbatch: debug2: spank: lua.so: init_post_opt = 0
sbatch: debug2: spank: private-tmpdir.so: init_post_opt = 0
sbatch: debug:  propagating SLURM_PRIO_PROCESS=0
sbatch: debug:  propagating UMASK=0002
sbatch: select/cons_res: common_init: select/cons_res loaded
sbatch: select/cons_tres: common_init: select/cons_tres loaded
sbatch: select/cray_aries: init: Cray/Aries node selection plugin loaded
sbatch: select/linear: init: Linear node selection plugin loaded with argument 276
sbatch: debug:  auth/munge: init: Munge authentication plugin loaded
sbatch: error: Batch job submission failed: Invalid Trackable RESource (TRES) specification
sbatch: debug2: spank: lua.so: exit = 0
sbatch: debug2: spank: private-tmpdir.so: exit = 0


This is in slurmctld.log:

Oct 11 12:58:36 pitzer-slurm01-test slurmctld[109031]: _slurm_rpc_submit_batch_job: Invalid Trackable RESource (TRES) specification
Comment 7 Trey Dockendorf 2021-10-11 11:02:40 MDT
Created attachment 21690 [details]
job_submit.lua
Comment 8 Trey Dockendorf 2021-10-11 11:03:02 MDT
Created attachment 21691 [details]
job_submit_lib.lua
Comment 9 Trey Dockendorf 2021-10-11 11:03:16 MDT
Created attachment 21692 [details]
osc_common.lua
Comment 10 Trey Dockendorf 2021-10-11 11:04:29 MDT
Here is debug logs from slurmctld:

Oct 11 13:00:50 pitzer-slurm01-test slurmctld[109031]: debug2: gpu,gpfs:ess:1 is not a gres
Oct 11 13:00:50 pitzer-slurm01-test slurmctld[109031]: _slurm_rpc_submit_batch_job: Invalid Trackable RESource (TRES) specification

We have a job submit Lua that based on the submit directory, will add a GPFS GRES.  That now appears to not work.  I've attached our job submit Lua files.

It's worth noting again all this code worked just fine with 20.11.8 and only when upgrading to 21.08.2 did things break with our Lua code.
Comment 11 Trey Dockendorf 2021-10-11 11:18:27 MDT
Okay, I think I found out what is going on here and have some work arounds I will test deploying.

The CLI filter Lua options now contains "gres:" prefix for options["gres"] but you can't pass a GRES with that prefix back to the options in CLI filter so in CLI filter have to strip out "gres:" prefix and submit.

The Job submit Lua requires the "gres:" prefix for all GRES added in Lua, otherwise the GRES is not found.

This is also having an issue for some custom C plugins we deploy to add GRES to environment variables, now they get "gres:pfsdir:scratch" rather than just "pfsdir:scratch".  That is something I think we can work around too if this is the new behavior.

It would be good to know from SchedMD if what I'm seeing is expected or a bug introduced in SLURM 21.08. I see no mention of such a change in the 21.08 release notes.
Comment 15 Trey Dockendorf 2021-10-13 07:14:56 MDT
I've managed to work around issues with sbatch but now I'm unable to use a GRES from srun.  Job:

#!/bin/bash
#SBATCH -t 00:05:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --partition=parallel-40core
#SBATCH -o output/parallel-%j.out

env | sort
echo "hostname"
srun --gres=gpfs:ess -n2 -N2 hostname

This is the error in logs:

hostname
srun: error: Unable to create step for job 2003686: Invalid generic resource (gres) specification

I've tried to turn off all our CLI filters and job submit filters and the issue is still present.  This issue is now preventing us from further testing SLURM 21.08.
Comment 16 Trey Dockendorf 2021-10-13 07:18:02 MDT
In case it helps, I submitted another job with debug5 on scheduler and get this:

Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3: StepDesc: user_id=20821 JobId=2003688 node_count=2-2 cpu_count=2 num_tasks=2
Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3:    cpu_freq_gov=4294967294 cpu_freq_max=4294967294 cpu_freq_min=4294967294 relative=65534 task_dist=0x
1 plane=1
Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3:    node_list=(null)  constraints=(null)
Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3:    host=p0002 port=35043 srun_pid=164316 name=hostname network=(null) exclusive=yes
Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3:    mem_per_cpu=4556 resv_port_cnt=65534 immediate=0 no_kill=no
Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3:    overcommit=no time_limit=0
Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3:    TRES_per_step=cpu:2
Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3:    TRES_per_node=gres:gpfs:ess
Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug2: cpu:2 is not a gres
Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from UID=0
Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3: Processing RPC details: REQUEST_COMPLETE_BATCH_SCRIPT for JobId=2003688
Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: _job_complete: JobId=2003688 WEXITSTATUS 1
Comment 18 Marcin Stolarek 2021-10-13 09:40:39 MDT
Comment on attachment 21729 [details]
v2

Trey,

After looking into this we think that we should handle the same behavior for job_desc.gres field in job_submit/lua. However, we'll just document that for 'tres_per_{node,socket,task}' the addition of gres on the job_submit.lua script is now required.

Could you please give the following patch a try?

I've checked the code and played with cli_filter and I believe that those were not affected by the change[1].
 
cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/6300d47c2d2485683dedac290b157a8bfe77f918
Comment 19 Trey Dockendorf 2021-10-13 10:34:22 MDT
I verified that patch works so I no longer have to add the "gres:" prefix in job_submit Lua.  However the issue of not being able to use GRES with srun is still present.  Here is maybe a simpler way to reproduce along with previous comment's batch script:

$ salloc -A PZS0708 --gres=gpfs srun --gres=gpfs hostname
salloc: Pending job allocation 2003701
salloc: job 2003701 queued and waiting for resources
salloc: job 2003701 has been allocated resources
salloc: Granted job allocation 2003701
salloc: Waiting for resource configuration
salloc: Nodes p0006 are ready for job
srun: error: Unable to create step for job 2003701: Invalid generic resource (gres) specification
salloc: Relinquishing job allocation 2003701
salloc: Job allocation 2003701 has been revoked.

However a GPU GRES does work:

$ salloc -A PZS0708 --gres=gpu:1 srun --gres=gpu:1 hostname
salloc: Pending job allocation 2003699
salloc: job 2003699 queued and waiting for resources
salloc: job 2003699 has been allocated resources
salloc: Granted job allocation 2003699
salloc: Waiting for resource configuration
salloc: Nodes p0318 are ready for job
p0318.ten.osc.edu
salloc: Relinquishing job allocation 2003699
Comment 22 Marcin Stolarek 2021-10-14 07:04:39 MDT
Trey,

Unfortunately, you've noticed a second issue at once. I was able to reproduce that and I'm looking into the code (and its history) to find a proper solution.

I'll keep you posted.

cheers,
Marcin
Comment 23 Marcin Stolarek 2021-10-14 09:16:28 MDT
Created attachment 21747 [details]
v2 (from bug10498)

Trey,

Am I correct that you're on the test system? In that case, we shouldn't call the ticket severity 2.

At the same time if it's a test system, could you please test the attached patch? I didn't go through all the bits yet, but it's something I worked on in a different bug report, and from my current test and understanding it resolved the issue we see here too.

cheers,
Marcin
Comment 24 Trey Dockendorf 2021-10-14 09:37:11 MDT
Correct, this is a test system where we are testing SLURM 21.08 before we attempt a production upgrade. Can lower severity to whatever is appropriate.

I tested that most recent patch and still have the same issue:

$ salloc -A PZS0708 --gres=gpfs:ess srun --gres=gpfs:ess -vvv hostname
salloc: Pending job allocation 2003715
salloc: job 2003715 queued and waiting for resources
salloc: job 2003715 has been allocated resources
salloc: Granted job allocation 2003715
salloc: Waiting for resource configuration
salloc: Nodes p0006 are ready for job
srun: defined options
srun: -------------------- --------------------
srun: (null)              : p0006
srun: gres                : gres:gpfs:ess
srun: jobid               : 2003715
srun: job-name            : interactive
srun: nodes               : 1
srun: verbose             : 3
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0002
srun: debug2: srun PMI messages to port=39818
srun: debug:  auth/munge: init: Munge authentication plugin loaded
srun: jobid 2003715: nodes(1):`p0006', cpu counts: 1(x1)
srun: debug2: creating job with 1 tasks
srun: debug:  requesting job 2003715, user 20821, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name hostname, relative 65534
srun: error: Unable to create step for job 2003715: Invalid generic resource (gres) specification
salloc: Relinquishing job allocation 2003715
salloc: Job allocation 2003715 has been revoked.

Verified that using --gres=gpu:1 still works.
Comment 27 Marcin Stolarek 2021-10-15 11:42:28 MDT
Trey,

Just wanted to follow up with you on that. I understand when the regression was introduced and I'm looking for the best way to address the issue.
I hope I'll be able to share something with you soon since I'd like to let you continue testing ASAP.

Since it's on a test system, I'm decreasing the severity to 3.

cheers,
Marcin
Comment 30 Marcin Stolarek 2021-10-18 03:20:44 MDT
Comment on attachment 21791 [details]
v1

Trey,

I attached a new patch set that should solve the issues you noticed. It didn't pass our QA yet, but since you're working on a test system I'd like to get your feedback on that first.

cheers,
Marcin
Comment 31 Trey Dockendorf 2021-10-19 06:10:36 MDT
The single node srun test works when used with salloc but not sbatch or any srun with more than one node:

WORKS:
$ salloc -A PZS0708 --gres=gpfs:ess srun --gres=gpfs:ess hostname
salloc: Pending job allocation 2004197
salloc: job 2004197 queued and waiting for resources
salloc: job 2004197 has been allocated resources
salloc: Granted job allocation 2004197
salloc: Waiting for resource configuration
salloc: Nodes p0006 are ready for job
p0006.ten.osc.edu
salloc: Relinquishing job allocation 2004197

$ salloc -A PZS0708 -n 2 --gres=gpfs:ess srun --gres=gpfs:ess hostname
salloc: Pending job allocation 2004199
salloc: job 2004199 queued and waiting for resources
salloc: job 2004199 has been allocated resources
salloc: Granted job allocation 2004199
salloc: Waiting for resource configuration
salloc: Nodes p0002 are ready for job
p0002.ten.osc.edu
p0002.ten.osc.edu
salloc: Relinquishing job allocation 2004199



FAILS:

$ cat parallel.sbatch 
#!/bin/bash
#SBATCH -t 00:05:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --partition=parallel-40core
#SBATCH -o output/parallel-%j.out

echo "hostname"
srun --gres=gpfs:ess -n2 -N2 hostname

$ sbatch parallel.sbatch 
Submitted batch job 2004196

$ cat output/parallel-2004196.out 
hostname
srun: error: Unable to create step for job 2004196: Invalid generic resource (gres) specification

ALSO FAILS:

$ salloc -A PZS0708 -N 2 --gres=gpfs:ess srun --gres=gpfs:ess hostname
salloc: Pending job allocation 2004198
salloc: job 2004198 queued and waiting for resources
salloc: job 2004198 has been allocated resources
salloc: Granted job allocation 2004198
salloc: Waiting for resource configuration
salloc: Nodes p[0002,0006] are ready for job
srun: error: Unable to create step for job 2004198: Invalid generic resource (gres) specification
salloc: Relinquishing job allocation 2004198
Comment 32 Marcin Stolarek 2021-10-19 06:23:18 MDT
Created attachment 21813 [details]
v3

Trey,

I just tried reproducing the multiple node issue you've noticed without success. I'll try to understand what may be the cause on your side.

In the meantime, I merged the commits into one patch.

cheers,
Marcin
Comment 33 Marcin Stolarek 2021-10-19 06:25:19 MDT
PS. Are you still on 21.08.1 - there was a fix for GRES inheritance between 21.08.1 and 21.08.2 that may have impact. I'd recommend switching to 21.08.2 (+patches) if you're still on the previous minor release.
Comment 34 Trey Dockendorf 2021-10-19 06:28:13 MDT
I am on 21.08.2 plus patches.  The reason this case says 21.08.1 is 21.08.2 wasn't an option yet in dropdown when I opened the case.  I updated case dropdown to show 21.08.2
Comment 35 Marcin Stolarek 2021-10-20 04:06:41 MDT
Created attachment 21832 [details]
v4

Trey,

Could you please apply the attached patch on clean 21.08.2 and recheck? I wasn't able to reproduce the difference coming from -N2, but I found that trace over --exclusive allocation handling that results in the error you faced.

In case of the error please collect debug2 slurmctld logs with Gres and SelectType DebugFlags enabled and share those with us.

cheers,
Marcin
Comment 36 Marcin Stolarek 2021-10-20 05:07:14 MDT
PS. I think the patch will solve the issue. I just understood that you're routing jobs requesting more than one node to "parallel" partitions by job_submit plugin. Those partitions are configured as OverSubscribe=EXCLUSIVE which results in a call over the trace I found.
Comment 38 Andrew D'Angelo 2021-10-20 10:50:26 MDT
Hi,

We're seeing this behavior on our internal systems with Slurm 21.08.1 as well,
Is this fix likely to be released in a maintenance release?
We're working around this in our tools and would like to be able to check for a maximum version that has this bug (say, between 21.08 and 21.08.xx).

Thanks,
Andrew
Comment 39 Trey Dockendorf 2021-10-20 11:03:30 MDT
I verified the latest patch solves the issues for all cases we have seen so far.  I will ask one of my colleagues to run his automated test suite which is how many of these issues were initially discovered.
Comment 43 Quirin Lohr 2021-10-22 01:04:31 MDT
After applying patch v4, the scheduling seems to work, but the no_consume flag seems to have no effect.

Submitting a job with --gpus:rtx_a6000:1 works, while submitting a job with --gpus:1 --gres=VRAM:48G stay in state pending, reason ressources. Both srun and sbatch. 

Nodes are multi-GPU, and at least one job with requested VRAM is active on each node.

Selecting a lower VRAM will run the job on a node where this amount of gres/VRAM is "left" (e.G. one job already running with VRAM:12G on node1, which has 24G VRAM, lets another job with VRAM <= 12G run on that node).

gres/VRAM is set to no_consume and allows users' jobs to be scheduled to GPUs with at least the requested VRAM.

This worked in 20.11



Nodes from from slurm.conf:

NodeName=node1  CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=230000  Weight=30 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,p6000      Gres=gpu:p6000:4,VRAM:no_consume:24G,cudacores:no_consume:3840
        
NodeName=node2  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000  Weight=20 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      Gres=gpu:titan:7,VRAM:no_consume:12G,cudacores:no_consume:3584
 
[...]

NodeName=node14 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=63 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_a6000  Gres=gpu:rtx_a6000:8,VRAM:no_consume:48G,cudacores:no_consume:10752



gres.conf:
Autodetect=nvml
NodeName=node[1,7,9] Name=VRAM Count=24G 
NodeName=node[2-6] Name=VRAM Count=12G
NodeName=node[8,10] Name=VRAM Count=16G
NodeName=node[11-14] Name=VRAM Count=48G
Comment 44 Trey Dockendorf 2021-10-22 06:24:29 MDT
I confirmed the issue of not respecting no_consume GRES (sort of , towards ends is counter behavior):

$ sbatch -A PZS0708 --gres=gpfs -w p0317 -n 1 --wrap 'sleep 600'
Submitted batch job 2004679
$ sbatch -A PZS0708 --gres=gpfs -w p0317 -n 1 --wrap 'sleep 600'
Submitted batch job 2004680

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2004679 gpubackfi     wrap tdockend  R       0:05      1 p0317
           2004680 gpubackfi     wrap tdockend PD       0:00      1 (ReqNodeNotAvail, UnavailableNodes:p0317)


What is confounding about this is after some time, the issue seems to go away so seems to be not 100% repeatable for some reason.  In our environment if you don't ask for a partition you get the default which job_submit filter changes to a long list of partitions based on the request, and that can often cause weird pending job reasons so I tried this:

$ sbatch -A PZS0708 --gres=gpfs:ess -w p0002 -p serial-40core -n 1 --wrap 'sleep 600'

The above run twice did not have nodes stay pending but not sure if the method of requesting the job was the reason the behavior went away or just timing with some bug in the code that makes this issue not happen every time.  Here is another example, from the above I submitted 2 more jobs:

$ sbatch -A PZS0708 --gres=gpfs -w p0317 -n 1 --wrap 'sleep 600'
Submitted batch job 2004681
$ sbatch -A PZS0708 --gres=gpfs -w p0317 -n 1 --wrap 'sleep 600'
Submitted batch job 2004682

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2004679 gpubackfi     wrap tdockend  R       3:28      1 p0317
           2004680 gpubackfi     wrap tdockend  R       3:28      1 p0317
           2004682 gpubackfi     wrap tdockend PD       0:00      1 (ReqNodeNotAvail, UnavailableNodes:p0317)
           2004681 gpubackfi     wrap tdockend PD       0:00      1 (ReqNodeNotAvail, UnavailableNodes:p0317)

However after maybe 20 seconds or close to scheduler iteration, the jobs are running:

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2004681 gpubackfi     wrap tdockend  R       0:28      1 p0317
           2004682 gpubackfi     wrap tdockend  R       0:28      1 p0317
           2004679 gpubackfi     wrap tdockend  R       3:58      1 p0317
           2004680 gpubackfi     wrap tdockend  R       3:58      1 p0317

So maybe there is no bug but instead the scheduling of no_consume GRES is just a little delayed and maybe only getting hit by backfill scheduler or something?  I think this might be the case because I noticed when I look at job 2004681 it says "Scheduler=Backfill" but maybe that's also expected if backfill happened before regular scheduler iteration.
Comment 45 Trey Dockendorf 2021-10-22 06:40:10 MDT
I attempted Quirin Lohr's issue but maybe my GRES setup isn't able to reproduce:

$ sbatch -A PZS0708 --gres=gpfs:ess --gpus=1 -w p0317 -n 1 -p gpuserial-48core --wrap 'sleep 600'
Submitted batch job 2004683
$ sbatch -A PZS0708 --gres=gpfs:ess --gpus=1 -w p0317 -n 1 -p gpuserial-48core --wrap 'sleep 600'
Submitted batch job 2004684
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2004684 gpuserial     wrap tdockend  R       0:02      1 p0317
           2004683 gpuserial     wrap tdockend  R       0:05      1 p0317
Comment 54 Marcin Stolarek 2021-10-26 02:53:30 MDT
A series of commits 2525601a01..768c6e9e57 was merged to our main repository, branch of Slurm 21.08. This will be released in Slurm 21.08.3.

We were unable to reproduce the scheduling issues mentioned here by Quirin and Trey, but the code changes were finally more complete than what we shared. Could you please rerun the test using the 21.08 branch[1]?

If you prefer I can prepare a patch with those changes for you.

cheers,
Marcin
[1]https://github.com/schedmd/slurm/tree/slurm-21.08
Comment 55 Trey Dockendorf 2021-10-26 07:58:27 MDT
I used the 2525601a01^..768c6e9e57 commit range to re-patch out builds and so far things seem to work.
Comment 56 Quirin Lohr 2021-10-27 00:20:43 MDT
Short update: Problem solved itself after reboot of slurmctld/slurmdbd VM.

I only restarted the services after installing the patched versions. When I rebooted, the backup ctld/dbd (also patchlevel "v4") took over and the pending jobs miraculously started.

After primary took back over, the problem could not be reproduced.

Still on patchlevel "v4" and no problems.
Comment 57 Marcin Stolarek 2021-10-27 00:59:39 MDT
Thanks for checking. I'm closing the bug report now as fixed.

cheers,
Marcin