Created attachment 21685 [details] gres.conf I am testing the upgrade from 20.11.8 to 21.08.2 and have discovered I'm unable to submit jobs that request a GRES: $ sbatch -A PZS0708 --gres=gpfs:1 --wrap 'sleep 600' sbatch: error: Invalid generic resource (gres) specification $ sbatch -A PZS0708 --gres=gpu:1 --wrap 'sleep 600' sbatch: error: Invalid generic resource (gres) specification I'm attaching our gres.conf and slurm.conf that did not have this issue on 20.11.8. The TRES entries are in database too: [root@slurmdbd01-test ~]# sacctmgr show tres Type Name ID -------- --------------- ------ cpu 1 mem 2 energy 3 node 4 billing 5 fs disk 6 vmem 7 pages 8 gres gpfs:ess 1001 gres gpfs:project 1002 gres gpfs:scratch 1003 gres gpu 1004 gres gpu:v100 1005 gres ime 1006 gres pfsdir 1007 gres pfsdir:ess 1008 gres pfsdir:scratch 1009 gres gpu:v100-32g 1010 gres gpu:v100-quad 1011 gres vis 1012 gres gpu:p100 1013 gres mps 1014 license abaqus 1015 license abaqus@osc 1016 license abaquscae@osc 1017 license abaqusexplicit+ 1018 license abaqusextended+ 1019 license ansys@osc 1020 license ansyspar@osc 1021 license comsolscript@o+ 1022 license epik@osc 1023 license glide@osc 1024 license ligprep@osc 1025 license lsdyna@osc 1026 license macromodel@osc 1027 license qikprep@osc 1028 license starccm@osc 1029 license starccmpar@osc 1030 license stata@osc 1031 license usearch@osc 1032 license qikprop@osc 1033
Created attachment 21686 [details] slurm.conf
Here is verbose sbatch and can see the issue is it tries to submit "gres:gres:gpu": $ sbatch -A PZS0708 --gres=gpu --wrap 'sleep 600' -vvv sbatch: defined options sbatch: -------------------- -------------------- sbatch: account : PZS0708 sbatch: gres : gres:gres:gpu sbatch: verbose : 3 sbatch: wrap : sleep 600 sbatch: -------------------- -------------------- sbatch: end of defined options sbatch: debug2: spank: lua.so: init_post_opt = 0 sbatch: debug2: spank: private-tmpdir.so: init_post_opt = 0 sbatch: debug: propagating SLURM_PRIO_PROCESS=0 sbatch: debug: propagating UMASK=0002 sbatch: select/cons_res: common_init: select/cons_res loaded sbatch: select/cons_tres: common_init: select/cons_tres loaded sbatch: select/cray_aries: init: Cray/Aries node selection plugin loaded sbatch: select/linear: init: Linear node selection plugin loaded with argument 276 sbatch: debug: _get_next_gres: Failed to locate GRES gres sbatch: error: Invalid generic resource (gres) specification sbatch: debug2: spank: lua.so: exit = 0 sbatch: debug2: spank: private-tmpdir.so: exit = 0
Looks like CLI filter Lua behavior changed. We have this in "slurm_cli_pre_submit": if options["gres"] ~= nil then local new_gres = gres_defaults(options["gres"]) options["gres"] = new_gres posix.setenv("SLURM_JOB_GRES", options["gres"]) end The function gres_defaults: function gres_defaults(gres) if gres == nil then return gres end local new_gres = {} for g in string.gmatch(gres, "([^,]+)") do if g == "pfsdir" then g = "pfsdir:ess" end new_gres[#new_gres+1] = g end gres = "" for i,g in ipairs(new_gres) do if gres ~= "" then gres = gres .. "," .. g else gres = g end end return gres end The entire point of this code is that if you request "pfsdir" gres it's transformed into "pfsdir:ess" GRES. Is there some bug or new behavior with how setting GRES via CLI filter is handled?
Dug into this comparing 20.11 and 21.08 and it looks like options["gres"] in 20.11 would be "gpu" but in 21.08 it's "gres:gpu". Is this intentional or a bug?
One more bit of data, this is also affecting job submit Lua so I'm guessing the GRES handling in Lua is either changed or broken.
So there is seemingly a disconnect between sbatch and slurmctld because I got the -vvv output to show "gres:gpu" but now the scheduler is rejecting the job: $ sbatch -A PZS0708 --gres=gpu --wrap 'sleep 600' -vvv DEBUG: gres=gres:gpusbatch: defined options sbatch: -------------------- -------------------- sbatch: account : PZS0708 sbatch: gres : gres:gpu sbatch: verbose : 3 sbatch: wrap : sleep 600 sbatch: -------------------- -------------------- sbatch: end of defined options sbatch: debug2: spank: lua.so: init_post_opt = 0 sbatch: debug2: spank: private-tmpdir.so: init_post_opt = 0 sbatch: debug: propagating SLURM_PRIO_PROCESS=0 sbatch: debug: propagating UMASK=0002 sbatch: select/cons_res: common_init: select/cons_res loaded sbatch: select/cons_tres: common_init: select/cons_tres loaded sbatch: select/cray_aries: init: Cray/Aries node selection plugin loaded sbatch: select/linear: init: Linear node selection plugin loaded with argument 276 sbatch: debug: auth/munge: init: Munge authentication plugin loaded sbatch: error: Batch job submission failed: Invalid Trackable RESource (TRES) specification sbatch: debug2: spank: lua.so: exit = 0 sbatch: debug2: spank: private-tmpdir.so: exit = 0 This is in slurmctld.log: Oct 11 12:58:36 pitzer-slurm01-test slurmctld[109031]: _slurm_rpc_submit_batch_job: Invalid Trackable RESource (TRES) specification
Created attachment 21690 [details] job_submit.lua
Created attachment 21691 [details] job_submit_lib.lua
Created attachment 21692 [details] osc_common.lua
Here is debug logs from slurmctld: Oct 11 13:00:50 pitzer-slurm01-test slurmctld[109031]: debug2: gpu,gpfs:ess:1 is not a gres Oct 11 13:00:50 pitzer-slurm01-test slurmctld[109031]: _slurm_rpc_submit_batch_job: Invalid Trackable RESource (TRES) specification We have a job submit Lua that based on the submit directory, will add a GPFS GRES. That now appears to not work. I've attached our job submit Lua files. It's worth noting again all this code worked just fine with 20.11.8 and only when upgrading to 21.08.2 did things break with our Lua code.
Okay, I think I found out what is going on here and have some work arounds I will test deploying. The CLI filter Lua options now contains "gres:" prefix for options["gres"] but you can't pass a GRES with that prefix back to the options in CLI filter so in CLI filter have to strip out "gres:" prefix and submit. The Job submit Lua requires the "gres:" prefix for all GRES added in Lua, otherwise the GRES is not found. This is also having an issue for some custom C plugins we deploy to add GRES to environment variables, now they get "gres:pfsdir:scratch" rather than just "pfsdir:scratch". That is something I think we can work around too if this is the new behavior. It would be good to know from SchedMD if what I'm seeing is expected or a bug introduced in SLURM 21.08. I see no mention of such a change in the 21.08 release notes.
I've managed to work around issues with sbatch but now I'm unable to use a GRES from srun. Job: #!/bin/bash #SBATCH -t 00:05:00 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --partition=parallel-40core #SBATCH -o output/parallel-%j.out env | sort echo "hostname" srun --gres=gpfs:ess -n2 -N2 hostname This is the error in logs: hostname srun: error: Unable to create step for job 2003686: Invalid generic resource (gres) specification I've tried to turn off all our CLI filters and job submit filters and the issue is still present. This issue is now preventing us from further testing SLURM 21.08.
In case it helps, I submitted another job with debug5 on scheduler and get this: Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3: StepDesc: user_id=20821 JobId=2003688 node_count=2-2 cpu_count=2 num_tasks=2 Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3: cpu_freq_gov=4294967294 cpu_freq_max=4294967294 cpu_freq_min=4294967294 relative=65534 task_dist=0x 1 plane=1 Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3: node_list=(null) constraints=(null) Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3: host=p0002 port=35043 srun_pid=164316 name=hostname network=(null) exclusive=yes Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3: mem_per_cpu=4556 resv_port_cnt=65534 immediate=0 no_kill=no Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3: overcommit=no time_limit=0 Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3: TRES_per_step=cpu:2 Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3: TRES_per_node=gres:gpfs:ess Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug2: cpu:2 is not a gres Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from UID=0 Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: debug3: Processing RPC details: REQUEST_COMPLETE_BATCH_SCRIPT for JobId=2003688 Oct 13 09:15:50 pitzer-slurm01-test slurmctld[89791]: _job_complete: JobId=2003688 WEXITSTATUS 1
Comment on attachment 21729 [details] v2 Trey, After looking into this we think that we should handle the same behavior for job_desc.gres field in job_submit/lua. However, we'll just document that for 'tres_per_{node,socket,task}' the addition of gres on the job_submit.lua script is now required. Could you please give the following patch a try? I've checked the code and played with cli_filter and I believe that those were not affected by the change[1]. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/6300d47c2d2485683dedac290b157a8bfe77f918
I verified that patch works so I no longer have to add the "gres:" prefix in job_submit Lua. However the issue of not being able to use GRES with srun is still present. Here is maybe a simpler way to reproduce along with previous comment's batch script: $ salloc -A PZS0708 --gres=gpfs srun --gres=gpfs hostname salloc: Pending job allocation 2003701 salloc: job 2003701 queued and waiting for resources salloc: job 2003701 has been allocated resources salloc: Granted job allocation 2003701 salloc: Waiting for resource configuration salloc: Nodes p0006 are ready for job srun: error: Unable to create step for job 2003701: Invalid generic resource (gres) specification salloc: Relinquishing job allocation 2003701 salloc: Job allocation 2003701 has been revoked. However a GPU GRES does work: $ salloc -A PZS0708 --gres=gpu:1 srun --gres=gpu:1 hostname salloc: Pending job allocation 2003699 salloc: job 2003699 queued and waiting for resources salloc: job 2003699 has been allocated resources salloc: Granted job allocation 2003699 salloc: Waiting for resource configuration salloc: Nodes p0318 are ready for job p0318.ten.osc.edu salloc: Relinquishing job allocation 2003699
Trey, Unfortunately, you've noticed a second issue at once. I was able to reproduce that and I'm looking into the code (and its history) to find a proper solution. I'll keep you posted. cheers, Marcin
Created attachment 21747 [details] v2 (from bug10498) Trey, Am I correct that you're on the test system? In that case, we shouldn't call the ticket severity 2. At the same time if it's a test system, could you please test the attached patch? I didn't go through all the bits yet, but it's something I worked on in a different bug report, and from my current test and understanding it resolved the issue we see here too. cheers, Marcin
Correct, this is a test system where we are testing SLURM 21.08 before we attempt a production upgrade. Can lower severity to whatever is appropriate. I tested that most recent patch and still have the same issue: $ salloc -A PZS0708 --gres=gpfs:ess srun --gres=gpfs:ess -vvv hostname salloc: Pending job allocation 2003715 salloc: job 2003715 queued and waiting for resources salloc: job 2003715 has been allocated resources salloc: Granted job allocation 2003715 salloc: Waiting for resource configuration salloc: Nodes p0006 are ready for job srun: defined options srun: -------------------- -------------------- srun: (null) : p0006 srun: gres : gres:gpfs:ess srun: jobid : 2003715 srun: job-name : interactive srun: nodes : 1 srun: verbose : 3 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0002 srun: debug2: srun PMI messages to port=39818 srun: debug: auth/munge: init: Munge authentication plugin loaded srun: jobid 2003715: nodes(1):`p0006', cpu counts: 1(x1) srun: debug2: creating job with 1 tasks srun: debug: requesting job 2003715, user 20821, nodes 1 including ((null)) srun: debug: cpus 1, tasks 1, name hostname, relative 65534 srun: error: Unable to create step for job 2003715: Invalid generic resource (gres) specification salloc: Relinquishing job allocation 2003715 salloc: Job allocation 2003715 has been revoked. Verified that using --gres=gpu:1 still works.
Trey, Just wanted to follow up with you on that. I understand when the regression was introduced and I'm looking for the best way to address the issue. I hope I'll be able to share something with you soon since I'd like to let you continue testing ASAP. Since it's on a test system, I'm decreasing the severity to 3. cheers, Marcin
Comment on attachment 21791 [details] v1 Trey, I attached a new patch set that should solve the issues you noticed. It didn't pass our QA yet, but since you're working on a test system I'd like to get your feedback on that first. cheers, Marcin
The single node srun test works when used with salloc but not sbatch or any srun with more than one node: WORKS: $ salloc -A PZS0708 --gres=gpfs:ess srun --gres=gpfs:ess hostname salloc: Pending job allocation 2004197 salloc: job 2004197 queued and waiting for resources salloc: job 2004197 has been allocated resources salloc: Granted job allocation 2004197 salloc: Waiting for resource configuration salloc: Nodes p0006 are ready for job p0006.ten.osc.edu salloc: Relinquishing job allocation 2004197 $ salloc -A PZS0708 -n 2 --gres=gpfs:ess srun --gres=gpfs:ess hostname salloc: Pending job allocation 2004199 salloc: job 2004199 queued and waiting for resources salloc: job 2004199 has been allocated resources salloc: Granted job allocation 2004199 salloc: Waiting for resource configuration salloc: Nodes p0002 are ready for job p0002.ten.osc.edu p0002.ten.osc.edu salloc: Relinquishing job allocation 2004199 FAILS: $ cat parallel.sbatch #!/bin/bash #SBATCH -t 00:05:00 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --partition=parallel-40core #SBATCH -o output/parallel-%j.out echo "hostname" srun --gres=gpfs:ess -n2 -N2 hostname $ sbatch parallel.sbatch Submitted batch job 2004196 $ cat output/parallel-2004196.out hostname srun: error: Unable to create step for job 2004196: Invalid generic resource (gres) specification ALSO FAILS: $ salloc -A PZS0708 -N 2 --gres=gpfs:ess srun --gres=gpfs:ess hostname salloc: Pending job allocation 2004198 salloc: job 2004198 queued and waiting for resources salloc: job 2004198 has been allocated resources salloc: Granted job allocation 2004198 salloc: Waiting for resource configuration salloc: Nodes p[0002,0006] are ready for job srun: error: Unable to create step for job 2004198: Invalid generic resource (gres) specification salloc: Relinquishing job allocation 2004198
Created attachment 21813 [details] v3 Trey, I just tried reproducing the multiple node issue you've noticed without success. I'll try to understand what may be the cause on your side. In the meantime, I merged the commits into one patch. cheers, Marcin
PS. Are you still on 21.08.1 - there was a fix for GRES inheritance between 21.08.1 and 21.08.2 that may have impact. I'd recommend switching to 21.08.2 (+patches) if you're still on the previous minor release.
I am on 21.08.2 plus patches. The reason this case says 21.08.1 is 21.08.2 wasn't an option yet in dropdown when I opened the case. I updated case dropdown to show 21.08.2
Created attachment 21832 [details] v4 Trey, Could you please apply the attached patch on clean 21.08.2 and recheck? I wasn't able to reproduce the difference coming from -N2, but I found that trace over --exclusive allocation handling that results in the error you faced. In case of the error please collect debug2 slurmctld logs with Gres and SelectType DebugFlags enabled and share those with us. cheers, Marcin
PS. I think the patch will solve the issue. I just understood that you're routing jobs requesting more than one node to "parallel" partitions by job_submit plugin. Those partitions are configured as OverSubscribe=EXCLUSIVE which results in a call over the trace I found.
Hi, We're seeing this behavior on our internal systems with Slurm 21.08.1 as well, Is this fix likely to be released in a maintenance release? We're working around this in our tools and would like to be able to check for a maximum version that has this bug (say, between 21.08 and 21.08.xx). Thanks, Andrew
I verified the latest patch solves the issues for all cases we have seen so far. I will ask one of my colleagues to run his automated test suite which is how many of these issues were initially discovered.
After applying patch v4, the scheduling seems to work, but the no_consume flag seems to have no effect. Submitting a job with --gpus:rtx_a6000:1 works, while submitting a job with --gpus:1 --gres=VRAM:48G stay in state pending, reason ressources. Both srun and sbatch. Nodes are multi-GPU, and at least one job with requested VRAM is active on each node. Selecting a lower VRAM will run the job on a node where this amount of gres/VRAM is "left" (e.G. one job already running with VRAM:12G on node1, which has 24G VRAM, lets another job with VRAM <= 12G run on that node). gres/VRAM is set to no_consume and allows users' jobs to be scheduled to GPUs with at least the requested VRAM. This worked in 20.11 Nodes from from slurm.conf: NodeName=node1 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=230000 Weight=30 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,p6000 Gres=gpu:p6000:4,VRAM:no_consume:24G,cudacores:no_consume:3840 NodeName=node2 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000 Weight=20 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan Gres=gpu:titan:7,VRAM:no_consume:12G,cudacores:no_consume:3584 [...] NodeName=node14 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=63 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_a6000 Gres=gpu:rtx_a6000:8,VRAM:no_consume:48G,cudacores:no_consume:10752 gres.conf: Autodetect=nvml NodeName=node[1,7,9] Name=VRAM Count=24G NodeName=node[2-6] Name=VRAM Count=12G NodeName=node[8,10] Name=VRAM Count=16G NodeName=node[11-14] Name=VRAM Count=48G
I confirmed the issue of not respecting no_consume GRES (sort of , towards ends is counter behavior): $ sbatch -A PZS0708 --gres=gpfs -w p0317 -n 1 --wrap 'sleep 600' Submitted batch job 2004679 $ sbatch -A PZS0708 --gres=gpfs -w p0317 -n 1 --wrap 'sleep 600' Submitted batch job 2004680 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2004679 gpubackfi wrap tdockend R 0:05 1 p0317 2004680 gpubackfi wrap tdockend PD 0:00 1 (ReqNodeNotAvail, UnavailableNodes:p0317) What is confounding about this is after some time, the issue seems to go away so seems to be not 100% repeatable for some reason. In our environment if you don't ask for a partition you get the default which job_submit filter changes to a long list of partitions based on the request, and that can often cause weird pending job reasons so I tried this: $ sbatch -A PZS0708 --gres=gpfs:ess -w p0002 -p serial-40core -n 1 --wrap 'sleep 600' The above run twice did not have nodes stay pending but not sure if the method of requesting the job was the reason the behavior went away or just timing with some bug in the code that makes this issue not happen every time. Here is another example, from the above I submitted 2 more jobs: $ sbatch -A PZS0708 --gres=gpfs -w p0317 -n 1 --wrap 'sleep 600' Submitted batch job 2004681 $ sbatch -A PZS0708 --gres=gpfs -w p0317 -n 1 --wrap 'sleep 600' Submitted batch job 2004682 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2004679 gpubackfi wrap tdockend R 3:28 1 p0317 2004680 gpubackfi wrap tdockend R 3:28 1 p0317 2004682 gpubackfi wrap tdockend PD 0:00 1 (ReqNodeNotAvail, UnavailableNodes:p0317) 2004681 gpubackfi wrap tdockend PD 0:00 1 (ReqNodeNotAvail, UnavailableNodes:p0317) However after maybe 20 seconds or close to scheduler iteration, the jobs are running: $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2004681 gpubackfi wrap tdockend R 0:28 1 p0317 2004682 gpubackfi wrap tdockend R 0:28 1 p0317 2004679 gpubackfi wrap tdockend R 3:58 1 p0317 2004680 gpubackfi wrap tdockend R 3:58 1 p0317 So maybe there is no bug but instead the scheduling of no_consume GRES is just a little delayed and maybe only getting hit by backfill scheduler or something? I think this might be the case because I noticed when I look at job 2004681 it says "Scheduler=Backfill" but maybe that's also expected if backfill happened before regular scheduler iteration.
I attempted Quirin Lohr's issue but maybe my GRES setup isn't able to reproduce: $ sbatch -A PZS0708 --gres=gpfs:ess --gpus=1 -w p0317 -n 1 -p gpuserial-48core --wrap 'sleep 600' Submitted batch job 2004683 $ sbatch -A PZS0708 --gres=gpfs:ess --gpus=1 -w p0317 -n 1 -p gpuserial-48core --wrap 'sleep 600' Submitted batch job 2004684 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2004684 gpuserial wrap tdockend R 0:02 1 p0317 2004683 gpuserial wrap tdockend R 0:05 1 p0317
A series of commits 2525601a01..768c6e9e57 was merged to our main repository, branch of Slurm 21.08. This will be released in Slurm 21.08.3. We were unable to reproduce the scheduling issues mentioned here by Quirin and Trey, but the code changes were finally more complete than what we shared. Could you please rerun the test using the 21.08 branch[1]? If you prefer I can prepare a patch with those changes for you. cheers, Marcin [1]https://github.com/schedmd/slurm/tree/slurm-21.08
I used the 2525601a01^..768c6e9e57 commit range to re-patch out builds and so far things seem to work.
Short update: Problem solved itself after reboot of slurmctld/slurmdbd VM. I only restarted the services after installing the patched versions. When I rebooted, the backup ctld/dbd (also patchlevel "v4") took over and the pending jobs miraculously started. After primary took back over, the problem could not be reproduced. Still on patchlevel "v4" and no problems.
Thanks for checking. I'm closing the bug report now as fixed. cheers, Marcin