Ticket 11044 - submit to multiple partitions with GRES specified with patch
Summary: submit to multiple partitions with GRES specified with patch
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 20.11.4
Hardware: Linux Linux
: --- C - Contributions
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-03-09 07:08 MST by Bas van der Vlies
Modified: 2021-09-10 04:57 MDT (History)
2 users (show)

See Also:
Site: SURF
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
multipartition gres fix (1.06 KB, patch)
2021-03-09 07:08 MST, Bas van der Vlies
Details | Diff
Also add patch for 20.02 version (1017 bytes, patch)
2021-03-24 11:39 MDT, Bas van der Vlies
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Bas van der Vlies 2021-03-09 07:08:43 MST
Created attachment 18305 [details]
multipartition gres fix

At our site we have a lot of partitions for the different CPU/GPU types. So to make it easy for the user we have written a job_submit.lua script that submit to all cpu partitions or gpu partitions.  But this fails when we make use of GRES specification.

In this cluster we do not have GPU so i defined a GRES type:
 * cpu_type

We have defined 2 no consumble flagcount only GRES:
 * e5_2650_v1 
 * e5_2650_v2

two partitions:
 * cpu_e5_2650_v1 --> 1 node
 * cpu_e5_2650_v2 --> 1 node

The last partition checked is `cpu_e5_2650_v2`. This important for this example.

No we submit a job that require GRES `e5_2650_v1`:
 * srun --exclusive  --gres=cpu_type:e5_2650_v1 --pty /bin/bash 
 * second job with the same GRES type fails with:
```
srun: error: Unable to allocate resources: Requested node configuration is not available
``` 

When we use the other GRES type `e5_2650_v2` the second job is queued as I would expect also for the above example. So the error code of the last partition determines the error code.

when we use the GRES type `e5_2650_v1` then the last partition `cpu_e5_2650_v2` returns `SLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE = 2014`. That is also returned to the user. But in the first partition it could run but all nodes are busy `ESLURM_NODES_BUSY = 2016`. We should return this state when we have examined all partitions.

The patch attached implements this behaviour. Do not know if this is the right approach.
Comment 1 Bas van der Vlies 2021-03-24 11:39:22 MDT
Created attachment 18629 [details]
Also add patch for 20.02 version

This is the multipartition fix for slurm version 20.02. We are also using this version
Comment 2 Bas van der Vlies 2021-09-10 04:57:12 MDT
Is thers an update on this issue. Will it be addressed or is there a fix in a newer slurm version?