1294 – Gres count value modified if type is specified

Ticket 1294 - Gres count value modified if type is specified

Summary: Gres count value modified if type is specified

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	14.11.0
Hardware:	Linux Linux

Importance:	--- 6 - No support contract
Assignee:	David Bigagli
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-12-04 01:10 MST by Nicolas Joly
Modified:	2014-12-04 04:05 MST (History)
CC List:	2 users (show)

See Also:
Site:	-Other-
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Nicolas Joly 2014-12-04 01:10:51 MST

Hi,

On our cluster, we do use Gres to allow users to record local disk usage on jobs. For now we only have a single type of disks (200GB SSD), and this works fine.

njoly@lanfeust [slurm/etc]> cat gres.conf 
# Local disk space
NodeName=lanfeust Name=disk Count=204800
njoly@lanfeust [slurm/etc]> grep Gres slurm.conf 
GresTypes=disk
NodeName=lanfeust CoresPerSocket=8 Gres=disk

njoly@lanfeust [~]> scontrol show nodes
NodeName=lanfeust Arch=amd64 CoresPerSocket=8
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.00 Features=(null)
   Gres=disk:200K
   NodeAddr=lanfeust NodeHostName=lanfeust Version=14.11
   OS=NetBSD RealMemory=8190 AllocMem=0 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=1 Weight=1
   BootTime=2014-11-30T10:20:08 SlurmdStartTime=2014-12-04T15:38:50
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
njoly@lanfeust [~]> srun --gres=disk:204800 hostname
lanfeust.sis.pasteur.fr
njoly@lanfeust [~]> srun --gres=disk:204801 hostname
srun: error: Unable to allocate resources: Requested node configuration is not available

But in a near future we're gonna have nodes with larger 1TB SATA disk. According to the gres.conf documentation, it looked obvious that we can use Type parameter to record the disk type ;)

njoly@lanfeust [slurm/etc]> cat gres.conf       
# Local disk space
NodeName=lanfeust Name=disk Type=SSD Count=204800

But, in that case, the reported value is lowered to 1024, and jobs cannot use more than one :

[2014-12-04T16:00:28.054] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
[2014-12-04T16:00:28.054] error: gres_plugin_node_config_unpack: gres/disk has File plus very large Count (204800) for node lanfeust, resetting 
value to 1024
[2014-12-04T16:00:28.054] gres/disk: state for lanfeust
[2014-12-04T16:00:28.054]   gres_cnt found:1024 configured:1 avail:1024 alloc:0
[2014-12-04T16:00:28.054]   gres_bit_alloc:
[2014-12-04T16:00:28.054]   gres_used:(null)
[2014-12-04T16:00:28.054]   topo_cpus_bitmap[0]:NULL
[2014-12-04T16:00:28.054]   topo_gres_bitmap[0]:0-1023
[2014-12-04T16:00:28.054]   topo_gres_cnt_alloc[0]:0
[2014-12-04T16:00:28.054]   topo_gres_cnt_avail[0]:1024
[2014-12-04T16:00:28.054]   type[0]:SSD
[2014-12-04T16:00:28.054] debug2: _slurm_rpc_node_registration complete for lanfeust usec=285

njoly@lanfeust [~]> scontrol show nodes                        
NodeName=lanfeust Arch=amd64 CoresPerSocket=8
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.00 Features=(null)
   Gres=disk:1K
   NodeAddr=lanfeust NodeHostName=lanfeust Version=14.11
   OS=NetBSD RealMemory=8190 AllocMem=0 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=1 Weight=1
   BootTime=2014-11-30T10:20:08 SlurmdStartTime=2014-12-04T16:00:26
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

njoly@lanfeust [~]> srun --gres=disk:1 hostname
lanfeust.sis.pasteur.fr
njoly@lanfeust [~]> srun --gres=disk:2 hostname
srun: error: Unable to allocate resources: Requested node configuration is not available

Comment 1 Nicolas Joly 2014-12-04 01:24:28 MST

And the corresponding log for the failing job :

[2014-12-04T16:22:03.089] debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=1000
[2014-12-04T16:22:03.089] debug3: JobDesc: user_id=1000 job_id=N/A partition=(null) name=hostname
[2014-12-04T16:22:03.089] debug3:    cpus=1-4294967294 pn_min_cpus=-1 core_spec=-1
[2014-12-04T16:22:03.089] debug3:    -N min-[max]: 1-[4294967294]:65534:65534:65534
[2014-12-04T16:22:03.089] debug3:    pn_min_memory_job=-1 pn_min_tmp_disk=-1
[2014-12-04T16:22:03.089] debug3:    immediate=0 features=(null) reservation=(null)
[2014-12-04T16:22:03.089] debug3:    req_nodes=(null) exc_nodes=(null) gres=disk:2
[2014-12-04T16:22:03.089] debug3:    time_limit=-1--1 priority=-1 contiguous=0 shared=-1
[2014-12-04T16:22:03.089] debug3:    kill_on_node_fail=-1 script=(null)
[2014-12-04T16:22:03.089] debug3:    argv="hostname"
[2014-12-04T16:22:03.089] debug3:    stdin=(null) stdout=(null) stderr=(null)
[2014-12-04T16:22:03.089] debug3:    work_dir=/home/njoly alloc_node:sid=lanfeust:19835
[2014-12-04T16:22:03.089] debug3:    resp_host=157.99.60.140 alloc_resp_port=63921  other_port=63922
[2014-12-04T16:22:03.089] debug3:    dependency=(null) account=(null) qos=(null) comment=(null)
[2014-12-04T16:22:03.089] debug3:    mail_type=0 mail_user=(null) nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null)
[2014-12-04T16:22:03.089] debug3:    network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null)
[2014-12-04T16:22:03.089] debug3:    end_time=Unknown signal=0@0 wait_all_nodes=-1
[2014-12-04T16:22:03.089] debug3:    ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1
[2014-12-04T16:22:03.089] debug3:    mem_bind=65534:(null) plane_size:65534
[2014-12-04T16:22:03.089] debug3:    array_inx=(null)
[2014-12-04T16:22:03.101] debug3: found correct user
[2014-12-04T16:22:03.101] debug3: found correct association
[2014-12-04T16:22:03.101] debug3: found correct qos
[2014-12-04T16:22:03.101] debug3: before alteration asking for nodes 1-4294967294 cpus 1-4294967294
[2014-12-04T16:22:03.101] debug3: after alteration asking for nodes 1-4294967294 cpus 1-4294967294
[2014-12-04T16:22:03.101] gres: disk state for job 260
[2014-12-04T16:22:03.101]   gres_cnt:2 node_cnt:0 type:(null)
[2014-12-04T16:22:03.101] debug2: found 1 usable nodes from config containing lanfeust
[2014-12-04T16:22:03.101] debug3: _pick_best_nodes: job 260 idle_nodes 1 share_nodes 1
[2014-12-04T16:22:03.102] debug2: select_p_job_test for job 260
[2014-12-04T16:22:03.102] debug3: cons_res: _vns: node lanfeust lacks gres
[2014-12-04T16:22:03.102] debug2: select_p_job_test for job 260
[2014-12-04T16:22:03.102] debug2: select_p_job_test for job 260
[2014-12-04T16:22:03.102] _pick_best_nodes: job 260 never runnable
[2014-12-04T16:22:03.102] debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
[2014-12-04T16:22:03.102] _slurm_rpc_allocate_resources: Requested node configuration is not available 
[2014-12-04T16:22:03.102] debug2: got 1 threads to send out
[2014-12-04T16:22:03.102] debug3: slurm_send_only_node_msg: sent 0
[2014-12-04T16:22:03.491] debug:  backfill: beginning
[2014-12-04T16:22:03.491] debug:  backfill: no jobs to backfill
[2014-12-04T16:22:06.009] debug2: Testing job time limits and checkpoints

Comment 2 Moe Jette 2014-12-04 02:27:57 MST

I believe the gres count field size is 32-bits, so you will probably need to allocate in units if GB rather than bytes. I can run with a gres count of 3G without difficulty.

$ grep Gres slurm.conf
GresTypes=disk
NodeName=jette CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7600 State=UNKNOWN Gres=disk:3g

$ grep disk gres.conf
Name=disk Count=3g

$ salloc --gres=disk:2g bash
salloc: Granted job allocation 775

Comment 3 Moe Jette 2014-12-04 03:14:20 MST

(In reply to Moe Jette from comment #2)
> I believe the gres count field size is 32-bits, so you will probably need to
> allocate in units if GB rather than bytes. I can run with a gres count of 3G
> without difficulty.
> 
> $ grep Gres slurm.conf
> GresTypes=disk
> NodeName=jette CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1
> RealMemory=7600 State=UNKNOWN Gres=disk:3g
> 
> $ grep disk gres.conf
> Name=disk Count=3g
> 
> $ salloc --gres=disk:2g bash
> salloc: Granted job allocation 775

One more thing, a Gres with a "Type" needs to be tied to specific files, otherwise there is no way to identify what "Type" of GRES is allocated to a job that only ask for, say "--gres=disk:1g" rather than "--gres=disk:ssd:1g". Rather than specifying a Gres Name=disk Type=ssd, just use a name like this "disk_ssd" or just "ssd".

Comment 4 Nicolas Joly 2014-12-04 03:39:36 MST

(In reply to Moe Jette from comment #3)
> (In reply to Moe Jette from comment #2)
> > I believe the gres count field size is 32-bits, so you will probably need to
> > allocate in units if GB rather than bytes. I can run with a gres count of 3G
> > without difficulty.
> > 
> > $ grep Gres slurm.conf
> > GresTypes=disk
> > NodeName=jette CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1
> > RealMemory=7600 State=UNKNOWN Gres=disk:3g
> > 
> > $ grep disk gres.conf
> > Name=disk Count=3g
> > 
> > $ salloc --gres=disk:2g bash
> > salloc: Granted job allocation 775
> 
> One more thing, a Gres with a "Type" needs to be tied to specific files,
> otherwise there is no way to identify what "Type" of GRES is allocated to a
> job that only ask for, say "--gres=disk:1g" rather than
> "--gres=disk:ssd:1g". Rather than specifying a Gres Name=disk Type=ssd, just
> use a name like this "disk_ssd" or just "ssd".

That's what i wanted to avoid ... I thought that it was the tuple Name+Type that was the key of GRES, but it seems not; and asking only for name without type will pickup whatever SLURM wants/can ...

So Type implies File ... Ok fine, but this may need to be documented somewhere.

Thanks for the explanation. We'll find an alternate solution.

Comment 5 Moe Jette 2014-12-04 03:50:21 MST

(In reply to Nicolas Joly from comment #4)
> That's what i wanted to avoid ... I thought that it was the tuple Name+Type
> that was the key of GRES, but it seems not;

It is the key. Did you write a gres/disk plugin or are you assuming the user will figure out which type of disk was assigned to the job if they don't request a specific gres type?

Comment 6 Nicolas Joly 2014-12-04 04:05:13 MST

(In reply to Moe Jette from comment #5)
> (In reply to Nicolas Joly from comment #4)
> > That's what i wanted to avoid ... I thought that it was the tuple Name+Type
> > that was the key of GRES, but it seems not;
> 
> It is the key. Did you write a gres/disk plugin or are you assuming the user
> will figure out which type of disk was assigned to the job if they don't
> request a specific gres type?

I don't have a plugin ... But i'll think about it. 

If user do not request a specific type, this means that he don't really care and IMO it does not matter much to report this kind of information. It has the needed resource, that's all.