Hi, On our cluster, we do use Gres to allow users to record local disk usage on jobs. For now we only have a single type of disks (200GB SSD), and this works fine. njoly@lanfeust [slurm/etc]> cat gres.conf # Local disk space NodeName=lanfeust Name=disk Count=204800 njoly@lanfeust [slurm/etc]> grep Gres slurm.conf GresTypes=disk NodeName=lanfeust CoresPerSocket=8 Gres=disk njoly@lanfeust [~]> scontrol show nodes NodeName=lanfeust Arch=amd64 CoresPerSocket=8 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.00 Features=(null) Gres=disk:200K NodeAddr=lanfeust NodeHostName=lanfeust Version=14.11 OS=NetBSD RealMemory=8190 AllocMem=0 Sockets=1 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=1 Weight=1 BootTime=2014-11-30T10:20:08 SlurmdStartTime=2014-12-04T15:38:50 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s njoly@lanfeust [~]> srun --gres=disk:204800 hostname lanfeust.sis.pasteur.fr njoly@lanfeust [~]> srun --gres=disk:204801 hostname srun: error: Unable to allocate resources: Requested node configuration is not available But in a near future we're gonna have nodes with larger 1TB SATA disk. According to the gres.conf documentation, it looked obvious that we can use Type parameter to record the disk type ;) njoly@lanfeust [slurm/etc]> cat gres.conf # Local disk space NodeName=lanfeust Name=disk Type=SSD Count=204800 But, in that case, the reported value is lowered to 1024, and jobs cannot use more than one : [2014-12-04T16:00:28.054] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 [2014-12-04T16:00:28.054] error: gres_plugin_node_config_unpack: gres/disk has File plus very large Count (204800) for node lanfeust, resetting value to 1024 [2014-12-04T16:00:28.054] gres/disk: state for lanfeust [2014-12-04T16:00:28.054] gres_cnt found:1024 configured:1 avail:1024 alloc:0 [2014-12-04T16:00:28.054] gres_bit_alloc: [2014-12-04T16:00:28.054] gres_used:(null) [2014-12-04T16:00:28.054] topo_cpus_bitmap[0]:NULL [2014-12-04T16:00:28.054] topo_gres_bitmap[0]:0-1023 [2014-12-04T16:00:28.054] topo_gres_cnt_alloc[0]:0 [2014-12-04T16:00:28.054] topo_gres_cnt_avail[0]:1024 [2014-12-04T16:00:28.054] type[0]:SSD [2014-12-04T16:00:28.054] debug2: _slurm_rpc_node_registration complete for lanfeust usec=285 njoly@lanfeust [~]> scontrol show nodes NodeName=lanfeust Arch=amd64 CoresPerSocket=8 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.00 Features=(null) Gres=disk:1K NodeAddr=lanfeust NodeHostName=lanfeust Version=14.11 OS=NetBSD RealMemory=8190 AllocMem=0 Sockets=1 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=1 Weight=1 BootTime=2014-11-30T10:20:08 SlurmdStartTime=2014-12-04T16:00:26 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s njoly@lanfeust [~]> srun --gres=disk:1 hostname lanfeust.sis.pasteur.fr njoly@lanfeust [~]> srun --gres=disk:2 hostname srun: error: Unable to allocate resources: Requested node configuration is not available
And the corresponding log for the failing job : [2014-12-04T16:22:03.089] debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=1000 [2014-12-04T16:22:03.089] debug3: JobDesc: user_id=1000 job_id=N/A partition=(null) name=hostname [2014-12-04T16:22:03.089] debug3: cpus=1-4294967294 pn_min_cpus=-1 core_spec=-1 [2014-12-04T16:22:03.089] debug3: -N min-[max]: 1-[4294967294]:65534:65534:65534 [2014-12-04T16:22:03.089] debug3: pn_min_memory_job=-1 pn_min_tmp_disk=-1 [2014-12-04T16:22:03.089] debug3: immediate=0 features=(null) reservation=(null) [2014-12-04T16:22:03.089] debug3: req_nodes=(null) exc_nodes=(null) gres=disk:2 [2014-12-04T16:22:03.089] debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1 [2014-12-04T16:22:03.089] debug3: kill_on_node_fail=-1 script=(null) [2014-12-04T16:22:03.089] debug3: argv="hostname" [2014-12-04T16:22:03.089] debug3: stdin=(null) stdout=(null) stderr=(null) [2014-12-04T16:22:03.089] debug3: work_dir=/home/njoly alloc_node:sid=lanfeust:19835 [2014-12-04T16:22:03.089] debug3: resp_host=157.99.60.140 alloc_resp_port=63921 other_port=63922 [2014-12-04T16:22:03.089] debug3: dependency=(null) account=(null) qos=(null) comment=(null) [2014-12-04T16:22:03.089] debug3: mail_type=0 mail_user=(null) nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null) [2014-12-04T16:22:03.089] debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null) [2014-12-04T16:22:03.089] debug3: end_time=Unknown signal=0@0 wait_all_nodes=-1 [2014-12-04T16:22:03.089] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 [2014-12-04T16:22:03.089] debug3: mem_bind=65534:(null) plane_size:65534 [2014-12-04T16:22:03.089] debug3: array_inx=(null) [2014-12-04T16:22:03.101] debug3: found correct user [2014-12-04T16:22:03.101] debug3: found correct association [2014-12-04T16:22:03.101] debug3: found correct qos [2014-12-04T16:22:03.101] debug3: before alteration asking for nodes 1-4294967294 cpus 1-4294967294 [2014-12-04T16:22:03.101] debug3: after alteration asking for nodes 1-4294967294 cpus 1-4294967294 [2014-12-04T16:22:03.101] gres: disk state for job 260 [2014-12-04T16:22:03.101] gres_cnt:2 node_cnt:0 type:(null) [2014-12-04T16:22:03.101] debug2: found 1 usable nodes from config containing lanfeust [2014-12-04T16:22:03.101] debug3: _pick_best_nodes: job 260 idle_nodes 1 share_nodes 1 [2014-12-04T16:22:03.102] debug2: select_p_job_test for job 260 [2014-12-04T16:22:03.102] debug3: cons_res: _vns: node lanfeust lacks gres [2014-12-04T16:22:03.102] debug2: select_p_job_test for job 260 [2014-12-04T16:22:03.102] debug2: select_p_job_test for job 260 [2014-12-04T16:22:03.102] _pick_best_nodes: job 260 never runnable [2014-12-04T16:22:03.102] debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE [2014-12-04T16:22:03.102] _slurm_rpc_allocate_resources: Requested node configuration is not available [2014-12-04T16:22:03.102] debug2: got 1 threads to send out [2014-12-04T16:22:03.102] debug3: slurm_send_only_node_msg: sent 0 [2014-12-04T16:22:03.491] debug: backfill: beginning [2014-12-04T16:22:03.491] debug: backfill: no jobs to backfill [2014-12-04T16:22:06.009] debug2: Testing job time limits and checkpoints
I believe the gres count field size is 32-bits, so you will probably need to allocate in units if GB rather than bytes. I can run with a gres count of 3G without difficulty. $ grep Gres slurm.conf GresTypes=disk NodeName=jette CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7600 State=UNKNOWN Gres=disk:3g $ grep disk gres.conf Name=disk Count=3g $ salloc --gres=disk:2g bash salloc: Granted job allocation 775
(In reply to Moe Jette from comment #2) > I believe the gres count field size is 32-bits, so you will probably need to > allocate in units if GB rather than bytes. I can run with a gres count of 3G > without difficulty. > > $ grep Gres slurm.conf > GresTypes=disk > NodeName=jette CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 > RealMemory=7600 State=UNKNOWN Gres=disk:3g > > $ grep disk gres.conf > Name=disk Count=3g > > $ salloc --gres=disk:2g bash > salloc: Granted job allocation 775 One more thing, a Gres with a "Type" needs to be tied to specific files, otherwise there is no way to identify what "Type" of GRES is allocated to a job that only ask for, say "--gres=disk:1g" rather than "--gres=disk:ssd:1g". Rather than specifying a Gres Name=disk Type=ssd, just use a name like this "disk_ssd" or just "ssd".
(In reply to Moe Jette from comment #3) > (In reply to Moe Jette from comment #2) > > I believe the gres count field size is 32-bits, so you will probably need to > > allocate in units if GB rather than bytes. I can run with a gres count of 3G > > without difficulty. > > > > $ grep Gres slurm.conf > > GresTypes=disk > > NodeName=jette CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 > > RealMemory=7600 State=UNKNOWN Gres=disk:3g > > > > $ grep disk gres.conf > > Name=disk Count=3g > > > > $ salloc --gres=disk:2g bash > > salloc: Granted job allocation 775 > > One more thing, a Gres with a "Type" needs to be tied to specific files, > otherwise there is no way to identify what "Type" of GRES is allocated to a > job that only ask for, say "--gres=disk:1g" rather than > "--gres=disk:ssd:1g". Rather than specifying a Gres Name=disk Type=ssd, just > use a name like this "disk_ssd" or just "ssd". That's what i wanted to avoid ... I thought that it was the tuple Name+Type that was the key of GRES, but it seems not; and asking only for name without type will pickup whatever SLURM wants/can ... So Type implies File ... Ok fine, but this may need to be documented somewhere. Thanks for the explanation. We'll find an alternate solution.
(In reply to Nicolas Joly from comment #4) > That's what i wanted to avoid ... I thought that it was the tuple Name+Type > that was the key of GRES, but it seems not; It is the key. Did you write a gres/disk plugin or are you assuming the user will figure out which type of disk was assigned to the job if they don't request a specific gres type?
(In reply to Moe Jette from comment #5) > (In reply to Nicolas Joly from comment #4) > > That's what i wanted to avoid ... I thought that it was the tuple Name+Type > > that was the key of GRES, but it seems not; > > It is the key. Did you write a gres/disk plugin or are you assuming the user > will figure out which type of disk was assigned to the job if they don't > request a specific gres type? I don't have a plugin ... But i'll think about it. If user do not request a specific type, this means that he don't really care and IMO it does not matter much to report this kind of information. It has the needed resource, that's all.