Summary: | Gres count value modified if type is specified | ||
---|---|---|---|
Product: | Slurm | Reporter: | Nicolas Joly <njoly> |
Component: | Other | Assignee: | David Bigagli <david> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 6 - No support contract | ||
Priority: | --- | CC: | brian, da |
Version: | 14.11.0 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Nicolas Joly
2014-12-04 01:10:51 MST
And the corresponding log for the failing job : [2014-12-04T16:22:03.089] debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=1000 [2014-12-04T16:22:03.089] debug3: JobDesc: user_id=1000 job_id=N/A partition=(null) name=hostname [2014-12-04T16:22:03.089] debug3: cpus=1-4294967294 pn_min_cpus=-1 core_spec=-1 [2014-12-04T16:22:03.089] debug3: -N min-[max]: 1-[4294967294]:65534:65534:65534 [2014-12-04T16:22:03.089] debug3: pn_min_memory_job=-1 pn_min_tmp_disk=-1 [2014-12-04T16:22:03.089] debug3: immediate=0 features=(null) reservation=(null) [2014-12-04T16:22:03.089] debug3: req_nodes=(null) exc_nodes=(null) gres=disk:2 [2014-12-04T16:22:03.089] debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1 [2014-12-04T16:22:03.089] debug3: kill_on_node_fail=-1 script=(null) [2014-12-04T16:22:03.089] debug3: argv="hostname" [2014-12-04T16:22:03.089] debug3: stdin=(null) stdout=(null) stderr=(null) [2014-12-04T16:22:03.089] debug3: work_dir=/home/njoly alloc_node:sid=lanfeust:19835 [2014-12-04T16:22:03.089] debug3: resp_host=157.99.60.140 alloc_resp_port=63921 other_port=63922 [2014-12-04T16:22:03.089] debug3: dependency=(null) account=(null) qos=(null) comment=(null) [2014-12-04T16:22:03.089] debug3: mail_type=0 mail_user=(null) nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null) [2014-12-04T16:22:03.089] debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null) [2014-12-04T16:22:03.089] debug3: end_time=Unknown signal=0@0 wait_all_nodes=-1 [2014-12-04T16:22:03.089] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 [2014-12-04T16:22:03.089] debug3: mem_bind=65534:(null) plane_size:65534 [2014-12-04T16:22:03.089] debug3: array_inx=(null) [2014-12-04T16:22:03.101] debug3: found correct user [2014-12-04T16:22:03.101] debug3: found correct association [2014-12-04T16:22:03.101] debug3: found correct qos [2014-12-04T16:22:03.101] debug3: before alteration asking for nodes 1-4294967294 cpus 1-4294967294 [2014-12-04T16:22:03.101] debug3: after alteration asking for nodes 1-4294967294 cpus 1-4294967294 [2014-12-04T16:22:03.101] gres: disk state for job 260 [2014-12-04T16:22:03.101] gres_cnt:2 node_cnt:0 type:(null) [2014-12-04T16:22:03.101] debug2: found 1 usable nodes from config containing lanfeust [2014-12-04T16:22:03.101] debug3: _pick_best_nodes: job 260 idle_nodes 1 share_nodes 1 [2014-12-04T16:22:03.102] debug2: select_p_job_test for job 260 [2014-12-04T16:22:03.102] debug3: cons_res: _vns: node lanfeust lacks gres [2014-12-04T16:22:03.102] debug2: select_p_job_test for job 260 [2014-12-04T16:22:03.102] debug2: select_p_job_test for job 260 [2014-12-04T16:22:03.102] _pick_best_nodes: job 260 never runnable [2014-12-04T16:22:03.102] debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE [2014-12-04T16:22:03.102] _slurm_rpc_allocate_resources: Requested node configuration is not available [2014-12-04T16:22:03.102] debug2: got 1 threads to send out [2014-12-04T16:22:03.102] debug3: slurm_send_only_node_msg: sent 0 [2014-12-04T16:22:03.491] debug: backfill: beginning [2014-12-04T16:22:03.491] debug: backfill: no jobs to backfill [2014-12-04T16:22:06.009] debug2: Testing job time limits and checkpoints I believe the gres count field size is 32-bits, so you will probably need to allocate in units if GB rather than bytes. I can run with a gres count of 3G without difficulty. $ grep Gres slurm.conf GresTypes=disk NodeName=jette CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7600 State=UNKNOWN Gres=disk:3g $ grep disk gres.conf Name=disk Count=3g $ salloc --gres=disk:2g bash salloc: Granted job allocation 775 (In reply to Moe Jette from comment #2) > I believe the gres count field size is 32-bits, so you will probably need to > allocate in units if GB rather than bytes. I can run with a gres count of 3G > without difficulty. > > $ grep Gres slurm.conf > GresTypes=disk > NodeName=jette CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 > RealMemory=7600 State=UNKNOWN Gres=disk:3g > > $ grep disk gres.conf > Name=disk Count=3g > > $ salloc --gres=disk:2g bash > salloc: Granted job allocation 775 One more thing, a Gres with a "Type" needs to be tied to specific files, otherwise there is no way to identify what "Type" of GRES is allocated to a job that only ask for, say "--gres=disk:1g" rather than "--gres=disk:ssd:1g". Rather than specifying a Gres Name=disk Type=ssd, just use a name like this "disk_ssd" or just "ssd". (In reply to Moe Jette from comment #3) > (In reply to Moe Jette from comment #2) > > I believe the gres count field size is 32-bits, so you will probably need to > > allocate in units if GB rather than bytes. I can run with a gres count of 3G > > without difficulty. > > > > $ grep Gres slurm.conf > > GresTypes=disk > > NodeName=jette CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 > > RealMemory=7600 State=UNKNOWN Gres=disk:3g > > > > $ grep disk gres.conf > > Name=disk Count=3g > > > > $ salloc --gres=disk:2g bash > > salloc: Granted job allocation 775 > > One more thing, a Gres with a "Type" needs to be tied to specific files, > otherwise there is no way to identify what "Type" of GRES is allocated to a > job that only ask for, say "--gres=disk:1g" rather than > "--gres=disk:ssd:1g". Rather than specifying a Gres Name=disk Type=ssd, just > use a name like this "disk_ssd" or just "ssd". That's what i wanted to avoid ... I thought that it was the tuple Name+Type that was the key of GRES, but it seems not; and asking only for name without type will pickup whatever SLURM wants/can ... So Type implies File ... Ok fine, but this may need to be documented somewhere. Thanks for the explanation. We'll find an alternate solution. (In reply to Nicolas Joly from comment #4) > That's what i wanted to avoid ... I thought that it was the tuple Name+Type > that was the key of GRES, but it seems not; It is the key. Did you write a gres/disk plugin or are you assuming the user will figure out which type of disk was assigned to the job if they don't request a specific gres type? (In reply to Moe Jette from comment #5) > (In reply to Nicolas Joly from comment #4) > > That's what i wanted to avoid ... I thought that it was the tuple Name+Type > > that was the key of GRES, but it seems not; > > It is the key. Did you write a gres/disk plugin or are you assuming the user > will figure out which type of disk was assigned to the job if they don't > request a specific gres type? I don't have a plugin ... But i'll think about it. If user do not request a specific type, this means that he don't really care and IMO it does not matter much to report this kind of information. It has the needed resource, that's all. |