I noticed a job in the queue that seemed like it should be running: $ \squeue -j 451777 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 451777 tsang-16, 48ca124s username PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) $ scontrol show job 451777 JobId=451777 JobName=48ca124sn_b2_e0.140_g1.00_458 UserId=username(922644) GroupId=nscl(2002) MCS_label=N/A Priority=412 Nice=0 Account=tsang QOS=normal JobState=PENDING Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-12:00:00 TimeMin=N/A SubmitTime=2018-10-18T23:34:58 EligibleTime=2018-10-18T23:34:58 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-10-20T17:53:05 Partition=tsang-16,general-long-14,general-long-16,general-long-18 AllocNode:Sid=dev-intel16:25792 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=1G,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=1G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/mnt/home/username/pBUU-Pawel/48ca124sn_b2_e0.140_g1.00_458.sh WorkDir=/mnt/home/username/pBUU-Pawel Comment=stdout=/mnt/home/username/pBUU-Pawel/48ca124sn_b2_e0.140_g1.00_458.sh.o%A StdErr=/mnt/home/username/pBUU-Pawel/48ca124sn_b2_e0.140_g1.00_458.sh.e451777 StdIn=/dev/null StdOut=/mnt/home/username/pBUU-Pawel/48ca124sn_b2_e0.140_g1.00_458.sh.o451777 Power= The reason states it's waiting for certain nodes to be available, however, we have plenty of idle nodes where a 1core/1gb job could run. After looking at the association, its clear this user has hit the MaxTRESPU limit: User 922644 MaxJobsPU=N(520) MaxSubmitJobsPU=1000(1000) MaxTRESPU=cpu=520(520),mem=N(532480),energy=N(0),node=N(520),billing=N(520),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) Should the reason say QOS limit instead of 'nodes required for job are down'?
Hi Steve, I tested it in an empty Slurm and when there's a MaxGRESPU limit the reason is correctly set: ]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 72 debug sleep lipi PD 0:00 1 (QOSMaxGRESPerUser) 73 debug sleep lipi PD 0:00 1 (QOSMaxGRESPerUser) 74 debug sleep lipi PD 0:00 1 (QOSMaxGRESPerUser) 71 debug sleep lipi R 0:30 1 gamba3 Is this node in other partitions? Note that the reason says: '(Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)' So there can be multiple causes for this. Can you make a test and try to reproduce the issue creating a reservation for 1 node, and running multiple jobs with a limited MaxTRESPU QoS onto this node? This would help to determine if this is exactly the problem or the node was unavailable because the backfill already allocated it for another job.
Hi Steve, can you refer to my last comment? Thanks
Felip, I created a reservation for one node and a QOS with as MaxTRESPU of cpu=10. $ scontrol show res fordste5_19 ReservationName=fordste5_19 StartTime=2018-11-05T08:31:53 EndTime=2018-11-06T08:31:53 Duration=1-00:00:00 Nodes=skl-086 NodeCnt=1 CoreCnt=40 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=40 Users=fordste5 Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a $ sacctmgr show qos debug Name Priority GraceTime Preempt PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA MaxSubmitPA MinTRES ---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- ------------- debug 0 00:00:00 cluster 1.000000 cpu=10 $ sacctmgr show assoc where user=fordste5 account=general Cluster Account User Partition Share GrpJobs GrpTRES GrpSubmit GrpWall GrpTRESMins MaxJobs MaxTRES MaxTRESPerNode MaxSubmit MaxWall MaxTRESMins QOS Def QOS GrpTRESRunMin ---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- ------------- msuhpcc general fordste5 1 cpu=30000000 debug,normal normal I submitted four jobs to this reservation that each requested five cores for one hour: $ sbatch --reservation=fordste5_19 -A general -q debug --wrap='sleep 3600' -c 5 --time=1:00:00 Two of these jobs were started by the backfill scheduler and the other two were kept pending with the reason "Priority": /usr/bin/squeue -u fordste5 -o "%i %7q %8T %.5C %v %R" JOBID QOS STATE CPUS RESERVATION NODELIST(REASON) 1193807 debug PENDING 5 fordste5_19 (Priority) 1193811 debug PENDING 5 fordste5_19 (Priority) 1193801 debug RUNNING 5 fordste5_19 skl-086 1193804 debug RUNNING 5 fordste5_19 skl-086 It seems odd that the main scheduling loop didn't start these jobs. Does the main scheduler behave differently when reservations are involved? Also, I'm seeing a lot of messages like the following in the logs for only one of these jobs when they are pending. Before the two jobs were started: [2018-11-05T09:00:41.617] _build_node_list: No nodes satisfy JobId=1193801 requirements in partition general-long-14 And after they started: [2018-11-05T09:04:15.422] _build_node_list: No nodes satisfy JobId=1193807 requirements in partition general-long-14 The partition list for all of these jobs is the same: "general-short-14,general-short-16,general-short-18,general-long-14,general-long-16,general-long-18", so I would expect this message to be there for each job. We've been tuning the scheduler lately, so I'll attach our latest slurm.conf.
Created attachment 8201 [details] slurm.conf
I reproduce the situation: [slurm@moll0 ~]$ squeue -o " %P %N %S %R %" PARTITION NODELIST START_TIME NODELIST(REASON) general-short-18,general-short-14,general-short-16,general-long-14,general-long-16,general-long-18 N/A (Priority) general-short-18,general-short-14,general-short-16,general-long-14,general-long-16,general-long-18 N/A (Priority) general-short-18 moll1 2018-11-07T00:51:30 moll1 general-short-18 moll1 2018-11-07T00:51:30 moll1 [slurm@moll0 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4169 general-s wrap slurm PD 0:00 1 (Priority) 4170 general-s wrap slurm PD 0:00 1 (Priority) 4167 general-s wrap slurm R 1:52 1 moll1 4168 general-s wrap slurm R 1:52 1 moll1 This is happening only in nodes which asks for multiple partitions. If you do the same but just asking for 1 partition, you will see the reason set correctly. I see this message repeated continuously just like you: [2018-11-07T00:53:08.783] _build_node_list: No nodes satisfy JobId=4169 requirements in partition general-short-14 I am guessing this is due to the scheduler taking the first job in the queue and trying to allocate resources at the same moment when is taking into account the first partition on the list (general-short-14). The resources are all busy so none are found for this job, and the informative message for this partition is shown. After that no other try is done since the job cannot run now, and therefore you don't see other partition errors. Let me investigate more in-deep this behavior, but now I can tell that is something that shouldn't worry you too much. I'll come back to you soon.
> This is happening only *in nodes* which asks for multiple partitions. In *jobs* , sorry.
(In reply to Steve Ford from comment #4) > Created attachment 8201 [details] > slurm.conf Hi Steve, Can you change bf_interval=240 to bf_interval=30, and see if in 30 seconds the reason is correctly set? bf_interval=# The number of seconds between backfill iterations. Higher values result in less overhead and better responsiveness. This option applies only to SchedulerType=sched/backfill. The default value is 30 seconds. I've identified the cause but I want to be sure that your problem is the same than I found. It seems that the main scheduler checks all the partitions requested by the job despite that they may not match the reservation. The backfill schedule seems to do the right thing though, and when it runs it correctly sets the reason. Otherwise, if the main sched runs first, the reason is incorrectly set. You could even see oscillations in the reason. Thanks.
> Can you change bf_interval=240 to bf_interval=30, and see if in 30 seconds > the reason is correctly set? btw, you have to restart slurmctld after changing the parameter and then run a new test.
Hi Steve, The issue have been fixed starting at 18.08.4 with commit c4fb37796e77c. Thank you for reporting, Felip M