Ticket 5895 - Pending job reason is incorrect, should be MaxTRESPU
Summary: Pending job reason is incorrect, should be MaxTRESPU
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 18.08.1
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-10-20 16:04 MDT by Steve Ford
Modified: 2018-11-19 09:02 MST (History)
2 users (show)

See Also:
Site: MSU
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 18.08.4
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (31.49 KB, text/plain)
2018-11-05 07:13 MST, Steve Ford
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Steve Ford 2018-10-20 16:04:04 MDT
I noticed a job in the queue that seemed like it should be running:

$ \squeue -j 451777
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            451777 tsang-16, 48ca124s   username PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)

$ scontrol show job 451777
JobId=451777 JobName=48ca124sn_b2_e0.140_g1.00_458
   UserId=username(922644) GroupId=nscl(2002) MCS_label=N/A
   Priority=412 Nice=0 Account=tsang QOS=normal
   JobState=PENDING Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-12:00:00 TimeMin=N/A
   SubmitTime=2018-10-18T23:34:58 EligibleTime=2018-10-18T23:34:58
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-10-20T17:53:05
   Partition=tsang-16,general-long-14,general-long-16,general-long-18 AllocNode:Sid=dev-intel16:25792
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=1G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=1G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/mnt/home/username/pBUU-Pawel/48ca124sn_b2_e0.140_g1.00_458.sh
   WorkDir=/mnt/home/username/pBUU-Pawel
   Comment=stdout=/mnt/home/username/pBUU-Pawel/48ca124sn_b2_e0.140_g1.00_458.sh.o%A 
   StdErr=/mnt/home/username/pBUU-Pawel/48ca124sn_b2_e0.140_g1.00_458.sh.e451777
   StdIn=/dev/null
   StdOut=/mnt/home/username/pBUU-Pawel/48ca124sn_b2_e0.140_g1.00_458.sh.o451777
   Power=

The reason states it's waiting for certain nodes to be available, however, we have plenty of idle nodes where a 1core/1gb job could run.

After looking at the association, its clear this user has hit the MaxTRESPU limit:

      User 922644
        MaxJobsPU=N(520)
        MaxSubmitJobsPU=1000(1000)
        MaxTRESPU=cpu=520(520),mem=N(532480),energy=N(0),node=N(520),billing=N(520),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)

Should the reason say QOS limit instead of 'nodes required for job are down'?
Comment 1 Felip Moll 2018-10-22 06:34:54 MDT
Hi Steve,

I tested it in an empty Slurm and when there's a MaxGRESPU limit the reason is correctly set:

]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                72     debug    sleep     lipi PD       0:00      1 (QOSMaxGRESPerUser)
                73     debug    sleep     lipi PD       0:00      1 (QOSMaxGRESPerUser)
                74     debug    sleep     lipi PD       0:00      1 (QOSMaxGRESPerUser)
                71     debug    sleep     lipi  R       0:30      1 gamba3


Is this node in other partitions?

Note that the reason says:

 '(Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)'

So there can be multiple causes for this.

Can you make a test and try to reproduce the issue creating a reservation for 1 node, and running multiple jobs with a limited MaxTRESPU QoS onto this node? This would help to determine if this is exactly the problem or the node was unavailable because the backfill already allocated it for another job.
Comment 2 Felip Moll 2018-10-31 12:45:34 MDT
Hi Steve, can you refer to my last comment?

Thanks
Comment 3 Steve Ford 2018-11-05 07:11:30 MST
Felip,

I created a reservation for one node and a QOS with as MaxTRESPU of cpu=10.

$ scontrol show res fordste5_19
ReservationName=fordste5_19 StartTime=2018-11-05T08:31:53 EndTime=2018-11-06T08:31:53 Duration=1-00:00:00
   Nodes=skl-086 NodeCnt=1 CoreCnt=40 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=40
   Users=fordste5 Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

$ sacctmgr show qos debug
      Name   Priority  GraceTime    Preempt PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA MaxSubmitPA       MinTRES 
---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- ------------- 
     debug          0   00:00:00                cluster                                                        1.000000                                                                                                                                       cpu=10  

$ sacctmgr show assoc where user=fordste5 account=general
   Cluster    Account       User  Partition     Share GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin 
---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- ------------- 
   msuhpcc    general   fordste5                    1                                              cpu=30000000                                                                                  debug,normal    normal

   
I submitted four jobs to this reservation that each requested five cores for one hour:

$ sbatch --reservation=fordste5_19 -A general -q debug --wrap='sleep 3600' -c 5 --time=1:00:00

Two of these jobs were started by the backfill scheduler and the other two were kept pending with the reason "Priority":

/usr/bin/squeue -u fordste5 -o "%i %7q %8T %.5C %v %R"
JOBID QOS     STATE     CPUS RESERVATION NODELIST(REASON)
1193807 debug   PENDING      5 fordste5_19 (Priority)
1193811 debug   PENDING      5 fordste5_19 (Priority)
1193801 debug   RUNNING      5 fordste5_19 skl-086
1193804 debug   RUNNING      5 fordste5_19 skl-086


It seems odd that the main scheduling loop didn't start these jobs. Does the main scheduler behave differently when reservations are involved?

Also, I'm seeing a lot of messages like the following in the logs for only one of these jobs when they are pending.

Before the two jobs were started:

[2018-11-05T09:00:41.617] _build_node_list: No nodes satisfy JobId=1193801 requirements in partition general-long-14

And after they started:

[2018-11-05T09:04:15.422] _build_node_list: No nodes satisfy JobId=1193807 requirements in partition general-long-14

The partition list for all of these jobs is the same: "general-short-14,general-short-16,general-short-18,general-long-14,general-long-16,general-long-18", so I would expect this message to be there for each job.

We've been tuning the scheduler lately, so I'll attach our latest slurm.conf.
Comment 4 Steve Ford 2018-11-05 07:13:59 MST
Created attachment 8201 [details]
slurm.conf
Comment 5 Felip Moll 2018-11-06 17:02:29 MST
I reproduce the situation:

[slurm@moll0 ~]$ squeue -o " %P %N %S  %R %"
 PARTITION NODELIST START_TIME  NODELIST(REASON) 
 general-short-18,general-short-14,general-short-16,general-long-14,general-long-16,general-long-18  N/A  (Priority) 
 general-short-18,general-short-14,general-short-16,general-long-14,general-long-16,general-long-18  N/A  (Priority) 
 general-short-18 moll1 2018-11-07T00:51:30  moll1 
 general-short-18 moll1 2018-11-07T00:51:30  moll1 

[slurm@moll0 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              4169 general-s     wrap    slurm PD       0:00      1 (Priority)
              4170 general-s     wrap    slurm PD       0:00      1 (Priority)
              4167 general-s     wrap    slurm  R       1:52      1 moll1
              4168 general-s     wrap    slurm  R       1:52      1 moll1


This is happening only in nodes which asks for multiple partitions. If you do the same but just
asking for 1 partition, you will see the reason set correctly.

I see this message repeated continuously just like you:
[2018-11-07T00:53:08.783] _build_node_list: No nodes satisfy JobId=4169 requirements in partition general-short-14

I am guessing this is due to the scheduler taking the first job in the queue and trying to allocate resources
at the same moment when is taking into account the first partition on the list (general-short-14). The resources
are all busy so none are found for this job, and the informative message for this partition is shown.
After that no other try is done since the job cannot run now, and therefore you don't see other partition errors.

Let me investigate more in-deep this behavior, but now I can tell that is something that shouldn't worry you too much.
I'll come back to you soon.
Comment 6 Felip Moll 2018-11-06 17:03:28 MST
> This is happening only *in nodes* which asks for multiple partitions. 

In *jobs* , sorry.
Comment 7 Felip Moll 2018-11-07 10:21:22 MST
(In reply to Steve Ford from comment #4)
> Created attachment 8201 [details]
> slurm.conf

Hi Steve,

Can you change bf_interval=240 to bf_interval=30, and see if in 30 seconds the reason is correctly set?

bf_interval=#  The number of seconds between backfill iterations.  Higher values
result in less overhead and better responsiveness.  This option applies only to 
SchedulerType=sched/backfill.  The default value is 30 seconds.

I've identified the cause but I want to be sure that your problem is the
same than I found. It seems that the main scheduler checks all the partitions
requested by the job despite that they may not match the reservation.

The backfill schedule seems to do the right thing though, and when it runs it correctly sets the reason. Otherwise, if the main sched runs first, the reason is  incorrectly set.

You could even see oscillations in the reason.

Thanks.
Comment 8 Felip Moll 2018-11-07 10:22:10 MST
> Can you change bf_interval=240 to bf_interval=30, and see if in 30 seconds
> the reason is correctly set?


btw, you have to restart slurmctld after changing the parameter
and then run a new test.
Comment 14 Felip Moll 2018-11-19 09:02:39 MST
Hi Steve,

The issue have been fixed starting at 18.08.4 with commit c4fb37796e77c.

Thank you for reporting,
Felip M