Ticket 4895 - Wrong CPUs/Task value
Summary: Wrong CPUs/Task value
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 17.11.4
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-03-09 11:19 MST by Stephane Thiell
Modified: 2024-05-01 06:46 MDT (History)
4 users (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.7 18.08.0pre2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Stephane Thiell 2018-03-09 11:19:05 MST
Hi SchedMD!

We're seeing pending jobs with a wrong CPUs/Task value with 17.11.4 on Sherlock. See the line:

   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*

in the full output below:

$ scontrol show job 7849143
JobId=7849143 JobName=combo_permut
   UserId=julienc(38982) GroupId=agitler(13103) MCS_label=N/A
   Priority=2270 Nice=0 Account=agitler QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A
   SubmitTime=2018-03-08T22:10:26 EligibleTime=2018-03-08T22:10:26
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-03-09T10:07:13
   Partition=agitler,normal,owners AllocNode:Sid=sh-ln01:361294
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=sh-113-09
   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=125G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/users/julienc/Crispy/casTLE/combo_permut.sbatch
   ...

I can sometimes reproduce the issue even with srun when using the agitler partition first (eg. agitler,normal,owners, but I can't reproduce the issue with normal,agitler,owners). Well, it's not always reproducible, right now I can't seem to reproduce anymore. The issue isn't just cosmetic as such jobs are finding nodes with more cpus (like sh-113-09 in the example above, which is a node with more cores/memory).

On the slurmctld side, the following messages can be seen:
Mar  9 10:14:36 sh-sl01 slurmctld[334976]: _pick_best_nodes: job 7849143 never runnable in partition agitler
Mar  9 10:14:36 sh-sl01 slurmctld[334976]: _pick_best_nodes: job 7849143 never runnable in partition normal

Thanks!
Stephane
Comment 1 Tim Wickberg 2018-03-12 13:49:33 MDT
Hi Stephane -

Can you attach your current slurm.conf file for the cluster? I'm going to see if Dominik can chase down a reason this could happen.

If you have a way to trigger this again, it might be helpful if you could attach logs captured while the TraceJobs and Backfill DebugFlags were turned on temporarily.
Comment 2 Dominik Bartkiewicz 2018-03-13 08:53:20 MDT
Hi

Without data Tim mentioned I can't be sure but I can reproduce similar/same behavior.
I am not sure if this is a bug or just effect of submitting to multiple partition with different MaxMemPerCPU. I will look at this and let you know what we can do about it.

Dominik
Comment 3 Stephane Thiell 2018-03-13 11:27:15 MDT
Hi Tim and Dominik,

Thanks much for looking into this! I just sent you the current slurm.conf by email.

Stephane
Comment 31 Alejandro Sanchez 2018-05-10 02:42:10 MDT
Hi Stephane, this should have been fixed in commit:

https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e

starting from 17.11.7. You can apply it at your earliest convenience by appending ".patch" to the github URL. We're gonna go ahead and mark this as resolved/fixed. Please, reopen if there's any new issue you find after applying it. Thanks.