4895 – Wrong CPUs/Task value

Ticket 4895 - Wrong CPUs/Task value

Summary: Wrong CPUs/Task value

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	17.11.4
Hardware:	Linux Linux

Importance:	--- 3 - Medium Impact
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-03-09 11:19 MST by Stephane Thiell
Modified:	2024-05-01 06:46 MDT (History)
CC List:	4 users (show)

See Also:	4976 7876 19722
Site:	Stanford
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	17.11.7 18.08.0pre2
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Stephane Thiell 2018-03-09 11:19:05 MST

Hi SchedMD!

We're seeing pending jobs with a wrong CPUs/Task value with 17.11.4 on Sherlock. See the line:

   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*

in the full output below:

$ scontrol show job 7849143
JobId=7849143 JobName=combo_permut
   UserId=julienc(38982) GroupId=agitler(13103) MCS_label=N/A
   Priority=2270 Nice=0 Account=agitler QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A
   SubmitTime=2018-03-08T22:10:26 EligibleTime=2018-03-08T22:10:26
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-03-09T10:07:13
   Partition=agitler,normal,owners AllocNode:Sid=sh-ln01:361294
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=sh-113-09
   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=125G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/users/julienc/Crispy/casTLE/combo_permut.sbatch
   ...

I can sometimes reproduce the issue even with srun when using the agitler partition first (eg. agitler,normal,owners, but I can't reproduce the issue with normal,agitler,owners). Well, it's not always reproducible, right now I can't seem to reproduce anymore. The issue isn't just cosmetic as such jobs are finding nodes with more cpus (like sh-113-09 in the example above, which is a node with more cores/memory).

On the slurmctld side, the following messages can be seen:
Mar  9 10:14:36 sh-sl01 slurmctld[334976]: _pick_best_nodes: job 7849143 never runnable in partition agitler
Mar  9 10:14:36 sh-sl01 slurmctld[334976]: _pick_best_nodes: job 7849143 never runnable in partition normal

Thanks!
Stephane

Comment 1 Tim Wickberg 2018-03-12 13:49:33 MDT

Hi Stephane -

Can you attach your current slurm.conf file for the cluster? I'm going to see if Dominik can chase down a reason this could happen.

If you have a way to trigger this again, it might be helpful if you could attach logs captured while the TraceJobs and Backfill DebugFlags were turned on temporarily.

Comment 2 Dominik Bartkiewicz 2018-03-13 08:53:20 MDT

Hi

Without data Tim mentioned I can't be sure but I can reproduce similar/same behavior.
I am not sure if this is a bug or just effect of submitting to multiple partition with different MaxMemPerCPU. I will look at this and let you know what we can do about it.

Dominik

Comment 3 Stephane Thiell 2018-03-13 11:27:15 MDT

Hi Tim and Dominik,

Thanks much for looking into this! I just sent you the current slurm.conf by email.

Stephane

Comment 31 Alejandro Sanchez 2018-05-10 02:42:10 MDT

Hi Stephane, this should have been fixed in commit:

https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e

starting from 17.11.7. You can apply it at your earliest convenience by appending ".patch" to the github URL. We're gonna go ahead and mark this as resolved/fixed. Please, reopen if there's any new issue you find after applying it. Thanks.