Hi SchedMD! We're seeing pending jobs with a wrong CPUs/Task value with 17.11.4 on Sherlock. See the line: NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:* in the full output below: $ scontrol show job 7849143 JobId=7849143 JobName=combo_permut UserId=julienc(38982) GroupId=agitler(13103) MCS_label=N/A Priority=2270 Nice=0 Account=agitler QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A SubmitTime=2018-03-08T22:10:26 EligibleTime=2018-03-08T22:10:26 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-03-09T10:07:13 Partition=agitler,normal,owners AllocNode:Sid=sh-ln01:361294 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=sh-113-09 NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:* TRES=cpu=16,mem=125G,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=16 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/scratch/users/julienc/Crispy/casTLE/combo_permut.sbatch ... I can sometimes reproduce the issue even with srun when using the agitler partition first (eg. agitler,normal,owners, but I can't reproduce the issue with normal,agitler,owners). Well, it's not always reproducible, right now I can't seem to reproduce anymore. The issue isn't just cosmetic as such jobs are finding nodes with more cpus (like sh-113-09 in the example above, which is a node with more cores/memory). On the slurmctld side, the following messages can be seen: Mar 9 10:14:36 sh-sl01 slurmctld[334976]: _pick_best_nodes: job 7849143 never runnable in partition agitler Mar 9 10:14:36 sh-sl01 slurmctld[334976]: _pick_best_nodes: job 7849143 never runnable in partition normal Thanks! Stephane
Hi Stephane - Can you attach your current slurm.conf file for the cluster? I'm going to see if Dominik can chase down a reason this could happen. If you have a way to trigger this again, it might be helpful if you could attach logs captured while the TraceJobs and Backfill DebugFlags were turned on temporarily.
Hi Without data Tim mentioned I can't be sure but I can reproduce similar/same behavior. I am not sure if this is a bug or just effect of submitting to multiple partition with different MaxMemPerCPU. I will look at this and let you know what we can do about it. Dominik
Hi Tim and Dominik, Thanks much for looking into this! I just sent you the current slurm.conf by email. Stephane
Hi Stephane, this should have been fixed in commit: https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e starting from 17.11.7. You can apply it at your earliest convenience by appending ".patch" to the github URL. We're gonna go ahead and mark this as resolved/fixed. Please, reopen if there's any new issue you find after applying it. Thanks.