Ticket 2472 - Badconstraints if contstraints are modified
Summary: Badconstraints if contstraints are modified
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 15.08.8
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Danny Auble
QA Contact:
URL:
: 2478 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2016-02-22 22:02 MST by CSC sysadmins
Modified: 2016-02-24 07:19 MST (History)
1 user (show)

See Also:
Site: CSC - IT Center for Science
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 15.08.9 16.05.0-pre2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description CSC sysadmins 2016-02-22 22:02:19 MST
We have two kind of nodes, snb = 16 cores and hsw = 24 cores. After Slurm update from 14.11 to 15.08.8 (92ac0dcdbb78df968962acc5d006e1f3aeb6eb37) altering constraints will result a badcontraints state. Slurm sets NumNodes so that job can run only nodes which has 24 cores.


srun -C "hsw|snb" --mem-per-cpu=12000 -n 320 -p parallel --pty $SHELL

NumNodes=21 NumCPUs=320 CPUs/Task=1 ReqB:S:C:T=0:0:*:*                                                              TRES=cpu=320,mem=3840000,node=1                                                 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*   


scontrol update JobId=8500942 Features=snb
scontrol show job 8500942

JobId=8500942 JobName=bash                                                                       JobState=PENDING Reason=BadConstraints Dependency=(null)                                                                          
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2016-02-23T14:34:44 EligibleTime=2016-02-23T14:34:44
   StartTime=2016-02-23T14:35:15 EndTime=2016-02-23T14:35:15
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=parallel AllocNode:Sid=taito-login3:59995
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=c[609-615,620-624,627-635,637-638]
   NumNodes=21 NumCPUs=320 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=320,mem=3840000,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=320 MinMemoryCPU=12000M MinTmpDiskNode=0
   Features=snb Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/homeappl/home/ttervo
   Power= SICP=0
Comment 1 Tim Wickberg 2016-02-24 05:19:01 MST
*** Ticket 2478 has been marked as a duplicate of this ticket. ***
Comment 2 Tim Wickberg 2016-02-24 05:32:20 MST
We're working on this, it's a regression in Slurm 15.08.6 and later that has shown up on a few different bugs this week.

Any update for a job without a -N (nodes) count set incorrectly results in  MinCPUsNode=(total-number-of-cpus). In your example below it was set to 320, and I assume you don't have any nodes with that many CPUs available.
Comment 3 Danny Auble 2016-02-24 07:19:17 MST
They Tommi, this is fixed in commits bd9fa8300b1 and de28c13a159d.  Please reopen if they don't fix the problem for you.