Ticket 4987

Summary: Maintenance Reservation "Reason"
Product: Slurm Reporter: Kaylea Nelson <kaylea.nelson>
Component: User CommandsAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: pbisbal
Version: 17.11.3   
Hardware: Linux   
OS: Linux   
Site: Yale Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 18.08.0pre2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Kaylea Nelson 2018-03-26 15:40:15 MDT
As suggested, we use maintenance reservations to prevent jobs from starting that will be killed by the scheduled downtime. However, the "Reason" that is listed for jobs whose walltime request would extend into the reservation is

(ReqNodeNotAvail, UnavailableNodes:...)

and then it lists all down/drained nodes. Is there a way to make the text of this reason better align with the actual reason the job isn't running (e.g. "Upcoming Maintenance" or "Walltime Overlaps with Maintenance")?
Comment 1 Dominik Bartkiewicz 2018-03-27 04:46:49 MDT
Hi

Could you send me slurm.conf, output from "scontrol show res" and
scontrol show job <example job>.
We are already working on 2 related problems and  I need to know could if this is something new or some sort of another bug?

Dominik
Comment 2 Kaylea Nelson 2018-03-27 08:47:15 MDT
Created attachment 6483 [details]
slurm.conf

See attached for slurm.conf.

scontrol show res:

ReservationName=root_2 StartTime=2017-10-12T19:19:20
EndTime=2018-10-12T19:19:20 Duration=365-00:00:00
   Nodes=c14n[01-04] NodeCnt=4 CoreCnt=32 Features=(null)
PartitionName=(null) Flags=MAINT,SPEC_NODES
   TRES=cpu=32
   Users=root Accounts=(null) Licenses=(null) State=ACTIVE
BurstBuffer=(null) Watts=n/a

ReservationName=root_3 StartTime=2017-11-25T00:00:00
EndTime=2018-11-25T00:00:00 Duration=365-00:00:00
   Nodes=c19n12 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=(null)
Flags=MAINT,SPEC_NODES
   TRES=cpu=8
   Users=root Accounts=(null) Licenses=(null) State=ACTIVE
BurstBuffer=(null) Watts=n/a

ReservationName=root_4 StartTime=2018-04-02T08:00:00
EndTime=2019-04-02T08:00:00 Duration=365-00:00:00
   Nodes=c01n[02-16],c02n[01-16],c03n[01-16],c04n[01-16],c05n[
01-16],c06n[01-16],c07n[01-16],c08n[01-16],c09n[01-16],c10n[
01-16],c11n[01-16],c12n[01-16],c13n[01-16],c14n[01-16],c15n[
01-16],c16n[01-16],c17n[01-16],c18n[02-16],c19n[01-16],c20n[
01-16],c21n[01-16],c22n[01-16],c23n[01-16],c24n[01-16],c25n[
01-16],c26n[01-16],c27n[01-16],c28n[01-16],c29n[01-16],c30n[
01-16],c31n[01-16],c32n[01-07,09-16],c33n[01-16],c34n[01-16]
,c35n[01-16],c36n[01-16],c37n[01-16],c38n[01-16],c39n[01-16]
,c40n[01-16],c41n[01-16],c42n[01-16],c43n[01-16],c44n[01-16]
,c45n[01-04],c46n[01-05,07-08],c47n[01-04,06-07],c48n[01-07]
,c49n[01-08],c50n[01-08],c51n[01-08],c52n[01-08],c53n[01-08] NodeCnt=765
CoreCnt=6392 Features=(null) PartitionName=(null)
Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES
   TRES=cpu=6392
   Users=root Accounts=(null) Licenses=(null) State=INACTIVE
BurstBuffer=(null) Watts=n/a

Example scontrol show job:
JobId=597372 JobName=t16z16defk14a14
   UserId=zm56(12261) GroupId=storelvmo(10081) MCS_label=N/A
   Priority=11204 Nice=0 Account=geo_storelvmo QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:c03n16,
c06n02,c14n[01-04,10],c15n01,c16n06,c17n12,c19n12,c28n08,
c35n13,c45n[01-04],c46n[02,04-05],c47n[01,03,06-07],c48n[02-
03],c49n05,c50n[01-02,07],c51n[02,07],c52n[06-07],c53n[01,04-06]
Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2018-03-26T16:35:36 EligibleTime=2018-03-26T16:35:36
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=geo AllocNode:Sid=omega2:22355
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=16-16 NumCPUs=128 NumTasks=128 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=128,mem=524288,node=16
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=32G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=./t16z16defk14a14_merradust_n0.083_d2.39_2004_
uvonly_1.4xwsahara_1.run
   WorkDir=/lustre/home/client/geo/storelvmo/zm56/cesmruns/
t16z16defk14a14_merradust_n0.083_d2.39_2004_uvonly_1.4xwsahara_1
   StdErr=/lustre/home/client/geo/storelvmo/zm56/cesmruns/
t16z16defk14a14_merradust_n0.083_d2.39_2004_uvonly_1.
4xwsahara_1/slurm-597372.out
   StdIn=/dev/null
   StdOut=/lustre/home/client/geo/storelvmo/zm56/cesmruns/
t16z16defk14a14_merradust_n0.083_d2.39_2004_uvonly_1.
4xwsahara_1/slurm-597372.out
   Power=

---------------
Kaylea Nelson, PhD | kaylea.nelson@yale.edu
Computational Research Support Analyst
Yale Center for Research Computing <http://research.computing.yale.edu/>

On Tue, Mar 27, 2018 at 6:46 AM, <bugs@schedmd.com> wrote:

> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=4987#c1> on bug
> 4987 <https://bugs.schedmd.com/show_bug.cgi?id=4987> from Dominik
> Bartkiewicz <bart@schedmd.com> *
>
> Hi
>
> Could you send me slurm.conf, output from "scontrol show res" and
> scontrol show job <example job>.
> We are already working on 2 related problems and  I need to know could if this
> is something new or some sort of another bug?
>
> Dominik
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 3 Dominik Bartkiewicz 2018-04-19 09:33:32 MDT
Hi

Commit https://github.com/SchedMD/slurm/commit/fc4e5ac9e0563c8 should change reason string to: "ReqNodeNotAvail,_May_be_reserved_for_other_job"

If you want, we can now change this to enhancement.
Described feature requests to some changes that can be done only in next major release.

Dominik
Comment 4 Kaylea Nelson 2018-04-19 09:40:14 MDT
Hi Dominik,

Thanks, this will be really useful the regular reservations. However, is there away to have different reason string for reservations specifically flagged as Maint, such as "ReqNodeNotAvail,_reserved_for_maintenance"?  I feel like this new text will still be very confusing to users when there are maintenance reservations.

Thanks!

(In reply to Dominik Bartkiewicz from comment #3)
> Hi
> 
> Commit https://github.com/SchedMD/slurm/commit/fc4e5ac9e0563c8 should change
> reason string to: "ReqNodeNotAvail,_May_be_reserved_for_other_job"
> 
> If you want, we can now change this to enhancement.
> Described feature requests to some changes that can be done only in next
> major release.
> 
> Dominik
Comment 8 Tim Wickberg 2018-05-07 18:29:45 MDT
*** Ticket 5138 has been marked as a duplicate of this ticket. ***
Comment 11 Dominik Bartkiewicz 2018-06-11 05:42:17 MDT
Hi

In commit https://github.com/SchedMD/slurm/commi/4b9a7589067d88 we added a new
job pending reason of "ReqNodeNotAvail, reserved for maintenance".
It will be in released 18.08.
I will close this ticket.

Dominik