Bug 4987 - Maintenance Reservation "Reason"
Summary: Maintenance Reservation "Reason"
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other bugs)
Version: 17.11.3
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
: 5138 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-03-26 15:40 MDT by Kaylea Nelson
Modified: 2018-06-11 06:22 MDT (History)
1 user (show)

See Also:
Site: Yale
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 18.08.0pre2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (6.11 KB, application/octet-stream)
2018-03-27 08:47 MDT, Kaylea Nelson
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kaylea Nelson 2018-03-26 15:40:15 MDT
As suggested, we use maintenance reservations to prevent jobs from starting that will be killed by the scheduled downtime. However, the "Reason" that is listed for jobs whose walltime request would extend into the reservation is

(ReqNodeNotAvail, UnavailableNodes:...)

and then it lists all down/drained nodes. Is there a way to make the text of this reason better align with the actual reason the job isn't running (e.g. "Upcoming Maintenance" or "Walltime Overlaps with Maintenance")?
Comment 1 Dominik Bartkiewicz 2018-03-27 04:46:49 MDT
Hi

Could you send me slurm.conf, output from "scontrol show res" and
scontrol show job <example job>.
We are already working on 2 related problems and  I need to know could if this is something new or some sort of another bug?

Dominik
Comment 2 Kaylea Nelson 2018-03-27 08:47:15 MDT
Created attachment 6483 [details]
slurm.conf

See attached for slurm.conf.

scontrol show res:

ReservationName=root_2 StartTime=2017-10-12T19:19:20
EndTime=2018-10-12T19:19:20 Duration=365-00:00:00
   Nodes=c14n[01-04] NodeCnt=4 CoreCnt=32 Features=(null)
PartitionName=(null) Flags=MAINT,SPEC_NODES
   TRES=cpu=32
   Users=root Accounts=(null) Licenses=(null) State=ACTIVE
BurstBuffer=(null) Watts=n/a

ReservationName=root_3 StartTime=2017-11-25T00:00:00
EndTime=2018-11-25T00:00:00 Duration=365-00:00:00
   Nodes=c19n12 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=(null)
Flags=MAINT,SPEC_NODES
   TRES=cpu=8
   Users=root Accounts=(null) Licenses=(null) State=ACTIVE
BurstBuffer=(null) Watts=n/a

ReservationName=root_4 StartTime=2018-04-02T08:00:00
EndTime=2019-04-02T08:00:00 Duration=365-00:00:00
   Nodes=c01n[02-16],c02n[01-16],c03n[01-16],c04n[01-16],c05n[
01-16],c06n[01-16],c07n[01-16],c08n[01-16],c09n[01-16],c10n[
01-16],c11n[01-16],c12n[01-16],c13n[01-16],c14n[01-16],c15n[
01-16],c16n[01-16],c17n[01-16],c18n[02-16],c19n[01-16],c20n[
01-16],c21n[01-16],c22n[01-16],c23n[01-16],c24n[01-16],c25n[
01-16],c26n[01-16],c27n[01-16],c28n[01-16],c29n[01-16],c30n[
01-16],c31n[01-16],c32n[01-07,09-16],c33n[01-16],c34n[01-16]
,c35n[01-16],c36n[01-16],c37n[01-16],c38n[01-16],c39n[01-16]
,c40n[01-16],c41n[01-16],c42n[01-16],c43n[01-16],c44n[01-16]
,c45n[01-04],c46n[01-05,07-08],c47n[01-04,06-07],c48n[01-07]
,c49n[01-08],c50n[01-08],c51n[01-08],c52n[01-08],c53n[01-08] NodeCnt=765
CoreCnt=6392 Features=(null) PartitionName=(null)
Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES
   TRES=cpu=6392
   Users=root Accounts=(null) Licenses=(null) State=INACTIVE
BurstBuffer=(null) Watts=n/a

Example scontrol show job:
JobId=597372 JobName=t16z16defk14a14
   UserId=zm56(12261) GroupId=storelvmo(10081) MCS_label=N/A
   Priority=11204 Nice=0 Account=geo_storelvmo QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:c03n16,
c06n02,c14n[01-04,10],c15n01,c16n06,c17n12,c19n12,c28n08,
c35n13,c45n[01-04],c46n[02,04-05],c47n[01,03,06-07],c48n[02-
03],c49n05,c50n[01-02,07],c51n[02,07],c52n[06-07],c53n[01,04-06]
Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2018-03-26T16:35:36 EligibleTime=2018-03-26T16:35:36
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=geo AllocNode:Sid=omega2:22355
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=16-16 NumCPUs=128 NumTasks=128 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=128,mem=524288,node=16
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=32G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=./t16z16defk14a14_merradust_n0.083_d2.39_2004_
uvonly_1.4xwsahara_1.run
   WorkDir=/lustre/home/client/geo/storelvmo/zm56/cesmruns/
t16z16defk14a14_merradust_n0.083_d2.39_2004_uvonly_1.4xwsahara_1
   StdErr=/lustre/home/client/geo/storelvmo/zm56/cesmruns/
t16z16defk14a14_merradust_n0.083_d2.39_2004_uvonly_1.
4xwsahara_1/slurm-597372.out
   StdIn=/dev/null
   StdOut=/lustre/home/client/geo/storelvmo/zm56/cesmruns/
t16z16defk14a14_merradust_n0.083_d2.39_2004_uvonly_1.
4xwsahara_1/slurm-597372.out
   Power=

---------------
Kaylea Nelson, PhD | kaylea.nelson@yale.edu
Computational Research Support Analyst
Yale Center for Research Computing <http://research.computing.yale.edu/>

On Tue, Mar 27, 2018 at 6:46 AM, <bugs@schedmd.com> wrote:

> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=4987#c1> on bug
> 4987 <https://bugs.schedmd.com/show_bug.cgi?id=4987> from Dominik
> Bartkiewicz <bart@schedmd.com> *
>
> Hi
>
> Could you send me slurm.conf, output from "scontrol show res" and
> scontrol show job <example job>.
> We are already working on 2 related problems and  I need to know could if this
> is something new or some sort of another bug?
>
> Dominik
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 3 Dominik Bartkiewicz 2018-04-19 09:33:32 MDT
Hi

Commit https://github.com/SchedMD/slurm/commit/fc4e5ac9e0563c8 should change reason string to: "ReqNodeNotAvail,_May_be_reserved_for_other_job"

If you want, we can now change this to enhancement.
Described feature requests to some changes that can be done only in next major release.

Dominik
Comment 4 Kaylea Nelson 2018-04-19 09:40:14 MDT
Hi Dominik,

Thanks, this will be really useful the regular reservations. However, is there away to have different reason string for reservations specifically flagged as Maint, such as "ReqNodeNotAvail,_reserved_for_maintenance"?  I feel like this new text will still be very confusing to users when there are maintenance reservations.

Thanks!

(In reply to Dominik Bartkiewicz from comment #3)
> Hi
> 
> Commit https://github.com/SchedMD/slurm/commit/fc4e5ac9e0563c8 should change
> reason string to: "ReqNodeNotAvail,_May_be_reserved_for_other_job"
> 
> If you want, we can now change this to enhancement.
> Described feature requests to some changes that can be done only in next
> major release.
> 
> Dominik
Comment 8 Tim Wickberg 2018-05-07 18:29:45 MDT
*** Bug 5138 has been marked as a duplicate of this bug. ***
Comment 11 Dominik Bartkiewicz 2018-06-11 05:42:17 MDT
Hi

In commit https://github.com/SchedMD/slurm/commi/4b9a7589067d88 we added a new
job pending reason of "ReqNodeNotAvail, reserved for maintenance".
It will be in released 18.08.
I will close this ticket.

Dominik