Summary: | Maintenance Reservation "Reason" | ||
---|---|---|---|
Product: | Slurm | Reporter: | Kaylea Nelson <kaylea.nelson> |
Component: | User Commands | Assignee: | Dominik Bartkiewicz <bart> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | pbisbal |
Version: | 17.11.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Yale | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 18.08.0pre2 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | slurm.conf |
Description
Kaylea Nelson
2018-03-26 15:40:15 MDT
Hi Could you send me slurm.conf, output from "scontrol show res" and scontrol show job <example job>. We are already working on 2 related problems and I need to know could if this is something new or some sort of another bug? Dominik Created attachment 6483 [details] slurm.conf See attached for slurm.conf. scontrol show res: ReservationName=root_2 StartTime=2017-10-12T19:19:20 EndTime=2018-10-12T19:19:20 Duration=365-00:00:00 Nodes=c14n[01-04] NodeCnt=4 CoreCnt=32 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES TRES=cpu=32 Users=root Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=root_3 StartTime=2017-11-25T00:00:00 EndTime=2018-11-25T00:00:00 Duration=365-00:00:00 Nodes=c19n12 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES TRES=cpu=8 Users=root Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a ReservationName=root_4 StartTime=2018-04-02T08:00:00 EndTime=2019-04-02T08:00:00 Duration=365-00:00:00 Nodes=c01n[02-16],c02n[01-16],c03n[01-16],c04n[01-16],c05n[ 01-16],c06n[01-16],c07n[01-16],c08n[01-16],c09n[01-16],c10n[ 01-16],c11n[01-16],c12n[01-16],c13n[01-16],c14n[01-16],c15n[ 01-16],c16n[01-16],c17n[01-16],c18n[02-16],c19n[01-16],c20n[ 01-16],c21n[01-16],c22n[01-16],c23n[01-16],c24n[01-16],c25n[ 01-16],c26n[01-16],c27n[01-16],c28n[01-16],c29n[01-16],c30n[ 01-16],c31n[01-16],c32n[01-07,09-16],c33n[01-16],c34n[01-16] ,c35n[01-16],c36n[01-16],c37n[01-16],c38n[01-16],c39n[01-16] ,c40n[01-16],c41n[01-16],c42n[01-16],c43n[01-16],c44n[01-16] ,c45n[01-04],c46n[01-05,07-08],c47n[01-04,06-07],c48n[01-07] ,c49n[01-08],c50n[01-08],c51n[01-08],c52n[01-08],c53n[01-08] NodeCnt=765 CoreCnt=6392 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES TRES=cpu=6392 Users=root Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a Example scontrol show job: JobId=597372 JobName=t16z16defk14a14 UserId=zm56(12261) GroupId=storelvmo(10081) MCS_label=N/A Priority=11204 Nice=0 Account=geo_storelvmo QOS=normal JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:c03n16, c06n02,c14n[01-04,10],c15n01,c16n06,c17n12,c19n12,c28n08, c35n13,c45n[01-04],c46n[02,04-05],c47n[01,03,06-07],c48n[02- 03],c49n05,c50n[01-02,07],c51n[02,07],c52n[06-07],c53n[01,04-06] Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2018-03-26T16:35:36 EligibleTime=2018-03-26T16:35:36 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=geo AllocNode:Sid=omega2:22355 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=16-16 NumCPUs=128 NumTasks=128 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=128,mem=524288,node=16 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=32G MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=./t16z16defk14a14_merradust_n0.083_d2.39_2004_ uvonly_1.4xwsahara_1.run WorkDir=/lustre/home/client/geo/storelvmo/zm56/cesmruns/ t16z16defk14a14_merradust_n0.083_d2.39_2004_uvonly_1.4xwsahara_1 StdErr=/lustre/home/client/geo/storelvmo/zm56/cesmruns/ t16z16defk14a14_merradust_n0.083_d2.39_2004_uvonly_1. 4xwsahara_1/slurm-597372.out StdIn=/dev/null StdOut=/lustre/home/client/geo/storelvmo/zm56/cesmruns/ t16z16defk14a14_merradust_n0.083_d2.39_2004_uvonly_1. 4xwsahara_1/slurm-597372.out Power= --------------- Kaylea Nelson, PhD | kaylea.nelson@yale.edu Computational Research Support Analyst Yale Center for Research Computing <http://research.computing.yale.edu/> On Tue, Mar 27, 2018 at 6:46 AM, <bugs@schedmd.com> wrote: > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=4987#c1> on bug > 4987 <https://bugs.schedmd.com/show_bug.cgi?id=4987> from Dominik > Bartkiewicz <bart@schedmd.com> * > > Hi > > Could you send me slurm.conf, output from "scontrol show res" and > scontrol show job <example job>. > We are already working on 2 related problems and I need to know could if this > is something new or some sort of another bug? > > Dominik > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hi Commit https://github.com/SchedMD/slurm/commit/fc4e5ac9e0563c8 should change reason string to: "ReqNodeNotAvail,_May_be_reserved_for_other_job" If you want, we can now change this to enhancement. Described feature requests to some changes that can be done only in next major release. Dominik Hi Dominik, Thanks, this will be really useful the regular reservations. However, is there away to have different reason string for reservations specifically flagged as Maint, such as "ReqNodeNotAvail,_reserved_for_maintenance"? I feel like this new text will still be very confusing to users when there are maintenance reservations. Thanks! (In reply to Dominik Bartkiewicz from comment #3) > Hi > > Commit https://github.com/SchedMD/slurm/commit/fc4e5ac9e0563c8 should change > reason string to: "ReqNodeNotAvail,_May_be_reserved_for_other_job" > > If you want, we can now change this to enhancement. > Described feature requests to some changes that can be done only in next > major release. > > Dominik *** Ticket 5138 has been marked as a duplicate of this ticket. *** Hi In commit https://github.com/SchedMD/slurm/commi/4b9a7589067d88 we added a new job pending reason of "ReqNodeNotAvail, reserved for maintenance". It will be in released 18.08. I will close this ticket. Dominik |