Bug 6787 - Add EndTime, CompletingTime to output of scontrol completing
Summary: Add EndTime, CompletingTime to output of scontrol completing
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other bugs)
Version: 19.05.x
Hardware: Linux Linux
: --- C - Contributions
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-04-01 10:42 MDT by Doug Jacobsen
Modified: 2019-11-25 12:29 MST (History)
6 users (show)

See Also:
Site: NERSC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 19.05.0pre4
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
patch that merges with both 18.08 and master branches (992 bytes, patch)
2019-04-01 10:42 MDT, Doug Jacobsen
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Doug Jacobsen 2019-04-01 10:42:02 MDT
Created attachment 9764 [details]
patch that merges with both 18.08 and master branches

Hello,

I use `scontrol completing` frequently to identify nodes that are causing trouble (usually by bugs in the OS kernel, IO systems, and sometimes slurm).  I frequently end up looping over `scontrol completing` and then using scontrol show job <jobid> to identify if the completing job is transient or not.

Please find attached a patch I would like for you to consider in 19.05 to augment the output of `scontrol completing` with EndTime and CompletingTime output to ease debugging in these situations (as well as ease automated monitoring).

Thank you,
Doug
Comment 1 Chris Samuel (NERSC) 2019-04-09 09:26:00 MDT
This is how it looks on Gerty, and will be rolled on onto Cori in tomorrows maintenance as our operators will be relying on this information to speed detection of stuck jobs.

csamuel@gert01:~/TESTS/Slurm/Gerty/Basic> scancel -u csamuel
csamuel@gert01:~/TESTS/Slurm/Gerty/Basic> scontrol completing
JobId=1137366 EndTime=2019-04-08T16:45:18 CompletingTime=00:00:02 Nodes(COMPLETING)=nid00028
JobId=1137367 EndTime=2019-04-08T16:45:18 CompletingTime=00:00:02 Nodes(COMPLETING)=nid00028
JobId=1137368 EndTime=2019-04-08T16:45:18 CompletingTime=00:00:02 Nodes(COMPLETING)=nid00028
JobId=1137369 EndTime=2019-04-08T16:45:18 CompletingTime=00:00:02 Nodes(COMPLETING)=nid00028
JobId=1137370 EndTime=2019-04-08T16:45:18 CompletingTime=00:00:02 Nodes(COMPLETING)=nid00028
JobId=1137371 EndTime=2019-04-08T16:45:18 CompletingTime=00:00:02 Nodes(COMPLETING)=nid00028
Comment 2 Tim Wickberg 2019-04-10 20:06:52 MDT
Thanks Doug. This is in for 19.05.0pre4 and above:

commit c39a971504d49ea99218b332ed23df510222a29a
Author:     Doug Jacobsen <dmjacobsen@lbl.gov>
AuthorDate: Thu Apr 11 09:48:38 2019 +0800

    Add EndTime and CompletingTime fields to 'scontrol completing'.
    
    Bug 6787.