Created attachment 9764 [details] patch that merges with both 18.08 and master branches Hello, I use `scontrol completing` frequently to identify nodes that are causing trouble (usually by bugs in the OS kernel, IO systems, and sometimes slurm). I frequently end up looping over `scontrol completing` and then using scontrol show job <jobid> to identify if the completing job is transient or not. Please find attached a patch I would like for you to consider in 19.05 to augment the output of `scontrol completing` with EndTime and CompletingTime output to ease debugging in these situations (as well as ease automated monitoring). Thank you, Doug
This is how it looks on Gerty, and will be rolled on onto Cori in tomorrows maintenance as our operators will be relying on this information to speed detection of stuck jobs. csamuel@gert01:~/TESTS/Slurm/Gerty/Basic> scancel -u csamuel csamuel@gert01:~/TESTS/Slurm/Gerty/Basic> scontrol completing JobId=1137366 EndTime=2019-04-08T16:45:18 CompletingTime=00:00:02 Nodes(COMPLETING)=nid00028 JobId=1137367 EndTime=2019-04-08T16:45:18 CompletingTime=00:00:02 Nodes(COMPLETING)=nid00028 JobId=1137368 EndTime=2019-04-08T16:45:18 CompletingTime=00:00:02 Nodes(COMPLETING)=nid00028 JobId=1137369 EndTime=2019-04-08T16:45:18 CompletingTime=00:00:02 Nodes(COMPLETING)=nid00028 JobId=1137370 EndTime=2019-04-08T16:45:18 CompletingTime=00:00:02 Nodes(COMPLETING)=nid00028 JobId=1137371 EndTime=2019-04-08T16:45:18 CompletingTime=00:00:02 Nodes(COMPLETING)=nid00028
Thanks Doug. This is in for 19.05.0pre4 and above: commit c39a971504d49ea99218b332ed23df510222a29a Author: Doug Jacobsen <dmjacobsen@lbl.gov> AuthorDate: Thu Apr 11 09:48:38 2019 +0800 Add EndTime and CompletingTime fields to 'scontrol completing'. Bug 6787.