Bug 1868

Summary: New feature: cluster monitoring tool: schedtop v5.01
Product: Slurm Reporter: Dennis McRitchie <dmcr>
Component: ContributionsAssignee: Unassigned Developer <dev-unassigned>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: chris, dmcr, sts, tim
Version: 16.05.x   
Hardware: Linux   
OS: Linux   
Site: Princeton (PICSciE) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmtop main window screenshot
slurmtop interactive help screen screenshot
Springdale/RedHat 6 SRPM for schedtop v5.00
Springdale/RedHat 6 binary RPM for slurmtop v5.00
Tarball for schedtop v5.00
Springdale/RedHat 6 SRPM for schedtop v5.01
Springdale/RedHat 6 binary RPM for slurmtop v5.01
Tarball for schedtop v5.01
Springdale/RedHat 6 SRPM for schedtop v5.02
Springdale/RedHat 6 binary RPM for slurmtop v5.02
Tarball for schedtop v5.02

Description Dennis McRitchie 2015-08-16 07:53:26 MDT
Created attachment 2122 [details]
slurmtop main window screenshot

schedtop v5.00 is an enhancement to a long-lived and well-loved cluster monitoring tool for PBS known as pbstop. Based on pbstop v4.16 released by Fedora, schedtop v5.00 supports both SLURM and PBS, and also contains significant functional enhancements and bug fixes. See attached screenshots of the main window and the interactive help screen.

This release is provided in the form of a source schedtop RPM, a binary slurmtop RPM, and a schedtop tarball. The binary slurmtop RPM is created by typing:
> rpmbuild --rebuild --with slurm schedtop-5.00-1.sdl6.src.rpm 
(a "--with pbs" option exists to build a binary pbstop rpm)
The binary RPM can be installed as root.

slurmtop can also be built using the attached tarball by extracting the files and following the directions in the README file.

The changes since v4.16 of pbstop are as follows:

1) Added SLURM support, including support for subwindow node-level (offline, restore) and job-level (delete, hold, release, rerun) commands.
2) Refactored code base to share common pbstop/slurmtop code in new schedtop.pm module.  slurmtop and pbstop scripts contain the scheduler-specific code.  Same division was done with the POD documentation, with the schedtop.pm, pbstop, and slurmtop documentation split into several POD files that are assembled at build time.
3) Ported array job support, including array job compression (array job support courtesy of Gareth Williams at CSIRO).
4) Added array job support enhancements:
  4a) Enhanced array job support by displaying total number of allocated cores for compressed array jobs.
  4b) Supported with pbstop when using command-line utilities as backend.
  4c) Fixed sort order problem with array job index zero.
5) Fixed perl-PBS (Perl API for PBS) build script to support Torque 3 and 4, including circumventing lack of swig support for Torque 3 and 4’s pbs_error.h file.
6) Re-implemented secondary "timeshare" grid to support servers with a very large number of cpus per node (i.e., nodes or servers requiring multiple terminal lines to display all their cpus such as SGI UV).
7) Major auto-configuration enhancements added for many cluster types and sizes: 
  7a) Unless explicitly set, show_cpu and maxnodegrid are automatically set to display all cluster nodes in the primary grid, except for those that cannot be displayed on a single terminal line. 
  7b) All nodes/servers whose cpu display will not fit on one terminal line are automatically assigned to be displayed in the secondary grid.
  7c) Either show_cpu or maxnodegrid can be explicitly set in order to force larger nodes into the secondary grid.
8) Autocolumns support was improved to better assign nodes to fit terminal width.
9) Added support for "-f" command-line option and "f" interactive command: toggle fill background with black.
10) Compact display ("no space", -n) was improved; also new interactive "N" toggle was added.
11) New interactive "L" toggle for limiting job view to specific queue was added.
12) New interactive "m" command to specify primary grid's max per-node CPU count was added. Can be reduced from the default to force larger CPU nodes into their own secondary grid.
13) Brought all POD (man page) documentation up-to-date, including new documentation for subwindow commands to offline and restore nodes, and delete, hold, release,  and re-run jobs.
14) Updated -h menu and interactive help screen to match man pages.
15) Better support for mixed busy/free node grid display: node's cpu status (busy/free) in grid now shown with cpu-level rather than node-level granularity; if job display disabled or user-specific job filtering in effect, nodes with 'free' status show cpu status accurately.
16) Helpful warning displayed if $maxrows is set too small to display all jobs (1500 default). Instructions for correcting value are provided in the message.
17) $maxcols changed to default to 300 (from 250) to accommodate wider terminals.
18) Grid legend moved directly under the grids for better visibility.
19) Window and subwindow formatting improvements.
20) Display expected run delay for queued jobs as negative elapsed time (slurmtop)
21) Highlight recently completed jobs (slurmtop)
22) Fixed bug with 0-9 CPU number toggle in primary grid: it broke CPU numbers > 9; deprecated this early feature: not designed for nodes with 10 or more CPUs.
23) Display USC copyright for pbstop only.
24) Miscellaneous bug fixes.

Please let me know if you have any questions.

Best regards,
Dennis
Comment 1 Dennis McRitchie 2015-08-16 07:55:24 MDT
Created attachment 2123 [details]
slurmtop interactive help screen screenshot
Comment 2 Dennis McRitchie 2015-08-16 07:57:51 MDT
Created attachment 2124 [details]
Springdale/RedHat 6 SRPM for schedtop v5.00
Comment 3 Dennis McRitchie 2015-08-16 07:59:27 MDT
Created attachment 2125 [details]
Springdale/RedHat 6 binary RPM for slurmtop v5.00
Comment 4 Dennis McRitchie 2015-08-16 08:00:22 MDT
Created attachment 2126 [details]
Tarball for schedtop v5.00
Comment 5 Dennis McRitchie 2015-08-16 08:12:50 MDT
Can be run under SLURM 14.11, but needs 15.08 to benefit from memory leak fixes in the SLURM Perl API.
Comment 6 Dennis McRitchie 2015-08-20 04:50:18 MDT
Created attachment 2132 [details]
Springdale/RedHat 6 SRPM for schedtop v5.01

schedtop/slurmtop has been updated to v5.01 to reflect a data structure name change in the SLURM v15.08 Perl API. There are no functional changes from v5.00.

schedtop v5.01 requires SLURM 15.08, whereas schedtop v5.00 required 14.11.
Comment 7 Dennis McRitchie 2015-08-20 04:54:16 MDT
Created attachment 2133 [details]
Springdale/RedHat 6 binary RPM for slurmtop v5.01
Comment 8 Dennis McRitchie 2015-08-20 04:55:51 MDT
Created attachment 2134 [details]
Tarball for schedtop v5.01
Comment 9 Dennis McRitchie 2015-08-31 07:15:25 MDT
Created attachment 2168 [details]
Springdale/RedHat 6 SRPM for schedtop v5.02

* Fri Aug 28 2015 Princeton University Research Computing release - <cses@princeton.edu> - 5.02
- Fixed regression in interactive spacebar refresh command that caused runtime warning.
- Don't attempt to display job's requested time when not specified by user. (slurmtop only)
- Don't display negative elapsed time for queued job if set to 1 year ahead. (Happens 
  when requested time is 'infinite' for that job or job it depends on.) (slurmtop only)
- Enhanced to optionally use partition instead of qos as queue name in job list by
  setting new configuration parameter.  POD doc updated accordingly. (slurmtop only)
- Remove libtorque-devel and perl-devel build dependencies when doing a "--with slurm" build.
- slurmtop/pbstop to 5.02, perl-PBS at 0.35
Comment 10 Dennis McRitchie 2015-08-31 07:17:34 MDT
Created attachment 2169 [details]
Springdale/RedHat 6 binary RPM for slurmtop v5.02
Comment 11 Dennis McRitchie 2015-08-31 07:18:36 MDT
Created attachment 2170 [details]
Tarball for schedtop v5.02
Comment 12 Tim Wickberg 2015-10-29 10:11:19 MDT
Hi Dennis -

My assumption with your bug here is that you'd like to see this included under the contrib/ directory in the Slurm source? Or just have us point to it through the website on http://slurm.schedmd.com/download.html?

If so, are you expecting that this would become the main distribution source for slurmtop long-term, or would you continue maintaining in independently through Princeton?

cheers,
- Tim
Comment 13 Dennis McRitchie 2015-10-31 05:32:12 MDT
Hi Tim,

Yes, the intent was to have schedtop be distributed under the contrib directory. This utility is a bit unusual since it has a common code base that supports both Slurm and PBS. The source RPM or tarball is therefore capable of building either slurmtop or pbstop depending on the build options.

My hope is that it can be distributed with those capabilities intact, if only because this will encourage anyone that makes changes to it to place the changes in either the scheduler-specific script (slurmtop or pbstop) or in the common script (schedtop.pm) as appropriate.

Presumably, when distributed under the Slurm contrib directory it would automatically build slurmtop, of course!

Just FYI, I have also contributed schedtop to Adaptive for them to release as a pbstop upgrade, since pbstop has been around for a long time. Many users and sysadmins have found this a very useful utility over the years, which is why I upgraded it in a scheduler-independent fashion. As users and sysadmins migrate from PBS to Slurm, they will be able to keep using the same tool to monitor their jobs, which makes the transition easier.

I realize that it might make more sense to maintain it independently at Princeton, but I don't think that is going to be possible given that I retired recently. So while I'll attempt to make changes as I am able should I become aware of problems or feature requests, I am not in a position to commit to remaining the primary maintainer.

Let me know if you have any other questions.

Best,
Dennis
Comment 14 Christopher Samuel 2018-04-08 19:31:31 MDT
Hi Dennis,

Thanks for this work, just wondering if there was any news on it?  Seems like it's all gone quiet here.

Was there a source repo for the original code?  All I can find is an SVN repo which holds tarballs & RPMs.

All the best,
Chris
Comment 15 Dennis McRitchie 2018-04-10 12:01:51 MDT
Hi Chris,

 

Schedtop v5.02 was released in 2015, but I have since retired from Princeton University and am not aware of any further releases. However, https://svn.princeton.edu/schedtop/ does contain the 5.02 tarball, which has all the source code, a makefile, and a README with instructions. So if you wish to make changes, just download the tarball and extract its files. Then after making changes, you can simply rebuild it.

 

Best,

Dennis

 

From: bugs@schedmd.com [mailto:bugs@schedmd.com] 
Sent: Sunday, April 8, 2018 6:32 PM
To: dmcr@princeton.edu
Subject: [Bug 1868] New feature: cluster monitoring tool: schedtop v5.01

 

 <mailto:chris@csamuel.org> Christopher Samuel changed bug 1868 <https://bugs.schedmd.com/show_bug.cgi?id=1868>  


What

Removed

Added


CC

  

chris@csamuel.org <mailto:chris@csamuel.org>  

Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=1868#c14>  on bug 1868 <https://bugs.schedmd.com/show_bug.cgi?id=1868>  from  <mailto:chris@csamuel.org> Christopher Samuel 

Hi Dennis,
 
Thanks for this work, just wondering if there was any news on it?  Seems like
it's all gone quiet here.
 
Was there a source repo for the original code?  All I can find is an SVN repo
which holds tarballs & RPMs.
 
All the best,
Chris
  _____  


You are receiving this mail because: 

*	You are on the CC list for the bug.
*	You reported the bug.