Bug 4966 - Provide some way to retrieve most recently measured stats (especially memory) for tasks of a job
Summary: Provide some way to retrieve most recently measured stats (especially memory)...
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other bugs)
Version: 17.11.5
Hardware: Linux Linux
: --- 5 - Enhancement
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-03-21 04:01 MDT by Christopher Samuel
Modified: 2019-08-15 08:55 MDT (History)
9 users (show)

See Also:
Site: Swinburne
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: future
DevPrio: 4 - Medium
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Christopher Samuel 2018-03-21 04:01:00 MDT
Hi there,

We have some in-house tools to track information about jobs such as CPU usage, IB stats, GPU usage in a graphical way and I've been asked if it's possible to extend Slurm to tell us the recently measured memory usage of tasks in some way.

It wouldn't be necessary to store this in slurmdbd (as point-in-time values wouldn't make sense there), merely to be able to get the information in some way, even if this was by a privileged user querying a slurmd directly rather than slurnmctld (as the slurmd must have this information already to generate the information it does expose via the API).

Having just discovered "scontrol listpids" perhaps something like "scontrol listtasks" ?

It could return the list of tasks on a node, the associated job, step and task number and what slurmd had measured for the stats on it and would avoid having to change the public API (if I'm understanding correctly).

For formatting perhaps something like the format used by "squeue -o %all" could be handy?

All the best,
Chris
Comment 1 Felip Moll 2018-03-21 06:21:15 MDT
Hi Chris,

As you probably know you have the 'sstat' tool that provides the information of the job gathered in real time:

]$ sstat -a -j 81 --fields=JobID,AveRSS,MaxRSS,MaxRSSNode
       JobID     AveRSS     MaxRSS MaxRSSNode 
------------ ---------- ---------- ---------- 
81.extern             0          0     gamba1 
81.0                72K       722K     gamba1 


This is analog to sacct but this last takes the info of the database about completed jobs.

You have several fields you can query, please look at the man page for all of them. Some of the most interesting regarding to memory are:

              AveRSS     Average resident set size of all tasks in job.
              AveVMSize  Average Virtual Memory size of all tasks in job.
              MaxRSSNode The node on which the maxrss occurred.
              MaxRSS     Maximum resident set size of all tasks in job.
              MaxRSSTask The task ID where the maxrss occurred.
 	      MaxVMSize  Maximum Virtual Memory size of all tasks in job.
	      MaxVMSizeNode The node on which the maxvmsize occurred.
	      MaxVMSizeTask The task ID where the maxvmsize occurred.


A second option is to use some plugin for profiling jobs. Influxdb profiling plugin was introduced some time ago:

http://hpckp.org/index.php/conference/2016/131-introduction-of-a-near-real-time-monitoring-plugin-in-the-slurm-open-source-software

https://slurm.schedmd.com/SLUG16/monitoring_influxdb_slug.pdf

But anyway this plugin comes with an impact to performance so probably there can be a noticeable impact and I recommend using it for sporadic cases only.



If this doesn't work for you I would have to check with the dev. team to get their opinion.
Comment 2 Christopher Samuel 2018-03-21 07:09:28 MDT
On Wednesday, 21 March 2018 11:21:15 PM AEDT bugs@schedmd.com wrote:

> If this doesn't work for you I would have to check with the dev. team to get
> their opinion.

Thanks for the suggestions, very kind!   Here are my concerns/issues:

1) sstat does not return the most recently measured memory information, only 
the max and average, which are not useful in this context.

2) sstat only works for jobs launched with srun, we would like to gather info 
for single CPU or SMP single node jobs as well.

3) sstat doesn't give you per-task info, which is our holy grail. 

4) whilst influxdb might be interesting we need to get the data from Slurm into 
a tool that's already in use here and which was used in a previous version on 
our older cluster (running Torque/Moab).

5) the performance impact of leveraging the influxdb sounds a killer, this will 
need to be gathering info on every running job.    We hope that by just asking 
slurmd what it has already measured for each task we will be able to avoid 
that penalty altogether.

Does this help explain things?

All the best,
Chris  (off to bed now as it's past midnight here!)
Comment 5 Christopher Samuel 2018-03-21 16:41:24 MDT
On 22/03/18 00:09, Chris Samuel wrote:

> 4) whilst influxdb might be interesting we need to get the data from
> Slurm into a tool that's already in use here and which was used in a
> previous version on our older cluster (running Torque/Moab).

This code is public (GPLv3) and is on Github here:

https://github.com/chanconrad/bobMonitor

All the best,
Chris
Comment 6 Felip Moll 2018-03-26 15:20:12 MDT
Hi Chris,

I commented internally and I should mark this bug as an enhancement.


> 1) sstat does not return the most recently measured memory information, only 
> the max and average, which are not useful in this context.

I have a local patch that could fix this. It just sums all the tasks total RSS and shows the related field in sstat, but it is at a task level, not at a pid level, since from jobacct_gather plugin every pid from the task is summed into a single field (tot_rss).

> 2) sstat only works for jobs launched with srun, we would like to gather
> info for single CPU or SMP single node jobs as well.

I don't get your point here, sstat also works for sbatch, multiple tasks, etc. the jobacctgather
grabs the same data be it a srun, a sbatch or salloc.

> 3) sstat doesn't give you per-task info, which is our holy grail. 

To provide this, we should extend the code and create an array or list of pids, insert the values from jobacct_gather in this list and pass it over to sstat. This would require to modify the job_record struct, change the protocol and also change jobacct_gather plugins, so it certainly cannot be part of 17.11 version.

What's your opinion on that?
Comment 7 Christopher Samuel 2018-03-26 18:16:14 MDT
On 27/03/18 08:20, bugs@schedmd.com wrote:

> Hi Chris,

Hi Felip!

> I commented internally and I should mark this bug as an enhancement.

Yup, sounds like a good idea.

>> 1) sstat does not return the most recently measured memory
>> information, only  > the max and average, which are not useful in
>> this context.
> 
> I have a local patch that could fix this. It just sums all the tasks
> total RSS and shows the related field in sstat, but it is at a task
> level, not at a pid level, since from jobacct_gather plugin every pid
> from the task is summed into a single field (tot_rss).

So that would show the memory usage for each task, individually, for the
job step?  We're trying to show users how much memory they're using on
each node (now I've had a chance to play with the new interface).

Even a per-node list of totals for RSS & VM for a step would be great.

>> 2) sstat only works for jobs launched with srun, we would like to
>> gather  > info for single CPU or SMP single node jobs as well.
> 
> I don't get your point here, sstat also works for sbatch, multiple
> tasks, etc. the jobacctgather grabs the same data be it a srun, a
> sbatch or salloc.

Ahh, I was sure in the past I'd been told that sstat only showed
info for steps launched with srun.  I checked a job that isn't
using srun and it doesn't list the batch step, just the extern
one.

Digging a bit further even adding '-a' to list all steps doesn't
show it, but if you request the batch step explicitly it does appear.

That's useful, but perhaps '-a' should show the batch step too?
Or have a --really-all-the-steps-please flag? :-)

>> 3) sstat doesn't give you per-task info, which is our holy grail.
> 
> To provide this, we should extend the code and create an array or
> list of pids, insert the values from jobacct_gather in this list and
> pass it over to sstat. This would require to modify the job_record
> struct, change the protocol and also change jobacct_gather plugins,
> so it certainly cannot be part of 17.11 version.
> 
> What's your opinion on that?

I've been told they're happy to wait on that functionality for a
little while, so if it is possible for the next major release instead
that would be great.

Would that affect the scalability of the protocol on large systems?

All the best,
Chris
Comment 8 Felip Moll 2018-03-27 05:48:38 MDT
> > I have a local patch that could fix this. It just sums all the tasks
> > total RSS and shows the related field in sstat, but it is at a task
> > level, not at a pid level, since from jobacct_gather plugin every pid
> > from the task is summed into a single field (tot_rss).
> 
> So that would show the memory usage for each task, individually, for the
> job step?  We're trying to show users how much memory they're using on
> each node (now I've had a chance to play with the new interface).
> 

Yes sorry, my sentence should have been: "For each task it just sums all the task's pids rss values as total RSS and shows the field in sstat,"
But there are still some aspects to study and see if everything is correct and makes sense in all situations.

> Even a per-node list of totals for RSS & VM for a step would be great.

Even it is possible it would require more data manipulation and I think it is more useful to have it sorted by job or by user, but of course we can contemplate the possibility.

> 
> >> 2) sstat only works for jobs launched with srun, we would like to
> >> gather  > info for single CPU or SMP single node jobs as well.
..
> That's useful, but perhaps '-a' should show the batch step too?
> Or have a --really-all-the-steps-please flag? :-)


I will look into that and see what we can do in respect, thanks!.


> 
> I've been told they're happy to wait on that functionality for a
> little while, so if it is possible for the next major release instead
> that would be great.
> 

Cool, will try our best to integrate this once there's time for it.


> Would that affect the scalability of the protocol on large systems?

This is one concern, yes, because the data that has to be stored in memory and sent over in a RPC can be important.



Thank you,
Felip
Comment 9 Christopher Samuel 2018-05-04 01:50:50 MDT
Hi Felip,

Just to quickly update you.

1) It seems per node stats would be sufficient for a job (which would make things more tractable I think).

2) You can see the live interface for the job monitor here:

https://supercomputing.swin.edu.au/monitor/

A lot of the stats are currently grabbed from Ganglia, but ideally for things that Slurm supports could be grabbed straight from Slurm.

All the best!
Chris