6004 – Update seff to reflect rss_max API change

Ticket 6004 - Update seff to reflect rss_max API change

Summary: Update seff to reflect rss_max API change

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	18.08.3
Hardware:	Linux Linux

Importance:	--- 4 - Minor Issue
Assignee:	Danny Auble
QA Contact:

URL:

Duplicates (1):	6380 (view as ticket list)
Depends on:
Blocks:	6380
	Show dependency tree / graph

Reported:	2018-11-08 08:57 MST by Paddy Doyle
Modified:	2019-01-24 07:50 MST (History)
CC List:	4 users (show)

See Also:
Site:	-Other-
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	18.08.5 19.05.0pre2
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Patch seff to reflect api change from rss_max (2.12 KB, patch) 2018-11-08 08:57 MST, Paddy Doyle	Details \| Diff
Add more constants to perl API (698 bytes, patch) 2018-12-18 13:05 MST, Josko Plazonic	Details \| Diff
Updated patch seff to reflect api change from rss_max (fix kb) (3.10 KB, patch) 2019-01-16 05:02 MST, Paddy Doyle	Details \| Diff
Fix contributors name list (1.14 KB, patch) 2019-01-24 05:32 MST, Paddy Doyle	Details \| Diff
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paddy Doyle 2018-11-08 08:57:52 MST

Created attachment 8255 [details]
Patch seff to reflect api change from rss_max

Hi all,

The 18.08.3 version of seff is showing warnings and is not calculating the memory usage correctly:

  Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, <DATA> line 624.

As reported on the mailing list, it looks like the perlapi changed from using 'rss_max' 'tres_usage_in_max' (changed in c72bc449) needs to be matched in seff.

The attached patch should fix it.

Thanks,
Paddy

Comment 1 Josko Plazonic 2018-12-18 09:16:05 MST

Hullo,

this largely works for us - with one addition - the value needs to be divided by 1024 to be in kbs (so just a /1024):

my $lmem = Slurmdb::find_tres_count_in_string($step->{'stats'}{'tres_usage_in_max'}, TRES_MEM)/1024;

can you please merge this (with this tweak).  If it would help please feel free to associate this bug with Princeton site - we use this and need it fixed...

Thanks!

Comment 2 Josko Plazonic 2018-12-18 13:05:17 MST

Created attachment 8695 [details]
Add more constants to perl API

We also don't have access to vm_max either so we need those constants available in perl.

Comment 3 Paddy Doyle 2019-01-16 05:02:51 MST

Created attachment 8933 [details]
Updated patch seff to reflect api change from rss_max (fix kb)

Updated the patch to include Josko's fix about converting to kb.

Comment 6 Danny Auble 2019-01-22 16:45:45 MST

Thanks guys, these are now pushed into 18.08.

I will note we have the idea to add seff functionality to sacct in new columns such as

TRESUtil[in|out] and TRESEff[in|out].

This will be in 19.05 or 20.02 depending on available developers for the task.

I will also note while looking at seff in general it appears it will not handle any parallel steps correctly, or any batch/extern step doing things as well.

I am not sure if this was expected, but wanted to bring it to your attention.  It appears this has always been the case.  If the job only ever has one step then you get better results.  But still it is assuming homogeneous tasks as well for memory.  It is taking the max of all tasks and then multiplying that by the number of tasks.  This is wrong in all cases unless they are all exactly the same.  I would suggest using the tres_usage_in_ave instead of tres_usage_in_max to get a more correct value here.

Of course if you only ever have 1 task then things work as expected.

At the moment I am leaving the code the way it is as it has always been this way.

Please reopen if you would like to talk about this future.

Comment 7 Josko Plazonic 2019-01-22 19:41:10 MST

Hullo,

we are aware of the seff's limitations as far as memory efficiency goes (for n>1) but we haven't address them in part due to time limitations and in part because there are multiple answers and the current one is a good starting point for deciding "what's the minimum memory I need/task to allocate".  E.g. if your memory use is asymmetrical (as it can be with MPI programs where rank 0 can consume more, sometimes much more) then you need to take the max as the guide for how much memory you need per task.  Average could easily be insufficient. Of course, one could go with heterogeneous job allocations but that's too messy for most of our audience.

So we'd like to see in per job report max/task, average and min numbers and better handling of corner cases (we've seen memory efficiencies of 2000%+...).

Anyway, just something to think about.  Thanks for patching it.

Comment 8 Felip Moll 2019-01-23 03:11:28 MST

*** Ticket 6380 has been marked as a duplicate of this ticket. ***

Comment 9 Danny Auble 2019-01-23 10:41:49 MST

Josko, thanks for the information.  It wasn't clear from the output what was expected.  I can see what you are looking for now.

Just a heads up on this, we are looking to clear out contribs and send projects like this back to the original author referencing them from our webpage instead of having them directly in contribs.

This will mostly happen this year at some point.

Comment 10 Paddy Doyle 2019-01-24 05:32:52 MST

Created attachment 9000 [details]
Fix contributors name list

Thanks for the merge Danny. And good to know about the future plans for contribs.

I looked back on my patch and realised I didn't follow the guidelines for keeping authors in alphabetical order! So trivial patch attached, if you want to put things back in order. (The nit-picker in me couldn't just let it go!)

Comment 11 Josko Plazonic 2019-01-24 07:50:15 MST

Hullo,

it would be nice if you considered carefully each one of these and either kept more useful ones or moved them into core.  E.g. seff is something that we find essential to help users and us gauge how well their jobs are running and clearly you do too as you are adding some of that functionality to sacct.  I guess if all of this (memory usage stats included) will be easily available in sacct we can ditch seff and write a simple wrapper around sacct for email/post job checking.

Thanks,

JP

(In reply to Danny Auble from comment #9)
> Josko, thanks for the information.  It wasn't clear from the output what was
> expected.  I can see what you are looking for now.
> 
> Just a heads up on this, we are looking to clear out contribs and send
> projects like this back to the original author referencing them from our
> webpage instead of having them directly in contribs.
> 
> This will mostly happen this year at some point.