Ticket 10322 - How to edit accounting usage for job after it completes?
Summary: How to edit accounting usage for job after it completes?
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 20.02.6
Hardware: Linux Linux
: --- 3 - Medium Impact
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-11-30 18:20 MST by Luke Yeager
Modified: 2021-01-15 11:31 MST (History)
0 users

See Also:
Site: NVIDIA (PSLA)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Luke Yeager 2020-11-30 18:20:08 MST
There are some great tools for assigning discounts or penalties to jobs before they are allocated resources:

* QOS.UsageFactor
* Partition.TRESBillingWeights
* etc.

But what if I want to change the "cost" of a job *after* the fact? I might want to avoid penalizing users who ran jobs which failed due to hardware reasons. Or I might want to charge extra for jobs which require node reboots to cleanup afterwards. Either way, I want to be able to adjust how the job affects their current fairshare priority for queued jobs.

Are there any existing solutions for this?

The only solutions I've found so far are:

1. 'sacctmgr modify ... set RawUsage=0' - obviously this is too big of a hammer. I only want to edit a single job, and I might want to *increase* the usage for the job - not decrease it.

2. For clusters using "banking" (limits on GrpTRESMins and PriorityDecayHalfLife=0), you can essentially accomplish this by editing the limit after the fact (increasing the limit for a refund, decreasing it for a penalty). But we don't use that accounting strategy at our site.

I think something like this would mostly get me what I want:

> $SLURM_TOOL modify job=1234 set EffectiveWallTime=+1:00:00
> $SLURM_TOOL modify job=1234 set EffectiveWallTime=0
I like this option because the syntax is pretty easy to interpret. I think it would be complex to implement in the backend because you would need to interpolate back from NOW to JOB_END and estimate how much that job is contributing to the current RawUsage (as adjusted by the PriorityHalfLifeDecay). And you might need to keep track of the last time there was a RawUsage=0 reset to avoid editing the usage if it has already been removed. Seems tricky, but not impossible.


Another, simpler-to-implement option would be something like this:

> sacctmgr modify association where user=lyeager account=admin set RawUsage=+100
That's still a little tricky to implement because you'd need to figure out how to adjust 'usage_raw', AND 'used_jobs', AND 'usage.usage_tres_raw', etc. internally. But most of the burden would be on me, the admin, to figure out what the RawUsage actually means. Though, since I'd usually be using this in the job epilog, I wouldn't typically need to worry much about the half-life decay.


I'm hoping I've missed something and that there's already some way for me to edit historical jobs? Please let me know if so.
Comment 1 Scott Hilton 2020-12-02 13:40:01 MST
Luke,

I don't think what you are looking for is possible with slurm at the moment. We want the accounting to be an accurate record of what happened and discourage messing with it after the fact.

If you use WCKeys, there is a way to retroactively change the wckey of a job in slurm 20.11.
>sacctmgr update jobid=<foo> set newwckey=correctkey

-Scott
Comment 2 Luke Yeager 2020-12-02 14:35:41 MST
Ok. I wasn't really intending to change the actual accounting data - I was just hoping to either add some extra metadata (EffectiveWallTime, not actual WallTime) or to tweak the raw fairshare counters directly. In either case, sacct would still work the same way.

Thanks for confirming that this is not possible at this time.
Comment 3 Scott Hilton 2021-01-15 11:31:47 MST
Closing bug