Bug 9956

Summary: RAPL plugin: incorrect *Watts and ConsumedEnergy values
Product: Slurm Reporter: Alexey Kozlov <alexey.kozlov>
Component: AccountingAssignee: Tim Wickberg <tim>
Status: OPEN --- QA Contact:
Severity: C - Contributions    
Priority: --- CC: sts, uemit.seren
Version: 21.08.x   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: proposed patch

Description Alexey Kozlov 2020-10-07 15:57:18 MDT
AcctGatherEnergy RAPL plugin is using the same energy unit for all CPU and DRAM packages:

https://github.com/SchedMD/slurm/blob/master/src/plugins/acct_gather_energy/rapl/acct_gather_energy_rapl.c#L326

However, on many modern server architectures (Haswell, Skylake X/SP, CascadeLake SP), DRAM energy unit is distinct from the package energy unit stored in the MSR_RAPL_POWER_UNIT register. Instead, it has a fixed value of 1/15300.

The (gloomy) situation becomes clear when looking at the Linux powercap driver code, which gives correct measurements:    

https://github.com/torvalds/linux/blob/master/drivers/powercap/intel_rapl_common.c#L964

https://github.com/torvalds/linux/blob/master/drivers/powercap/intel_rapl_common.c#L1017

So apparently, the only viable solution would be to check CPU model and set DRAM energy unit accordingly.

As a result of this bug, AcctGatherEnergy reports power and energy values which are incorrect, and in my experiments they were usually inflated by as much as 30%-50%.
Comment 3 Alexey Kozlov 2020-10-12 12:55:22 MDT
Created attachment 16196 [details]
proposed patch

This patch fixes multiple bugs/issues in power computation:

- CurrentWatts: using CPU energy unit for DRAM domain resulted in wrong values on many systems (Intel Haswell/Skylake/CascadeLake)

- CurrentWatts: same energy unit was used for all packages -> might work for now, but could break anytime 

- AveWatts: incorrect value due to missing normalization by the polling interval

- AveWatts: inaccurate value due to using integer type to compute running average (at some point contribution of the current measurement becomes <1.0 -> AveWatts is frozen)