Ryzen 3900x at 65W PPT

The energy consumption of CPUs rises linearly with the clock frequency and quadratically with the voltage (dynamic voltage and frequency scaling (DVFS)). Higher frequency usually needs higher voltage, so the power consumption rises superlinearly with frequency (and performance). So, if we don't need the result ASAP, we can limit the power consumption, and use less power for the same computation.

The Ryzen 3900x is advertized as having 105W TDP (thermal design power, not really meaningful these days), which in practice means a power limit (PPT) of 142W, and indeed, when loaded fully, one of our machines took roughly 142W more than it takes when idle (48W at idle; ~190W loaded, all measured at the mains).

In the BIOS of the ASUS TUF Gaming B550M-Plus mainboard we can reduce the PPT under AI Tweaker after setting Precision Boost to "manual". Then we can set the PPT. We first tried a value of 80, but it had no effect. Then we tried a value of 65, and indeed, the power consumption under load was then about 62W above idle (i.e., 110W total).

Results

Running a 6000x6000 matrix multiplication (a pretty power-hungry workload, the difference may be less for other workloads) using libopenblas using all 24 threads of a Ryzen 3900X gives the following results:
                               Energy
    PPT total   clock time   PPT total
    65W ~110W 2390MHz 2.06s 134J  227J
   142W ~190W 3890MHz 1.54s 219J  293J
  
So the lower-power setting is a factor 1.34 slower for this workload than the default setting, but saves a factor 1.63 in energy if you consider PPT. If you consider total power (relevant if you don't let the computer run idle for the rest of the time, i.e., you turn off the computer once the computations are done), the power savings is a factor 1.29.

You may wonder why a clock rate difference by a factor of 1.63 results in only a factor 1.34 difference in run-time. One part of the explanation is that apparently the synchronization overheads between the application threads don't scale with CPU speed, resulting in a lower utilization of the threads at higher CPU frequency: 1638% vs. 1779% CPU utilization out of 2400% on this CPU. Another factor is that the uncore (L3, memory controllers) don't scale with the core frequency, so with a slower core frequency, accesses to uncore consume fewer cycles. This results in a factor of about 4.15Gcycles/3.93Gcycles=1.06. That still leaves a gap for which I don't have an explanation.


Anton Ertl