Hopper Power Efficiency
The SXM5 form factor H100 is stated to have a TDP (Thermal Design Power, the maximum heat dissipation a hardware component is designed to endure) of 700W. This generated its fair share of discourse on social media with proponents and opponents of the seemingly high TDP. Opponents are discussing, e.g., direct operating costs from the power draw, cooling costs due to increased cooling needs, among other concerns. The TDP is relatively high, yet Nvidia states that the Hopper generation of GPUs is their most energy-efficient yet. How does one defend a TDP of 700W for a GPU when previous generations have had TDPs around 3-400W?
The SXM5 variant uses HBM3 memory modules. Utilizing the full bandwidth of HBM3 likely requires an increase in memory clock over the PCIe Gen 5 variant; the PCIe Gen 5 variant uses the lower bandwidth HBM2e modules. The SXM5 variant is also, in likelihood, going to have a faster boost clock than the PCIe Gen 5 variant. Increased memory and boost clock frequencies are, however, unlikely to be the factors pushing the TDP towards 700W, due to power consumption scaling linearly with frequency under the same voltage conditions.
At the time of writing, it is challenging for us to estimate the mean power draw of the SXM5 variant in a general AI or HPC workload. Benchmarking the performance-to-power ratio of the H100 when power-capped is another benchmark that could be interesting to investigate.
Note also that the Hopper architecture introduces additional capabilities for asynchrony. One of the significant benefits of asynchrony in this case is the potential for attaining high degrees of utilization. The increased asynchrony and potential for concurrency bodes well for latency hiding \cite{Volkov:EECS-2016-143}. The Hopper architecture might look bad on paper with a TDP of 700W, but might look good when taking the performance-per-watt into account.