- 2x faster clock-for-clock performance per SM contributes significantly to 3x faster FP32 and FP64 instructions.
- Fourth-generation Tensor Cores, which are up to 6x faster chip-to-chip compared to A100, announced to deliver 2x the MMA (Matrix Multiply- Accumulate) computational rates of the A100 SM on equivalent data types, and 4x using the new FP8 data type, compared to old FP16.
- New DPX Instructions that should accelerate Dynamic Programming algorithms by up to 7x over the A100 GPU. A more detailed description of DPX will follow in the next section.
- New Thread Block Cluster feature allowing programmatic control of locality at a granularity larger than a single Thread Block on a single SM. Note that this adds another synchronization layer. This will also be discussed in the next section.
- New Asynchronous Execution features including a new Tensor Memory Accelerator (TMA). TMA is designed to transfer large data blocks efficiently between global and shared memory. TMA also supports asynchronous copies between Thread Blocks in a Cluster. There is also a new Asynchronous Transaction Barrier for doing atomic data movement and synchronization. .
- HBM3 memory subsystem providing nearly a 2x bandwidth increase over the previous generation. The H100 SXM5 GPU is the world’s first GPU with HBM3 memory delivering a class-leading 3 TB/sec of memory bandwidth.
- 50 MB L2 cache (versus A100s 40 MB L2) reducing trips to HBM3.
- Second-generation Multi-Instance GPU (MIG) technology provides approximately 3x more compute capacity and nearly 2x more memory bandwidth per GPU Instance compared to A100.
Tensor Memory Accelerator
The newly added Tensor Memory Accelerator (TMA) enables asynchronous transfers of multidimensional blocks of data. An elected thread within a thread group takes on responsibility for interacting with the TMA by passing along a Copy Descriptor detailing the information the TMA needs to correctly transfer a multidimensional block of data, or tensor. The remaining threads are free to perform other instructions while the TMA operation is underway.
Fourth-Generation Tensor Cores
Fourth-generation tensor cores further improve upon the efficiency of the previous generation. Nvidia has now added support for an 8-bit floating-point datatype: FP8. They support two flavors of FP8, namely E4M3 and E5M2, enabling the choice between dynamic range or precision. The number following the E and the number following the M represent the number of exponent- and mantissa bits respectively. Generic computations that natively match FP8 ranges are few and far between. In the cases where FP8 is sufficient, one can expect great performance improvements over, e.g., FP16. Nvidia expects their new DGX SuperPOD to be able to deliver 1 exaFLOPS of sparse FP8 compute.