Hopper GPU - H100 Overview
The Hopper GPU, introduced as Nvidia H100 Tensor Core GPU, is implemented using the Taiwanese semiconductor company TSMC’s 4N process customized for Nvidia with 80 billion transistors. The H100 architecture includes several noteworthy architectural advances.
The custom Nvidia H100 SXM5 module houses the H100
GPU and HBM3 RAM chips, and also provides connection to other systems via their fourth-generation NVLink and PCIe Gen 5 ports. (Figure 3). Note that these modules do
not include display connectors, NVIDIA RT Cores for ray tracing acceleration, or an NVENC
encoder since they, like the A100, are data center modules..
The GH100 GPU consists of up to 144 Streaming Multiprocessors (SM) per full GPU that have many performance and efficiency
improvements over earlier versions.
Key new features useful for Scientific computing include: \cite{architecture}
- 2x faster clock-for-clock performance per SM contributes to 3x faster FP32 and FP64 instructions.
- New fourth-generation Tensor Cores which are up to 6x faster chip-to-chip compared
to A100, deliver 2x the MMA (Matrix Multiply-
Accumulate) computational rates of the A100 SM on equivalent data types, and
4x using the new FP8 data type, compared old FP16.
- New DPX Instructions that accelerate Dynamic Programming algorithms by up to 7x
over the A100GPU. A more detailed description of DPX will follow in the next section.
- New Thread Block Cluster feature allowing programmatic control of locality at a
granularity larger than a single Thread Block on a single SM. Note that this adds another synchronization layer. This will aslo be discussed in the next section.
- New Asynchronous Execution features including a new Tensor Memory
Accelerator (TMA). TMA can transfer large data blocks efficiently
between global and shared memory. TMA also supports asynchronous
copies between Thread Blocks in a Cluster. There is also a new Asynchronous
Transaction Barrier for doing atomic data movement and synchronization. .
- New Transformer Engine can accelerate Transformer model training and
inference by dynamically choosing between FP8 and 16-bit calculations, delivering up to 9x faster AI training and up to 30x
faster AI inference speedups on large language models compared to A100.
- HBM3 memory subsystem provides nearly a 2x bandwidth increase over the previous
generation. The H100 SXM5 GPU is the world’s first GPU with HBM3 memory delivering
a class-leading 3 TB/sec of memory bandwidth.
- 50 MB L2 cache (versus A100´s 40 MB L2) reducing trips to HBM3.
- Second-generation Multi-Instance GPU (MIG) technology provides approximately 3x
more compute capacity and nearly 2x more memory bandwidth per GPU Instance
compared to A100. Confidential Computing capability with MIG-level Trusted Execution
Environments (TEE) is now provided for the first time. Up to seven individual GPU
Instances are supported, each with dedicated NVDEC and NVJPG units. Each Instance
now includes its own set of performance monitors that work with NVIDIA developer tools.
- H100 implements the world's first native Confidential Computing
GPU and extends the Trusted Execution Environment with CPUs at full PCIe line rate.
An overview of comparing the V100, A100 and H100 architectures is shown in the following table.
Hopper Features useful for Scientific Computing
they do
not include display connectors, NVIDIA RT Cores for ray tracing acceleration, or an NVENC
encoder.