- The New Streaming Multiprocessor (SM) has many performance and efficiency
improvements. Key new features include: (from Nvidia Hopper Whitepaper ) -- to be shortened / re-written!!!
- New fourth-generation Tensor Cores which are up to 6x faster chip-to-chip compared
to A100, including per-SM speedup, additional SM count, and higher clocks of
H100. On a per SM basis, the Tensor Cores deliver 2x the MMA (Matrix Multiply-
Accumulate) computational rates of the A100 SM on equivalent data types, and
4x the rate of A100 using the new FP8 data type, compared to previous
generation 16-bit floating point options. The Sparsity feature exploits fine-grained
structured sparsity in deep learning networks, doubling the performance of
standard Tensor Core operations.
○ New DPX Instructions accelerate Dynamic Programming algorithms by up to 7x
over the A100GPU. Two examples include the Smith-Waterman algorithm for
genomics processing, and the Floyd-Warshall algorithm used to find optimal
routes for a fleet of robots through a dynamic warehouse environment. A more detailed description of DPX will follow in the next section.
○ 3x faster IEEE FP64 and FP32 processing rates chip-to-chip compared to A100,
due to 2x faster clock-for-clock performance per SM, plus additional SM counts
and higher clocks of H100.
○ New Thread Block Cluster feature allows programmatic control of locality at a
granularity larger than a single Thread Block on a single SM. This extends the
CUDA programming model by adding another level to the programming hierarchy
to now include Threads, Thread Blocks, Thread Block Clusters, and Grids.
Clusters enable multiple Thread Blocks running concurrently across multiple SMs
to synchronize and collaboratively fetch and exchange data.
- New Asynchronous Execution features include a new Tensor Memory
Accelerator (TMA) unit that can transfer large blocks of data very efficiently
between global memory and shared memory. TMA also supports asynchronous
copies between Thread Blocks in a Cluster. There is also a new Asynchronous
Transaction Barrier for doing atomic data movement and synchronization. Note that this adds another synchronization layer.
- New Transformer Engine uses a combination of software and custom Hopper Tensor
Core technology designed specifically to accelerate Transformer model training and
inference. The Transformer Engine intelligently manages and dynamically chooses
between FP8 and 16-bit calculations, automatically handling re-casting and scaling
between FP8 and 16-bit in each layer to deliver up to 9x faster AI training and up to 30x
faster AI inference speedups on large language models compared to the prior
generation A100.
- HBM3 memory subsystem provides nearly a 2x bandwidth increase over the previous
generation. The H100 SXM5 GPU is the world’s first GPU with HBM3 memory delivering
a class-leading 3 TB/sec of memory bandwidth.
- 50 MB L2 cache architecture caches large portions of models and datasets for
repeated access, reducing trips to HBM3.
- Second-generation Multi-Instance GPU (MIG) technology provides approximately 3x
more compute capacity and nearly 2x more memory bandwidth per GPU Instance
NVIDIA H100 Tensor Core GPU Overview
12
NVIDIA H100 Tensor Core GPU Architecture
compared to A100. Confidential Computing capability with MIG-level Trusted Execution
Environments (TEE) is now provided for the first time. Up to seven individual GPU
Instances are supported, each with dedicated NVDEC and NVJPG units. Each Instance
now includes its own set of performance monitors that work with NVIDIA developer tools.
- New Confidential Computing support to protect user data, defend against hardware
and software attacks, and better isolate and protect VMs from each other in virtualized
and MIG environments. H100 implements the world's first native Confidential Computing
GPU and extends the Trusted Execution Environment with CPUs at full PCIe line rate.
Hopper Features useful for Scientific Computing
- DPX Instructions \cite{bloga}