Tensor Memory Accelerator

The newly added Tensor Memory Accelerator (TMA) enables asynchronous transfers of multidimensional blocks of data. An elected thread within a thread group takes on responsibility for interacting with the TMA by passing along a Copy Descriptor detailing the information the TMA needs to correctly transfer a multidimensional block of data, or tensor. The remaining threads are free to perform other instructions while the TMA operation is underway. 

Fourth-Generation Tensor Cores

Fourth-generation tensor cores further improve upon the efficiency of the previous generation. Nvidia has now added support for an 8-bit floating-point datatype: FP8. They support two flavors of FP8, namely E4M3 and E5M2, enabling the choice between dynamic range or precision. The number following the E and the number following the M represent the number of exponent- and mantissa bits respectively. Generic computations that natively match FP8 ranges are few and far between. In the cases where FP8 is sufficient, one can expect great performance improvements over, e.g., FP16. Nvidia expects their new DGX SuperPOD to be able to deliver 1 exaFLOPS of sparse FP8 compute.