Abstract
This report summarizes the key innovations in Google’s Tensor Processing Unit version 4 (TPU v4), a domain-specific supercomputer for training machine learning models. TPU v4 introduces optical circuit switches for flexible topology and sparse cores dedicated to accelerating embeddings. The optical switches construct a 4096-chip interconnect that improves reliability through reconfigurability while optimizing performance through tailorable topologies. Sparse cores provide a 5-7x speedup for embeddings at only 5% incremental area and power. Across production workloads, TPU v4 demonstrates 2.1-3.5x higher performance over TPU v3, with 2.7x greater performance per Watt.
Introduction
Machine learning models continue to rapidly advance in scale and algorithmic complexity. TPU v4 represents Google’s latest specialized hardware for accelerating the training of these demanding models as part of an integrated machine learning stack including algorithms, software infrastructure, and customized silicon. This report focuses on two vital architectural innovations in TPU v4:
Optical circuit switches enabling a flexible interconnect topology
Sparse cores dedicated to accelerating embedding lookups
Optical Circuit Switches for Flexible Topology
To scale up from 1024 chips per supercomputer in TPU v3 to 4096 chips in TPU v4, optical circuit switches (OCSes) are introduced to connect the TPU chips over optical links. The OCSes offer eight substantial benefits:
Scalability up to 4096 chips, a 4x increase over TPU v3
Improved reliability through reconfigurability around failed chips
Flexible topology tailored to optimize each job
1.2-2.3x higher performance from topology tuning
Reduced power versus electronic packet switching
Simplified scheduler for better utilization
Faster partial deployment of systems
Enhanced security isolation between jobs
The optical switches construct the TPU v4 supercomputer from 64-chip building blocks, with electrical links between chips within a block and optical links between blocks. Despite the transformative interconnect flexibility enabled by the OCS fabric, it adds less than 5% to overall system cost and 3% to power. This allows constructing a 4096-chip system with high efficiency and fault tolerance.
Sparse Cores for Embedding Acceleration
A major portion of Google’s machine learning workload consists of deep learning recommendation models (DLRMs). DLRMs rely heavily on embedding lookups, which stress memory bandwidth with all-to-all communication traffic. To accelerate embeddings, TPU v4 contains dedicated Sparse Cores operating as an interconnected sea of simple dataflow processors. The sparse cores improve embedding performance by 5-7x over CPUs and 3.1x over TPU v3, with only ~5% incremental area and power. Without sparse cores, placing embeddings in CPU memory would reduce overall TPU v4 performance by 5-7x.
Production Workload Performance
Across eight production workloads, TPU v4 demonstrates significantly higher performance over TPU v3. As shown in Figure 12, speedups range from 1.5x to 3.5x for the same slice sizes. The DLRMs experience especially large gains of 3.0-3.5x due to dual benefits from the optical interconnect’s enhanced bisection bandwidth and dedicated acceleration in sparse cores. Surprisingly, one RNN workload runs 3.3x faster on TPU v4, benefiting from the increased scratchpad memory bandwidth.
Conclusion
The optical circuit switches and sparse cores in TPU v4 facilitate scaling to 4096 chips while improving reliability, efficiency, and suitability for demanding machine learning workloads. TPU v4 delivers substantially higher performance across Google’s production jobs compared to prior generations. The specialized architecture exemplifies customized system design for keeping pace with the rapidly changing machine learning field over multiple generations.
Reference
Jouppi, N., Kurian, G., Li, S., Ma, P., Nagarajan, R., Nai, L., Patil, N., Subramanian, S., Swing, A., Towles, B., Young, C., Zhou, X., Zhou, Z., & Patterson, D. A. (2023). TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA '23), 1-14. https://doi.org/10.1145/3579371.3589350
Commenti