Introduction
The relentless pursuit of Moore's Law and chip scaling is encountering mounting challenges from the soaring costs and complexity of advancing to newer semiconductor process nodes. As depicted in Figure 1, the cost per yielded mm2 of silicon is projected to keep rising with each node shrink. This economic reality is driving the industry toward a more disaggregated and modular approach to chip design, harnessing the advantages of "chiplets" - relatively small dies made in an economically optimal process, assembled together into a multi-chip package.
The Need for Modular Architectures
Beyond the economics, the slowing of transistor scaling dictates the need for more specialized and accelerated compute engines optimized for different domains like AI, graphics, and signal processing. General-purpose CPUs can no longer efficiently handle the burst parallel workloads required by these applications. Heterogeneous integration through chiplets provides an ideal way to mix and match cutting-edge accelerators with more cost-effective legacy nodes for components like I/O that do not benefit much from scaling.
Basics of Chiplet Design
The fundamental premise of chiplets is to split a large monolithic system-on-chip (SoC) into a collection of smaller dies or "chiplets" which can be manufactured with higher yields (Figure 3-9). Despite the design overhead of chiplet interfaces, this divided approach enhances the overall yield of functional chips per wafer. Additionally, the non-linear relationship between die area and production cost favors keeping die sizes small.
However, chiplet architectures incur overheads from inter-chiplet communication interfaces and per-die costs like area for power regulation (Figure 10-11). Judicious system partitioning is required to ensure the yield/cost benefits outweigh these overheads.
Classes of Chiplet Interconnects
Three main types of chiplet interconnects are employed, varying in performance and reach: advanced packaging, organic package, and stacked 3D interconnects (Figure 12).
Advanced packaging uses ultra-short reach (USR) links under 2mm long, enabling very wide parallel interfaces operating at moderate speeds with ultra-low energy (<0.6 pJ/bit). Figure 13 illustrates an example using microbump interconnects. However, this approach restricts chiplet placement due to tight distance constraints.
Organic package interconnects use longer (up to 25mm) but narrower high-speed serial links, enabling flexible die-to-die placements across the package substrate (Figure 14). The link energy is higher at <2 pJ/bit.
3D stacked die interconnects based on through-silicon vias (TSVs) provide unparalleled interconnect density and energy efficiency at sub-pJ/bit levels, but with limited reach between adjacent dice (Figure 15).
Figure 16 summarizes the complex tradeoffs in choosing the optimal chiplet architecture based on performance needs, bandwidth, packaging costs, and engineering complexity.
Organic Chiplet Designs in AMD CPUs
AMD has been a pioneer in adopting chiplet packaging across its CPU product lines. The EPYC server processor exemplifies the cost benefits, with die costs scaling linearly with core count instead of the traditional quadratic scaling of monolithic dies (Figure 17). Splitting into a multi-chiplet module enabled scaling up to 64 cores using a lower-cost mature I/O die paired with multiple CPU core chiplets made on the latest node (Figure 18).
This strategy extended to AMD's Ryzen desktop processors too, yielding substantial cost savings compared to monolithic dies (Figure 19). Critically, the chiplet design enabled rapid cross-pollination of technology across product segments, simply by remixing the exact same chiplets between server and desktop systems (Figure 20-21).
Advanced Packaging for AMD GPUs
While organic substrate designs worked well for CPUs, AMD's latest "Navi 31" gaming GPUs pushed the limits of required bandwidth. GPUs have much higher connectivity needs than CPUs between the various compute clusters, caches, and memory interfaces (Figure 22).
The organic substrate proved inadequate to satisfy these terabyte-scale bandwidth requirements within a reasonable power budget. AMD developed advanced packaging techniques with ultra-short reach links to provide an order of magnitude higher bandwidth density than previous organic interconnects (Figure 23-24).
This breakthrough enabled the "Navi 31" architecture to integrate multi-chiplet memory cache dies (MCDs) with an optimized graphics
compute die (GCD) made on the latest 5nm process node. The MCDs handle the poorly-scaling DRAM interfaces and cache, while the performance-critical graphics engines reside on the N5 GCD. The USR links provide an unprecedented 5.3 TB/s of total bandwidth between the chiplets (Figure 25).
Moreover, AMD's chiplet designs maintained power efficiency by aggressive voltage scaling and clock gating of the USR links to reduce energy/bit by up to 80% versus organic links (Figure 26). The tight integration enabled by USR packaging also helped improve memory latency versus the monolithic predecessor (Figure 27).
Overall, the "Navi 31" chiplet design boosted performance-per-watt by up to 54% compared to the prior monolithic generation (Figure 28). These innovations highlight how advanced packaging unlocks new architectural possibilities and accelerates product scalability.
3D Stacked Chiplet Architectures
At the cutting edge lies AMD's latest AI accelerator, the Instinct MI300, which incorporates multiple 3D-stacked chiplets (Figures 29-30). 3D hybrid bonding with TSVs achieves unprecedented interconnect densities, enabling 17 TB/s of vertical bandwidth between the chiplets.
Managing the high bandwidth demands and thermal loads of AI/HPC workloads on such a highly integrated design requires extensive co-optimization between the architecture, packaging, and power delivery (Figures 31-32). AMD's modular chiplet approach facilitated mixing and matching components like re-using the same CCD across the MI300 and latest EPYC processors with a flexible system interface (Figure 33).
To construct the elaborate 3D stack with mirrored and rotated chiplets, AMD developed new design capabilities like heterogeneous chiplet interfaces supporting multiple bonding patterns (Figures 34-36). Complex symmetry requirements for power/ground TSVs and thermals had to be addressed.
Key Levers for Scaling Chiplet Architectures
As Figure 37 illustrates, the key engineering levers for boosting bandwidth density and interconnect efficiency span innovations in system partitioning, 3D stacking/bonding, and advanced packaging substrates. Industry ecosystems fostering open standards and a "chiplet marketplace" can further catalyze such heterogeneous integration.
While chiplet architectures involve higher design complexity, their inherent modularity accelerates hardware optimization and time-to-market for next-generation computing platforms. The ability to mix-and-match components from various process nodes provides unparalleled flexibility to scale performance, efficiency, and differentiation – ushering in a new era of domain-specific, data-centric computing.
Reference
[1] S. Naffziger, "Efficient Chiplets and Die-to-Die Communications," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2024.
留言