Introduction
The relentless growth in computing demands, especially for machine learning workloads, is putting increased pressure on memory systems. Compute capability has been improving at a rate of 3.1x every 2 years, much faster than the 1.4x/2years increase in memory interconnect bandwidth (Figure 1). This growing compute-to-memory gap is a key bottleneck known as the "memory wall."
To address this, the industry is moving from monolithic 2D system-on-chip (SoC) designs to 2.5D chiplet-based architectures. Partitioning functionality into modular chiplets that are integrated either laterally in 2.5D or vertically in 3D provides more flexibility in mixing and matching different process technologies (Figures 2-4). In particular, it opens the door for integrating emerging memory technologies into high performance computing (HPC) systems in novel ways.
Opportunity for Emerging Memory
Advanced packaging technologies like TSMC's SoIC enable high-density 3D stacking of chiplets. For example, AMD's 3D V-Cache technology, used in their Ryzen processors, stacks an additional 64-128MB of cache memory on top of the compute die (Figure 5). This extra level 3 (L3) cache, sitting between the processor and main memory, provides a significant performance boost for many workloads.
The V-Cache is currently implemented with conventional SRAM, but it presents an exciting opportunity for emerging memory technologies that offer higher density than SRAM:
Spin-transfer torque MRAM (STT-MRAM) is a non-volatile memory well-suited for replacing SRAM in last-level caches. While it has challenges in write current and endurance, STT-MRAM can provide a ~3x density improvement over SRAM.
Spin-orbit torque MRAM (SOT-MRAM) improves on STT-MRAM by separating read and write paths for better endurance. Voltage-gated SOT devices are a further optimization targeting high density.
Capacitor-less 1T-DRAM based on oxide semiconductors like IGZO could provide DRAM-like performance with a simple structure manufacturable in the back-end-of-line.
Comparative analysis shows that different applications have varying requirements on the memory hierarchy and 3D interconnect density (Figure 6). Mobile and server workloads can make good use of stacked L2/L3 caches with ~15-20μm pitch interconnects. Graphics and gaming applications require higher bandwidths, driving 3D interconnects with ~750K connections at a 4-2μm pitch for stacking L1/L2 caches and memory.
Main Memory Trends
The main memory in HPC systems today is dominated by DRAM. While conventional DRAM scaling is slowing down, several alternatives are being explored to extend the roadmap (Figures 7-9):
Sub-20nm class DRAMs are pushing the limits of 2D scaling with EUV patterning
3D DRAM with multiple tiers can provide continued scaling while alleviating lithography challenges
Capacitor-less 1T-DRAM using oxide semiconductors like IGZO could simplify the cell structure
Ferroelectric capacitors using newly discovered doped HfO2 materials are a promising option to replace the dielectric in conventional 1T1C DRAM
An emerging alternative to DRAM for main memory is storage class memory (SCM) accessed over a cache-coherent interconnect like CXL. By disaggregating and pooling SCM, multiple compute nodes can access a large shared memory space with high bandwidth. Two leading SCM technologies are:
Phase change memory (PCM): PCM exploits the resistivity difference between amorphous and crystalline states in chalcogenide materials. 3D XPoint, a 1 PCM cell + 1 OTS selector architecture developed by Intel and Micron, was the first SCM to market. However, it struggled with high write energy and limited endurance.
OTS selector-based memory: Ovonic threshold switching (OTS) devices exhibit volatile non-linear I-V curves that can be used as built-in selectors. By utilizing the threshold switching effect in the OTS itself as a memory mechanism, a self-selecting 1S1R cell can be built with excellent density, latency, and energy efficiency (Figure 36). The main challenge is achieving sufficient retention time as the state is volatile. Nevertheless, several companies have demonstrated impressive OTS memory array results (Figure 10).
Summary
The chiplet revolution is opening the door for the integration of emerging memories in novel ways across the memory hierarchy. In the near term, spin-based memories like STT-MRAM and SOT-MRAM are poised to replace SRAM in last level caches, offering significant density advantages. High-bandwidth 3D interconnects are enabling tighter integration of these caches with compute chiplets.
For main memory, while DRAM will remain dominant for years to come, capacitor-less 1T-DRAM and ferroelectric DRAM are promising options to simplify the cell structure and enable further scaling. Storage class memories like OTS memory, accessed via CXL, may disrupt the traditional DRAM market if they can meet performance and cost targets.
3D integration also has implications for storage class memories. For example, a 3D NAND flash array with an integrated ferroelectric FET as a built-in selector could reduce programming voltages and tighten the cell size.
The field of emerging memories is rapidly evolving and the chiplet paradigm is providing an ideal platform to integrate them in new ways. Memory architects must carefully evaluate the density, latency, bandwidth, endurance, and cost of these new technologies to build balanced HPC memory hierarchies that can keep pace with skyrocketing compute performance.
Reference
[1] S. Couet and G. S. Kar, "Does Chiplets Open the Space for Emerging Memory in the HPC System?" in Proc. IEEE International Solid-State Circuits Conference (ISSCC), Leuven, Belgium, 2024.
Commentaires