top of page

IBM’s Co-Packaged Optics (CPO) Technology for Training and Running Generative AI Models in Data Centers and Other Computing Applications

Introduction

Data center technology is undergoing a fundamental transformation driven by generative artificial intelligence. Currently, approximately three-quarters of data center traffic occurs internally, leading to a sharp increase in demand for high-speed data transmission. While traditional copper cables have long served as the foundation for data transfer, they are increasingly limited by signal attenuation over long distances. CPO, as a revolutionary solution, is fundamentally reshaping how interconnect bandwidth density and energy efficiency are achieved [1].

Evolution of Computing Performance

Over the past two decades, computing performance has surged by an astonishing factor of 60,000, thanks to the continued scaling predicted by Moore’s law. However, a significant gap has emerged: I/O bandwidth has only improved by a factor of 30 in the same period. This growing discrepancy between computational capacity and data transmission capability has become a major challenge in modern data centers.

stark contrast in the expansion of hardware computing performance
Figure 1 illustrates the stark contrast in the expansion of hardware computing performance (HW FLOPS) and interconnect bandwidth from 1996 to 2023, highlighting the widening communication gap in modern computing systems.
Impact on AI Model Training

The limitations of current network infrastructure severely affect the efficiency of AI model training. Recent studies reveal that networks often become bottlenecks in GPU training, with about one-third of users experiencing GPU utilization below 15%. The impact is profound — training a single GPT-4 model consumes around 50 GWh of electricity, underscoring the urgent need for more efficient solutions.

throughput under different tensor parallelism
Figure 2 displays changes in training throughput under different tensor parallelism (TP) configurations, with batch sizes of 1 and 2, showing up to a 5x performance drop due to communication bottlenecks.
CPO Technology Innovations

IBM’s breakthroughs in CPO technology have led to several major advancements in photonic integration. By significantly shortening electrical circuit lengths and leveraging advanced packaging technologies, this innovation has made substantial progress in addressing both bandwidth density and energy efficiency challenges.

top-down and cross-sectional views of key components
Figure 3 provides top-down and cross-sectional views of key components such as the substrate, photonic integrated chip (PIC), and waveguide, showcasing the complex integration within CPO design.
Advanced Module Design

The integrated module design combines optical and electronic components, including a PIC (8 x 10 mm²), substrate (17 x 17 mm²), and waveguides (less than 12mm in length). This level of integration represents a significant leap forward in packaging density and efficiency.

rectangular version of the PWG
Figure 4 presents an example of a rectangular version of the PWG, where waveguide channels fan out from a 50 μm pitch to a 250 μm pitch, demonstrating precise waveguide design.
Implementation and Assembly

The optical test vehicles (OTV-1a and OTV-1b) demonstrate the precise integration of optical and electronic components. The assembly process utilizes lead-free flip-chip technology and micro-BGA card connections, marking a significant advancement in manufacturing techniques. With meticulous design and optimization, the technology achieves minimal insertion loss variation across multiple reflow cycles.

depicting top and bottom views of the assembly
Figure 5 shows CAD images depicting top and bottom views of the assembly, including the PIC, PWG, connectors, and substrate, as well as micro-BGA integration and substrate cutout design.
assembled modules realized in OTV-1b
Figure 6 presents actual photos of the assembled modules realized in OTV-1b, displaying both top and bottom views of the integrated components.
Performance Achievements and Reliability

Current CPO technology exhibits substantial improvements over traditional methods. Bandwidth density has increased from 0.15–0.25 Tbps/mm to 2–10 Tbps/mm, and with optimization across 4–16 wavelengths, future targets range from 20–80 Tbps/mm. Interface density between PICs and waveguides has been enhanced sixfold. Rigorous testing has confirmed reliability through multiple reflow cycles, environmental testing, thermal cycling from -40°C to +125°C, and 1000-hour damp heat testing at 85°C and 85% relative humidity.

Future Development

Next-generation CPO technology focuses on advancing several critical areas. Development efforts target waveguide spacing under 20 μm, increased waveguide channel density, and enhanced multi-wavelength compatibility. The technology roadmap includes multilayer connector/termination assembly schemes and improved manufacturing processes. These innovations aim to further reduce energy consumption while enhancing performance.

The successful integration of optical and electronic components, backed by proven reliability through rigorous testing, makes CPO a pivotal enabling technology for next-generation high-performance computing systems. This technological leap offers scalable solutions for future computational demands, addressing both current data center challenges and anticipated future requirements. Significant improvements in bandwidth density and energy efficiency position CPO as a foundational technology for next-generation computing infrastructure.

Reference

[1] J. Knickerbocker, J. B. Heroux, G. Bonilla, H. Hsu, N. Liu, A. P. Ramos, F. Arguin, Y. Tribodeau, B. Terjani, M. Schultz, R. K. Ganti, L. Chu, C. Marushima, Y. Taira, S. Kohara, A. Horibe, H. Mori, and H. Numata, "Next generation Co-Packaged Optics Technology to Train & Run Generative AI Models in Data Centers and other computing applications," Technical Report, IBM Research, 2024.

Comments


bottom of page