Data Throughput for Efficient Photonic Neural Network Accelerators

Latitude Design Systems
May 16, 2024
3 min read

Abstract

Machine learning has emerged as a dominant technology in recent years, driving the development of photonic systems to implement neural network tasks. These photonic systems offer high throughput operations with low power consumption, but require large bandwidths of data to operate with maximum efficiency. In this tutorial, we will explore the required data bandwidths for photonic neural network accelerators, which can reach nearly 1 Tbps for a single chip, necessitating the use of high bandwidth memory.

Introduction to Photonic Tensor Cores

Neural networks have two basic components for computation: linear vector-vector or matrix-vector multiplication, and a non-linear activation function. Photonic processors have been designed and demonstrated specifically to implement these compute tasks, forming application-specific integrated circuits (ASICs) known as photonic tensor cores (PTCs).

For optimal integration with electronic compute architectures, PTCs operate as photonic black boxes, where the internal workings are in the optical and analog domain, but the input/output (I/O) uses standard digital electronic signals.

Data Rates on a Single PTC

To maximize the optical components of a PTC, it is crucial to understand the required data rates for different operational modes: training and inferencing.

In the inferencing mode, where the kernel weights are fixed and only the input changes, the minimum speed between the input Mach-Zehnder modulators (MZMs) and output photodetectors (PDs) determines the operating frequency (f), which can be multiplied by the bit depth (b) for each component and summed up to find the total data rate, as shown in Equation 1.

In the training mode, specifically for stochastic gradient descent (SGD), where the kernel is updated in each round, Equation 1 can be used, including a third component for the kernel weights.

For batch training, where the kernel is unchanging for all but one sample, Equation 2 should be used, considering that the accelerator operates like inference mode for k-1 samples, and then updates all components for the final sample. Operating with an increasing batch size approaches the inference data rate requirements.

For the PTC chip shown in Figure 1A, which implements a 3x3 kernel with a 3-input vector and 3-output vector (Figure 1B), an estimated 0.68 TOPS can be reached for multiply-and-accumulate operations with 6-bit depth. Using the system information, the required data rates for SGD training and inference are 840 Gbps and 975 Gbps, respectively (Figures 1C and 1D), while batch/mini-batch training will fall between these two rates.

Figure 1: A. Photonic Tensor Core Chip. B. All active photonic components, their operating speeds, bit depths, and estimated TOPS based on this data. C. All components, the number of active channels required for data I/O, relevant information for computing SGD training bandwidth requirement of 840 Gbps. D. I/O components and relevant information for computing required bandwidth of 975 Gbps for inferencing.

Chiplet Scaling

To explore an expanded system with chiplets and find the required data movement on the memory side, a SCALE-Sim simulation was implemented for a simple MNIST architecture (Figure 2A). The simulation results provide the number of memory accesses, utilization, total cycles required, and other relevant information.

Using these results, along with the operating frequency and bit depth, the latency for operation on the architecture and the total required bandwidth from memory can be computed (Figures 2B and 2D).

Architectures with 8 and 16 chiplets in different configurations were tested to match the input and kernel feature mapping of the neural network. By comparing the latency and bandwidth requirements for all configurations, an optimization model for balancing the system can be created (Figure 2C).

The results show that increasing the number of chiplets can exponentially increase the data requirement but only linearly reduce the latency, leaving elbow points at the borders, balancing for both latency and required data throughput.

Figure 2: A. Neural network architecture for implementing an MNIST classifier. B. Comparison of chiplet structure to computation latency and processor utilization for first layer of NN. C. Comparison between bandwidth and latency for chiplet architectures for dual optimization. D. Comparison of chiplet structure to required total bandwidth from memory for first layer of NN.

Conclusions

In this article, we have explored a practical implementation of a photonic neural network accelerator, the photonic tensor core (PTC). We have analyzed the data rate requirements necessary to operate a single PTC to maximize the optical components of the system, reaching nearly 1 Tbps for a single chip.

Additionally, we have examined the requirements on the memory side for a scaled chiplet architecture using the PTC, highlighting the need for high bandwidth memory integration to support the data throughput demands of these efficient photonic neural network accelerators.

As machine learning continues to advance, the development of photonic systems like PTCs will play a crucial role in enabling high throughput operations with low power consumption, revolutionizing the field of neural network accelerators.

Reference

[1] R. L. T. Schwartz, B. Jahannia, N. Peserico, H. Dalir, and V. J. Sorger, "Data Throughput for Efficient Photonic Neural Network Accelerators," in Proc. of IEEE International Conference on Neural Networks, Gainesville, FL, USA, 2024, pp. 1-6

Data Throughput for Efficient Photonic Neural Network Accelerators

Recent Posts

Comments