11  AI Acceleration

Resources: Slides, Videos, Exercises

DALL·E 3 Prompt: Create an intricate and colorful representation of a System on Chip (SoC) design in a rectangular format. Showcase a variety of specialized machine learning accelerators and chiplets, all integrated into the processor. Provide a detailed view inside the chip, highlighting the rapid movement of electrons. Each accelerator and chiplet should be designed to interact with neural network neurons, layers, and activations, emphasizing their processing speed. Depict the neural networks as a network of interconnected nodes, with vibrant data streams flowing between the accelerator pieces, showcasing the enhanced computation speed.

DALL·E 3 Prompt: Create an intricate and colorful representation of a System on Chip (SoC) design in a rectangular format. Showcase a variety of specialized machine learning accelerators and chiplets, all integrated into the processor. Provide a detailed view inside the chip, highlighting the rapid movement of electrons. Each accelerator and chiplet should be designed to interact with neural network neurons, layers, and activations, emphasizing their processing speed. Depict the neural networks as a network of interconnected nodes, with vibrant data streams flowing between the accelerator pieces, showcasing the enhanced computation speed.

Purpose

How does hardware acceleration impact machine learning system performance, and what principles should ML engineers understand to effectively design and deploy systems?

Machine learning systems has driven a fundamental shift in computer architecture. Traditional processors, designed for general-purpose computing, prove inefficient for the repeated mathematical operations and data movement patterns in neural networks. Modern accelerators address this challenge by matching hardware structures to ML computation patterns. These accelerators introduce fundamental trade-offs in performance, power consumption, and flexibility. Effective utilization of hardware acceleration requires an understanding of these trade-offs, as well as the architectural principles that govern accelerator design. By optimizing and learning to map models effectively for specific hardware platforms, engineers can balance computational efficiency.

Learning Objectives
  • Understand the historical context of hardware acceleration.

  • Identify key AI compute primitives and their role in model execution.

  • Explain the memory hierarchy and its impact on AI accelerator performance.

  • Describe strategies for mapping neural networks to hardware.

  • Analyze the role of compilers and runtimes in optimizing AI workloads.

  • Compare single-chip and multi-chip AI architectures.

11.1 Overview

Machine learning has driven a fundamental shift in computer architecture, pushing beyond traditional general-purpose processors toward specialized acceleration. The computational demands of modern machine learning models exceed the capabilities of conventional CPUs, which were designed for sequential execution. Instead, machine learning workloads exhibit massive parallelism, high memory bandwidth requirements, and structured computation patterns that demand purpose-built hardware for efficiency and scalability. Machine Learning Accelerators (ML Accelerators) have emerged as a response to these challenges.

Definition of ML Accelerator

Machine Learning Accelerator (ML Accelerator) refers to a specialized computing hardware designed to efficiently execute machine learning workloads. These accelerators optimize matrix multiplications, tensor operations, and data movement, enabling high-throughput and energy-efficient computation. ML accelerators operate at various power and performance scales, ranging from edge devices with milliwatt-level consumption to data center-scale accelerators requiring kilowatts of power. They are specifically designed to address the computational and memory demands of machine learning models, often incorporating optimized memory hierarchies, parallel processing units, and custom instruction sets to maximize performance. ML accelerators are widely used in training, inference, and real-time AI applications across cloud, edge, and embedded systems.

Unlike CPUs and GPUs, which were originally designed for general-purpose computing and graphics, ML accelerators are optimized for tensor operations, matrix multiplications, and memory-efficient execution—the core computations that drive deep learning. These accelerators span a wide range of power and performance envelopes, from energy-efficient edge devices to large-scale data center accelerators. Their architectures integrate custom processing elements, optimized memory hierarchies, and domain-specific execution models, enabling high-performance training and inference.

As ML models have grown in size and complexity, hardware acceleration has evolved to keep pace. The shift from von Neumann architectures1 to specialized accelerators reflects a broader trend in computing: reducing the cost of data movement, increasing parallelism, and tailoring hardware to domain-specific workloads. Moving data across memory hierarchies often consumes more energy than computation itself, making efficient memory organization and computation placement critical to overall system performance.

1 von Neumann Architecture: A computing model where programs and data share the same memory, leading to a bottleneck in data transfer between the processor and memory, known as the von Neumann bottleneck.

This chapter explores AI acceleration from a systems perspective, examining how computational models, hardware optimizations, and software frameworks interact to enable efficient execution. It covers key operations like matrix multiplications and activation functions, the role of memory hierarchies in data movement, and techniques for mapping neural networks to hardware. The discussion extends to compilers, scheduling strategies, and runtime optimizations, highlighting their impact on performance. Finally, it addresses the challenges of scaling AI systems from single-chip accelerators to multi-chip and distributed architectures, integrating real-world examples to illustrate effective AI acceleration.

11.2 Hardware Evolution

The progression of computing architectures follows a recurring pattern: as computational workloads grow in complexity, general-purpose processors become increasingly inefficient, prompting the development of specialized hardware accelerators. This transition is driven by the need for higher computational efficiency, reduced energy consumption, and optimized execution of domain-specific workloads. Machine learning acceleration is the latest stage in this ongoing evolution, following a well-established trajectory observed in prior domains such as floating-point arithmetic, graphics processing, and digital signal processing.

At the heart of this transition is hardware specialization, which enhances performance and efficiency by optimizing frequently executed computational patterns through dedicated circuit implementations. While this approach leads to significant gains, it also introduces trade-offs in flexibility, silicon area utilization, and programming complexity. As computing demands continue to evolve, specialized accelerators must balance these factors to deliver sustained improvements in efficiency and performance.

Building on this historical trajectory, the evolution of hardware specialization provides a foundational perspective for understanding modern machine learning accelerators. Many of the principles that shaped the development of early floating-point and graphics accelerators now inform the design of AI-specific hardware. Examining these past trends offers a systematic framework for analyzing contemporary approaches to AI acceleration and anticipating future developments in specialized computing.

11.2.1 Specialized Computing

The transition toward specialized computing architectures arises from the fundamental limitations of general-purpose processors. Early computing systems relied on central processing units (CPUs) to execute all computational tasks sequentially, following a one-size-fits-all approach. However, as computing workloads diversified and grew in complexity, certain operations—particularly floating-point arithmetic—emerged as critical performance bottlenecks that could not be efficiently handled by CPUs alone. These fundamental inefficiencies prompted the development of specialized hardware architectures designed to accelerate specific computational patterns (Flynn 1966).

Flynn, M. J. 1966. “Very High-Speed Computing Systems.” Proceedings of the IEEE 54 (12): 1901–9. https://doi.org/10.1109/proc.1966.5273.
Fisher, Lawrence D. 1981. “The 8087 Numeric Data Processor.” IEEE Computer 14 (7): 19–29. https://doi.org/10.1109/MC.1981.1653991.

One of the earliest examples of hardware specialization was the Intel 8087 mathematics coprocessor, introduced in 1980. This floating-point unit (FPU) was designed to offload arithmetic-intensive computations from the main CPU, dramatically improving performance for scientific and engineering applications. The 8087 demonstrated unprecedented efficiency, achieving performance gains of up to 100× for floating-point operations compared to software-based implementations on general-purpose processors (Fisher 1981). This milestone established a fundamental principle in computer architecture: carefully designed hardware specialization could provide order-of-magnitude improvements for well-defined, computationally intensive tasks.

The success of floating-point coprocessors led to their eventual integration into mainstream processors. For example, the Intel 486DX, released in 1989, incorporated an on-chip floating-point unit, eliminating the need for an external coprocessor. This integration not only improved processing efficiency but also marked a recurring pattern in computer architecture: successful specialized functions tend to become standard features in future generations of general-purpose processors (Patterson and Hennessy 2021).

Patterson, David A., and John L. Hennessy. 2021. Computer Organization and Design: The Hardware/Software Interface. 5th ed. Morgan Kaufmann.

The principles established through early floating-point acceleration continue to influence modern hardware specialization. These include:

  1. Identification of computational bottlenecks through workload analysis
  2. Development of specialized circuits for frequent operations
  3. Creation of efficient hardware-software interfaces
  4. Progressive integration of proven specialized functions

This progression from domain-specific specialization to general-purpose integration has played a central role in shaping modern computing architectures. As computational workloads expanded beyond arithmetic operations, these same fundamental principles were applied to new domains, such as graphics processing, digital signal processing, and ultimately, machine learning acceleration. Each of these domains introduced specialized architectures tailored to their unique computational requirements, establishing hardware specialization as a cornerstone strategy for advancing computing performance and efficiency in increasingly complex workloads.

The evolution of specialized computing hardware follows a well-defined trajectory, where architectural innovations arise to meet computational bottlenecks and gradually integrate into broader computing ecosystems. Figure 11.1 illustrates key milestones in this progression, highlighting how each computing era introduced accelerators optimized for dominant workloads.

Figure 11.1: Evolution of specialized computing hardware.

11.2.2 Expanding Specialized Computing

The principles established through floating-point acceleration provided a blueprint for addressing emerging computational challenges. As computing applications diversified, new computational patterns emerged that exceeded the capabilities of general-purpose processors. This expansion of specialized computing manifested across multiple domains, each contributing unique insights to hardware acceleration strategies.

Graphics processing emerged as a significant driver of hardware specialization in the 1990s. Early graphics accelerators focused on specific operations like bitmap transfers and polygon filling. The introduction of programmable graphics pipelines with NVIDIA’s GeForce 256 in 1999 represented a crucial advancement in specialized computing. Graphics Processing Units (GPUs) demonstrated how parallel processing architectures could efficiently handle data-parallel workloads. For example, in 3D rendering tasks like texture mapping and vertex transformation, GPUs achieved 50-100\(\times\) speedups over CPU implementations. By 2004, GPUs could process over 100 million polygons per second—tasks that would overwhelm even the fastest CPUs of the time (Owens et al. 2008).

Lyons, Richard G. 2011. Understanding Digital Signal Processing. 3rd ed. Prentice Hall.

Digital Signal Processing (DSP) represents another fundamental domain of hardware specialization. DSP processors introduced architectural innovations specifically designed for efficient signal processing operations. These included specialized multiply-accumulate units, circular buffers, and parallel data paths optimized for filtering and transform operations. Texas Instruments’ TMS32010, introduced in 1983, established how domain-specific instruction sets and memory architectures could dramatically improve performance for signal processing applications (Lyons 2011).

Network processing introduced additional patterns of specialization. Network processors developed unique architectures to handle packet processing at line rate, incorporating multiple processing cores, specialized packet manipulation units, and sophisticated memory management systems. Intel’s IXP2800 network processor demonstrated how multiple levels of hardware specialization could be combined to address complex processing requirements.

These diverse domains of specialization shared several common themes:

  1. Identification of domain-specific computational patterns
  2. Development of specialized processing elements and memory hierarchies
  3. Creation of domain-specific programming models
  4. Progressive evolution toward more flexible architectures

This period of expanding specialization demonstrated that hardware acceleration strategies could successfully address diverse computational requirements. The lessons learned from these domains would prove crucial for the development of modern accelerators, particularly in the emerging field of machine learning computation.

11.2.3 Domain-Specific Architectures

The emergence of domain-specific architectures (DSA) marks a fundamental shift in computer system design, driven by two key factors: the breakdown of traditional scaling laws and the increasing computational demands of specialized workloads. The slowdown of Moore’s Law2—which had previously guaranteed predictable improvements in transistor density every 18-24 months—and the end of Dennard scaling3—which had allowed frequency increases without proportional power increases—created a critical performance and efficiency bottleneck in general-purpose computing. As John Hennessy and David Patterson noted in their 2017 Turing Lecture (Hennessy and Patterson 2019), these limitations signaled the onset of a new era in computer architecture—one centered on domain-specific solutions that optimize hardware for specialized workloads.

2 Moore’s Law: An observation that the number of transistors on a chip doubles approximately every 18-24 months, an insight first articulated by Gordon Moore in 1965.

3 Dennard Scaling: The principle that as transistors get smaller, their power density remains constant, allowing operating frequencies to increase without a proportional rise in power consumption.

Hennessy, John L., and David A. Patterson. 2019. “A New Golden Age for Computer Architecture.” Communications of the ACM 62 (2): 48–60. https://doi.org/10.1145/3282307.

Historically, improvements in processor performance relied on semiconductor process scaling and increasing clock speeds. However, as power density limitations restricted further frequency scaling, and as transistor miniaturization faced increasing physical and economic constraints, architects were forced to explore alternative approaches to sustain computational growth. The result was a shift toward domain-specific architectures, which dedicate silicon resources to optimize computation for specific application domains, trading flexibility for efficiency. Domain-specific architectures achieve superior performance and energy efficiency through several key principles:

  1. Customized datapaths: Design processing paths specifically optimized for target application patterns, enabling direct hardware execution of common operations. For example, matrix multiplication units in AI accelerators implement systolic arrays tailored for neural network computations.

  2. Specialized memory hierarchies: Optimize memory systems around domain-specific access patterns and data reuse characteristics. This includes custom cache configurations, prefetching logic, and memory controllers tuned for expected workloads.

  3. Reduced instruction overhead: Implement domain-specific instruction sets that minimize decode and dispatch complexity by encoding common operation sequences into single instructions. This improves both performance and energy efficiency.

  4. Direct hardware implementation: Create dedicated circuit blocks that natively execute frequently used operations without software intervention. This eliminates instruction processing overhead and maximizes throughput.

Perhaps the best-known example of success in domain-specific architectures is modern smartphones. Introduced in the late 2000s, modern smartphones can decode 4K video at 60 frames per second while consuming just a few watts of power—even though video processing requires billions of operations per second. This remarkable efficiency is achieved through dedicated hardware video codecs that implement industry standards such as H.264/AVC (introduced in 2003) and H.265/HEVC (finalized in 2013) (Sullivan et al. 2012). These specialized circuits offer 100–1000\(\times\) improvements in both performance and power efficiency compared to software-based decoding on general-purpose processors.

Sullivan, Gary J., Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. 2012. “Overview of the High Efficiency Video Coding (HEVC) Standard.” IEEE Transactions on Circuits and Systems for Video Technology 22 (12): 1649–68. https://doi.org/10.1109/tcsvt.2012.2221191.
Shang, J., G. Wang, and Y. Liu. 2018. “Accelerating Genomic Data Analysis with Domain-Specific Architectures.” IEEE Transactions on Computers 67 (7): 965–78. https://doi.org/10.1109/TC.2018.2799212.
Bedford Taylor, Michael. 2017. “The Evolution of Bitcoin Hardware.” Computer 50 (9): 58–66. https://doi.org/10.1109/mc.2017.3571056.

The trend toward specialization continues to accelerate, with new architectures emerging for an expanding range of domains. Genomics processing, for example, benefits from custom accelerators that optimize sequence alignment and variant calling, reducing the time required for DNA analysis (Shang, Wang, and Liu 2018). Similarly, blockchain computation has given rise to application-specific integrated circuits (ASICs) optimized for cryptographic hashing, dramatically increasing the efficiency of mining operations (Bedford Taylor 2017). These examples illustrate that domain-specific architecture is not merely a transient trend but a fundamental transformation in computing systems, offering tailored solutions that address the growing complexity and diversity of modern computational workloads.

11.2.4 ML as a Computational Domain

Machine learning has emerged as one of the most computationally demanding fields, demonstrating the need for dedicated hardware that targets its unique characteristics. Domain-specific architectures—once developed for video codecs or other specialized tasks—have now expanded to meet the challenges posed by ML workloads. These specialized designs optimize the execution of dense matrix operations and manage data movement efficiently, a necessity given the inherent memory bandwidth4 limitations.

4 Memory Bandwidth: The rate at which data can be read from or written to memory by a processor, influencing performance in data-intensive operations.

A key distinction in ML is the differing requirements between training and inference. Training demands both forward and backward propagation, with high numerical precision (e.g., FP32 or FP16) to ensure stable gradient updates and convergence, while inference can often operate at lower precision (e.g., INT8) without major accuracy loss. This variance not only drives the need for mixed-precision arithmetic hardware but also allows optimizations that improve throughput and energy efficiency—often achieving 4–8\(\times\) gains.

The computational foundation of modern ML accelerators is built on common patterns such as dense matrix multiplications and consistent data-flow patterns. These operations underpin architectures like GPUs with tensor cores and Google’s Tensor Processing Unit (TPU). While GPUs extended their original graphics capabilities to handle ML tasks via parallel execution and specialized memory hierarchies, TPUs take a more focused approach. For instance, the TPU’s systolic array architecture is tailored to excel at matrix multiplication, effectively aligning hardware performance with the mathematical structure of neural networks.

11.2.5 Application-specific ML Accelerators

The shift toward application-specific hardware is evident in how these accelerators are designed for both high-powered data centers and low-power edge devices. In data centers, powerful training accelerators can reduce model development times from weeks to days, thanks to their finely-tuned compute engines and memory systems. Conversely, edge devices benefit from inference engines that deliver millisecond-level responses while consuming very little power.

The success of these dedicated solutions reinforces a broader trend—hardware specialization adapts to the computational demands of evolving applications. By focusing on the core operations of machine learning, from matrix multiplications to flexible numerical precision, application-specific accelerators ensure that systems remain efficient, scalable, and ready to meet future advancements.

The evolution of specialized hardware architectures illustrates a fundamental principle in computing systems: as computational patterns emerge and mature, hardware specialization follows to achieve optimal performance and energy efficiency. This progression is particularly evident in machine learning acceleration, where domain-specific architectures have evolved to meet the increasing computational demands of machine learning models. Unlike general-purpose processors, which prioritize flexibility, specialized accelerators optimize execution for well-defined workloads, balancing performance, energy efficiency, and integration with software frameworks.

Table 11.1 outlines key milestones in hardware specialization, highlighting how each computing era has produced accelerators tailored to dominant workloads. This historical trajectory provides context for the rise of AI accelerators, which follow similar design principles but must also integrate seamlessly with machine learning frameworks, compilers, and deployment environments to maximize efficiency.

Table 11.1: Evolution of hardware specialization across computing eras.
Era Computational Pattern Architecture Examples Key Characteristics
1980s Floating-Point & Signal Processing FPU, DSP
  • Single-purpose engines
  • Focused instruction sets
  • Focused instruction sets
  • Coprocessor interfaces
  • 1990s 3D Graphics & Multimedia GPU, SIMD Units
  • Many identical compute units
  • Regular data patterns
  • Wide memory interfaces
  • 2000s Real-time Media Coding Media Codecs, Network Processors
  • Fixed-function pipelines
  • High throughput processing
  • Power-performance optimization
  • 2010s Deep Learning Tensor Operations TPU, GPU Tensor Cores
  • Matrix multiplication units
  • Massive parallelism
  • Memory bandwidth optimization
  • 2020s Application-Specific Acceleration ML Engines, Smart NICs, Domain Accelerators
  • Workload-specific datapaths
  • Customized memory hierarchies
  • Application-optimized designs
  • This historical progression reveals that hardware specialization is not a recent phenomenon but rather a consistent approach to improving computational efficiency. As new workloads become dominant, specialized architectures emerge to optimize their execution, balancing raw performance with power efficiency and software compatibility.

    In the case of AI acceleration, this transition has introduced challenges that extend well beyond the confines of hardware design. Machine learning accelerators must integrate seamlessly into comprehensive ML workflows by aligning with optimizations at multiple levels of the computing stack. To achieve this, they are required to operate effectively with widely adopted frameworks such as TensorFlow, PyTorch, and JAX, thereby ensuring that deployment is smooth and consistent across varied hardware platforms. In tandem with this, compiler and runtime support become essential; advanced optimization techniques—including graph-level transformations, kernel fusion, and memory scheduling—are critical for harnessing the full potential of these specialized accelerators.

    Moreover, scalability presents an ongoing demand as AI accelerators are deployed in diverse environments ranging from high-throughput data centers to resource-constrained edge and mobile devices, necessitating tailored performance tuning and energy efficiency strategies. Finally, the integration of such accelerators into heterogeneous computing environments underscores the importance of interoperability, ensuring that these specialized units can function in concert with conventional CPUs and GPUs in distributed systems.

    The emergence of AI accelerators is therefore not simply a matter of hardware optimization but also a system-level transformation, where improvements in computation must be tightly coupled with advances in compilers, software frameworks, and distributed computing strategies. Understanding these principles is essential for designing and deploying efficient machine learning systems. The following sections explore how modern ML accelerators address these challenges, focusing on their architectural approaches, system-level optimizations, and integration into the broader machine learning ecosystem.

    11.3 AI Compute Primitives

    At the heart of all neural network computations lies a simple operation: multiply and accumulate. Every layer of a neural network, whether a dense layer, convolution, or attention mechanism, ultimately reduces to multiplying input values by learned weights and summing the results. This core mathematical operation—repeated billions of times—defines the computational structure of modern AI workloads.

    While the fundamental arithmetic is straightforward, the sheer scale of neural network computations necessitates specialized hardware optimizations. Unlike traditional computing workloads that involve intricate control flow and branching logic, neural network execution consists of highly structured, repetitive operations applied to large arrays of data in parallel. This characteristic has led to the development of AI compute primitives—specialized processor operations designed to accelerate machine learning workloads.[^fn-scalar]

    When implementing neural networks in hardware, four fundamental computational requirements emerge:

    1. Efficient parallel processing: Need to process multiple data elements simultaneously through vector operations
    2. Structured coordination of computation: Ability to orchestrate calculations across multiple dimensions through matrix operations
    3. Systematic movement of data: Organized transfer of data through memory hierarchies to minimize latency and maximize bandwidth
    4. Hardware acceleration: Direct hardware support for executing non-linear mathematical functions efficiently

    These requirements drive the development of specialized processor primitives that form the foundation of modern AI accelerators. The sections that follow examine four critical categories of architectural primitives:

    # High-level framework code
    dense = Dense(512)(input_tensor)

    This high-level framework code decomposes into mathematical operations:

    # Mathematical operations
    output = matmul(input_weights) + bias
    output = activation(output)

    The mathematical representation further decomposes into processor-level computation:

    # Computational implementation
    for n in range(batch_size):
        for m in range(output_size):
            sum = bias[m]
            for k in range(input_size):
                sum += input[n,k] * weights[k,m]
            output[n,m] = activation(sum)

    Analysis of this computational decomposition reveals four fundamental characteristics that underpin modern hardware design. Data parallelism enables simultaneous processing across independent elements, significantly accelerating computation. The predominance of matrix operations defines the computational complexity, driving the need for optimized circuits. Systematic data movement patterns shape memory system architecture to ensure efficient transfer and minimal latency. Finally, recurring non-linear transformations require dedicated hardware support for effective execution.

    The implementation of hardware primitives designed to accelerate these computational patterns is governed by three fundamental criteria. First, a primitive must be employed with enough frequency to justify the allocation of dedicated silicon resources. Second, its hardware implementation must provide performance or efficiency benefits that exceed those of general-purpose approaches. Finally, the architectural design should maintain stability across multiple generations of neural network models, ensuring long-term viability and compatibility with evolving computational needs.

    In modern machine learning accelerators, a few important categories of processor primitives have emerged as essential building blocks. These include vector operations, matrix operations, and special function units. Each category addresses specific computational challenges while complementing the capabilities of the others. Together, these primitives form the foundation of neural network acceleration, enabling efficient, scalable, and robust performance in increasingly complex applications.

    11.3.1 Vector Operations

    Vector operations provide the first level of hardware acceleration by processing multiple data elements simultaneously. This parallelism exists at multiple scales, from individual neurons to entire layers, making vector processing essential for efficient neural network execution. By examining how framework-level code translates to hardware instructions, we can understand the critical role of vector processing in neural accelerators.

    Framework to Hardware Execution

    Machine learning frameworks hide hardware complexity through high-level abstractions. These abstractions decompose into progressively lower-level operations, revealing opportunities for hardware acceleration. Consider the execution flow of a linear layer:

    # Framework Level: What ML developers write
    layer = nn.Linear(256, 512)  # Layer transforms 256 inputs to
                                 # 512 outputs
    output = layer(input_tensor) # Process a batch of inputs

    This abstraction represents a fully connected layer that transforms input features through learned weights. The framework translates this high-level expression into mathematical operations:

    # Framework Internal: Mathematical operations
    Z = matmul(weights, input) + bias # Each output needs all inputs
    output = activation(Z)            # Transform each result

    These mathematical operations decompose into explicit computational steps during processor execution. Each output value requires a sequence of multiply-accumulate operations:

    # Computational Level: Implementation
    for batch in range(32):            # Process 32 samples at once
        for out_neuron in range(512):  # Compute each output neuron
            sum = 0.0
            for in_feature in range(256): # Each output needs
                                          # all inputs
                sum += input[batch, in_feature] *
                             weights[out_neuron, in_feature]
            output[batch, out_neuron] = activation(sum +
                                        bias[out_neuron])

    Sequential Execution on Scalar Processors

    Traditional scalar processors5 execute these operations sequentially, processing individual values one at a time. For the linear layer example above with a batch of 32 samples, computing the outputs requires over 4 million multiply-accumulate operations. Each operation involves loading an input value and a weight value, multiplying them, and accumulating the result. This sequential approach becomes highly inefficient when processing the massive number of identical operations required by neural networks.

    5 Scalar Processor: A scalar processor handles one data element per cycle, executing operations sequentially rather than in parallel.

    Parallel Execution with Vector Processing

    Vector processing units transform this execution pattern by operating on multiple data elements simultaneously. The following RISC-V assembly code demonstrates modern vector processing:

    # Vector hardware execution (RISC-V Vector Extension)
    vsetvli t0, a0, e32   # Process 8 elements at once
    loop_batch:
        loop_neuron:
            vxor.vv v0, v0, v0    # Clear 8 accumulators
            loop_feature:
                vle32.v v1, (in_ptr)   # Load 8 inputs together
                vle32.v v2, (wt_ptr)   # Load 8 weights together
                vfmacc.vv v0, v1, v2   # 8 multiply-adds at once
                add in_ptr, in_ptr, 32  # Move to next 8 inputs
                add wt_ptr, wt_ptr, 32  # Move to next 8 weights
                bnez feature_cnt, loop_feature

    This vector implementation processes eight data elements in parallel, reducing both computation time and energy consumption. Vector load instructions transfer eight values simultaneously, maximizing memory bandwidth utilization. The vector multiply-accumulate instruction processes eight pairs of values in parallel, dramatically reducing the total instruction count from over 4 million to approximately 500,000. Modern vector processors support additional specialized operations that accelerate common neural network patterns. Table 11.2 summarizes key vector operations and their applications in neural network computation:

    Table 11.2: Vector operations and their neural network applications.
    Vector Operation Description Neural Network Application
    Reduction Combines elements across a vector (e.g., sum, max) Pooling layers, attention score computation
    Gather Loads multiple non-consecutive memory elements Embedding lookups, sparse operations
    Scatter Writes to multiple non-consecutive memory locations Gradient updates for embeddings
    Masked operations Selectively operates on vector elements Attention masks, padding handling
    Vector-scalar broadcast Applies scalar to all vector elements Bias addition, scaling operations

    The efficiency gains from vector processing extend beyond instruction count reduction. Memory bandwidth utilization improves as vector loads transfer multiple values per operation. Energy efficiency increases because control logic is shared across multiple operations. These improvements compound across the deep layers of modern neural networks, where billions of operations execute for each forward pass.

    Historical Foundations of Vector Processing

    The principles underlying vector operations have long played a central role in high-performance computing. In the 1970s and 1980s, vector processors emerged as a critical architectural solution for scientific computing, weather modeling, and physics simulations, where large arrays of data required efficient parallel processing. Early systems such as the Cray-1, one of the first commercially successful supercomputers, introduced dedicated vector units to perform arithmetic operations on entire data vectors in a single instruction. This approach dramatically improved computational throughput compared to traditional scalar execution (Jordan 1982).

    Jordan, T. L. 1982. “A Guide to Parallel Computation and Some Cray-1 Experiences.” In Parallel Computations, 1–50. Elsevier. https://doi.org/10.1016/b978-0-12-592101-5.50006-3.

    These foundational concepts have reemerged in the context of machine learning, where neural networks exhibit an inherent structure well suited to vectorized execution. The same fundamental operations—vector addition, multiplication, and reduction—that once accelerated numerical simulations now drive the execution of machine learning workloads. While the scale and specialization of modern AI accelerators differ from their historical predecessors, the underlying architectural principles remain the same. The resurgence of vector processing in neural network acceleration highlights its enduring utility as a mechanism for achieving high computational efficiency.

    Vector operations establish the foundation for neural network acceleration by enabling efficient parallel processing of independent data elements. However, the core transformations in neural networks require coordinating computation across multiple dimensions simultaneously. This need for structured parallel computation leads to the next architectural primitive: matrix operations.

    11.3.2 Matrix Operations

    Matrix operations are the computational workhorse of neural networks, transforming high-dimensional data through structured patterns of weights, activations, and gradients (Goodfellow, Courville, and Bengio 2013a). While vector operations process elements independently, matrix operations orchestrate computations across multiple dimensions simultaneously. Understanding these operations reveals fundamental patterns that drive hardware acceleration strategies.

    Goodfellow, Ian J., Aaron Courville, and Yoshua Bengio. 2013a. “Scaling up Spike-and-Slab Models for Unsupervised Feature Learning.” IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8): 1902–14. https://doi.org/10.1109/tpami.2012.273.

    Matrix Operations in Neural Networks

    Neural network computations decompose into hierarchical matrix operations. Consider how a linear layer illustrates this hierarchy, processing multiple input features into output neurons across a batch of samples:

    # Framework Level: What ML developers write
    layer = nn.Linear(256, 512)  # Layer transforms 256 inputs to
                                 # 512 outputs
    output = layer(input_batch)  # Process a batch of 32 samples
    
    # Framework Internal: Core operations
    Z = matmul(weights, input)   # Matrix: transforms [256 x 32]
                                 # input to [512 x 32] output
    Z = Z + bias                 # Vector: adds bias to each
                                 # output independently
    output = relu(Z)             # Vector: applies activation to
                                 # each element independently

    This computation demonstrates the inherent scale of matrix operations in neural networks. Each output neuron (512 total) must process all input features (256 total) for every sample in the batch (32 samples). The weight matrix alone contains \(256 \times 512 = 131,072\) parameters that define these transformations, illustrating why efficient matrix multiplication becomes crucial for performance.

    Types of Matrix Computations in Neural Networks

    Matrix operations appear consistently across modern neural architectures. Consider these fundamental patterns:

    # Linear Layers - Direct matrix multiply
    hidden = matmul(weights, inputs)  # weights: [out_dim x in_dim],
                                      # inputs: [in_dim x batch]
                                      # Result combines all inputs
                                      # for each output
    
    # Attention Mechanisms - Multiple matrix operations
    Q = matmul(Wq, inputs)       # Project inputs to query space
                                 # [query_dim x batch]
    K = matmul(Wk, inputs)       # Project inputs to key space
                                 #[key_dim x batch]
    attention = matmul(Q, K.T)   # Compare all queries with all
                                 # keys [query_dim x key_dim]
    
    # Convolutions - Matrix multiply after reshaping
    patches = im2col(input)           # Convert [H x W x C] image
                                      # to matrix of patches
    output = matmul(kernel, patches)  # Apply kernels to all
                                      # patches simultaneously

    This pervasive pattern of matrix multiplication has direct implications for hardware design. Modern processors implement dedicated matrix units that extend beyond vector processing capabilities.

    Hardware Acceleration of Matrix Operations

    The computational demands of matrix operations have driven specialized hardware optimizations. Modern processors implement dedicated matrix units that extend beyond vector processing capabilities. Consider the following example of matrix acceleration in hardware:

    # Matrix processing unit operation for a block of the computation
    mload mr1, (weight_ptr)     # Load e.g., 16x16 block of
                                # weight matrix
    mload mr2, (input_ptr)      # Load corresponding input block
    matmul.mm mr3, mr1, mr2     # Multiply and accumulate entire
                                # blocks at once
    mstore (output_ptr), mr3    # Store computed output block

    This matrix processing unit can handle \(16\times16\) blocks of the linear layer computation described earlier, processing 256 multiply-accumulate operations simultaneously compared to the 8 operations possible with vector processing. These matrix operations complement vectorized computation by enabling structured many-to-many transformations. The interplay between matrix and vector operations shapes the efficiency of neural network execution.

    Table 11.3: Comparison of matrix and vector operation characteristics.
    Operation Type Best For Examples Key Characteristic
    Matrix Operations Many-to-many transforms
  • Layer transformations
  • Attention computation
  • Convolutions
  • Each output depends on multiple inputs
    Vector Operations One-to-one transforms
  • Activation functions
  • Layer normalization
  • Element-wise gradients
  • Each output depends only on corresponding input

    Matrix operations provide essential computational capabilities for neural networks through coordinated parallel processing across multiple dimensions. However, achieving peak performance with these operations requires careful orchestration of data movement between processing units. This need for efficient data handling leads us to examine the critical role of dataflow patterns in neural accelerator design (Hwu 2011).

    Hwu, Wen-mei W. 2011. “Introduction.” In GPU Computing Gems Emerald Edition, xix–xx. Elsevier. https://doi.org/10.1016/b978-0-12-384988-5.00064-4.

    Historical Foundations of Matrix Compuation

    Matrix operations have long served as a cornerstone of computational mathematics, with applications extending from numerical simulations to graphics processing (Golub and Loan 1996). The structured nature of matrix multiplications and transformations made them a natural target for acceleration in early computing architectures. In the 1980s and 1990s, specialized digital signal processors (DSPs) and graphics processing units (GPUs) optimized for matrix computations played a critical role in accelerating workloads such as image processing, scientific computing, and 3D rendering (Owens et al. 2008).

    Golub, Gene H., and Charles F. Van Loan. 1996. Matrix Computations. Johns Hopkins University Press.
    Owens, J. D., M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. 2008. “GPU Computing.” Proceedings of the IEEE 96 (5): 879–99. https://doi.org/10.1109/jproc.2008.917757.

    The widespread adoption of machine learning has reinforced the importance of efficient matrix computation. Neural networks, fundamentally built on matrix multiplications and tensor operations, have driven the development of dedicated hardware architectures that extend beyond traditional vector processing. Modern tensor processing units (TPUs) and AI accelerators implement matrix multiplication at scale, reflecting the same architectural principles that once underpinned early scientific computing and graphics workloads. The resurgence of matrix-centric architectures highlights the deep connection between classical numerical computing and contemporary AI acceleration.

    11.3.3 Special Function Units

    While vector and matrix operations efficiently handle the linear transformations in neural networks, non-linear functions present unique computational challenges that require dedicated hardware solutions. Special Function Units (SFUs) provide hardware acceleration for these essential computations, completing the set of fundamental processing primitives needed for efficient neural network execution.

    Non-Linear Functions

    Non-linear functions play a fundamental role in machine learning by enabling neural networks to model complex relationships (Goodfellow, Courville, and Bengio 2013b). Consider a typical neural network layer sequence:

    ———. 2013b. “Scaling up Spike-and-Slab Models for Unsupervised Feature Learning.” IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8): 1902–14. https://doi.org/10.1109/tpami.2012.273.
    # Framework Level Operation
    layer = nn.Sequential(
        nn.Linear(256, 512),
        nn.ReLU(),
        nn.BatchNorm1d(512)
    )
    output = layer(input_tensor)

    This sequence introduces multiple non-linear transformations. The framework decomposes it into mathematical operations:

    # Mathematical Operations
    Z = matmul(weights, input) + bias    # Linear transformation
    H = max(0, Z)                        # ReLU activation
    mean = reduce_mean(H, axis=0)        # BatchNorm statistics
    var = reduce_mean((H - mean)**2)     # Variance computation
    output = gamma * (H - mean)/sqrt(var + eps) + beta
                                         # Normalization

    Implementing the Non-Linear Functions

    On traditional processors, these seemingly simple mathematical operations translate into complex sequences of instructions. Consider the computation of batch normalization: calculating the square root requires multiple iterations of numerical approximation, while exponential functions in operations like softmax need series expansion or lookup tables (Ioffe and Szegedy 2015). Even a simple ReLU activation requires conditional branching, which can disrupt instruction pipelining:

    Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” International Conference on Machine Learning (ICML), February, 448–56. http://arxiv.org/abs/1502.03167v3.
    # Traditional Implementation Overhead
    for batch in range(32):
        for feature in range(512):
            # ReLU: Requires branch prediction and potential
            # pipeline stalls
            z = matmul_output[batch, feature]
            h = max(0.0, z)    # Conditional operation
    
            # BatchNorm: Multiple passes over data
            mean_sum[feature] += h     # First pass for mean
            var_sum[feature] += h * h  # Additional pass for variance
    
            temp[batch, feature] = h   # Extra memory storage needed
    
    # Normalization requires complex arithmetic
    for feature in range(512):
        mean = mean_sum[feature] / batch_size
        var = (var_sum[feature] / batch_size) - mean * mean
    
        # Square root computation: Multiple iterations
        scale = gamma[feature] / sqrt(var + eps)  # Iterative
                                                  # approximation
        shift = beta[feature] - mean * scale
    
        # Additional pass over data for final computation
        for batch in range(32):
            output[batch, feature] = temp[batch, feature] *
                                     scale + shift

    These operations introduce several key inefficiencies:

    1. Multiple passes over data, increasing memory bandwidth requirements
    2. Complex arithmetic requiring many instruction cycles
    3. Conditional operations that can cause pipeline stalls
    4. Additional memory storage for intermediate results
    5. Poor utilization of vector processing units

    More specifically, each operation introduces distinct challenges. Batch normalization requires multiple passes through data: one for mean computation, another for variance, and a final pass for output transformation. Each pass loads and stores data through the memory hierarchy. Operations that appear simple in mathematical notation often expand into many instructions. The square root computation typically requires 10-20 iterations of numerical methods like Newton-Raphson approximation for suitable precision (Goldberg 1991). Conditional operations like ReLU’s max function require branch instructions that can stall the processor’s pipeline. The implementation needs temporary storage for intermediate values, increasing memory usage and bandwidth consumption. While vector units excel at regular computations, functions like exponentials and square roots often require scalar operations that cannot fully utilize vector processing capabilities.

    Goldberg, David. 1991. “What Every Computer Scientist Should Know about Floating-Point Arithmetic.” ACM Computing Surveys 23 (1): 5–48. https://doi.org/10.1145/103162.103163.

    Hardware Acceleration

    Special Function Units (SFUs) address these inefficiencies through dedicated hardware implementation. Modern ML accelerators include specialized circuits that transform these complex operations into single-cycle or fixed-latency computations. The accelerator can load a vector of values and apply non-linear functions directly, eliminating the need for multiple passes and complex instruction sequences:

    # Example hardware execution with Special Function Units
    vld.v v1, (input_ptr)      # Load vector of values
    vrelu.v v2, v1             # Single-cycle ReLU on entire vector
    vsigm.v v3, v1             # Fixed-latency sigmoid computation
    vtanh.v v4, v1             # Direct hardware tanh implementation
    vrsqrt.v v5, v1            # Fast reciprocal square root

    Each SFU implements a specific function through specialized circuitry. For instance, a ReLU unit performs the comparison and selection in dedicated logic, eliminating branching overhead. Square root operations use hardware implementations of algorithms like Newton-Raphson with fixed iteration counts, providing guaranteed latency. Exponential and logarithmic functions often combine small lookup tables with hardware interpolation circuits (Costa et al. 2019). Using these custom instructions, the SFU implementation eliminates multiple passes over data, removes complex arithmetic sequences, and maintains high computational efficiency. Table 11.4 shows the various hardware implementations and their typical latencies.

    Costa, Tiago, Chen Shi, Kevin Tien, and Kenneth L. Shepard. 2019. “A CMOS 2D Transmit Beamformer with Integrated PZT Ultrasound Transducers for Neuromodulation.” In 2019 IEEE Custom Integrated Circuits Conference (CICC), 1–4. IEEE. https://doi.org/10.1109/cicc.2019.8780236.
    Table 11.4: Special function unit implementation.
    Function Unit Operation Implementation Strategy Typical Latency
    Activation Unit ReLU, sigmoid, tanh Piece-wise approximation circuits 1-2 cycles
    Statistics Unit Mean, variance Parallel reduction trees log(N) cycles
    Exponential Unit exp, log Table lookup + hardware interpolation 2-4 cycles
    Root/Power Unit sqrt, rsqrt Fixed-iteration Newton-Raphson 4-8 cycles

    Historical Foundations of SFUs

    The need for efficient non-linear function evaluation has shaped computer architecture for decades. Early processors incorporated hardware support for complex mathematical functions, such as logarithms and trigonometric operations, to accelerate workloads in scientific computing and signal processing (Smith 1997). In the 1970s and 1980s, floating-point co-processors were introduced to handle complex mathematical operations separately from the main CPU (Palmer 1980). In the 1990s, instruction set extensions such as Intel’s SSE and ARM’s NEON provided dedicated hardware for vectorized mathematical transformations, improving efficiency for multimedia and signal processing applications.

    Smith, Steven W. 1997. The Scientist and Engineer’s Guide to Digital Signal Processing. California Technical Publishing. https://www.dspguide.com/.
    Palmer, John F. 1980. “The INTEL® 8087 Numeric Data Processor.” In Proceedings of the May 19-22, 1980, National Computer Conference on - AFIPS ’80, 887. ACM Press. https://doi.org/10.1145/1500518.1500674.

    Machine learning workloads have reintroduced a strong demand for specialized functional units, as activation functions, normalization layers, and exponential transformations are fundamental to neural network computations. Rather than relying on iterative software approximations, modern AI accelerators implement fast, fixed-latency SFUs for these operations, mirroring historical trends in scientific computing. The reemergence of dedicated special function units underscores the ongoing cycle in hardware evolution, where domain-specific requirements drive the reinvention of classical architectural concepts in new computational paradigms.

    The combination of vector, matrix, and special function units provides the computational foundation for modern AI accelerators. However, the effective utilization of these processing primitives depends critically on data movement and access patterns. This leads us to examine the architectures, hierarchies, and strategies that enable efficient data flow in neural network execution.

    11.3.4 Compute Units and Execution Models

    The vector operations, matrix operations, and special function units examined previously represent the fundamental computational primitives in AI accelerators. Modern AI processors package these primitives into distinct execution units, such as SIMD units, tensor cores, and processing elements, which define how computations are structured and exposed to users. Understanding this organization reveals both the theoretical capabilities and practical performance characteristics that developers can leverage in contemporary AI accelerators.

    Primitive to Execution Unit Mapping

    The progression from computational primitives to execution units follows a structured hierarchy that reflects the increasing complexity and specialization of AI accelerators:

    • Vector operations → SIMD/SIMT units that enable parallel processing of independent data elements
    • Matrix operations → Tensor cores and systolic arrays that provide structured matrix multiplication
    • Special functions → Dedicated hardware units integrated within processing elements

    Each execution unit combines these computational primitives with specialized memory and control mechanisms, optimizing both performance and energy efficiency. This structured packaging allows hardware vendors to expose standardized programming interfaces while implementing diverse underlying architectures tailored to specific workload requirements. The choice of execution unit significantly influences overall system efficiency, affecting data locality, compute density, and workload adaptability. Subsequent sections examine how these execution units operate within AI accelerators to maximize performance across different machine learning tasks.

    From SIMD to SIMT

    Single Instruction Multiple Data (SIMD) execution applies identical operations to multiple data elements in parallel, minimizing instruction overhead while maximizing data throughput. This execution model is widely used to accelerate workloads with regular, independent data parallelism, such as neural network computations. The ARM Scalable Vector Extension (SVE) provides a representative example of how modern architectures implement SIMD operations efficiently:

    # Vector operation implementation using ARM SVE
    ptrue p0.s              # Create predicate for vector length
    ld1w z0.s, p0/z, [x0]   # Load vector of inputs
    fmul z1.s, z0.s, z0.s   # Multiply elements
    fadd z2.s, z1.s, z0.s   # Add elements
    st1w z2.s, p0, [x1]     # Store results

    Processor architectures continue to expand SIMD capabilities to accommodate increasing computational demands. Intel’s Advanced Matrix Extensions (AMX) and ARM’s SVE2 architecture provide flexible SIMD execution, enabling software to scale across different hardware implementations (Stephens et al. 2017).

    Stephens, Nigel, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, et al. 2017. “The ARM Scalable Vector Extension.” IEEE Micro 37 (2): 26–39. https://doi.org/10.1109/mm.2017.35.
    Lindholm, Erik, John Nickolls, Stuart Oberman, and John Montrym. 2008. “NVIDIA Tesla: A Unified Graphics and Computing Architecture.” IEEE Micro 28 (2): 39–55. https://doi.org/10.1109/mm.2008.31.

    To address these limitations, SIMT extends SIMD principles by enabling parallel execution across multiple independent threads, each maintaining its own program counter and architectural state (Lindholm et al. 2008). This model maps naturally to matrix computations, where each thread processes different portions of a workload while still benefiting from shared instruction execution. In NVIDIA’s GPU architectures, each Streaming Multiprocessor (SM) coordinates thousands of threads executing in parallel, allowing for efficient scaling of neural network computations:

    // CUDA kernel demonstrating SIMT execution
    __global__ void matrix_multiply(float* C, float* A, float*
                                    B, int N) {
        // Each thread processes one output element
        int row = blockIdx.y * blockDim.y + threadIdx.y;
        int col = blockIdx.x * blockDim.x + threadIdx.x;
    
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            // Threads in a warp execute in parallel
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }

    SIMT execution allows neural network computations to scale efficiently across thousands of threads while maintaining flexibility for divergent execution paths. Similar execution models appear in AMD’s RDNA and Intel’s Xe architectures, reinforcing SIMT as a fundamental mechanism for AI acceleration.

    Tensor Cores

    While SIMD and SIMT units provide efficient execution of vector operations, neural networks rely heavily on matrix computations that require specialized execution units for structured multi-dimensional processing. Tensor processing units (TPUs) extend SIMD and SIMT principles by enabling efficient matrix operations through dedicated hardware blocks. These units execute matrix multiplications and accumulations on entire matrix blocks in a single operation, reducing instruction overhead and optimizing data movement.

    Tensor cores, implemented in architectures such as NVIDIA’s Ampere GPUs, provide an example of this approach. They expose matrix computation capabilities through specialized instructions, such as the following tensor core operation in the NVIDIA A100 GPU:

    Tensor Core Operation (NVIDIA A100):
    mma.sync.aligned.m16n16k16.f16.f16
      {d0,d1,d2,d3},     // Destination registers
      {a0,a1,a2,a3},     // Source matrix A
      {b0,b1,b2,b3},     // Source matrix B
      {c0,c1,c2,c3}      // Accumulator

    A single tensor core instruction processes an entire matrix block while maintaining intermediate results in local registers, significantly improving computational efficiency compared to implementations based on scalar or vector operations. This structured approach enables hardware to achieve high throughput while reducing the burden of explicit loop unrolling and data management at the software level.

    Tensor processing unit architectures differ based on design priorities. NVIDIA’s Ampere architecture incorporates tensor cores optimized for general-purpose deep learning acceleration. Google’s TPUv4 utilizes large-scale matrix units arranged in systolic arrays to maximize sustained training throughput. Apple’s M1 neural engine integrates smaller matrix processors optimized for mobile inference workloads, while Intel’s Sapphire Rapids architecture introduces AMX tiles designed for high-performance datacenter applications.

    The increasing specialization of AI hardware has driven significant performance improvements in deep learning workloads. Figure Figure 11.2 illustrates the trajectory of AI accelerator performance in NVIDIA GPUs, highlighting the transition from general-purpose floating-point execution units to highly optimized tensor processing cores.

    Figure 11.2: Single-chip performance scaling.

    Processing Elements

    The highest level of execution unit organization integrates multiple tensor cores with local memory into processing elements (PEs). A processing element serves as a fundamental building block in many AI accelerators, combining different computational units to efficiently execute neural network operations. Each PE typically includes vector units for element-wise operations, tensor cores for matrix computation, special function units for non-linear transformations, and dedicated memory resources to optimize data locality and minimize data movement overhead.

    Processing elements play an essential role in AI hardware by balancing computational density with memory access efficiency. Their design varies across different architectures to support diverse workloads and scalability requirements. Graphcore’s Intelligence Processing Unit (IPU) distributes computation across 1,472 tiles, each containing independent processing elements optimized for fine-grained parallelism (Graphcore 2020). Cerebras extends this approach in the CS-2 system, integrating 850,000 processing elements across a wafer-scale device to accelerate sparse computations. Tesla’s D1 processor arranges processing elements with substantial local memory, optimizing throughput and latency for real-time autonomous vehicle workloads (Inc. 2021).

    Graphcore. 2020. “The Colossus MK2 IPU Processor.” Graphcore Technical Paper.
    Inc., Tesla. 2021. “Tesla AI Day: D1 Dojo Chip.” Tesla AI Day Presentation.

    Processing elements provide the structural foundation for large-scale AI acceleration. Their efficiency depends not only on computational capability but also on interconnect strategies and memory hierarchy design. The next sections explore how these architectural choices impact performance across different AI workloads.

    Tensor processing units have enabled substantial efficiency gains in AI workloads by leveraging hardware-accelerated matrix computation. Their role continues to evolve as architectures incorporate support for advanced execution techniques, including structured sparsity6 and workload-specific optimizations. The effectiveness of these units, however, depends not only on their computational capabilities but also on how they interact with memory hierarchies and data movement mechanisms, which are examined in subsequent sections.

    6 Structured Sparsity: The deliberate design of neural network weight matrices where entire rows, columns, or blocks are pruned, thus simplifying hardware implementation and improving efficiency.

    Systolic Arrays

    While tensor cores package matrix operations into structured computational units, systolic arrays provide an alternative approach optimized for continuous data flow and operand reuse. A systolic array arranges processing elements in a grid pattern, where data flows rhythmically between neighboring units in a synchronized manner. This structured movement of data enables efficient execution of matrix multiplication, reducing memory access overhead and maximizing computational throughput.

    The concept of systolic arrays was first introduced by H.T. Kung, who formalized their use in parallel computing architectures for efficient matrix operations (Kung 1982). Unlike general-purpose execution units, systolic arrays exploit spatial and temporal locality by reusing operands as they propagate through the grid. Google’s Tensor Processing Unit (TPU) exemplifies this architectural approach. In the TPUv4, a \(128\times128\) systolic array of multiply-accumulate units processes matrix operations by streaming data through the array in a pipelined manner, as shown in Figure 11.3.

    Kung, H. T. 1982. “Why Systolic Architectures?” IEEE Computer 15 (1): 37–46. https://doi.org/10.1109/MC.1982.1653825.
    Figure 11.3: Data flow movement in a systolic array.

    Each processing element in the array performs a multiply-accumulate operation in every cycle:

    1. Receives an input activation from above
    2. Receives a weight value from the left
    3. Multiplies these values and adds to its running sum
    4. Passes the input activation downward and the weight value rightward to neighboring elements

    This structured computation model minimizes data movement between global memory and processing elements, improving both efficiency and scalability. As systolic arrays operate in a streaming fashion, they are particularly effective for high-throughput workloads such as deep learning training and inference.

    While the diagram in Figure 11.3 illustrates one common systolic array implementation, systolic architectures vary significantly across different accelerator designs. Training-focused architectures like Google’s TPU employ large arrays optimized for high computational throughput, while inference-oriented designs found in edge devices prioritize energy efficiency with smaller configurations.

    The fundamental principle remains consistent: data flows systematically through processing elements, with inputs moving horizontally and vertically to compute partial sums in a synchronized fashion. However, the practical effectiveness of systolic arrays extends beyond their computational structure—it depends heavily on efficient memory access patterns and careful scheduling strategies, topics we explore in detail in subsequent sections.

    Numerics in AI Acceleration

    The efficiency of AI accelerators is not determined by computational power alone but also by the precision of numerical representations. The choice of numerical format shapes the balance between accuracy, throughput, and energy consumption, influencing how different execution units—SIMD and SIMT units, tensor cores, and systolic arrays—are designed and deployed.

    Precision Trade-offs and Execution Unit Design

    The efficiency of AI accelerators is not determined by computational power alone but also by the precision of numerical representations. The choice of numerical format shapes the balance between accuracy, throughput, and energy consumption, influencing how different execution units—SIMD and SIMT units, tensor cores, and systolic arrays—are designed and deployed.

    Early deep learning models primarily relied on single-precision floating point (FP32) for both training and inference. While FP32 offers sufficient dynamic range and precision for stable learning, it imposes high computational and memory costs, limiting efficiency, especially as model sizes increase. Over time, hardware architectures evolved to support lower precision formats such as half-precision floating point (FP16) and bfloat16 (BF16), which reduce memory usage and increase computational throughput while maintaining sufficient accuracy for deep learning tasks. More recently, integer formats (INT8, INT4) have gained prominence in inference workloads, where small numerical representations significantly improve energy efficiency without compromising model accuracy beyond acceptable limits.

    The transition from high-precision to lower-precision formats is deeply integrated into hardware execution models. SIMD and SIMT units provide flexible support for multiple precisions, dynamically adapting to workload requirements. Tensor cores are designed explicitly for matrix multiplications, accelerating computation using reduced-precision floating point and integer arithmetic. Systolic arrays, with their structured data flow, further optimize performance by minimizing memory bandwidth constraints, often favoring low-precision formats that maximize operand reuse.

    Despite the advantages of reduced precision, deep learning models cannot always rely solely on low-bit representations. To address this challenge, modern AI accelerators implement mixed-precision computing, where different numerical formats are used at different stages of execution. For example, matrix multiplications may be performed in FP16 or BF16, while accumulations are maintained in FP32 to prevent precision loss. Similarly, inference engines leverage INT8 arithmetic while preserving key activations in higher precision when necessary.

    Mixed-Precision Computing and Hardware Evolution

    Modern AI accelerators increasingly support mixed-precision execution, allowing different numerical formats to be used at various stages of computation. Training workloads often leverage FP16 or BF16 for matrix multiplications, while maintaining FP32 accumulations to preserve precision. Inference workloads, by contrast, optimize for INT8 or even INT4, achieving high efficiency while retaining acceptable accuracy.

    This shift toward precision diversity is evident in the evolution of AI hardware. Early architectures such as NVIDIA Volta provided limited support for lower precision beyond FP16, whereas later architectures, including Turing and Ampere, expanded the range of supported formats. Ampere GPUs introduced TF32 as a hybrid between FP32 and FP16, alongside broader support for BF16, INT8, and INT4. Table 11.5 illustrates this trend.

    Table 11.5: Tensor core and cuda core precisions across GPU architectures.
    Architecture Year Supported Tensor Core Precisions Supported CUDA Core Precisions
    Volta 2017 FP16 FP64, FP32, FP16
    Turing 2018 FP16, INT8 FP64, FP32, FP16, INT8
    Ampere 2020 FP64, TF32, bfloat16, FP16, INT8, INT4 FP64, FP32, FP16, bfloat16, INT8

    Table 11.5 highlights how newer architectures incorporate a growing diversity of numerical formats, reflecting the need for greater flexibility across different AI workloads. This trend suggests that future AI accelerators will continue expanding support for adaptive precision, optimizing both computational efficiency and model accuracy. The selection now reads:

    The precision format used in hardware design has far-reaching implications. By adopting lower-precision formats, the data transferred between execution units and memory is reduced, leading to decreased memory bandwidth requirements and storage. Moreover, tensor cores and systolic arrays can process more lower-precision elements in parallel, thereby increasing the effective throughput in terms of FLOPs. Energy efficiency is also improved, as integer-based computations (e.g., INT8) require lower power compared to floating-point arithmetic—a clear advantage for inference workloads.

    As AI models continue to scale in size, accelerator architectures are evolving to support more efficient numerical formats. Future designs are expected to incorporate adaptive precision techniques, dynamically adjusting computation precision based on workload characteristics. This evolution promises further optimization of deep learning performance while striking an optimal balance between accuracy and energy efficiency.

    Architectural Integration

    The organization of computational primitives into execution units determines the efficiency of AI accelerators. While SIMD, tensor cores, and systolic arrays serve as fundamental building blocks, their integration into full-chip architectures varies significantly across different AI processors. The choice of execution units, their numerical precision support, and their connectivity impact how effectively hardware can scale for deep learning workloads.

    Modern AI processors exhibit a range of design trade-offs based on their intended applications. Some architectures, such as NVIDIA’s A100, integrate large numbers of tensor cores optimized for FP16-based training, while Google’s TPUv4 prioritizes high-throughput BF16 matrix multiplications. Inference-focused processors, such as Intel’s Sapphire Rapids, incorporate INT8-optimized tensor cores to maximize efficiency. The Apple M1, designed for mobile workloads, employs smaller processing elements optimized for low-power FP16 execution. These design choices reflect the growing flexibility in numerical precision and execution unit organization, as discussed in the previous section.

    Table 11.6 summarizes the execution unit configurations across contemporary AI processors.

    Table 11.6: Execution unit configurations across modern AI processors
    Processor SIMD Width Tensor Core Size Processing Elements Primary Workloads
    NVIDIA A100 1024-bit \(4\times4\times4\) FP16 108 SMs Training, HPC
    Google TPUv4 128-wide \(128\times128\) BF16 2 cores/chip Training
    Intel Sapphire 512-bit AVX \(32\times32\) INT8/BF16 56 cores Inference
    Apple M1 128-bit NEON \(16\times16\) FP16 8 NPU cores Mobile inference

    Table Table 11.6 highlights how execution unit configurations vary across architectures to optimize for different deep learning workloads. Training accelerators prioritize high-throughput floating-point tensor operations, whereas inference processors focus on low-precision integer execution for efficiency. Meanwhile, mobile accelerators balance precision and power efficiency to meet real-time constraints.

    While execution units define the compute potential of an accelerator, their effectiveness is fundamentally constrained by data movement and memory hierarchy. Achieving high utilization of compute resources requires efficient memory systems that minimize data transfer overhead and optimize locality. The next section explores these architectural challenges, focusing on how memory hierarchy impacts AI accelerator performance.

    11.4 AI Memory Systems

    Machine learning accelerators are designed to maximize computational throughput, leveraging specialized primitives such as vector units, matrix engines, and systolic arrays. However, the efficiency of these compute units is fundamentally constrained by the availability of data. Unlike conventional workloads, ML models require frequent access to large volumes of parameters, activations, and intermediate results, leading to substantial memory bandwidth demands. If data cannot be delivered to the processing elements at the required rate, memory bottlenecks can significantly limit performance, regardless of the accelerator’s raw computational capability.

    Modern AI hardware leverages advanced memory hierarchies, efficient data movement techniques, and compression strategies to alleviate bottlenecks and enhance performance. By examining the interplay between ML workloads and memory systems along with memory bandwidth constraints, we can gain insights into architectural innovations that promote efficient execution and improved AI acceleration.

    11.4.1 AI Memory Wall

    Machine learning accelerators are capable of performing vast amounts of computation per cycle, but their efficiency is increasingly limited by data movement rather than raw processing power. The disparity between rapid computational advancements and slower memory performance has led to a growing bottleneck, often referred to as the AI memory wall. Even the most optimized hardware architectures struggle to sustain peak throughput if data cannot be delivered at the required rate. Ensuring that compute units remain fully utilized without being stalled by memory latency and bandwidth constraints is one of the central challenges in AI acceleration.

    The Compute-Memory Imbalance

    As we have seen, neural networks rely on specialized computational primitives such as vector operations, matrix multiplications, and domain-specific functional units that accelerate key aspects of machine learning workloads. These operations are designed for highly parallel execution, enabling accelerators to perform vast amounts of computation in each cycle. Given this level of specialization, one might expect neural networks to execute efficiently without significant bottlenecks. However, the primary constraint is not the raw compute power but rather the ability to continuously supply data to these processing units.

    While these compute units can execute millions of operations per second, they remain heavily dependent on memory bandwidth to sustain peak performance. Each matrix multiplication or vector operation requires a steady flow of weights, activations, and intermediate results, all of which must be fetched from memory. If data cannot be delivered at the required rate, memory stalls occur, leaving many compute units idle. This imbalance between computational capability and data availability is often referred to as the memory wall—a fundamental challenge in AI acceleration.

    Over time, the gap between computation and memory performance has widened. As illustrated in Figure 11.4, the shaded region—termed the “AI Memory Wall”—highlights the growing disparity between compute performance and memory bandwidth over time. This visualization underscores the compute-memory imbalance, where computational capabilities advance rapidly while memory bandwidth lags, leading to potential bottlenecks in data-intensive applications. Over the past 20 years, peak server hardware FLOPs have scaled at 3.0x every two years, far outpacing the growth of DRAM bandwidth (1.6x/2yrs) (Gholami et al. 2024). This growing imbalance has made memory bandwidth, rather than compute, the primary constraint in AI acceleration.

    Gholami, Amir, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, and Kurt Keutzer. 2024. “AI and Memory Wall.” IEEE Micro 44 (3): 33–39. https://doi.org/10.1109/mm.2024.3373763.
    Figure 11.4: Compute performance versus memory bandwidth over time.

    Beyond performance limitations, memory access imposes a significant energy cost7. Fetching data from off-chip DRAM, in particular, consumes far more energy than performing arithmetic operations (Horowitz 2014). This inefficiency is particularly evident in machine learning models, where large parameter sizes, frequent memory accesses, and non-uniform data movement patterns exacerbate memory bottlenecks.

    7 A 32-bit floating-point addition consumes approximately 20 fJ, while fetching two 32-bit words from off-chip DRAM costs around 1.3 nJ–a difference of 65,000.

    Horowitz, Mark. 2014. “1.1 Computing’s Energy Problem (and What We Can Do about It).” In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). IEEE. https://doi.org/10.1109/isscc.2014.6757323.

    ML Workloads Are Memory-Intensive

    Machine learning workloads place substantial demands on memory systems due to the large volume of data involved in computation. Unlike traditional compute-bound applications, where performance is often dictated by the speed of arithmetic operations, ML workloads are characterized by high data movement requirements. The efficiency of an accelerator is not solely determined by its computational throughput but also by its ability to continuously supply data to processing units without introducing stalls or delays.

    A neural network processes multiple types of data throughout its execution, each with distinct memory access patterns:

    • Model parameters (weights and biases): Machine learning models, particularly those used in large-scale applications such as natural language processing and computer vision, often contain millions to billions of parameters. Storing and accessing these weights efficiently is essential for maintaining throughput.
    • Intermediate activations: During both training and inference, each layer produces intermediate results that must be temporarily stored and retrieved for subsequent operations. These activations can contribute significantly to memory overhead, particularly in deep architectures.
    • Gradients (during training): Backpropagation requires storing and accessing gradients for every parameter, further increasing the volume of data movement between compute units and memory.

    As models increase in size and complexity, improvements in memory capacity and bandwidth become essential. Although specialized compute units accelerate operations like matrix multiplications, their overall performance depends on the continuous, efficient delivery of data to the processing elements. In large-scale applications, such as natural language processing and computer vision, models often incorporate millions to billions of parameters (Brown et al. 2020). Consequently, achieving high performance necessitates minimizing delays and stalls caused by inefficient data movement between memory and compute units (Narayanan et al. 2021; Xingyu 2019).

    Xingyu, Huang et al. 2019. “Addressing the Memory Bottleneck in AI Accelerators.” IEEE Micro.

    One way to quantify this challenge is by comparing the data transfer time with the time required for computations. Specifically, we define the memory transfer time as \[ T_{\text{mem}} = \frac{M_{\text{total}}}{B_{\text{mem}}}, \] where \(M_{\text{total}}\) is the total data volume and \(B_{\text{mem}}\) is the available memory bandwidth. In contrast, the compute time is given by \[ T_{\text{compute}} = \frac{\text{FLOPs}}{P_{\text{peak}}}, \] with the number of floating-point operations (FLOPs) divided by the peak hardware throughput, \(P_{\text{peak}}\). When \(T_{\text{mem}} > T_{\text{compute}}\), the system becomes memory-bound, meaning that the processing elements spend more time waiting for data than performing computations. This imbalance demonstrates the need for memory-optimized architectures and efficient data movement strategies to sustain high performance.

    Figure 11.5 illustrates the scenario in the real world of how the rapid growth in model size is outpacing advancements in the hardware’s memory bandwidth, creating the “AI Memory Wall.” The shaded region emphasizes the widening gap between increasing parameter counts and available memory throughput, underscoring a critical challenge for modern AI system performance.

    Irregular Memory Access Patterns

    Unlike traditional computing workloads, where memory access follows well-structured and predictable patterns, machine learning models often exhibit irregular memory access behaviors that make efficient data retrieval a challenge. These irregularities arise due to the nature of ML computations, where memory access patterns are influenced by factors such as batch size, layer type, and sparsity. As a result, standard caching mechanisms and memory hierarchies often struggle to optimize performance, leading to increased memory latency and inefficient bandwidth utilization.

    Figure 11.5: Model growth versus memory bandwidth.

    To better understand how ML workloads differ from traditional computing workloads, it is useful to compare their respective memory access patterns (Table 11.7). Traditional workloads, such as scientific computing, general-purpose CPU applications, and database processing, typically exhibit well-defined memory access characteristics that benefit from standard caching and prefetching techniques. ML workloads, on the other hand, introduce highly dynamic access patterns that challenge conventional memory optimization strategies.

    Table 11.7: Memory access patterns in traditional vs. ML workloads.
    Feature Traditional Computing Workloads Machine Learning Workloads
    Memory Access Pattern Regular and predictable (e.g., sequential reads, structured patterns) Irregular and dynamic (e.g., sparsity, attention mechanisms)
    Cache Locality High temporal and spatial locality Often low locality, especially in large models
    Data Reuse Structured loops with frequent data reuse Sparse and dynamic reuse depending on layer type
    Data Dependencies Well-defined dependencies allow efficient prefetching Variable dependencies based on network structure
    Workload Example Scientific computing (e.g., matrix factorizations, physics simulations) Neural networks (e.g., CNNs, Transformers, sparse models)
    Memory Bottleneck DRAM latency, cache misses Off-chip bandwidth constraints, memory fragmentation
    Impact on Energy Consumption Moderate, driven by FLOP-heavy execution High, dominated by data movement costs

    One key source of irregularity in ML workloads stems from batch size and execution order. The way input data is processed in batches directly affects memory reuse, creating a complex optimization challenge. Small batch sizes decrease the likelihood of reusing cached activations and weights, resulting in frequent memory fetches from slower, off-chip memory. Larger batch sizes can improve reuse and amortize memory access costs, but simultaneously place higher demands on available memory bandwidth, potentially creating congestion at different memory hierarchy levels. This delicate balance requires careful consideration of model architecture and available hardware resources.

    In addition to batch size, different neural network layers interact with memory in distinct ways. Convolutional layers benefit from spatial locality, as neighboring pixels in an image are processed together, allowing for efficient caching of small weight kernels. Conversely, fully connected layers require frequent access to large weight matrices, often leading to more randomized memory access patterns that poorly align with standard caching policies. Transformers introduce additional complexity, as attention mechanisms demand accessing large key-value pairs stored across varied memory locations. The dynamic nature of sequence length and attention span renders traditional prefetching strategies ineffective, resulting in unpredictable memory latencies.

    Another significant factor contributing to irregular memory access is sparsity in neural networks. Many modern ML models employ techniques such as weight pruning, activation sparsity, and structured sparsity to reduce computational overhead. However, these optimizations often lead to non-uniform memory access, as sparse representations necessitate fetching scattered elements rather than sequential blocks, making hardware caching less effective. Furthermore, models that incorporate dynamic computation paths, such as Mixture of Experts8 and Adaptive Computation Time9, introduce highly non-deterministic memory access patterns, where the active neurons or model components can vary with each inference step. This variability challenges efficient prefetching and caching strategies.

    8 Mixture of Experts: A model design where different inputs are routed to specialized subnetworks based on gating mechanisms.

    9 Adaptive Computation Time: Allowing a network to dynamically allocate varying amounts of computation to different inputs based on their complexity.

    The consequences of these irregularities are significant. ML workloads often experience reduced cache efficiency, as activations and weights may not be accessed in predictable sequences. This leads to increased reliance on off-chip memory traffic, which not only slows down execution but also consumes more energy. Additionally, irregular access patterns contribute to memory fragmentation, where the way data is allocated and retrieved results in inefficient utilization of available memory resources. The combined effect of these factors is that ML accelerators frequently encounter memory bottlenecks that limit their ability to fully utilize available compute power.

    11.4.2 Memory Hierarchy

    To address the memory challenges in ML acceleration, hardware designers implement sophisticated memory hierarchies that balance speed, capacity, and energy efficiency. Understanding this hierarchy is essential before examining how different ML architectures utilize memory resources. Unlike general-purpose computing, where memory access patterns are often unpredictable, ML workloads exhibit structured reuse patterns that can be optimized through careful organization of data across multiple memory levels.

    Unlike general-purpose computing, where memory access patterns are often unpredictable, machine learning workloads exhibit structured reuse patterns that can be optimized by carefully organizing data across multiple levels of memory. At the highest level, large-capacity but slow storage devices provide long-term model storage. At the lowest level, high-speed registers and caches ensure that compute units can access operands with minimal latency. Between these extremes, intermediate memory levels—including scratchpad memory, high-bandwidth memory (HBM), and off-chip DRAM—offer trade-offs between performance and capacity.

    Table 11.8 summarizes the key characteristics of different memory levels in modern AI accelerators. Each level in the hierarchy has distinct latency, bandwidth, and capacity properties, which directly influence how neural network data, such as weights, activations, and intermediate results, should be allocated.

    Table 11.8: Memory hierarchy characteristics and their impact on machine learning.
    Memory Level Approx. Latency Bandwidth Capacity Example Use in Deep Learning
    Registers ~1 cycle Highest Few values Storing operands for immediate computation
    L1/L2 Cache (SRAM) ~1-10 ns High KBs-MBs Caching frequently accessed activations and small weight blocks
    Scratchpad Memory ~5-20 ns High MBs Software-managed storage for intermediate computations
    High-Bandwidth Memory (HBM) ~100 ns Very High GBs Storing large model parameters and activations for high-speed access
    Off-Chip DRAM (DDR, GDDR, LPDDR) ~50-150 ns Moderate GBs-TBs Storing entire model weights that do not fit on-chip
    Flash Storage (SSD/NVMe) ~100 µs - 1 ms Low TBs Storing pre-trained models and checkpoints for later loading

    On-chip Memory

    Each level of the memory hierarchy serves a distinct role in AI acceleration, with different trade-offs in speed, capacity, and accessibility. Registers, located within compute cores, provide the fastest access but can only store a few operands at a time. These are best utilized for immediate computations, where the operands needed for an operation can be loaded and consumed within a few cycles. However, because register storage is so limited, frequent memory accesses are required to fetch new operands and store intermediate results.

    To reduce the need for constant data movement between registers and external memory, small but fast caches serve as an intermediary buffer. These caches store recently accessed activations, weights, and intermediate values, ensuring that frequently used data remains available with minimal delay. However, the size of caches is limited, making them insufficient for storing full feature maps or large weight tensors in machine learning models. As a result, only the most frequently used portions of a model’s parameters or activations can reside here at any given time.

    For larger working datasets, many AI accelerators include scratchpad memory, which offers more storage than caches but with a crucial difference: it allows explicit software control over what data is stored and when it is evicted. Unlike caches, which rely on hardware-based eviction policies, scratchpad memory enables machine learning workloads to retain key values such as activations and filter weights for multiple layers of computation. This capability is particularly useful in models like convolutional neural networks, where the same input feature maps and filter weights are reused across multiple operations. By keeping this data in scratchpad memory rather than reloading it from external memory, accelerators can significantly reduce unnecessary memory transfers and improve overall efficiency (Chen, Emer, and Sze 2017).

    Off-Chip Memory

    Beyond on-chip memory, high-bandwidth memory (HBM) provides rapid access to larger model parameters and activations that do not fit within caches or scratchpad buffers. HBM achieves its high performance by stacking multiple memory dies and using wide memory interfaces, allowing it to transfer large amounts of data with minimal latency compared to traditional DRAM. Because of its high bandwidth and lower latency, HBM is often used to store entire layers of machine learning models that must be accessed quickly during execution. However, its cost and power consumption limit its use primarily to high-performance AI accelerators, making it less common in power-constrained environments such as edge devices.

    When a machine learning model exceeds the capacity of on-chip memory and HBM, it must rely on off-chip DRAM, such as DDR, GDDR, or LPDDR. While DRAM offers significantly greater storage capacity, its access latency is higher, meaning that frequent retrievals from DRAM can introduce execution bottlenecks. To make effective use of DRAM, models must be structured so that only the necessary portions of weights and activations are retrieved at any given time, minimizing the impact of long memory fetch times.

    At the highest level of the hierarchy, flash storage and solid-state drives (SSDs) store large pre-trained models, datasets, and checkpointed weights. These storage devices offer large capacities but are too slow for real-time execution, requiring models to be loaded into faster memory tiers before computation begins. For instance, in training scenarios, checkpointed models stored in SSDs must be loaded into DRAM or HBM before resuming computation, as direct execution from SSDs would be too slow to maintain efficient accelerator utilization (Narayanan et al. 2021).

    The memory hierarchy balances competing objectives of speed, capacity, and energy efficiency. However, moving data through multiple memory levels introduces bottlenecks that limit accelerator performance. Data transfers between memory levels incur latency costs, particularly for off-chip accesses. Limited bandwidth restricts data flow between memory tiers. Memory capacity constraints force constant data movement as models exceed local storage.

    11.4.3 Host and Accelerator Communication

    Machine learning accelerators, such as GPUs and TPUs, achieve high computational throughput through parallel execution. However, their efficiency is fundamentally constrained by data movement between the host (CPU) and accelerator memory. Unlike general-purpose workloads that operate entirely within a CPU’s memory subsystem, AI workloads require frequent data transfers between CPU main memory and the accelerator, introducing latency, consuming bandwidth, and affecting overall performance.

    Host-accelerator data movement follows a structured sequence, as illustrated in Figure 11.6. Before computation begins, data is copied from CPU memory to the accelerator’s memory. The CPU then issues execution instructions, and the accelerator processes the data in parallel. Once computation completes, the results are stored in accelerator memory and transferred back to the CPU. Each step introduces potential inefficiencies that must be managed to optimize performance.

    Figure 11.6: Host-accelerator memory access interactions.

    The key challenges in host-accelerator data movement include latency, bandwidth constraints, and synchronization overheads. Optimizing data transfers through efficient memory management and interconnect technologies is essential for maximizing accelerator utilization.

    Data Transfer Patterns

    The efficiency of ML accelerators depends not only on their computational power but also on the continuous supply of data. Even high-performance GPUs and TPUs remain underutilized if data transfers are inefficient. Host and accelerator memory exist as separate domains, requiring explicit transfers over interconnects such as PCIe, NVLink, or proprietary links. Ineffective data movement can cause execution stalls, making transfer optimization critical.

    Figure 11.6 illustrates this structured sequence. In step (1), data is copied from CPU memory to accelerator memory, as GPUs cannot directly access host memory at high speeds. A direct memory access (DMA) engine typically handles this transfer without consuming CPU cycles. In step (2), the CPU issues execution commands via APIs like CUDA, ROCm, or OpenCL. Step (3) involves parallel execution on the accelerator, where stalls can occur if data is not available when needed. Finally, in step (4), computed results are copied back to CPU memory for further processing.

    Latency and bandwidth limitations significantly impact AI workloads. PCIe, with a peak bandwidth of 32 GB/s (PCIe 4.0), is much slower than an accelerator’s high-bandwidth memory (HBM), which can exceed 1 TB/s. Large data transfers exacerbate bottlenecks, particularly in deep learning tasks. Additionally, synchronization overheads arise when computation must wait for data transfers to complete. Efficient scheduling and overlapping transfers with execution are essential to mitigate these inefficiencies.

    Common Data Transfer Mechanisms

    The movement of data between the host (CPU) and the accelerator (GPU, TPU, or other AI hardware) depends on the interconnect technology that links the two processing units. The choice of interconnect determines the bandwidth available for transfers, the latency of communication, and the overall efficiency of host-accelerator execution. The most commonly used transfer mechanisms include PCIe (Peripheral Component Interconnect Express), NVLink, Direct Memory Access (DMA), and Unified Memory Architectures. Each of these plays a crucial role in optimizing the four-step data movement process illustrated in Figure 11.6.

    PCIe: The Standard Host-Accelerator Interface

    Most accelerators communicate with the CPU via PCIe, the industry-standard interconnect for data movement. PCIe 4.0 provides up to 32 GB/s bandwidth, while PCIe 5.0 doubles this to 64 GB/s. However, this is still significantly lower than HBM bandwidth within accelerators, making PCIe a bottleneck for large AI workloads.

    PCIe also introduces latency overheads due to its packet-based communication and memory-mapped I/O model. Frequent small transfers are inefficient, so batching data movement reduces overhead. Computation commands, issued over PCIe, further contribute to latency, requiring careful optimization of execution scheduling.

    Direct Memory Access (DMA) for Efficient Data Transfers

    In conventional memory transfers, the CPU issues load/store instructions, consuming processing cycles. DMA offloads this task, enabling asynchronous data movement without CPU intervention.

    During data transfers, the CPU initiates a DMA request, allowing data to be copied to accelerator memory in the background. Similarly, result transfers back to main memory occur without blocking execution. This enables overlapping computation with data movement, reducing idle time and improving accelerator utilization.

    DMA is essential for enabling asynchronous data movement, which allows transfers to overlap with computation. Instead of waiting for transfers to complete before execution begins, AI workloads can stream data into the accelerator while earlier computations are still in progress, reducing idle time and improving accelerator utilization.

    Unified Memory: An Abstraction for Automatic Data Movement

    While PCIe, NVLink, and DMA optimize explicit memory transfers, some AI workloads require a more flexible memory model that eliminates the need for manual data copying. Unified Memory provides an abstraction that allows both the host and accelerator to access a single, shared memory space, automatically handling data movement when needed.

    With Unified Memory, data does not need to be explicitly copied between CPU and GPU memory before execution. Instead, when a computation requires a memory region that is currently located in host memory, the system automatically migrates it to the accelerator, handling step (1) transparently. Similarly, when computed results are accessed by the CPU, step (4) occurs automatically, eliminating the need for manual memory management.

    Although Unified Memory simplifies programming, it introduces performance trade-offs. Since memory migrations occur on demand, they can lead to unpredictable latencies, particularly if large datasets need to be transferred frequently. Additionally, since Unified Memory is implemented through page migration techniques, small memory accesses can trigger excessive data movement, further reducing efficiency.

    For AI workloads that require fine-grained memory control, explicit data transfers using PCIe, NVLink, and DMA often provide better performance. However, for applications where ease of development is more important than absolute speed, Unified Memory offers a convenient alternative.

    Data Transfer Overheads and Latency

    Host-accelerator data movement introduces overheads that impact AI workload execution. Unlike on-chip memory accesses, which occur at nanosecond latencies, host-accelerator transfers traverse system interconnects, adding latency, bandwidth constraints, and synchronization delays.

    Interconnect latency affects transfer speed, with PCIe, the standard host-accelerator link, incurring significant overhead due to packet-based transactions and memory-mapped I/O. This makes frequent small transfers inefficient. Faster alternatives like NVLink reduce latency and improve bandwidth but are limited to specific hardware ecosystems.

    Synchronization delays further contribute to inefficiencies. Synchronous transfers block execution until data movement completes, ensuring data consistency but introducing idle time. Asynchronous transfers allow computation and data movement to overlap, reducing stalls but requiring careful coordination to avoid execution mismatches.

    These factors—interconnect latency, bandwidth limitations, and synchronization overheads—determine AI workload efficiency. While optimization techniques mitigate these limitations, understanding these fundamental transfer mechanics is essential for improving performance.

    11.4.4 Model Memory Pressure

    Machine learning models impose varying memory access patterns that significantly influence accelerator performance. The way data is transferred between the host and accelerator, how frequently memory is accessed, and the efficiency of caching mechanisms all determine overall execution efficiency. While multilayer perceptrons (MLPs), convolutional neural networks (CNNs), and transformer networks each require large parameter sets, their distinct memory demands necessitate tailored optimization strategies for accelerators. Understanding these differences provides insight into why different hardware architectures exhibit varying levels of efficiency across workloads.

    Multilayer Perceptrons

    MLPs, also referred to as fully connected networks, are among the simplest neural architectures. Each layer consists of a dense matrix multiplication, requiring every neuron to interact with all neurons in the preceding layer. This results in high memory bandwidth demands, particularly for weights, as every input activation contributes to a large set of computations.

    From a memory perspective, MLPs rely on large, dense weight matrices that frequently exceed on-chip memory capacity, necessitating off-chip memory accesses. Since accelerators cannot directly access host memory at high speed, data transfers must be explicitly managed via interconnects such as PCIe or NVLink. These transfers introduce latency and consume bandwidth, affecting execution efficiency.

    Despite their bandwidth-heavy nature, MLPs exhibit regular and predictable memory access patterns, making them amenable to optimizations such as prefetching and streaming memory accesses. Dedicated AI accelerators mitigate transfer overhead by staging weight matrices in fast SRAM caches and overlapping data movement with computation through direct memory access (DMA) engines, reducing execution stalls. These optimizations allow accelerators to sustain high throughput even when handling large parameter sets (Chen, Emer, and Sze 2017).

    Convolutional Neural Networks

    Convolutional Neural Networks (CNNs) are widely used in image processing and computer vision tasks. Unlike MLPs, which require dense matrix multiplications, CNNs process input feature maps using small filter kernels that slide across the image. This localized computation structure results in high spatial data reuse, where the same input pixels contribute to multiple convolutions.

    CNN accelerators benefit from on-chip memory optimizations, as convolution filters exhibit extensive reuse, allowing weights to be stored in fast local SRAM instead of frequently accessing off-chip memory. However, activation maps require careful management due to their size. Since accessing main memory over interconnects like PCIe introduces latency and bandwidth bottlenecks, CNN accelerators employ tiling techniques to divide feature maps into smaller regions that fit within on-chip buffers. This minimizes costly external memory transfers, improving overall efficiency (Chen, Emer, and Sze 2017).

    Chen, Yu-Hsin, Joel Emer, and Vivienne Sze. 2017. “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks.” IEEE Micro, 1–1. https://doi.org/10.1109/mm.2017.265085944.

    While CNN workloads are more memory-efficient than MLPs, managing intermediate activations remains a challenge. Accelerators use hierarchical caching strategies and DMA engines to optimize memory movement, ensuring that computations are not stalled by inefficient host-accelerator data transfers. These memory optimizations help CNN accelerators maintain high throughput by reducing reliance on off-chip memory bandwidth (Chen, Emer, and Sze 2017).

    Transformer Networks

    Transformers have become the dominant architecture for natural language processing and are increasingly used in other domains such as vision and speech recognition. Unlike CNNs, which rely on local computations, transformers perform global attention mechanisms, where each token in an input sequence can interact with all other tokens. This leads to irregular and bandwidth-intensive memory access patterns, as large key-value matrices must be fetched and updated frequently.

    These models are particularly challenging for accelerators due to their massive parameter sizes, which often exceed on-chip memory capacity. As a result, frequent memory transfers between host and accelerator introduce substantial latency overheads, particularly when relying on interconnects such as PCIe. Unified Memory architectures can mitigate some of these issues by dynamically handling data movement, but they introduce additional latency due to unpredictable on-demand memory migrations. Because transformers are memory-bound rather than compute-bound, accelerators optimized for them rely on high-bandwidth memory (HBM), tensor tiling, and memory partitioning to sustain performance (Brown et al. 2020).

    Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” NeurIPS, May. http://arxiv.org/abs/2005.14165v4.
    Narayanan, Deepak, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, et al. 2021. “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM.” NeurIPS, April. http://arxiv.org/abs/2104.04473v5.

    Additionally, attention caching mechanisms and specialized tensor layouts reduce redundant memory fetches, improving execution efficiency. Given the bandwidth limitations of traditional interconnects, NVLink-enabled architectures offer significant advantages for large-scale transformer training, as they provide higher throughput and lower latency compared to PCIe. Furthermore, DMA-based asynchronous memory transfers enable overlapping computation with data movement, reducing execution stalls (Narayanan et al. 2021).

    11.4.5 Implications for ML Accelerators

    The diverse memory requirements of MLPs, CNNs, and Transformers highlight the need to tailor memory architectures to specific workloads. Table 11.9 compares the memory access patterns across these different models.

    Table 11.9: Memory access characteristics across different ML models.
    Model Type Weight Size Activation Reuse Memory Access Pattern Primary Bottleneck
    MLP (Dense) Large, dense Low Regular, sequential (streamed) Bandwidth (off-chip)
    CNN Small, reused High Spatial locality Feature map movement
    Transformer Massive, sparse Low Irregular, high-bandwidth Memory capacity + Interconnect

    Each model type presents unique challenges that directly impact accelerator design. MLPs benefit from fast streaming access to dense weight matrices, making memory bandwidth a critical factor in performance, especially when transferring large weights from host memory to accelerator memory. CNNs, with their high activation reuse and structured memory access patterns, can leverage on-chip caching and tiling strategies to minimize off-chip memory transfers. Transformers, however, impose significant demands on both bandwidth and capacity, as attention mechanisms require frequent access to large key-value matrices, leading to high interconnect traffic and increased memory pressure.

    To address these challenges, modern AI accelerators incorporate multi-tier memory hierarchies that balance speed, capacity, and energy efficiency. On-chip SRAM caches and scratchpad memories store frequently accessed data, while high-bandwidth external memory (HBM) provides scalability for large models. Efficient interconnects, such as NVLink, help alleviate host-accelerator transfer bottlenecks, particularly in transformer workloads where memory movement constraints can dominate execution time.

    As ML workloads continue to grow in complexity, memory efficiency is becoming as critical as raw compute power. Efficient data movement strategies, asynchronous memory transfers (DMA), and unified memory architectures play a fundamental role in sustaining high performance. The following section explores the design of memory hierarchies in AI accelerators, detailing how different levels of memory interact to optimize execution efficiency.

    11.5 Mapping Neural Networks

    Efficient execution of machine learning models on specialized AI acceleration hardware requires a structured approach to computation, ensuring that available resources are fully utilized while minimizing performance bottlenecks. Unlike general-purpose processors, which rely on dynamic task scheduling, AI accelerators operate under a structured execution model that maximizes throughput by carefully assigning computations to processing elements. This process, known as mapping, dictates how computations are distributed across hardware resources, influencing execution speed, memory access patterns, and overall efficiency.

    Definition of Mapping in AI Acceleration

    Mapping in AI Acceleration refers to the assignment of machine learning computations to hardware processing units to optimize execution efficiency. This process involves spatial allocation, which distributes computations across processing elements; temporal scheduling, which sequences operations to maintain balanced workloads; and memory-aware execution, which strategically places data to minimize access latency. Effective mapping ensures high resource utilization, reduced memory stalls, and energy-efficient execution, making it a critical factor in AI acceleration.

    Mapping machine learning models onto AI accelerators presents several challenges due to hardware constraints and the diversity of model architectures. Given the hierarchical memory system of modern accelerators, mapping strategies must carefully manage when and where data is accessed to minimize latency and power overhead while ensuring that compute units remain actively engaged. Poor mapping decisions can lead to underutilized compute resources, excessive data movement, and increased execution time, ultimately reducing overall efficiency.

    Mapping encompasses three interrelated aspects that form the foundation of effective AI accelerator design.

    • Computation Placement: Systematically assigns operations (e.g., matrix multiplications, convolutions) to processing elements to maximize parallelism and reduce idle time.
    • Memory Allocation: Carefully determines where model parameters, activations, and intermediate results reside within the memory hierarchy to optimize access efficiency.
    • Dataflow and Execution Scheduling: Structures the movement of data between compute units to reduce bandwidth bottlenecks and ensure smooth, continuous execution.

    Effective mapping strategies minimize off-chip memory accesses, maximize compute utilization, and efficiently manage data movement across different levels of the memory hierarchy. The following sections explore the key mapping choices that influence execution efficiency and lay the groundwork for optimization strategies that refine these decisions.

    11.5.1 Computation Placement

    Modern AI accelerators are designed to execute machine learning models with massive parallelism, leveraging thousands to millions of processing elements (PEs) to perform computations simultaneously. However, simply having a large number of compute units is not enough—how computations are assigned to these units determines overall efficiency.

    Without careful placement, some processing elements may sit idle while others are overloaded, leading to wasted resources, increased memory traffic, and reduced performance. Computation placement is the process of strategically mapping operations onto available hardware resources to sustain high throughput, minimize stalls, and optimize execution efficiency.

    Defining Computation Placement

    AI accelerators contain thousands to millions of processing elements, making computation placement a large-scale problem. Modern GPUs, such as the NVIDIA H100, feature over 16,000 CUDA cores and more than 500 specialized tensor cores, each designed to accelerate matrix operations (Jouppi, Young, et al. 2017). TPUs utilize systolic arrays composed of thousands of interconnected multiply-accumulate (MAC) units, while wafer-scale processors like Cerebras’ CS-2 push parallelism even further, integrating over 850,000 cores on a single chip (Systems 2021b). In these architectures, even minor inefficiencies in computation placement can lead to significant performance losses, as idle cores or excessive memory movement compound across the system.

    Jouppi, Norman P., Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, et al. 2017. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” In Proceedings of the 44th Annual International Symposium on Computer Architecture, 1–12. ACM. https://doi.org/10.1145/3079856.3080246.
    ———. 2021b. “Wafer-Scale Deep Learning Acceleration with the Cerebras CS-2.” Cerebras Technical Paper.

    Computation placement ensures that all processing elements contribute effectively to execution. This means that workloads must be distributed in a way that avoids imbalanced execution, where some processing elements sit idle while others remain overloaded. Similarly, placement must minimize unnecessary data movement, as excessive memory transfers introduce latency and power overheads that degrade system performance.

    Neural network computations vary significantly based on the model architecture, influencing how placement strategies are applied. For example, in a convolutional neural network (CNN), placement focuses on dividing image regions across processing elements to maximize parallelism. A \(256\times256\) image processed through thousands of GPU cores might be broken into small tiles, each mapped to a different processing unit to execute convolutional operations simultaneously. In contrast, a transformer-based model requires placement strategies that accommodate self-attention mechanisms, where each token in a sequence interacts with all others, leading to irregular and memory-intensive computation patterns. Meanwhile, Graph Neural Networks (GNNs) introduce additional complexity, as computations depend on sparse and dynamic graph structures that require adaptive workload distribution (Zheng et al. 2020).

    Because computation placement directly impacts resource utilization, execution speed, and power efficiency, it is one of the most critical factors in AI acceleration. A well-placed computation can reduce latency by orders of magnitude, while a poorly placed one can render thousands of processing units underutilized. The next section explores why efficient computation placement is essential and the consequences of suboptimal mapping strategies.

    Why Computation Placement Matters

    While computation placement is a hardware-driven process, its importance is fundamentally shaped by the structure of neural network workloads. Different types of machine learning models exhibit distinct computation patterns, which directly influence how efficiently they can be mapped onto accelerators. Without careful placement, workloads can become unbalanced, memory access patterns can become inefficient, and the overall performance of the system can degrade significantly.

    For models with structured computation patterns, such as CNNs, computation placement is relatively straightforward. CNNs process images using filters that are applied to small, localized regions, meaning their computations can be evenly distributed across processing elements. Because these operations are highly parallelizable, CNNs benefit from spatial partitioning, where the input is divided into tiles that are processed independently. This structured execution makes CNNs well-suited for accelerators that favor regular dataflows, minimizing the complexity of placement decisions.

    However, for models with irregular computation patterns, such as transformers and GNNs, computation placement becomes significantly more challenging. Transformers, which rely on self-attention mechanisms, require each token in a sequence to interact with all others, resulting in non-uniform computation demands. Unlike CNNs, where each processing element performs a similar amount of work, transformers introduce workload imbalance, where certain operations—such as computing attention scores—require far more computation than others. Without careful placement, this imbalance can lead to stalls, where some processing elements remain idle while others struggle to keep up.

    The challenge is even greater in graph neural networks (GNNs), where computation depends on sparse and dynamically changing graph structures. Unlike CNNs, which operate on dense and regularly structured data, GNNs must process nodes and edges with highly variable degrees of connectivity. Some regions of a graph may require significantly more computation than others, making workload balancing across processing elements difficult (Zheng et al. 2020). If computations are not placed strategically, some compute units will sit idle while others remain overloaded, leading to underutilization and inefficiencies in execution.

    Poor computation placement adversely affects AI execution by creating workload imbalance, inducing excessive data movement, and causing execution stalls and bottlenecks. Specifically, an uneven distribution of computations can lead to idle processing elements, thereby preventing full hardware utilization and diminishing throughput. In addition, inefficient execution assignment increases memory traffic by necessitating frequent data transfers between memory hierarchies, which in turn introduces latency and raises power consumption. Finally, such misallocation can cause operations to wait on data dependencies, resulting in pipeline inefficiencies that ultimately lower overall system performance.

    Ultimately, computation placement is not just about assigning operations to processing elements—it is about ensuring that models execute efficiently given their unique computational structure. A well-placed workload reduces execution time, memory overhead, and power consumption, while a poorly placed one can lead to stalled execution pipelines and inefficient resource utilization. The next section explores the key considerations that must be addressed to ensure that computation placement is both efficient and adaptable to different model architectures.

    Key Considerations for Effective Computation Placement

    Computation placement is a balancing act between hardware constraints and workload characteristics. To achieve high efficiency, placement strategies must account for parallelism, memory access, and workload variability while ensuring that processing elements remain fully utilized. Poor placement leads to imbalanced execution, increased data movement, and performance degradation, making it essential to consider key factors when designing placement strategies.

    As summarized in Table 11.10, computation placement faces several critical challenges that impact execution efficiency. Effective mapping strategies must address these challenges by balancing workload distribution, minimizing data movement, and optimizing communication across processing elements.

    Table 11.10: Primary challenges in computation placement and key considerations for effective mapping strategies.
    Challenge Impact on Execution Key Considerations for Placement
    Workload Imbalance Some processing elements finish early while others remain overloaded, leading to idle compute resources. Distribute operations evenly to prevent stalls and ensure full utilization of PEs.
    Irregular Computation Patterns Models like transformers and GNNs introduce non-uniform computation demands, making static placement difficult. Use adaptive placement strategies that adjust execution based on workload characteristics.
    Excessive Data Movement Frequent memory transfers introduce latency and increase power consumption. Keep frequently used data close to the compute units and minimize off-chip memory accesses.
    Limited Interconnect Bandwidth Poorly placed operations can create congestion, slowing data movement between PEs. Optimize spatial and temporal placement to reduce communication overhead.
    Model-Specific Execution Needs CNNs, transformers, and GNNs require different execution patterns, making a single placement strategy ineffective. Tailor placement strategies to match the computational structure of each model type.

    Each of these challenges highlights a core trade-off in computation placement: maximizing parallelism while minimizing memory overhead. For CNNs, placement strategies prioritize structured tiling to maintain efficient data reuse. For transformers, placement must ensure balanced execution across attention layers. For GNNs, placement must dynamically adjust to sparse computation patterns.

    Beyond model-specific needs, effective computation placement must also be scalable. As models grow in size and complexity, placement strategies must adapt dynamically rather than relying on static execution patterns. Future AI accelerators increasingly integrate runtime-aware scheduling mechanisms, where placement is optimized based on real-time workload behavior rather than predetermined execution plans.

    Ultimately, effective computation placement requires a holistic approach that balances hardware capabilities with model characteristics. The next section explores how computation placement interacts with memory allocation and data movement, ensuring that AI accelerators operate at peak efficiency.

    11.5.2 Memory Allocation

    Efficient memory allocation is a key requirement for high-performance AI acceleration. As AI models grow in complexity, accelerators must manage vast amounts of data movement—loading model parameters, storing intermediate activations, and handling gradient computations. The way this data is allocated across the memory hierarchy directly affects execution efficiency, power consumption, and overall system throughput.

    Defining Memory Allocation

    While computation placement determines where operations are executed, memory allocation defines where data is stored and how it is accessed throughout execution. As discussed earlier, all AI accelerators rely on hierarchical memory systems, ranging from on-chip caches and scratchpads to high-bandwidth memory (HBM) and DRAM. Poor memory allocation can lead to excessive off-chip memory accesses, increasing bandwidth contention and slowing down execution. Since AI accelerators operate at teraflop and petaflop scales, inefficient memory access patterns can result in substantial performance bottlenecks.

    The primary goal of memory allocation is to minimize latency and reduce power consumption by keeping frequently accessed data as close as possible to the processing elements. Different hardware architectures implement memory hierarchies tailored for AI workloads. GPUs rely on a mix of global memory, shared memory, and registers, requiring careful tiling strategies to optimize locality. TPUs use on-chip SRAM scratchpads, where activations and weights must be efficiently preloaded to sustain systolic array execution. Wafer-scale processors, with their hundreds of thousands of cores, demand sophisticated memory partitioning strategies to avoid excessive interconnect traffic. In all cases, the effectiveness of memory allocation determines the overall throughput, power efficiency, and scalability of AI execution.

    Why Memory Allocation Matters

    Memory allocation is important in AI acceleration because how and where data is stored directly impacts execution efficiency. Unlike general-purpose computing, where memory management is abstracted by caches and dynamic allocation, AI accelerators require explicit data placement strategies to sustain high throughput and avoid unnecessary stalls. When memory is not allocated efficiently, AI workloads suffer from latency overhead, excessive power consumption, and bottlenecks that limit computational performance.

    Neural network architectures have varying memory demands, which influence the importance of proper allocation. CNNs rely on structured and localized data access patterns, meaning that inefficient memory allocation can lead to redundant data loads and cache inefficiencies. In contrast, transformer models require frequent access to large model parameters and intermediate activations, making them highly sensitive to memory bandwidth constraints. GNNs introduce even greater challenges, as their irregular and sparse data structures result in unpredictable memory access patterns that can lead to inefficient use of memory resources. Poor memory allocation has three major consequences for AI execution:

    1. Increased Memory Latency: When frequently accessed data is not stored in the right location, accelerators must retrieve it from higher-latency memory, slowing down execution.
    2. Higher Power Consumption: Off-chip memory accesses consume significantly more energy than on-chip storage, leading to inefficiencies at scale.
    3. Reduced Computational Throughput: If data is not available when needed, processing elements remain idle, reducing the overall performance of the system.

    As AI models continue to grow in size and complexity, the importance of scalable and efficient memory allocation increases. Memory limitations can dictate how large of a model can be deployed on a given accelerator, affecting feasibility and performance. The next section explores the key considerations that impact memory allocation strategies and the constraints that must be addressed to optimize execution efficiency.

    Key Considerations for Effective Memory Allocation

    Inefficient allocation leads to frequent stalls, excessive memory traffic, and power inefficiencies, all of which degrade overall performance. As summarized in Table 11.11, memory allocation in AI accelerators must address several key challenges that influence execution efficiency. Effective allocation strategies mitigate high latency, bandwidth limitations, and irregular access patterns by carefully managing data placement and movement. Ensuring that frequently accessed data is stored in faster memory locations while minimizing unnecessary transfers is essential for maintaining performance and energy efficiency.

    Each of these challenges requires careful memory management to balance execution efficiency with hardware constraints. While structured models may benefit from well-defined memory layouts that facilitate predictable access, others, like transformer-based and graph-based models, require more adaptive allocation strategies to handle variable and complex memory demands.

    Table 11.11: Key challenges in memory allocation and considerations for efficient execution.
    Challenge Impact on Execution Key Considerations for Allocation
    High Memory Latency Slow data access delays execution and reduces throughput. Prioritize placing frequently accessed data in faster memory locations.
    Limited On-Chip Storage Small local memory constrains the amount of data available near compute units. Allocate storage efficiently to maximize data availability without exceeding hardware limits.
    High Off-Chip Bandwidth Demand Frequent access to external memory increases delays and power consumption. Reduce unnecessary memory transfers by carefully managing when and how data is moved.
    Irregular Memory Access Patterns Some models require accessing data unpredictably, leading to inefficient memory usage. Organize memory layout to align with access patterns and minimize unnecessary data movement.
    Model-Specific Memory Needs Different models require different allocation strategies to optimize performance. Tailor allocation decisions based on the structure and execution characteristics of the workload.

    Beyond workload-specific considerations, memory allocation must also be scalable. As model sizes continue to grow, accelerators must dynamically manage memory resources rather than relying on static allocation schemes. Ensuring that frequently used data is accessible when needed without overwhelming memory capacity is essential for maintaining high efficiency.

    In summary, mapping neural network computations to specialized hardware is a foundational step in AI acceleration, directly influencing performance, memory efficiency, and energy consumption. However, selecting an effective mapping strategy is not a trivial task—hardware constraints, workload variability, and execution dependencies create a vast and complex design space.

    While the principles of computation placement, memory allocation, and data movement provide a structured foundation, optimizing these decisions requires advanced techniques to navigate the trade-offs involved. The next section explores optimization strategies that refine mapping decisions, focusing on techniques that efficiently search the design space to maximize execution efficiency while balancing hardware constraints.

    11.5.3 Combinatorial Complexity

    The efficient execution of machine learning models on AI accelerators requires careful consideration of placement—the spatial assignment of computations and data—and allocation—the temporal distribution of resources. These decisions are interdependent, and each introduces trade-offs that impact performance, energy efficiency, and scalability. Table 11.12 outlines the fundamental trade-offs between computation placement and resource allocation in AI accelerators. Placement decisions influence parallelism, memory access patterns, and communication overhead, while allocation strategies determine how resources are distributed over time to balance execution efficiency. The interplay between these factors shapes overall performance, requiring a careful balance to avoid bottlenecks such as excessive synchronization, memory congestion, or underutilized compute resources. Optimizing these trade-offs is essential for ensuring that AI accelerators operate at peak efficiency.

    Table 11.12: Trade-offs between computation placement and resource allocation in AI accelerators.
    Dimension Placement Considerations Allocation Considerations
    Computational Granularity Fine-grained placement enables greater parallelism but increases synchronization overhead. Coarse-grained allocation reduces synchronization overhead but may limit flexibility.
    Spatial vs. Temporal Mapping Spatial placement enhances parallel execution but can lead to resource contention and memory congestion. Temporal allocation balances resource sharing but may reduce overall throughput.
    Memory and Data Locality Placing data closer to compute units minimizes latency but may reduce overall memory availability. Allocating data across multiple memory levels increases capacity but introduces higher access costs.
    Communication and Synchronization Co-locating compute units reduces communication latency but may introduce contention. Allocating synchronization mechanisms mitigates stalls but can introduce additional overhead.
    Dataflow and Execution Ordering Static placement simplifies execution but limits adaptability to workload variations. Dynamic allocation improves adaptability but adds scheduling complexity.

    Each of these dimensions requires balancing trade-offs between placement and allocation. For instance, spatially distributing computations across multiple processing elements can increase throughput; however, if data allocation is not optimized, memory bandwidth limitations may introduce bottlenecks. Likewise, allocating resources for fine-grained computations may enhance flexibility but, without appropriate placement strategies, may lead to excessive synchronization overhead.

    Because AI accelerator architectures impose constraints on both where computations execute and how resources are assigned over time, selecting an effective mapping strategy necessitates a coordinated approach to placement and allocation. Understanding how these trade-offs influence execution efficiency is essential for optimizing performance on AI accelerators.

    Mapping Configuration Space

    The efficiency of AI accelerators is determined not only by their computational capabilities but also by how neural network computations are mapped to hardware resources. Mapping defines how computations are assigned to processing elements, how data is placed and moved through the memory hierarchy, and how execution is scheduled. The choices made in this process significantly impact performance, influencing compute utilization, memory bandwidth efficiency, and energy consumption.

    Mapping machine learning models to hardware presents a large and complex design space. Unlike traditional computational workloads, model execution involves multiple interacting factors—computation, data movement, parallelism, and scheduling—each introducing constraints and trade-offs. The hierarchical memory structure of accelerators, as discussed in the Memory Systems section, further complicates this process by imposing limits on bandwidth, latency, and data reuse. As a result, effective mapping strategies must carefully balance competing objectives to maximize efficiency.

    At the heart of this design space lie three interconnected aspects: data placement, computation scheduling, and data movement timing. Data placement refers to the allocation of data across various memory hierarchies—including on-chip buffers, caches, and off-chip DRAM—and its effective management is critical because it influences both latency and energy consumption. Inefficient placement often results in frequent, costly memory accesses, whereas strategic placement ensures that data used regularly remains in fast-access storage. Computation scheduling governs the order in which operations execute, impacting compute efficiency and memory access patterns; for instance, some execution orders may optimize parallelism while introducing synchronization overheads, and others may improve data locality at the expense of throughput. Meanwhile, timing in data movement is equally essential, as transferring data between memory levels incurs significant latency and energy costs. Efficient mapping strategies thus focus on minimizing unnecessary transfers by reusing data and overlapping communication with computation to enhance overall performance.

    These factors define a vast combinatorial design space, where small variations in mapping decisions can lead to large differences in performance and energy efficiency. A poor mapping strategy can result in underutilized compute resources, excessive data movement, or imbalanced workloads, creating bottlenecks that degrade overall efficiency. Conversely, a well-designed mapping maximizes both throughput and resource utilization, making efficient use of available hardware.

    Because of the interconnected nature of mapping decisions, there is no single optimal solution—different workloads and hardware architectures demand different approaches. The next sections examine the structure of this design space and how different mapping choices shape the execution of machine learning workloads.

    Mapping machine learning computations onto specialized hardware requires balancing multiple constraints, including compute efficiency, memory bandwidth, and execution scheduling. The challenge arises from the vast number of possible ways to assign computations to processing elements, order execution, and manage data movement. Each decision contributes to a high-dimensional search space, where even minor variations in mapping choices can significantly impact performance.

    Unlike traditional workloads with predictable execution patterns, machine learning models introduce diverse computational structures that require flexible mappings adapted to data reuse, parallelization opportunities, and memory constraints. The search space grows combinatorially, making exhaustive search infeasible. To understand this complexity, we analyze three key sources of variation:

    Ordering of Computation and Execution Dependencies

    Machine learning workloads are often structured as nested loops, iterating over various dimensions of computation. For instance, a matrix multiplication kernel may loop over batch size (\(N\)), input features (\(C\)), and output features (\(K\)). The order in which these loops execute has a profound effect on data locality, reuse patterns, and computational efficiency.

    The number of ways to arrange \(d\) loops follows a factorial growth pattern: \[ \mathcal{O} = d! \] which scales rapidly. A typical convolutional layer may involve up to seven loop dimensions, leading to: \[ 7! = 5,040 \text{ possible execution orders.} \]

    Furthermore, when considering multiple memory levels, the search space expands as: \[ (d!)^l \] where \(l\) is the number of memory hierarchy levels. This rapid expansion highlights why execution order optimization is crucial—poor loop ordering can lead to excessive memory traffic, while an optimized order improves cache utilization (Sze et al. 2017a).

    Sze, Vivienne, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer. 2017a. “Efficient Processing of Deep Neural Networks: A Tutorial and Survey.” Proceedings of the IEEE 105 (12): 2295–2329. https://doi.org/10.1109/jproc.2017.2761740.

    Parallelization Across Processing Elements

    Modern AI accelerators leverage thousands of processing elements to maximize parallelism, but determining which computations should be parallelized is non-trivial. Excessive parallelization can introduce synchronization overheads and increased bandwidth demands, while insufficient parallelization leads to underutilized hardware.

    The number of ways to distribute computations among parallel units follows the binomial coefficient: \[ \mathcal{P} = \frac{d!}{(d-k)!} \] where \(d\) is the number of loops, and \(k\) is the number selected for parallel execution. For a six-loop computation where three loops are chosen for parallel execution, the number of valid configurations is: \[ \frac{6!}{(6-3)!} = 120. \]

    Even for a single layer, there can be hundreds of valid parallelization strategies, each affecting data synchronization, memory contention, and overall compute efficiency. Expanding this across multiple layers and model architectures further magnifies the complexity.

    Memory Placement and Data Movement

    The hierarchical memory structure of AI accelerators introduces additional constraints, as data must be efficiently placed across registers, caches, shared memory, and off-chip DRAM. Data placement impacts latency, bandwidth consumption, and energy efficiency—frequent access to slow memory creates bottlenecks, while optimized placement reduces costly memory transfers.

    The number of ways to allocate data across memory levels follows an exponential growth function: \[ \mathcal{M} = n^{d \times l} \] where:

    • \(n\) = number of placement choices per level,
    • \(d\) = number of computational dimensions,
    • \(l\) = number of memory hierarchy levels.

    For a model with:

    • \(d = 5\) computational dimensions,
    • \(l = 3\) memory levels,
    • \(n = 4\) possible placement choices per level,

    the number of possible memory allocations is: \[ 4^{5 \times 3} = 4^{15} = 1,073,741,824. \]

    This highlights how even a single layer may have over a billion possible memory configurations, making manual optimization impractical.

    Total Mapping Search Space

    By combining the complexity from computation ordering, parallelization, and memory placement, the total mapping search space can be approximated as: \[ \mathcal{S} = \left( n^d \times d! \times \frac{d!}{(d-k)!} \right)^l \] where:

    • \(n^d\) represents memory placement choices,
    • \(d!\) accounts for computation ordering choices,
    • \(\frac{d!}{(d-k)!}\) captures parallelization possibilities,
    • \(l\) is the number of memory hierarchy levels.

    This equation illustrates the exponential growth of the search space, making brute-force search infeasible for all but the simplest cases.

    11.6 Mapping Optimization Strategies

    Efficiently mapping machine learning computations onto hardware is a complex challenge due to the vast number of possible configurations. As models grow in complexity, the number of potential mappings increases exponentially. Even for a single layer, there are thousands of ways to order computation loops, hundreds of parallelization strategies, and an exponentially growing number of memory placement choices. This combinatorial explosion makes exhaustive search impractical.

    To overcome this challenge, AI accelerators rely on structured mapping strategies that systematically balance computational efficiency, data locality, and parallel execution. Rather than evaluating every possible configuration, these approaches use a combination of heuristic, analytical, and machine learning-based techniques to find high-performance mappings efficiently.

    The key to effective mapping lies in understanding and applying a set of core techniques that optimize data movement, memory access, and computation. These building blocks of mapping strategies provide a structured foundation for efficient execution, which we explore in the next section.

    11.6.1 Building Blocks of Mapping Strategies

    To navigate the complexity of mapping decisions, a set of foundational techniques is leveraged that optimizes execution across data movement, memory access, and computation efficiency. These techniques provide the necessary structure for mapping strategies that maximize hardware performance while minimizing bottlenecks.

    Key techniques include data movement strategies, which determine where data is staged during computation in order to reduce redundant transfers, such as in weight stationary, output stationary, and input stationary approaches. Memory-aware tensor layouts also play an important role by influencing memory access patterns and cache efficiency through the organization of data in formats such as row-major or channel-major.

    Other strategies involve kernel fusion, a method that minimizes redundant memory writes by combining multiple operations into a single computational step. Tiling is employed as a technique that partitions large computations into smaller, memory-friendly blocks to improve cache efficiency and reduce memory bandwidth requirements. Finally, balancing computation and communication is essential for managing the trade-offs between parallel execution and memory access to achieve high throughput.

    Each of these building blocks plays a crucial role in structuring high-performance execution, forming the basis for both heuristic and model-driven optimization techniques. In the next section, we explore how these strategies are adapted to different types of AI models.

    Data Movement Patterns

    While computational mapping determines where and when operations occur, its success depends heavily on how efficiently data is accessed and transferred across the memory hierarchy. Unlike traditional computing workloads, which often exhibit structured and predictable memory access patterns, machine learning workloads present irregular access behaviors due to frequent retrieval of weights, activations, and intermediate values.

    Even when computational units are mapped efficiently, poor data movement strategies can severely degrade performance, leading to frequent memory stalls and underutilized hardware resources. If data cannot be supplied to processing elements at the required rate, computational units remain idle, increasing latency, memory traffic, and energy consumption (Chen et al. 2016).

    To illustrate the impact of data movement inefficiencies, consider a typical matrix multiplication operation, which forms the backbone of many machine learning models:

    ## Matrix multiplication where:
    ## weights: [512 x 256] - model parameters
    ## input:   [256 x 32]  - batch of activations
    ## Z:       [512 x 32]  - output activations
    
    ## Computing each output element Z[i,j]:
    for i in range(512):
        for j in range(32):
            for k in range(256):
                Z[i,j] += weights[i,k] * input[k,j]

    This computation reveals several critical dataflow challenges.

    The first challenge is the number of memory accesses required. For each output \(Z[i, j]\), the computation must fetch an entire row of weights from the weight matrix and a full column of activations from the input matrix. Since the weight matrix contains 512 rows and the input matrix contains 32 columns, this results in repeated memory accesses that place a significant burden on memory bandwidth.

    The second challenge comes from weight reuse. The same weights are applied to multiple inputs, meaning that an ideal mapping strategy should maximize weight locality to avoid redundant memory fetches. Without proper reuse, the accelerator would waste bandwidth loading the same weights multiple times (Tianqi et al. 2018).

    Tianqi, Chen et al. 2018. “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 578–94.

    The third challenge involves the accumulation of intermediate results. Since each element in \(Z[i,j]\) requires contributions from 256 different weight-input pairs, partial sums must be stored and retrieved before the final value is computed. If these intermediate values are stored inefficiently, the system will require frequent memory accesses, further increasing bandwidth demands.

    A natural way to mitigate these challenges is to leverage SIMD and SIMT execution models, which allow multiple values to be fetched in parallel. However, even with these optimizations, data movement remains a bottleneck. The issue is not just how quickly data is retrieved but how often it must be moved and where it is placed within the memory hierarchy (Han et al. 2016).

    Han, Song, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. “EIE: Efficient Inference Engine on Compressed Deep Neural Network.” In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 243–54. IEEE. https://doi.org/10.1109/isca.2016.30.

    To address these constraints, accelerators implement dataflow strategies that determine which data remains fixed in memory and which data is streamed dynamically. These strategies aim to maximize reuse of frequently accessed data, thereby reducing the need for redundant memory fetches. The effectiveness of a given dataflow strategy depends on the specific workload—for example, deep convolutional networks benefit from keeping weights stationary, while fully connected layers may require a different approach.

    Weight Stationary

    The Weight Stationary strategy keeps weights fixed in local memory, while input activations and partial sums are streamed through the system. This approach is particularly beneficial in CNNs and matrix multiplications, where the same set of weights is applied across multiple inputs. By ensuring weights remain stationary, this method reduces redundant memory fetches, which helps alleviate bandwidth bottlenecks and improves energy efficiency.

    A key advantage of the weight stationary approach is that it maximizes weight reuse, reducing the frequency of memory accesses to external storage. Since weight parameters are often shared across multiple computations, keeping them in local memory eliminates unnecessary data movement, lowering the overall energy cost of computation. This makes it particularly effective for architectures where weights represent the dominant memory overhead, such as systolic arrays and custom accelerators designed for machine learning.

    A simplified Weight Stationary implementation for matrix multiplication is illustrated below:

    ## Weight Stationary Matrix Multiplication
    ## - Weights remain fixed in local memory
    ## - Input activations stream through
    ## - Partial sums accumulate for final output
    
    for weight_block in weights:  # Load and keep weights stationary
        load_to_local(weight_block)  # Fixed in local storage
        for input_block in inputs:   # Stream inputs dynamically
            for output_block in outputs:  # Compute results
                output_block += compute(weight_block, input_block)
                # Reuse weights across inputs

    In weight stationary execution, weights are loaded once into local memory and remain fixed throughout the computation, while inputs are streamed dynamically, thereby reducing redundant memory accesses. At the same time, partial sums are accumulated in an efficient manner that minimizes unnecessary data movement, ensuring that the system maintains high throughput and energy efficiency.

    By keeping weights fixed in local storage, memory bandwidth requirements are significantly reduced, as weights do not need to be reloaded for each new computation. Instead, the system efficiently reuses the stored weights across multiple input activations, allowing for high throughput execution. This makes weight stationary dataflow highly effective for workloads with heavy weight reuse patterns, such as CNNs and matrix multiplications.

    However, while this strategy reduces weight-related memory traffic, it introduces trade-offs in input and output movement. Since inputs must be streamed dynamically while weights remain fixed, the efficiency of this approach depends on how well input activations can be delivered to the computational units without causing stalls. Additionally, partial sums—representing intermediate results—must be carefully accumulated to avoid excessive memory traffic. The total performance gain depends on the size of available on-chip memory, as storing larger weight matrices locally can become a constraint in models with millions or billions of parameters.

    The weight stationary strategy is well-suited for workloads where weights exhibit high reuse and memory bandwidth is a limiting factor. It is commonly employed in CNNs, systolic arrays, and matrix multiplication kernels, where structured weight reuse leads to significant performance improvements. However, for models where input or output reuse is more critical, alternative dataflow strategies, such as output stationary or input stationary, may provide better trade-offs.

    Output Stationary

    The Output Stationary strategy keeps partial sums fixed in local memory, while weights and input activations stream through the system. This approach is particularly effective for fully connected layers, systolic arrays, and other operations where an output element accumulates contributions from multiple weight-input pairs. By keeping partial sums stationary, this method reduces redundant memory writes, minimizing bandwidth consumption and improving energy efficiency (Chen et al. 2016).

    Chen, Yu-Hsin, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2016. “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks.” IEEE Journal of Solid-State Circuits 51 (1): 186–98. https://doi.org/10.1109/JSSC.2015.2488709.

    A key advantage of the output stationary approach is that it optimizes accumulation efficiency, ensuring that each output element is computed as efficiently as possible before being written to memory. Unlike Weight Stationary, which prioritizes weight reuse, Output Stationary execution is designed to minimize memory bandwidth overhead caused by frequent writes of intermediate results. This makes it well-suited for workloads where accumulation dominates the computational pattern, such as fully connected layers and matrix multiplications in transformer-based models.

    A simplified Output Stationary implementation for matrix multiplication is illustrated below:

    ## Output Stationary Matrix Multiplication
    ## - Partial sums remain in local memory
    ## - Weights and input activations stream through dynamically
    ## - Final outputs are written only once
    
    for output_block in outputs:   # Keep partial sums stationary
        accumulator = 0             # Initialize accumulation buffer
        for weight_block, input_block in zip(weights, inputs):
            accumulator += compute(weight_block, input_block)
            # Accumulate partial sums
        store_output(accumulator)  # Single write to memory

    This implementation follows the core principles of output stationary execution: - Partial sums are kept in local memory throughout the computation. - Weights and inputs are streamed dynamically, ensuring that intermediate results remain locally accessible. - Final outputs are written back to memory only once, reducing unnecessary memory traffic.

    By accumulating partial sums locally, this approach eliminates excessive memory writes, improving overall system efficiency. In architectures such as systolic arrays, where computation progresses through a grid of processing elements, keeping partial sums stationary aligns naturally with structured accumulation workflows, reducing synchronization overhead.

    However, while Output Stationary reduces memory write traffic, it introduces trade-offs in weight and input movement. Since weights and activations must be streamed dynamically, the efficiency of this approach depends on how well data can be fed into the system without causing stalls. Additionally, parallel implementations must carefully synchronize updates to partial sums, especially in architectures where multiple processing elements contribute to the same output.

    The Output Stationary strategy is most effective for workloads where accumulation is the dominant operation and minimizing intermediate memory writes is critical. It is commonly employed in fully connected layers, attention mechanisms, and systolic arrays, where structured accumulation leads to significant performance improvements. However, for models where input reuse is more critical, alternative dataflow strategies, such as Input Stationary, may provide better trade-offs.

    Input Stationary

    The Input Stationary strategy keeps input activations fixed in local memory, while weights and partial sums stream through the system. This approach is particularly effective for batch processing, transformer models, and sequence-based architectures, where input activations are reused across multiple computations. By ensuring that activations remain in local memory, this method reduces redundant input fetches, improving data locality and minimizing memory traffic.

    A key advantage of the Input Stationary approach is that it maximizes input reuse, reducing the frequency of memory accesses for activations. Since many models, especially those in natural language processing (NLP) and recommendation systems, process the same input data across multiple computations, keeping inputs stationary eliminates unnecessary memory transfers, thereby lowering energy consumption. This strategy is particularly useful when dealing with large batch sizes, where a single batch of input activations contributes to multiple weight transformations.

    A simplified Input Stationary implementation for matrix multiplication is illustrated below:

    ## Input Stationary Matrix Multiplication
    ## - Input activations remain in local memory
    ## - Weights stream through dynamically
    ## - Partial sums accumulate and are written out
    
    for input_block in inputs:   # Keep input activations stationary
        load_to_local(input_block)  # Fixed in local storage
        for weight_block in weights:   # Stream weights dynamically
            for output_block in outputs:  # Compute results
                output_block += compute(weight_block, input_block)
                # Reuse inputs across weights

    This implementation follows the core principles of input stationary execution:

    • Input activations are loaded into local memory and remain fixed during computation.
    • Weights are streamed dynamically, ensuring efficient application across multiple inputs.
    • Partial sums are accumulated and written out, optimizing memory bandwidth usage.

    By keeping input activations stationary, this strategy minimizes redundant memory accesses to input data, significantly reducing external memory bandwidth requirements. This is particularly beneficial in transformer architectures, where each token in an input sequence is used across multiple attention heads and layers. Additionally, in batch processing scenarios, keeping input activations in local memory improves data locality, making it well-suited for fully connected layers and matrix multiplications.

    However, while Input Stationary reduces memory traffic for activations, it introduces trade-offs in weight and output movement. Since weights must be streamed dynamically while inputs remain fixed, the efficiency of this approach depends on how well weights can be delivered to the computational units without causing stalls. Additionally, partial sums must be accumulated efficiently before being written back to memory, which may require additional buffering mechanisms.

    The Input Stationary strategy is most effective for workloads where input activations exhibit high reuse, and memory bandwidth for inputs is a critical constraint. It is commonly employed in transformers, recurrent networks, and batch processing workloads, where structured input reuse leads to significant performance improvements. However, for models where output accumulation is more critical, alternative dataflow strategies, such as Output Stationary, may provide better trade-offs.

    Memory-Aware Tensor Layouts

    Efficient execution of machine learning workloads depends not only on how data moves (dataflow strategies) but also on how data is stored and accessed in memory. Tensor layouts—the way multidimensional data is arranged in memory—can significantly impact memory access efficiency, cache performance, and computational throughput. Poorly chosen layouts can lead to excessive memory stalls, inefficient cache usage, and increased data movement costs.

    In AI accelerators, tensor layout optimization is particularly important because data is frequently accessed in patterns dictated by the underlying hardware architecture. Choosing the right layout ensures that memory accesses align with hardware-friendly access patterns, minimizing overhead from costly memory transactions (N. Corporation 2021).

    ———. 2021. NVIDIA cuDNN: GPU Accelerated Deep Learning. https://developer.nvidia.com/cudnn.
    He, Xuzhen. 2023a. “Accelerated Linear Algebra Compiler for Computationally Efficient Numerical Models: Success and Potential Area of Improvement.” PLOS ONE 18 (2): e0282265. https://doi.org/10.1371/journal.pone.0282265.

    While developers can sometimes manually specify tensor layouts, the choice is often determined automatically by machine learning frameworks (e.g., TensorFlow, PyTorch, JAX), compilers, or AI accelerator runtimes. Low-level optimization tools such as cuDNN (for NVIDIA GPUs), XLA (for TPUs), and MLIR (for custom accelerators) may rearrange tensor layouts dynamically to optimize performance (He 2023a). In high-level frameworks, layout transformations are typically applied transparently, but developers working with custom kernels or low-level libraries (e.g., CUDA, Metal, or OpenCL) may have direct control over tensor format selection.

    For example, in PyTorch, users can manually modify layouts using tensor.permute() or tensor.contiguous() to ensure efficient memory access (Paszke et al. 2019). In TensorFlow, layout optimizations are often applied internally by the XLA compiler, choosing between NHWC (row-major) and NCHW (channel-major) based on the target hardware (Brain 2022). Hardware-aware machine learning libraries, such as cuDNN for GPUs or OneDNN for CPUs, enforce specific memory layouts to maximize cache locality and SIMD efficiency. Ultimately, while developers may have some control over tensor layout selection, most layout decisions are driven by the compiler and runtime system, ensuring that tensors are stored in memory in a way that best suits the underlying hardware.

    Paszke, Adam et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” NeurIPS.
    ———. 2022. TensorFlow Documentation. https://www.tensorflow.org/.
    Row-Major Layout

    Row-major layout refers to the way multi-dimensional tensors are stored in memory, where elements are arranged row by row, ensuring that all values in a given row are placed contiguously before moving to the next row. This storage format is widely used in general-purpose CPUs and some machine learning frameworks because it aligns naturally with sequential memory access patterns, making it more cache-efficient for certain types of operations (I. Corporation 2021).

    Corporation, Intel. 2021. oneDNN: Intel’s Deep Learning Neural Network Library. https://github.com/oneapi-src/oneDNN.

    To understand how row-major layout works, consider a single RGB image represented as a tensor of shape (Height, Width, Channels). If the image has a size of \(3\times 3\) pixels with 3 channels (RGB), the corresponding tensor is structured as (3, 3, 3). The values are stored in memory as follows: \[\begin{gather*} I(0,0,0), I(0,0,1), I(0,0,2), I(0,1,0), I(0,1,1), \\ I(0,1,2), I(0,2,0), I(0,2,1), I(0,2,2), \ldots \end{gather*}\]

    Each row is stored contiguously, meaning all pixel values in the first row are placed sequentially in memory before moving on to the second row. This ordering is advantageous because CPUs and cache hierarchies are optimized for sequential memory access. When data is accessed in a row-wise fashion, such as when applying element-wise operations like activation functions or basic arithmetic transformations, memory fetches are efficient, and cache utilization is maximized (Sodani 2015).

    Sodani, Avinash. 2015. “Knights Landing (KNL): 2nd Generation Intel® Xeon Phi Processor.” In 2015 IEEE Hot Chips 27 Symposium (HCS), 1–24. IEEE. https://doi.org/10.1109/hotchips.2015.7477467.

    The efficiency of row-major storage becomes particularly evident in CPU-based machine learning workloads, where operations such as batch normalization, matrix multiplications, and element-wise arithmetic frequently process rows of data sequentially. Since modern CPUs employ cache prefetching mechanisms, a row-major layout allows the next required data values to be preloaded into cache ahead of execution, reducing memory latency and improving overall computational throughput.

    However, row-major layout can introduce inefficiencies when performing operations that require accessing data across channels rather than across rows. Consider a convolutional layer that applies a filter across multiple channels of an input image. Since channel values are interleaved in row-major storage, the convolution operation must jump across memory locations to fetch all the necessary channel values for a given pixel. These strided memory accesses can be costly on hardware architectures that rely on vectorized execution and coalesced memory access, such as GPUs and TPUs.

    Despite these limitations, row-major layout remains a dominant storage format in CPU-based machine learning frameworks. TensorFlow, for instance, defaults to the NHWC (row-major) format on CPUs, ensuring that cache locality is optimized for sequential processing. However, when targeting GPUs, frameworks often rearrange data dynamically to take advantage of more efficient memory layouts, such as channel-major storage, which aligns better with parallelized computation.

    Channel-Major Layout

    In contrast to row-major layout, channel-major layout arranges data in memory such that all values for a given channel are stored together before moving to the next channel. This format is particularly beneficial for GPUs, TPUs, and other AI accelerators, where vectorized operations and memory coalescing significantly impact computational efficiency.

    To understand how channel-major layout works, consider the same RGB image tensor of size (Height, Width, Channels) = (3, 3, 3). Instead of storing pixel values row by row, the data is structured channel-first in memory as follows: \[\begin{gather*} I(0,0,0), I(1,0,0), I(2,0,0), I(0,1,0), I(1,1,0), I(2,1,0), \ldots, \\ I(0,0,1), I(1,0,1), I(2,0,1), \ldots, I(0,0,2), I(1,0,2), I(2,0,2), \ldots \end{gather*}\]

    In this format, all red channel values for the entire image are stored first, followed by all green values, and then all blue values. This ordering allows hardware accelerators to efficiently load and process data across channels in parallel, which is crucial for convolution operations and SIMD (Single Instruction, Multiple Data) execution models (Chetlur et al. 2014).

    Chetlur, Sharan, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. “cuDNN: Efficient Primitives for Deep Learning.” arXiv Preprint arXiv:1410.0759, October. http://arxiv.org/abs/1410.0759v3.

    The advantage of channel-major layout becomes clear when performing convolutions in machine learning models. Convolutional layers process images by applying a shared set of filters across all channels. When the data is stored in a channel-major format, a convolution kernel can load an entire channel efficiently, reducing the number of scattered memory fetches. This reduces memory latency, improves throughput, and enhances data locality for matrix multiplications, which are fundamental to machine learning workloads.

    Because GPUs and TPUs rely on memory coalescing—a technique where consecutive threads fetch contiguous memory addresses—channel-major layout aligns naturally with the way these processors execute parallel computations. For example, in NVIDIA GPUs, each thread in a warp (a group of threads executed simultaneously) processes different elements of the same channel, ensuring that memory accesses are efficient and reducing the likelihood of strided memory accesses, which can degrade performance.

    Despite its advantages in machine learning accelerators, channel-major layout can introduce inefficiencies when running on general-purpose CPUs. Since CPUs optimize for sequential memory access, storing all values for a single channel before moving to the next disrupts cache locality for row-wise operations. This is why many machine learning frameworks (e.g., TensorFlow, PyTorch) default to row-major (NHWC) on CPUs and channel-major (NCHW) on GPUs—optimizing for the strengths of each hardware type.

    Modern AI frameworks and compilers often transform tensor layouts dynamically depending on the execution environment. For instance, TensorFlow and PyTorch automatically switch between NHWC and NCHW based on whether a model is running on a CPU, GPU, or TPU, ensuring that the memory layout aligns with the most efficient execution path.

    Row-Major versus Channel-Major Layouts

    Both row-major (NHWC) and channel-major (NCHW) layouts serve distinct purposes in machine learning workloads, with their efficiency largely determined by the hardware architecture, memory access patterns, and computational requirements. The choice of layout directly influences cache utilization, memory bandwidth efficiency, and processing throughput. Table 11.13 summarizes the differences between row-major (NHWC) and channel-major (NCHW) layouts in terms of performance trade-offs and hardware compatibility.

    Table 11.13: Comparison of row-major (NHWC) vs. channel-major (NCHW) layouts.
    Feature Row-Major (NHWC) Channel-Major (NCHW)
    Memory Storage Order Pixels are stored row-by-row, channel interleaved All values for a given channel are stored together first
    Best for CPUs, element-wise operations GPUs, TPUs, convolution operations
    Cache Efficiency High cache locality for sequential row access Optimized for memory coalescing across channels
    Convolution Performance Requires strided memory accesses (inefficient on GPUs) Efficient for GPU convolution kernels
    Memory Fetching Good for operations that process rows sequentially Optimized for SIMD execution across channels
    Default in Frameworks Default on CPUs (e.g., TensorFlow NHWC) Default on GPUs (e.g., cuDNN prefers NCHW)

    The decision to use row-major (NHWC) or channel-major (NCHW) layouts is not always made manually by developers. Instead, machine learning frameworks and AI compilers often determine the optimal layout dynamically based on the target hardware and operation type. CPUs tend to favor NHWC due to cache-friendly sequential memory access, while GPUs perform better with NCHW, which reduces memory fetch overhead for machine learning computations.

    In practice, modern AI compilers such as TensorFlow’s XLA and PyTorch’s TorchScript perform automatic layout transformations, converting tensors between NHWC and NCHW as needed to optimize performance across different processing units. This ensures that machine learning models achieve the highest possible throughput without requiring developers to manually specify tensor layouts.

    Kernel Fusion

    Intermediate Memory Write

    Optimizing memory access is a fundamental challenge in AI acceleration. While AI models rely on high-throughput computation, their performance is often constrained by memory bandwidth and intermediate memory writes rather than pure arithmetic operations. Every time an operation produces an intermediate result that must be written to memory and later read back, execution stalls occur due to data movement overhead.

    To better understand why kernel fusion is necessary, consider a simple sequence of operations in a machine learning model. Many AI workloads, particularly those involving element-wise transformations, introduce unnecessary intermediate memory writes, leading to increased memory bandwidth consumption and reduced execution efficiency (N. Corporation 2017).

    Corporation, NVIDIA. 2017. “GPU-Accelerated Machine Learning and Deep Learning.” Technical Report.

    In a naïve execution model, each operation is treated as a separate kernel, meaning that each intermediate result is written to memory, only to be read back for the next operation. The execution flow looks like this:

    import torch
    
    ## Input tensor
    X = torch.randn(1024, 1024).cuda()
    
    ## Step-by-step execution (naïve approach)
    X1 = torch.relu(X)         # Intermediate tensor stored in memory
    X2 = torch.batch_norm(X1)  # Another intermediate tensor stored
    Y  = 2.0 * X2 + 1.0        # Final result

    Each operation produces an intermediate tensor that must be written to memory and retrieved for the next operation. On large tensors, this overhead of moving data can outweigh the computational cost of the operations (Shazeer et al. 2018). Table 11.14 illustrates the memory overhead in a naïve execution model. While only the final result \(Y\) is needed, storing multiple intermediate tensors creates unnecessary memory traffic and inefficient memory usage. This data movement bottleneck significantly impacts performance, making memory optimization crucial for AI accelerators.

    Shazeer, Noam, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, et al. 2018. “Mesh-TensorFlow: Deep Learning for Supercomputers.” arXiv Preprint arXiv:1811.02084, November. http://arxiv.org/abs/1811.02084v1.
    Table 11.14: Memory footprint of a naïve execution model with intermediate tensor storage.
    Tensor Size (MB) for 1024 \(\times\) 1024 Tensor
    X 4 MB
    X’ 4 MB
    X’’ 4 MB
    Y 4 MB
    Total Memory 16 MB

    Even though only the final result \(Y\) is needed, three additional intermediate tensors consume extra memory without contributing to final output storage. This excessive memory usage limits scalability and wastes memory bandwidth, particularly in AI accelerators where minimizing data movement is critical.

    Fusing Kernels for Efficient Memory Reuse

    Kernel fusion is a key optimization technique that aims to minimize intermediate memory writes, reducing the memory footprint and bandwidth consumption of machine learning workloads (Zhihao Jia, Zaharia, and Aiken 2018).

    Jia, Zhihao, Matei Zaharia, and Alex Aiken. 2018. “Beyond Data and Model Parallelism for Deep Neural Networks.” arXiv Preprint arXiv:1807.05358, July. http://arxiv.org/abs/1807.05358v1.

    Kernel fusion involves merging multiple computation steps into a single, optimized operation, eliminating the need for storing and reloading intermediate tensors. Instead of executing each layer or element-wise operation separately—where each step writes its output to memory before the next step begins—fusion enables direct data propagation between operations, keeping computations within high-speed registers or local memory.

    A common machine learning sequence might involve applying a nonlinear activation function (e.g., ReLU), followed by batch normalization, and then scaling the values for input to the next layer. In a naïve implementation, each of these steps generates an intermediate tensor, which is written to memory, read back, and then modified again: \[ X' = \text{ReLU}(X) X'' = \text{BatchNorm}(X') Y = \alpha \cdot X'' + \beta \]

    With kernel fusion, these operations are combined into a single computation step, allowing the entire transformation to occur without generating unnecessary intermediate tensors: \[ Y = \alpha \cdot \text{BatchNorm}\big(\text{ReLU}(X)\big) + \beta \]

    Table 11.15 highlights the impact of operation fusion on memory efficiency. By keeping intermediate results in registers or local memory rather than writing them to main memory, fusion significantly reduces memory traffic. This optimization is especially beneficial on highly parallel architectures like GPUs and TPUs, where minimizing memory accesses translates directly into improved execution throughput. Compared to the naïve execution model, fused execution eliminates the need for storing intermediate tensors, dramatically lowering the total memory footprint and improving overall efficiency.

    Table 11.15: Reduction in memory usage through operation fusion.
    Execution Model Intermediate Tensors Stored Total Memory Usage (MB)
    Naïve Execution X’, X’’ 16 MB
    Fused Execution None 4 MB

    Kernel fusion reduces total memory consumption from 16 MB to 4 MB, eliminating redundant memory writes while improving execution efficiency.

    Performance Benefits and Hardware Constraints

    Kernel fusion brings several key advantages that enhance memory efficiency and computation throughput. By reducing memory accesses, fused kernels ensure that intermediate values stay within registers instead of being repeatedly written to and read from memory. This significantly lowers memory traffic, which is one of the primary bottlenecks in machine learning workloads. GPUs and TPUs, in particular, benefit from kernel fusion because high-bandwidth memory is a scarce resource, and reducing memory transactions leads to better utilization of compute units (Qi, Kantarci, and Liu 2017).

    However, not all operations can be fused. Element-wise operations, such as ReLU, batch normalization, and simple arithmetic transformations, are ideal candidates for fusion since their computations depend only on single elements from the input tensor. In contrast, operations with complex data dependencies, such as matrix multiplications and convolutions, involve global data movement, making direct fusion impractical. These operations require values from multiple input elements to compute a single output, which prevents them from being executed as a single fused kernel.

    Another major consideration is register pressure. Fusing multiple operations means all temporary values must be kept in registers rather than memory. While this eliminates redundant memory writes, it also increases register demand. If a fused kernel exceeds the available registers per thread, the system must spill excess values into shared memory, introducing additional latency and potentially negating the benefits of fusion. On GPUs, where thread occupancy (the number of threads that can run in parallel) is limited by available registers, excessive fusion can reduce parallelism, leading to diminishing returns.

    Different AI accelerators and compilers handle fusion in distinct ways. NVIDIA GPUs, for example, favor warp-level parallelism, where element-wise fusion is straightforward. TPUs, on the other hand, prioritize systolic array execution, which is optimized for matrix-matrix operations rather than element-wise fusion (Qi, Kantarci, and Liu 2017). AI compilers such as XLA (TensorFlow), TorchScript (PyTorch), TensorRT (NVIDIA), and MLIR automatically detect fusion opportunities and apply heuristics to balance memory savings and execution efficiency (He 2023b).

    Qi, Xuan, Burak Kantarci, and Chen Liu. 2017. “GPU-Based Acceleration of SDN Controllers.” In Network as a Service for Next Generation Internet, 339–56. Institution of Engineering; Technology. https://doi.org/10.1049/pbte073e\_ch14.
    ———. 2023b. “Accelerated Linear Algebra Compiler for Computationally Efficient Numerical Models: Success and Potential Area of Improvement.” PLOS ONE 18 (2): e0282265. https://doi.org/10.1371/journal.pone.0282265.

    Despite its advantages, fusion is not always beneficial. Some AI frameworks allow developers to disable fusion selectively, especially when debugging performance issues or making frequent model modifications. The decision to fuse operations must consider trade-offs between memory efficiency, register usage, and hardware execution constraints to ensure that fusion leads to tangible performance improvements.

    Tiling for Memory Efficiency

    While modern AI accelerators offer high computational throughput, their performance is often limited by memory bandwidth rather than raw processing power. If data cannot be supplied to processing units fast enough, execution stalls occur, leading to wasted cycles and inefficient hardware utilization.

    Tiling is a technique used to mitigate this issue by restructuring computations into smaller, memory-friendly subproblems. Instead of processing entire matrices or tensors at once—leading to excessive memory traffic—tiling partitions computations into smaller blocks (tiles) that fit within fast local memory (e.g., caches, shared memory, or registers) (Lam, Rothberg, and Wolf 1991). By doing so, tiling increases data reuse, minimizes memory fetches, and improves overall computational efficiency.

    Lam, Monica D., Edward E. Rothberg, and Michael E. Wolf. 1991. “The Cache Performance and Optimizations of Blocked Algorithms.” In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS-IV, 63–74. ACM Press. https://doi.org/10.1145/106972.106981.

    A classic example of inefficient memory access is matrix multiplication, which is widely used in AI models. Without tiling, the naïve approach results in repeated memory accesses for the same data, leading to unnecessary bandwidth consumption:

    ## Naïve Matrix Multiplication (No Tiling)
    for i in range(N):
        for j in range(N):
            for k in range(N):
                C[i, j] += A[i, k] * B[k, j]  # Repeatedly fetching
                                              # A[i, k] and B[k, j]

    Each iteration requires loading elements from matrices \(A\) and \(B\) multiple times from memory, causing excessive data movement. As the size of the matrices increases, the memory bottleneck worsens, limiting performance.

    Tiling addresses this problem by ensuring that smaller portions of matrices are loaded into fast memory, reused efficiently, and only written back to main memory when necessary. This technique is especially crucial in AI accelerators, where memory accesses dominate execution time.

    In the following sections, we will explore the fundamental principles of tiling, its different strategies, and the key trade-offs involved in selecting an effective tiling approach.

    Fundamentals of Tiling

    Tiling is based on a simple but powerful principle: instead of operating on an entire data structure at once, computations are divided into smaller tiles that fit within the available fast memory. By structuring execution around these tiles, data reuse is maximized, reducing redundant memory accesses and improving overall efficiency.

    Consider matrix multiplication, a key operation in machine learning workloads. The operation computes the output matrix \(C\) from two input matrices \(A\) and \(B\): \[ C = A \times B \] where each element \(C[i,j]\) is computed as: \[ C[i,j] = \sum_{k} A[i,k] \times B[k,j] \]

    A naïve implementation follows this formula directly:

    ## Naïve Matrix Multiplication (No Tiling)
    for i in range(N):
        for j in range(N):
            for k in range(N):
                C[i, j] += A[i, k] * B[k, j]  # Repeatedly fetching
                                              # A[i, k] and B[k, j]

    At first glance, this approach seems correct—it computes the desired result and follows the mathematical definition. However, the issue lies in how memory is accessed. Every time the innermost loop runs, it fetches an element from matrix \(A\) and matrix \(B\) from memory, performs a multiplication, and updates an element in matrix \(C\). Because matrices are large, the processor frequently reloads the same values from memory, even though they were just used in previous computations.

    This unnecessary data movement is expensive. Fetching values from main memory (DRAM) is hundreds of times slower than accessing values stored in on-chip cache or registers. If the same values must be reloaded multiple times instead of being stored in fast memory, execution slows down significantly.

    How Tiling Improves Performance

    Instead of computing one element at a time and constantly moving data in and out of slow memory, tiling processes submatrices (tiles) at a time, keeping frequently used values in fast memory. The idea is to divide the matrices into smaller blocks that fit within the processor’s cache or shared memory, ensuring that once a block is loaded, it is reused multiple times before moving to the next one.

    A tiled implementation of matrix multiplication looks like this:

    ## Tiled Matrix Multiplication
    TILE_SIZE = 32 # Choose a tile size based on hardware constraints
    
    for i in range(0, N, TILE_SIZE):
        for j in range(0, N, TILE_SIZE):
            for k in range(0, N, TILE_SIZE):
              # Compute the submatrix C[i:i+TILE_SIZE, j:j+TILE_SIZE]
                for ii in range(i, i + TILE_SIZE):
                    for jj in range(j, j + TILE_SIZE):
                        for kk in range(k, k + TILE_SIZE):
                            C[ii, jj] += A[ii, kk] * B[kk, jj]

    This restructuring significantly improves performance for three main reasons:

    1. Better Memory Reuse: Instead of fetching elements from \(A\) and \(B\) repeatedly from slow memory, this approach loads a small tile of data into fast memory, performs multiple computations using it, and only then moves on to the next tile. This minimizes redundant memory accesses.

    2. Reduced Memory Bandwidth Usage: Since each tile is used multiple times before being evicted, memory traffic is reduced. Instead of repeatedly accessing DRAM, most required data is available in L1/L2 cache or shared memory, leading to faster execution.

    3. Increased Compute Efficiency: Processors spend less time waiting for data and more time performing useful computations. In architectures like GPUs and TPUs, where thousands of parallel processing units operate simultaneously, tiling ensures that data is read and processed in a structured manner, avoiding unnecessary stalls.

    This technique is particularly effective in AI accelerators, where machine learning workloads consist of large matrix multiplications and tensor transformations. Without tiling, these workloads quickly become memory-bound, meaning performance is constrained by how fast data can be retrieved rather than by the raw computational power of the processor.

    Tiling Methods

    While the general principle of tiling remains the same—partitioning large computations into smaller subproblems to improve memory reuse—there are different ways to apply tiling based on the structure of the computation and hardware constraints. The two primary tiling strategies are spatial tiling and temporal tiling. These strategies optimize different aspects of computation and memory access, and in practice, they are often combined to achieve the best performance.

    Spatial Tiling

    Spatial tiling focuses on partitioning data structures into smaller blocks that fit within the fast memory of the processor. This approach ensures that each tile is fully processed before moving to the next, reducing redundant memory accesses. Spatial tiling is widely used in operations such as matrix multiplication, convolutions, and attention mechanisms in transformer models.

    Returning to our tiled matrix multiplication example, we can see spatial tiling in action:

    ## Tiled Matrix Multiplication (Spatial Tiling)
    TILE_SIZE = 32  # Tile size chosen based on available fast memory
    
    for i in range(0, N, TILE_SIZE):
        for j in range(0, N, TILE_SIZE):
            for k in range(0, N, TILE_SIZE):
                # Process a submatrix (tile) at a time
                for ii in range(i, i + TILE_SIZE):
                    for jj in range(j, j + TILE_SIZE):
                        for kk in range(k, k + TILE_SIZE):
                            C[ii, jj] += A[ii, kk] * B[kk, jj]

    In this implementation, each tile of \(A\) and \(B\) is loaded into cache or shared memory before processing, ensuring that the same data does not need to be fetched repeatedly from slower memory. The tile is fully used before moving to the next block, minimizing redundant memory accesses. Since data is accessed in a structured, localized way, cache efficiency improves significantly.

    Spatial tiling is particularly beneficial when dealing with large tensors that do not fit entirely in fast memory. By breaking them into smaller tiles, computations remain localized, avoiding excessive data movement between memory levels. This technique is widely used in AI accelerators where machine learning workloads involve large-scale tensor operations that require careful memory management to achieve high performance.

    Temporal Tiling

    While spatial tiling optimizes how data is partitioned, temporal tiling focuses on reorganizing the computation itself to improve data reuse over time. Many machine learning workloads involve operations where the same data is accessed repeatedly across multiple iterations. Without temporal tiling, this often results in redundant memory fetches, leading to inefficiencies. Temporal tiling, also known as loop blocking, restructures the computation to ensure that frequently used data stays in fast memory for as long as possible before moving on to the next computation.

    A classic example where temporal tiling is beneficial is convolutional operations, where the same set of weights is applied to multiple input regions. Without loop blocking, these weights might be loaded from memory multiple times for each computation. With temporal tiling, the computation is reordered so that the weights remain in fast memory across multiple inputs, reducing unnecessary memory fetches and improving overall efficiency.

    A simplified example of loop blocking in matrix multiplication is shown below:

    ## Matrix Multiplication with Temporal Tiling (Loop Blocking)
    for i in range(0, N, TILE_SIZE):
        for j in range(0, N, TILE_SIZE):
            for k in range(0, N, TILE_SIZE):
                # Load tile into fast memory before computation
                A_tile = A[i:i+TILE_SIZE, k:k+TILE_SIZE]
                B_tile = B[k:k+TILE_SIZE, j:j+TILE_SIZE]
    
                for ii in range(TILE_SIZE):
                    for jj in range(TILE_SIZE):
                        for kk in range(TILE_SIZE):
                            C[i+ii, j+jj] += A_tile[ii, kk] *
                                             B_tile[kk, jj]

    Temporal tiling improves performance by ensuring that the data loaded into fast memory is used multiple times before being evicted. In this implementation, small tiles of matrices \(A\) and \(B\) are explicitly loaded into temporary storage before performing computations, reducing memory fetch overhead. This restructuring allows the computation to process an entire tile before moving to the next, thereby reducing the number of times data must be loaded from slower memory.

    This technique is particularly useful in workloads where certain values are used repeatedly, such as convolutions, recurrent neural networks (RNNs), and self-attention mechanisms in transformers. By applying loop blocking, AI accelerators can significantly reduce memory stalls and improve execution throughput.

    Challenges and Trade-offs in Tiling

    While tiling significantly improves performance by optimizing memory reuse and reducing redundant memory accesses, it introduces several challenges and trade-offs. Selecting the right tile size is a critical decision, as it directly affects computational efficiency and memory bandwidth usage. If the tile size is too small, the benefits of tiling diminish, as memory fetches still dominate execution time. On the other hand, if the tile size is too large, it may exceed the available fast memory, causing cache thrashing and performance degradation.

    Load balancing is another key concern. In architectures such as GPUs and TPUs, computations are executed in parallel across thousands of processing units. If tiles are not evenly distributed, some units may remain idle while others are overloaded, leading to suboptimal utilization of computational resources. Effective tile scheduling ensures that parallel execution remains balanced and efficient.

    Data movement overhead is also an important consideration. Although tiling reduces the number of slow memory accesses, transferring tiles between different levels of memory still incurs a cost. This is especially relevant in hierarchical memory systems, where accessing data from cache is much faster than accessing it from DRAM. Efficient memory prefetching and scheduling strategies are required to minimize latency and ensure that data is available when needed.

    Beyond spatial and temporal tiling, hybrid approaches combine elements of both strategies to achieve optimal performance. Hybrid tiling adapts to workload-specific constraints by dynamically adjusting tile sizes or reordering computations based on real-time execution conditions. For example, some AI accelerators use spatial tiling for matrix multiplications while employing temporal tiling for weight reuse in convolutional layers.

    In addition to tiling, there are other methods for optimizing memory usage and computational efficiency. Techniques such as register blocking, double buffering, and hierarchical tiling extend the basic tiling principles to further optimize execution. AI compilers and runtime systems, such as TensorFlow XLA, TVM, and MLIR, automatically select tiling strategies based on hardware constraints, allowing for fine-tuned performance optimization without manual intervention.

    Table 11.16 provides a comparative overview of spatial, temporal, and hybrid tiling approaches, highlighting their respective benefits and trade-offs.

    Table 11.16: Comparative analysis of spatial, temporal, and hybrid tiling strategies.
    Aspect Spatial Tiling (Data Tiling) Temporal Tiling (Loop Blocking) Hybrid Tiling
    Primary Goal Reduce memory accesses by keeping data in fast memory longer Increase data reuse across loop iterations Adapt dynamically to workload constraints
    Optimization Focus Partitioning data structures into smaller, memory-friendly blocks Reordering computations to maximize reuse before eviction Balancing spatial and temporal reuse strategies
    Memory Usage Improves cache locality and reduces DRAM access Keeps frequently used data in fast memory for multiple iterations Minimizes data movement while ensuring high reuse
    Common Use Cases Matrix multiplications, CNNs, self-attention in transformers Convolutions, recurrent neural networks (RNNs), iterative computations AI accelerators with hierarchical memory, mixed workloads
    Performance Gains Reduced memory bandwidth requirements, better cache utilization Lower memory fetch latency, improved data locality Maximized efficiency across multiple hardware types
    Challenges Requires careful tile size selection, inefficient for workloads with minimal spatial reuse Can increase register pressure, requires loop restructuring Complexity in tuning tile size and execution order dynamically
    Best When Data is large and needs to be partitioned for efficient processing The same data is accessed multiple times across iterations Both data partitioning and iteration-based reuse are important

    As machine learning models continue to grow in size and complexity, tiling remains a critical tool for improving hardware efficiency, ensuring that AI accelerators operate at their full potential. While manual tiling strategies can provide substantial benefits, modern compilers and hardware-aware optimization techniques further enhance performance by automatically selecting the most effective tiling strategies for a given workload.

    11.6.2 Applying Mapping Strategies

    While the foundational mapping techniques we discussed apply broadly, their effectiveness varies based on the computational structure, data access patterns, and parallelization opportunities of different neural network architectures. Each architecture imposes distinct constraints on data movement, memory hierarchy, and computation scheduling, requiring tailored mapping strategies to optimize performance.

    A structured approach to mapping is essential to address the combinatorial explosion of choices that arise when assigning computations to AI accelerators. Rather than treating each model as a separate optimization problem, we recognize that the same fundamental principles apply across different architectures—only their priority shifts based on workload characteristics. The goal is to systematically select and apply mapping strategies that maximize efficiency for different types of machine learning models.

    To demonstrate these principles, we examine three representative AI workloads, each characterized by distinct computational demands. CNNs benefit from spatial data reuse, making weight-stationary execution and the application of tiling techniques especially effective. In contrast, Transformers are inherently memory-bound and rely on strategies such as efficient KV-cache management, fused attention mechanisms, and highly parallel execution to mitigate memory traffic. MLPs, which involve substantial matrix multiplication operations, demand the use of structured tiling, optimized weight layouts, and memory-aware execution to enhance overall performance.

    Despite their differences, each of these models follows a common set of mapping principles, with variations in how optimizations are prioritized. The following table provides a structured mapping between different optimization strategies and their suitability for CNNs, Transformers, and MLPs. This table serves as a roadmap for selecting appropriate mapping strategies for different machine learning workloads.

    Optimization Technique CNNs Transformers MLPs Rationale
    Dataflow Strategy Weight Stationary Activation Stationary Weight Stationary CNNs reuse filters across spatial locations; Transformers reuse activations (KV-cache); MLPs reuse weights across batches.
    Memory-Aware Tensor Layouts NCHW (Channel-Major) NHWC (Row-Major) NHWC CNNs favor channel-major for convolution efficiency; Transformers and MLPs prioritize row-major for fast memory access.
    Kernel Fusion Convolution + Activation Fused Attention GEMM Fusion CNNs optimize convolution+activation fusion; Transformers fuse attention mechanisms; MLPs benefit from fused matrix multiplications.
    Tiling for Memory Efficiency Spatial Tiling Temporal Tiling Blocked Tiling CNNs tile along spatial dimensions; Transformers use loop blocking to improve sequence memory efficiency; MLPs use blocked tiling for large matrix multiplications.

    This table highlights that each machine learning model benefits from a different combination of optimization techniques, reinforcing the importance of tailoring execution strategies to the computational and memory characteristics of the workload.

    In the following sections, we explore how these optimizations apply to each network type, explaining how CNNs, Transformers, and MLPs leverage specific mapping strategies to improve execution efficiency and hardware utilization.

    Convolutional Neural Networks

    CNNs are characterized by their structured spatial computations, where small filters (or kernels) are repeatedly applied across an input feature map. This structured weight reuse makes weight stationary execution the most effective strategy for CNNs. Keeping filter weights in fast memory while streaming activations ensures that weights do not need to be repeatedly fetched from slower external memory, significantly reducing memory bandwidth demands. Since each weight is applied to multiple spatial locations, weight stationary execution maximizes arithmetic intensity and minimizes redundant memory transfers.

    Memory-aware tensor layouts also play a critical role in CNN execution. Convolution operations benefit from a channel-major memory format, often represented as NCHW (batch, channels, height, width). This layout aligns with the access patterns of convolutions, enabling efficient memory coalescing on accelerators such as GPUs and TPUs. By storing data in a format that optimizes cache locality, accelerators can fetch contiguous memory blocks efficiently, reducing latency and improving throughput.

    Kernel fusion is another important optimization for CNNs. In a typical machine learning pipeline, convolution operations are often followed by activation functions such as ReLU and batch normalization. Instead of treating these operations as separate computational steps, fusing them into a single kernel reduces intermediate memory writes and improves execution efficiency. This optimization minimizes memory bandwidth pressure by keeping intermediate values in registers rather than writing them to memory and fetching them back in subsequent steps.

    Given the size of input images and feature maps, tiling is necessary to ensure that computations fit within fast memory hierarchies. Spatial tiling, where input feature maps are processed in smaller subregions, allows for efficient utilization of on-chip memory while avoiding excessive off-chip memory transfers. This technique ensures that input activations, weights, and intermediate outputs remain within high-speed caches or shared memory as long as possible, reducing memory stalls and improving overall performance.

    Together, these optimizations ensure that CNNs make efficient use of available compute resources by maximizing weight reuse, optimizing memory access patterns, reducing redundant memory writes, and structuring computation to fit within fast memory constraints.

    Transformer Architectures

    Unlike CNNs, which rely on structured spatial computations, Transformers process variable-length sequences and rely heavily on attention mechanisms. The primary computational bottleneck in Transformers is memory bandwidth, as attention mechanisms require frequent access to stored key-value pairs across multiple query vectors. Given this access pattern, activation stationary execution is the most effective strategy. By keeping key-value activations in fast memory and streaming query vectors dynamically, activation reuse is maximized while minimizing redundant memory fetches. This approach is critical in reducing bandwidth overhead, especially in long-sequence tasks such as natural language processing.

    Memory layout optimization is equally important for Transformers. Unlike CNNs, which benefit from channel-major layouts, Transformers require efficient access to sequences of activations, making a row-major format (NHWC) the preferred choice. This layout ensures that activations are accessed contiguously in memory, reducing cache misses and improving memory coalescing for matrix multiplications.

    Kernel fusion plays a key role in optimizing Transformer execution. In self-attention, multiple computational steps—including query-key dot products, softmax normalization, and weighted summation—can be fused into a single operation. Fused attention kernels eliminate intermediate memory writes by computing attention scores and performing weighted summations within a single execution step. This optimization significantly reduces memory traffic, particularly for large batch sizes and long sequences.

    Due to the nature of sequence processing, tiling must be adapted to improve memory efficiency. Instead of spatial tiling, which is effective for CNNs, Transformers benefit from temporal tiling, where computations are structured to process sequence blocks efficiently. This method ensures that activations are loaded into fast memory in manageable chunks, reducing excessive memory transfers. Temporal tiling is particularly beneficial for long-sequence models, where the memory footprint of key-value activations grows significantly. By tiling sequences into smaller segments, memory locality is improved, enabling efficient cache utilization and reducing bandwidth pressure.

    These optimizations collectively address the primary bottlenecks in Transformer models by prioritizing activation reuse, structuring memory layouts for efficient batched computations, fusing attention operations to reduce intermediate memory writes, and employing tiling techniques suited to sequence-based processing.

    Multi-Layer Perceptrons

    MLPs primarily consist of fully connected layers, where large matrices of weights and activations are multiplied to produce output representations. Given this structure, weight stationary execution is the most effective strategy for MLPs. Similar to CNNs, MLPs benefit from keeping weights in local memory while streaming activations dynamically, as this ensures that weight matrices, which are typically reused across multiple activations in a batch, do not need to be frequently reloaded.

    The preferred memory layout for MLPs aligns with that of Transformers, as matrix multiplications are more efficient when using a row-major (NHWC) format. Since activation matrices are processed in batches, this layout ensures that input activations are accessed efficiently without introducing memory fragmentation. By aligning tensor storage with compute-friendly memory access patterns, cache utilization is improved, reducing memory stalls.

    Kernel fusion in MLPs is primarily applied to General Matrix Multiplication (GEMM) operations. Since dense layers are often followed by activation functions and bias additions, fusing these operations into a single computation step reduces memory traffic. GEMM fusion ensures that activations, weights, and biases are processed within a single optimized kernel, avoiding unnecessary memory writes and reloads.

    To further improve memory efficiency, MLPs rely on blocked tiling strategies, where large matrix multiplications are divided into smaller sub-blocks that fit within the accelerator’s shared memory. This method ensures that frequently accessed portions of matrices remain in fast memory throughout computation, reducing external memory accesses. By structuring computations in a way that balances memory utilization with efficient parallel execution, blocked tiling minimizes bandwidth limitations and maximizes throughput.

    These optimizations ensure that MLPs achieve high computational efficiency by structuring execution around weight reuse, optimizing memory layouts for dense matrix operations, reducing redundant memory writes through kernel fusion, and employing blocked tiling strategies to maximize on-chip memory utilization.

    11.6.3 Hybrid Mapping Strategies

    While general mapping strategies provide a structured framework for optimizing machine learning models, real-world architectures often involve diverse computational requirements that cannot be effectively addressed with a single, fixed approach. Hybrid mapping strategies allow AI accelerators to dynamically apply different optimizations to specific layers or components within a model, ensuring that each computation is executed with maximum efficiency.

    Machine learning models typically consist of multiple layer types, each exhibiting distinct memory access patterns, data reuse characteristics, and parallelization opportunities. By tailoring mapping strategies to these specific properties, hybrid approaches achieve higher computational efficiency, improved memory bandwidth utilization, and reduced data movement overhead compared to a uniform mapping approach (Sze et al. 2017b).

    Sze, Vivienne, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017b. “Efficient Processing of Deep Neural Networks: A Tutorial and Survey.” Proceedings of the IEEE 105 (12): 2295–2329. https://doi.org/10.1109/jproc.2017.2761740.

    Layer-Specific Mapping in Hybrid Strategies

    Hybrid mapping strategies are particularly beneficial in models that combine spatially localized computations, such as convolutions, with fully connected operations, such as dense layers or attention mechanisms. These operations possess distinct characteristics that require different mapping strategies for optimal performance.

    In convolutional neural networks, hybrid strategies are frequently employed to optimize performance. Specifically, weight stationary execution is applied to convolutional layers, ensuring that filters remain in local memory while activations are streamed dynamically. For fully connected layers, output stationary execution is utilized to minimize redundant memory writes during matrix multiplications. Additionally, kernel fusion is integrated to combine activation functions, batch normalization, and elementwise operations into a single computational step, thereby reducing intermediate memory traffic. Collectively, these approaches enhance computational efficiency and memory utilization, contributing to the overall performance of the network.

    Transformers employ several strategies to enhance performance by optimizing memory usage and computational efficiency. Specifically, they use activation stationary mapping in self-attention layers to maximize the reuse of stored key-value pairs, thereby reducing memory fetches. In feedforward layers, weight stationary mapping is applied to ensure that large weight matrices are efficiently reused across computations. Additionally, these models incorporate fused attention kernels that integrate softmax and weighted summation into a single computation step, significantly enhancing execution speed (Jacobs et al. 2002).

    Jacobs, David, Bas Rokers, Archisman Rudra, and Zili Liu. 2002. “Fragment Completion in Humans and Machines.” In Advances in Neural Information Processing Systems 14, 35:27–34. The MIT Press. https://doi.org/10.7551/mitpress/1120.003.0008.

    For multilayer perceptrons, hybrid mapping strategies are employed to optimize performance through a combination of techniques that enhance both memory efficiency and computational throughput. Specifically, weight stationary execution is utilized to maximize the reuse of weights across activations, ensuring that these frequently accessed parameters remain readily available and reduce redundant memory accesses. In addition, blocked tiling strategies are implemented for large matrix multiplications, which significantly improve cache locality by partitioning the computation into manageable sub-blocks that fit within fast memory. Complementing these approaches, general matrix multiplication fusion is applied, effectively reducing memory stalls by merging consecutive matrix multiplication operations with subsequent functional transformations. Collectively, these optimizations illustrate how tailored mapping strategies can systematically balance memory constraints with computational demands in multilayer perceptron architectures.

    Hybrid mapping strategies are widely employed in vision transformers, which seamlessly integrate convolutional and self-attention operations. In these models, the patch embedding layer performs a convolution-like operation that benefits from weight stationary mapping (Dosovitskiy et al. 2020). The self-attention layers, on the other hand, require activation stationary execution to efficiently reuse the key-value cache across multiple queries. Additionally, the multilayer perceptron layers leverage general matrix multiplication fusion and blocked tiling to execute dense matrix multiplications efficiently. This layer-specific optimization framework effectively balances memory locality with computational efficiency, rendering vision transformers particularly well-suited for AI accelerators.

    Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2020. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” International Conference on Learning Representations (ICLR), October. http://arxiv.org/abs/2010.11929v2.

    11.6.4 Hardware Implementations of Hybrid Strategies

    Several modern AI accelerators incorporate hybrid mapping strategies to optimize execution by tailoring layer-specific techniques to the unique computational requirements of diverse neural network architectures. For example, Google TPUs employ weight stationary mapping for convolutional layers and activation stationary mapping for attention layers within transformer models, ensuring that the most critical data remains in fast memory. Likewise, NVIDIA GPUs leverage fused kernels alongside hybrid memory layouts, which enable the application of different mapping strategies within the same model to maximize performance. In addition, Graphcore IPUs dynamically select execution strategies on a per-layer basis to optimize memory access, thereby enhancing overall computational efficiency.

    These real-world implementations illustrate how hybrid mapping strategies bridge the gap between different types of machine learning computations, ensuring that each layer executes with maximum efficiency. However, hardware support is essential for these techniques to be practical. Accelerators must provide architectural features such as programmable memory hierarchies, efficient interconnects, and specialized execution pipelines to fully exploit hybrid mapping.

    Hybrid mapping provides a flexible and efficient approach to deep learning execution, enabling AI accelerators to adapt to the diverse computational requirements of modern architectures. By selecting the optimal mapping technique for each layer, hybrid strategies help reduce memory bandwidth constraints, improve data locality, and maximize parallelism.

    While hybrid mapping strategies offer an effective way to optimize computations at a layer-specific level, they remain static design-time optimizations. In real-world AI workloads, execution conditions can change dynamically due to varying input sizes, memory contention, or hardware resource availability. Machine learning compilers and runtime systems extend these mapping techniques by introducing dynamic scheduling, memory optimizations, and automatic tuning mechanisms. These systems ensure that hybrid strategies are not just predefined execution choices, but rather adaptive mechanisms that allow deep learning workloads to operate efficiently across different accelerators and deployment environments. In the next section, we explore how machine learning compilers and runtime stacks enable these adaptive optimizations through just-in-time scheduling, memory-aware execution, and workload balancing strategies.

    11.7 Compiler Support

    The performance of machine learning acceleration depends not only on hardware capabilities but also on how efficiently models are translated into executable operations. The optimizations discussed earlier in this chapter—kernel fusion, tiling, memory scheduling, and data movement strategies—are essential for maximizing efficiency. However, these optimizations must be systematically applied before execution to ensure they align with hardware constraints and computational requirements.

    This process is handled by machine learning compilers, which form the software stack responsible for bridging high-level model representations with low-level hardware execution. The compiler optimizes the model by restructuring computations, selecting efficient execution kernels, and placing operations in a way that maximizes hardware utilization (Tianqi, Moreau, et al. 2018a).

    While traditional compilers are designed for general-purpose computing, machine learning workloads require specialized approaches due to their reliance on tensor computations, parallel execution, and memory-intensive operations. To understand how these systems differ, we first compare machine learning compilers to their traditional counterparts.

    11.7.1 ML vs. Traditional Compilers

    Machine learning workloads introduce unique challenges that traditional compilers were not designed to handle. Unlike conventional software execution, which primarily involves sequential or multi-threaded program flow, machine learning models are expressed as computation graphs that describe large-scale tensor operations. These graphs require specialized optimizations that traditional compilers cannot efficiently apply (Cui, Li, and Xie 2019).

    Cui, Hongyi, Jiajun Li, and Peng et al. Xie. 2019. “A Survey on Machine Learning Compilers: Taxonomy, Challenges, and Future Directions.” ACM Computing Surveys 52 (4): 1–39.

    Table 11.17 outlines the fundamental differences between traditional compilers and those designed for machine learning workloads. While traditional compilers optimize linear program execution through techniques like instruction scheduling and register allocation, ML compilers focus on optimizing computation graphs for efficient tensor operations. This distinction is critical, as ML compilers must incorporate domain-specific transformations such as kernel fusion, memory-aware scheduling, and hardware-accelerated execution plans to achieve high performance on specialized accelerators like GPUs and TPUs.

    Table 11.17: Traditional vs. machine learning compilers and their optimization priorities.
    Aspect Traditional Compiler Machine Learning Compiler
    Input Representation Linear program code (C, Python) Computational graph (ML models)
    Execution Model Sequential or multi-threaded execution Massively parallel tensor-based execution
    Optimization Priorities Instruction scheduling, loop unrolling, register allocation Graph transformations, kernel fusion, memory-aware execution
    Memory Management Stack and heap memory allocation Tensor layout transformations, tiling, memory-aware scheduling
    Target Hardware CPUs (general-purpose execution) GPUs, TPUs, and custom accelerators
    Compilation Output CPU-specific machine code Hardware-specific execution plan (kernels, memory scheduling)

    This comparison highlights why machine learning models require a different compilation approach. Instead of optimizing instruction-level execution, machine learning compilers must transform entire computation graphs, apply tensor-aware memory optimizations, and schedule operations across thousands of parallel processing elements. These requirements make traditional compiler techniques insufficient for modern deep learning workloads.

    11.7.2 The ML Compilation Pipeline

    Machine learning models, as defined in frameworks such as TensorFlow and PyTorch, are initially represented in a high-level computation graph that describes operations on tensors. However, these representations are not directly executable on hardware accelerators such as GPUs, TPUs, and custom AI chips. To achieve efficient execution, models must go through a compilation process that transforms them into optimized execution plans suited for the target hardware (Brain 2020).

    Brain, Google. 2020. “XLA: Optimizing Compiler for Machine Learning.” TensorFlow Blog. https://tensorflow.org/xla.

    The machine learning compilation workflow consists of several key stages, each responsible for applying specific optimizations that ensure minimal memory overhead, maximum parallel execution, and optimal compute utilization. These stages include:

    1. Graph Optimization: The computation graph is restructured to eliminate inefficiencies.
    2. Kernel Selection: Each operation is mapped to an optimized hardware-specific implementation.
    3. Memory Planning: Tensor layouts and memory access patterns are optimized to reduce bandwidth consumption.
    4. Computation Scheduling: Workloads are distributed across parallel processing elements to maximize hardware utilization.
    5. Code Generation: The optimized execution plan is translated into machine-specific instructions for execution.

    At each stage, the compiler applies theoretical optimizations discussed earlier—such as kernel fusion, tiling, data movement strategies, and computation placement—ensuring that these optimizations are systematically incorporated into the final execution plan.

    By understanding this workflow, we can see how machine learning acceleration is realized not just through hardware improvements but also through compiler-driven software optimizations.

    11.7.3 Graph Optimization

    AI accelerators provide specialized hardware to speed up computation, but raw model representations are not inherently optimized for execution on these accelerators. Machine learning frameworks define models using high-level computation graphs, where nodes represent operations (such as convolutions, matrix multiplications, and activations), and edges define data dependencies. However, if executed as defined, these graphs often contain redundant operations, inefficient memory access patterns, and suboptimal execution sequences that can prevent the hardware from operating at peak efficiency.

    For example, in a Transformer model, the self-attention mechanism involves repeated accesses to the same key-value pairs across multiple attention heads. If compiled naïvely, the model may reload the same data multiple times, leading to excessive memory traffic (Shoeybi et al. 2019a). Similarly, in a Convolutional Neural Network (CNN), applying batch normalization and activation functions as separate operations after each convolution leads to unnecessary intermediate memory writes, increasing memory bandwidth usage. These inefficiencies are addressed during graph optimization, where the compiler restructures the computation graph to eliminate unnecessary operations and improve memory locality (Tianqi, Moreau, et al. 2018a).

    The graph optimization phase of compilation is responsible for transforming this high-level computation graph into an optimized execution plan before it is mapped to hardware. Rather than requiring manual optimization, the compiler systematically applies transformations that improve data movement, reduce redundant computations, and restructure operations for efficient parallel execution (NVIDIA 2021).

    At this stage, the compiler is still working at a hardware-agnostic level, focusing on high-level restructuring that improves efficiency before more hardware-specific optimizations are applied later.

    How the Compiler Optimizes the Computation Graph

    Graph optimization transforms the computation graph through a series of structured techniques designed to enhance execution efficiency. One key technique, which we discussed earlier, is kernel fusion, which merges consecutive operations to eliminate unnecessary memory writes and reduce the number of kernel launches. This approach is particularly effective in convolutional neural networks, where fusing convolution, batch normalization, and activation functions notably accelerates processing. Another important technique is computation reordering, which adjusts the execution order of operations to improve data locality and maximize parallel execution. For instance, in Transformer models, such reordering enables the reuse of cached key-value pairs rather than reloading them repeatedly from memory, thereby reducing latency.

    Additionally, redundant computation elimination plays an important role. By identifying and removing duplicate or unnecessary operations, this method is especially beneficial in models with residual connections where common subexpressions might otherwise be redundantly computed. Furthermore, memory-aware dataflow adjustments enhance overall performance by refining tensor layouts and optimizing memory movement. For example, tiling matrix multiplications to meet the structural requirements of systolic arrays in TPUs ensures that hardware resources are utilized optimally. This combined approach not only reduces unnecessary processing but also aligns data storage and movement with the accelerator’s strengths, leading to efficient execution across diverse AI workloads. Together, these techniques prepare the model for acceleration by minimizing overhead and ensuring an optimal balance between computational and memory resources.

    Practical Implementation in AI Compilers

    Modern AI compilers perform graph optimization through the use of automated pattern recognition and structured rewrite rules, systematically transforming computation graphs to maximize efficiency without manual intervention. For example, Google’s XLA (Accelerated Linear Algebra) in TensorFlow applies graph-level transformations such as fusion and layout optimizations that streamline execution on TPUs and GPUs. Similarly, TVM (Tensor Virtual Machine) not only refines tensor layouts and adjusts computational structures but also tunes execution strategies across diverse hardware backends, which is particularly beneficial for deploying models on embedded Tiny ML devices with strict memory constraints.

    NVIDIA’s TensorRT, another specialized deep learning compiler, focuses on minimizing kernel launch overhead by fusing operations and optimizing execution scheduling on GPUs, thereby improving utilization and reducing inference latency in large-scale convolutional neural network applications. Additionally, MLIR (Multi-Level Intermediate Representation) facilitates flexible graph optimization across various AI accelerators by enabling multi-stage transformations that improve execution order and memory access patterns, thus easing the transition of models from CPU-based implementations to accelerator-optimized versions. These compilers preserve the mathematical integrity of the models while rewriting the computation graph to ensure that the subsequent hardware-specific optimizations can be effectively applied.

    Why Graph Optimization is Critical for AI Acceleration

    Graph optimization enables AI accelerators to operate at peak efficiency. Without this phase, even the most optimized hardware would be underutilized, as models would be executed in a way that introduces unnecessary memory stalls, redundant computations, and inefficient data movement. By systematically restructuring computation graphs, the compiler arranges operations for efficient execution that mitigates bottlenecks before mapping to hardware, minimizes memory movement to keep tensors in high-speed memory, and optimizes parallel execution to reduce unnecessary serialization while enhancing hardware utilization. For instance, without proper graph optimization, a large Transformer model running on an edge device may experience excessive memory stalls due to suboptimal data access patterns; however, through effective graph restructuring, the model can operate with significantly reduced memory bandwidth consumption and latency, thus enabling real-time inference on devices with constrained resources.

    With the computation graph now fully optimized, the next step in compilation is kernel selection, where the compiler determines which hardware-specific implementation should be used for each operation. This ensures that the structured execution plan is translated into optimized low-level instructions for the target accelerator.

    11.7.4 Kernel Selection

    At this stage, the compiler translates the abstract operations in the computation graph into optimized low-level functions, ensuring that execution is performed as efficiently as possible given the constraints of the target accelerator. A kernel is a specialized implementation of a computational operation designed to run efficiently on a particular hardware architecture. Most accelerators, including GPUs, TPUs, and custom AI chips, provide multiple kernel implementations for the same operation, each optimized for different execution scenarios. Choosing the right kernel for each operation is essential for maximizing computational throughput, minimizing memory stalls, and ensuring that the accelerator’s specialized processing elements are fully utilized (NVIDIA 2021).

    Tianqi, Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, et al. 2018a. “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 578–94. https://www.usenix.org/conference/osdi18/presentation/chen.

    Kernel selection builds upon the graph optimization phase, ensuring that the structured execution plan is mapped to the most efficient implementation available. While graph optimization eliminates inefficiencies at the model level, kernel selection ensures that each individual operation is executed using the most efficient hardware-specific routine. The effectiveness of this process directly impacts the model’s overall performance, as poor kernel choices can nullify the benefits of prior optimizations by introducing unnecessary computation overhead or memory bottlenecks (Tianqi, Moreau, et al. 2018a).

    In a Transformer model, the matrix multiplications that dominate self-attention computations can be executed using different strategies depending on the available hardware. On a CPU, a general-purpose matrix multiplication routine is typically employed, exploiting vectorized execution to improve efficiency. In contrast, on a GPU, the compiler may select an implementation that leverages tensor cores to accelerate matrix multiplications using mixed-precision arithmetic. When the model is deployed on a TPU, the operation can be mapped onto a systolic array, ensuring that data flows through the accelerator in a manner that maximizes reuse and minimizes off-chip memory accesses. Additionally, for inference workloads, an integer arithmetic kernel may be preferable, as it facilitates computations in INT8 instead of floating-point precision, thereby reducing power consumption without significantly compromising accuracy.

    In many cases, compilers do not generate custom kernels from scratch but instead select from vendor-optimized kernel libraries that provide highly tuned implementations for different architectures. For instance, cuDNN and cuBLAS offer optimized kernels for deep learning on NVIDIA GPUs, while oneDNN provides optimized execution for Intel architectures. Similarly, ACL (Arm Compute Library) is optimized for Arm-based devices, and Eigen and BLIS provide efficient CPU-based implementations of deep learning operations. These libraries allow the compiler to choose pre-optimized, high-performance kernels rather than having to reinvent execution strategies for each hardware platform.

    How AI Compilers Perform Kernel Selection

    AI compilers use heuristics, profiling, and cost models to determine the best kernel for each operation. These strategies ensure that each computation is executed in a way that maximizes throughput and minimizes memory bottlenecks.

    In rule-based selection, the compiler applies predefined heuristics based on the known capabilities of the hardware. For instance, XLA, the compiler used in TensorFlow, automatically selects tensor core-optimized kernels for NVIDIA GPUs when mixed-precision execution is enabled. These predefined rules allow the compiler to make fast, reliable decisions about which kernel to use without requiring extensive analysis.

    Profile-guided selection takes a more dynamic approach, benchmarking different kernel options and choosing the one that performs best for a given workload. TVM, an open-source AI compiler, uses AutoTVM to empirically evaluate kernel performance, tuning execution strategies based on real-world execution times. By testing different kernels before deployment, profile-guided selection helps ensure that operations are assigned to the most efficient implementation under actual execution conditions.

    Another approach, cost model-based selection, relies on performance predictions to estimate execution time and memory consumption for various kernels before choosing the most efficient one. MLIR, a compiler infrastructure designed for machine learning workloads, applies this technique to determine the most effective tiling and memory access strategies (Lattner et al. 2020). By modeling how different kernels interact with the accelerator’s compute units and memory hierarchy, the compiler can select the kernel that minimizes execution cost while maximizing performance.

    Lattner, Chris, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. “MLIR: A Compiler Infrastructure for the End of Moore’s Law.” arXiv Preprint arXiv:2002.11054, February. http://arxiv.org/abs/2002.11054v2.

    Many AI compilers also incorporate precision-aware kernel selection, where the selected kernel is optimized for specific numerical formats such as FP32, FP16, BF16, or INT8. Training workloads often prioritize higher precision (FP32, BF16) to maintain model accuracy, whereas inference workloads favor lower precision (FP16, INT8) to increase speed and reduce power consumption. For example, an NVIDIA GPU running inference with TensorRT can dynamically select FP16 or INT8 kernels based on a model’s accuracy constraints. This trade-off between precision and performance is a key aspect of kernel selection, especially when deploying models in resource-constrained environments.

    Some compilers go beyond static kernel selection and implement adaptive kernel tuning, where execution strategies are adjusted at runtime based on the system’s workload and available resources. AutoTVM in TVM measures kernel performance across different workloads and dynamically refines execution strategies. TensorRT applies real-time optimizations based on batch size, memory constraints, and GPU load, adjusting kernel selection dynamically. Google’s TPU compiler takes a similar approach, optimizing kernel selection based on cloud resource availability and execution environment constraints.

    Why Kernel Selection is Essential

    The efficiency of AI acceleration depends not only on how computations are structured but also on how they are executed. Even the best-designed computation graph will fail to achieve peak performance if the selected kernels do not fully utilize the hardware’s capabilities.

    Proper kernel selection allows models to execute using the most efficient algorithms available for the given hardware, ensuring that memory is accessed in a way that avoids unnecessary stalls and that specialized acceleration features, such as tensor cores or systolic arrays, are leveraged wherever possible. Selecting an inappropriate kernel can lead to underutilized compute resources, excessive memory transfers, and increased power consumption, all of which limit the performance of AI accelerators.

    For instance, if a Transformer model running on a GPU is assigned a non-tensor-core kernel for its matrix multiplications, it may execute at only a fraction of the possible performance. Conversely, if a model designed for FP32 execution is forced to run on an INT8-optimized kernel, it may experience significant numerical instability, degrading accuracy. These choices illustrate why kernel selection is as much about maintaining numerical correctness as it is about optimizing performance.

    With kernel selection complete, the next stage in compilation involves execution scheduling and memory management, where the compiler determines how kernels are launched and how data is transferred between different levels of the memory hierarchy. These final steps in the compilation pipeline ensure that computations run with maximum parallelism while minimizing the overhead of data movement. As kernel selection determines what to execute, execution scheduling and memory management dictate when and how those kernels are executed, ensuring that AI accelerators operate at peak efficiency.

    11.7.5 Memory Planning

    The memory planning phase ensures that data is allocated and accessed in a way that minimizes memory bandwidth consumption, reduces latency, and maximizes cache efficiency (Zhang, Li, and Ouyang 2020). Even with the most optimized execution plan, a model can still suffer from severe performance degradation if memory is not managed efficiently.

    Zhang, Y., J. Li, and H. Ouyang. 2020. “Optimizing Memory Access for Deep Learning Workloads.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39 (11): 2345–58.

    Machine learning workloads are often memory-intensive. They require frequent movement of large tensors between different levels of the memory hierarchy. The compiler must determine how tensors are stored, how they are accessed, and how intermediate results are handled to ensure that memory does not become a bottleneck.

    The memory planning phase focuses on optimizing tensor layouts, memory access patterns, and buffer reuse to prevent unnecessary stalls and memory contention during execution. In this phase, tensors are arranged in a memory-efficient format that aligns with hardware access patterns, thereby minimizing the need for format conversions. Additionally, memory accesses are structured to reduce cache misses and stalls, which in turn lowers overall bandwidth consumption. Buffer reuse is also a critical aspect, as it reduces redundant memory allocations by intelligently managing intermediate results. Together, these strategies ensure that data is efficiently placed and accessed, thereby enhancing both computational performance and energy efficiency in AI workloads.

    How AI Compilers Perform Memory Planning

    Memory planning is a complex problem because AI models must balance memory availability, reuse, and access efficiency while operating across multiple levels of the memory hierarchy. AI compilers use several key strategies to manage memory effectively and prevent unnecessary data movement.

    The first step in memory planning is tensor layout optimization, where the compiler determines how tensors should be arranged in memory to maximize locality and prevent unnecessary data format conversions. Different hardware accelerators have different preferred storage layouts—for instance, NVIDIA GPUs often use row-major storage (NHWC format), while TPUs favor channel-major layouts (NCHW format) to optimize memory coalescing (Abadi et al. 2016). The compiler automatically transforms tensor layouts based on the expected access patterns of the target hardware, ensuring that memory accesses are aligned for maximum efficiency.

    Abadi, M. et al. 2016. “TensorFlow: A System for Large-Scale Machine Learning.” 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 265–83.
    Jones, Gareth A. 2018. “Joining Dessins Together.” arXiv Preprint arXiv:1810.03960, October. http://arxiv.org/abs/1810.03960v1.

    Beyond layout optimization, memory planning also includes buffer allocation and reuse, where the compiler minimizes memory footprint by reusing intermediate storage whenever possible. Deep learning workloads generate many temporary tensors, such as activations and gradients, which can quickly overwhelm on-chip memory if not carefully managed. Instead of allocating new memory for each tensor, the compiler analyzes the computation graph to identify opportunities for buffer reuse, ensuring that intermediate values are stored and overwritten efficiently (Jones 2018).

    Another critical aspect of memory planning is minimizing data movement between different levels of the memory hierarchy. AI accelerators typically have a mix of high-speed on-chip memory (such as caches or shared SRAM) and larger, but slower, external DRAM. If tensor data is repeatedly moved between these memory levels, the model may become memory-bound, reducing computational efficiency. To prevent this, compilers use tiling strategies that break large computations into smaller, memory-friendly chunks, allowing execution to fit within fast, local memory and reducing the need for costly off-chip memory accesses.

    Why Memory Planning is Essential for AI Acceleration

    Without proper memory planning, even the most optimized computation graph and kernel selection will fail to deliver high performance. Excessive memory transfers, inefficient memory layouts, and redundant memory allocations can all lead to bottlenecks that prevent AI accelerators from reaching their peak throughput.

    For instance, a CNN running on a GPU may achieve high computational efficiency in theory, but if its convolutional feature maps are stored in an incompatible format—say, using a row-major layout that requires conversion to a channel-friendly format like NCHW or even a variant such as NHCW—constant tensor format conversions can introduce significant overhead. Similarly, a Transformer model deployed on an edge device may struggle to meet real-time inference requirements if memory is not carefully planned, leading to frequent off-chip memory accesses that increase latency and power consumption.

    Through careful management of tensor placement, optimizing memory access patterns, and reducing unnecessary data movement, memory planning guarantees efficient operation of AI accelerators, leading to tangible performance improvements in real-world applications.

    11.7.6 Computation Scheduling

    With graph optimization completed, kernels selected, and memory planning finalized, the next step in the compilation pipeline is computation scheduling. This phase determines when and where each computation should be executed, ensuring that workloads are efficiently distributed across available processing elements while avoiding unnecessary stalls and resource contention (Rajbhandari et al. 2020; Zheng et al. 2020).

    Rajbhandari, Samyam, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. “ZeRO: Memory Optimization Towards Training Trillion Parameter Models.” Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). https://doi.org/10.5555/3433701.3433721.
    Zheng, Lianmin, Ziheng Jia, Yida Gao, Jiacheng Lin, Song Han, Xuehai Geng, Eric Zhao, and Tianqi Wu. 2020. “Ansor: Generating High-Performance Tensor Programs for Deep Learning.” USENIX Symposium on Operating Systems Design and Implementation (OSDI), 863–79.
    Jia, Ziheng, Nathan Tillman, Luis Vega, Po-An Ouyang, Matei Zaharia, and Joseph E. Gonzalez. 2019. “Optimizing DNN Computation with Relaxed Graph Substitutions.” Conference on Machine Learning and Systems (MLSys).

    AI accelerators achieve high performance through massive parallelism, but without an effective scheduling strategy, computational units may sit idle, memory bandwidth may be underutilized, and execution efficiency may degrade. Computation scheduling is responsible for ensuring that all processing elements remain active, execution dependencies are managed correctly, and workloads are distributed optimally (Ziheng Jia et al. 2019).

    In the scheduling phase, parallel execution, synchronization, and resource allocation are managed systematically. Task partitioning decomposes extensive computations into smaller, manageable tasks that can be distributed efficiently among multiple compute cores. Execution order optimization then determines the most effective sequence for launching these operations, maximizing hardware performance while reducing execution stalls. Additionally, resource allocation and synchronization are orchestrated to ensure that compute cores, memory bandwidth, and shared caches are utilized effectively, avoiding contention. Through these coordinated strategies, computation scheduling achieves optimal hardware utilization, minimizes memory access delays, and supports a streamlined and efficient execution process.

    How AI Compilers Perform Computation Scheduling

    Computation scheduling is highly dependent on the underlying hardware architecture, as different AI accelerators have unique execution models that must be considered when determining how workloads are scheduled. AI compilers implement several key strategies to optimize scheduling for efficient execution.

    One of the most fundamental aspects of scheduling is task partitioning, where the compiler divides large computational graphs into smaller, manageable units that can be executed in parallel. On GPUs, this typically means mapping matrix multiplications and convolutions to thousands of CUDA cores, while on TPUs, tasks are partitioned to fit within systolic arrays that operate on structured data flows (Norrie et al. 2021). In CPUs, partitioning is often focused on breaking computations into vectorized chunks that align with SIMD execution. The goal is to map workloads to available processing units efficiently, ensuring that each core remains active throughout execution.

    Norrie, Thomas, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman Jouppi, and David Patterson. 2021. “The Design Process for Google’s Training Chips: TPUv2 and TPUv3.” IEEE Micro 41 (2): 56–63. https://doi.org/10.1109/mm.2021.3058217.
    ———. 2019b. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” arXiv Preprint arXiv:1909.08053, September. http://arxiv.org/abs/1909.08053v4.

    In addition to task partitioning, scheduling also involves optimizing execution order to minimize dependencies and maximize throughput. Many AI models include operations that can be computed independently (e.g., different batches in a batch processing pipeline) alongside operations that have strict dependencies (e.g., recurrent layers in an RNN). AI compilers analyze these dependencies and attempt to rearrange execution where possible, reducing idle time and improving parallel efficiency. For example, in Transformer models, scheduling may prioritize preloading attention matrices into memory while earlier layers are still executing, ensuring that data is ready when needed (Shoeybi et al. 2019b).

    Another crucial aspect of computation scheduling is resource allocation and synchronization, where the compiler determines how compute cores share memory and coordinate execution. Modern AI accelerators often support overlapping computation and data transfers, meaning that while one task executes, the next task can begin fetching its required data. Compilers take advantage of this by scheduling tasks in a way that hides memory latency, ensuring that execution remains compute-bound rather than memory-bound (Tianqi, Moreau, et al. 2018b). TensorRT and XLA, for example, employ streaming execution strategies where multiple kernels are launched in parallel, and synchronization is carefully managed to prevent execution stalls (Google, n.d.).

    ———, et al. 2018b. “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” In OSDI, 578–94. https://www.usenix.org/conference/osdi18/presentation/chen.
    Google. n.d. “XLA: Optimizing Compiler for Machine Learning.” <https://www.tensorflow.org/xla>.

    Why Computation Scheduling is Essential for AI Acceleration

    Without effective scheduling, even the most optimized model can suffer from underutilized compute resources, memory bottlenecks, and execution inefficiencies. Poor scheduling decisions can lead to idle processing elements, forcing expensive compute cores to wait for data or synchronization events before continuing execution.

    For instance, a CNN running on a GPU may have highly optimized kernels and efficient memory layouts, but if its execution is not scheduled correctly, compute units may remain idle between kernel launches, reducing throughput. Similarly, a Transformer model deployed on a TPU may perform matrix multiplications efficiently but could experience performance degradation if attention layers are not scheduled to overlap efficiently with memory transfers.

    Effective computation scheduling occupies a central role in the orchestration of parallel workloads, ensuring that processing elements are utilized to their fullest capacity while preventing idle cores—a critical aspect for maximizing overall throughput. By strategically overlapping computation with data movement, the scheduling mechanism effectively conceals memory latency, thereby preventing operational stalls during data retrieval. Moreover, by resolving execution dependencies with precision, it minimizes waiting periods and enhances the concurrent progression of computation and data transfer. This systematic integration of scheduling and data handling serves to not only elevate performance but also exemplify the rigorous engineering principles that underpin modern accelerator design.

    Code Generation

    Furthermore, memory-aware dataflow adjustments enhance overall performance by refining tensor layouts and optimizing memory movement. For example, tiling matrix multiplications to meet the structural requirements of systolic arrays in TPUs ensures that hardware resources are utilized optimally. This combined approach not only reduces unnecessary processing but also aligns data storage and movement with the accelerator’s strengths, leading to efficient execution across diverse AI workloads.

    Unlike the previous phases, which required AI-specific optimizations, code generation follows many of the same principles as traditional compilers. This process includes instruction selection, register allocation, and final optimization passes, ensuring that execution makes full use of hardware-specific features such as vectorized execution, memory prefetching, and instruction reordering.

    For CPUs and GPUs, AI compilers typically generate machine code or optimized assembly instructions, while for TPUs, FPGAs, and other accelerators, the output may be optimized bytecode or execution graphs that are interpreted by the hardware’s runtime system.

    At this point, the compilation pipeline is complete: the original high-level model representation has been transformed into an optimized, executable format tailored for efficient execution on the target hardware. The combination of graph transformations, kernel selection, memory-aware execution, and parallel scheduling ensures that AI accelerators run workloads with maximum efficiency, minimal memory overhead, and optimal computational throughput.

    11.7.7 Compilation to Runtime Support

    The compiler plays a fundamental role in AI acceleration, transforming high-level machine learning models into optimized execution plans tailored to the constraints of specialized hardware. Throughout this section, we have seen how graph optimization restructures computation, kernel selection maps operations to hardware-efficient implementations, memory planning optimizes data placement, and computation scheduling ensures efficient parallel execution. Each of these phases is crucial in enabling AI models to fully leverage modern accelerators, ensuring high throughput, minimal memory overhead, and efficient execution pipelines.

    However, compilation alone is not enough to guarantee efficient execution in real-world AI workloads. While compilers statically optimize computation based on known model structures and hardware capabilities, AI execution environments are often dynamic and unpredictable. Batch sizes fluctuate, hardware resources may be shared across multiple workloads, and accelerators must adapt to real-time performance constraints. In these cases, a static execution plan is insufficient, and runtime management becomes critical in ensuring that models execute optimally under real-world conditions.

    This transition from static compilation to adaptive execution is where AI runtimes come into play. Runtimes provide dynamic memory allocation, real-time kernel selection, workload scheduling, and multi-chip coordination, allowing AI models to adapt to varying execution conditions while maintaining efficiency. In the next section, we explore how AI runtimes extend the capabilities of compilers, enabling models to run effectively in diverse and scalable deployment scenarios.

    11.8 Runtime Support

    While compilers optimize AI models before execution, real-world deployment introduces dynamic and unpredictable conditions that static compilation alone cannot fully address (NVIDIA 2021). AI workloads operate in varied execution environments, where factors such as fluctuating batch sizes, shared hardware resources, memory contention, and latency constraints necessitate real-time adaptation. Precompiled execution plans, optimized for a fixed set of assumptions, may become suboptimal when actual runtime conditions change.

    NVIDIA. 2021. “TensorRT: High-Performance Deep Learning Inference Library.” NVIDIA Developer Blog. https://developer.nvidia.com/tensorrt.

    To bridge this gap, AI runtimes provide a dynamic layer of execution management, extending the optimizations performed at compile time with real-time decision-making. Unlike traditional compiled programs that execute a fixed sequence of instructions, AI workloads require adaptive control over memory allocation, kernel execution, and resource scheduling. AI runtimes continuously monitor execution conditions and make on-the-fly adjustments to ensure that machine learning models fully utilize available hardware while maintaining efficiency and performance guarantees.

    At a high level, AI runtimes manage three critical aspects of execution:

    1. Kernel Execution Management: AI runtimes dynamically select and dispatch computation kernels based on the current system state, ensuring that workloads are executed with minimal latency.
    2. Memory Adaptation and Allocation: Since AI workloads frequently process large tensors with varying memory footprints, runtimes adjust memory allocation dynamically to prevent bottlenecks and excessive data movement (Huang et al. 2019).
    3. Execution Scaling: AI runtimes handle workload distribution across multiple accelerators, supporting large-scale execution in multi-chip, multi-node, or cloud environments (Mirhoseini et al. 2017).
    Huang, Yanping et al. 2019. “GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism.” In Advances in Neural Information Processing Systems (NeurIPS).
    Mirhoseini, Azalia et al. 2017. “Device Placement Optimization with Reinforcement Learning.” International Conference on Machine Learning (ICML).

    By dynamically handling these execution aspects, AI runtimes complement compiler-based optimizations, ensuring that models continue to perform efficiently under varying runtime conditions. The next section explores how AI runtimes differ from traditional software runtimes, highlighting why machine learning workloads require fundamentally different execution strategies compared to conventional CPU-based programs.

    11.8.1 ML vs. Traditional Runtimes

    Traditional software runtimes are designed for managing general-purpose program execution, primarily handling sequential and multi-threaded workloads on CPUs. These runtimes allocate memory, schedule tasks, and optimize execution at the level of individual function calls and instructions. In contrast, AI runtimes are specialized for machine learning workloads, which require massively parallel computation, large-scale tensor operations, and dynamic memory management.

    Table 11.18 highlights the fundamental differences between traditional and AI runtimes. One of the key distinctions lies in execution flow. Traditional software runtimes operate on a predictable, structured execution model where function calls and CPU threads follow a predefined control path. AI runtimes, however, execute computational graphs, requiring complex scheduling decisions that account for dependencies between tensor operations, parallel kernel execution, and efficient memory access.

    Table 11.18: Key differences between traditional and AI runtimes.
    Aspect Traditional Runtime AI Runtime
    Execution Model Sequential or multi-threaded execution Massively parallel tensor execution
    Task Scheduling CPU thread management Kernel dispatch across accelerators
    Memory Management Static allocation (stack/heap) Dynamic tensor allocation, buffer reuse
    Optimization Priorities Low-latency instruction execution Minimizing memory stalls, maximizing parallel execution
    Adaptability Mostly static execution plan Adapts to batch size and hardware availability
    Target Hardware CPUs (general-purpose execution) GPUs, TPUs, and custom accelerators

    Memory management is another major differentiator. Traditional software runtimes handle small, frequent memory allocations, optimizing for cache efficiency and low-latency access. AI runtimes, in contrast, must dynamically allocate, reuse, and optimize large tensors, ensuring that memory access patterns align with accelerator-friendly execution. Poor memory management in AI workloads can lead to performance bottlenecks, particularly due to excessive off-chip memory transfers and inefficient cache usage.

    Moreover, AI runtimes are inherently designed for adaptability. While traditional runtimes often follow a mostly static execution plan, AI workloads typically operate in highly variable execution environments, such as cloud-based accelerators or multi-tenant hardware. As a result, AI runtimes must continuously adjust batch sizes, reallocate compute resources, and manage real-time scheduling decisions to maintain high throughput and minimize execution delays.

    These distinctions demonstrate why AI runtimes require fundamentally different execution strategies compared to traditional software runtimes. Rather than simply managing CPU processes, AI runtimes must oversee large-scale tensor execution, multi-device coordination, and real-time workload adaptation to ensure that machine learning models can run efficiently under diverse and ever-changing deployment conditions.

    11.8.2 Dynamic Kernel Execution

    Dynamic kernel execution is the process of mapping machine learning models to hardware and optimizing runtime execution. While static compilation provides a solid foundation, efficient execution of machine learning workloads requires real-time adaptation to fluctuating conditions such as available memory, data sizes, and computational loads. The runtime functions as an intermediary that continuously adjusts execution strategies to match both the constraints of the underlying hardware and the characteristics of the workload.

    When mapping a machine learning model to hardware, individual computational operations —such as matrix multiplications, convolutions, and activation functions—must be assigned to the most appropriate processing units. This mapping is not fixed; it must be modified during runtime in response to changes in input data, memory availability, and overall system load. Dynamic kernel execution allows the runtime to make real-time decisions regarding kernel selection, execution order, and memory management, ensuring that workloads remain efficient despite these changing conditions.

    For example, consider an AI accelerator executing a deep neural network (DNN) for image classification. If an incoming batch of high-resolution images requires significantly more memory than expected, a statically planned execution may cause cache thrashing or excessive off-chip memory accesses. Instead, a dynamic runtime can adjust tiling strategies on the fly, breaking down tensor operations into smaller tiles that fit within the high-speed on-chip memory. This prevents memory stalls and ensures optimal utilization of caches.

    Similarly, when running a transformer-based natural language processing (NLP) model, the sequence length of input text may vary between inference requests. A static execution plan optimized for a fixed sequence length may lead to underutilization of compute resources when processing shorter sequences or excessive memory pressure with longer sequences. Dynamic kernel execution can mitigate this by selecting different kernel implementations based on the actual sequence length, dynamically adjusting memory allocations and execution strategies to maintain efficiency.

    Moreover, overlapping computation with memory movement is a vital strategy to mitigate performance bottlenecks. AI workloads often encounter delays due to memory-bound issues, where data movement between memory hierarchies limits computation speed. To combat this, AI runtimes implement techniques like asynchronous execution and double buffering, ensuring that computations proceed without waiting for memory transfers to complete. In a large-scale model, for instance, image data can be prefetched while computations are performed on the previous batch, thus maintaining a steady flow of data and avoiding pipeline stalls.

    Another practical example is the execution of convolutional layers in a CNN on a GPU. If multiple convolution kernels need to be scheduled, a static scheduling approach may lead to inefficient resource utilization due to variation in layer sizes and compute requirements. By dynamically scheduling kernel execution, AI runtimes can prioritize smaller kernels when compute units are partially occupied, improving hardware utilization. For instance, in NVIDIA’s TensorRT runtime, fusion of small kernels into larger execution units is done dynamically to avoid launch overhead, optimizing latency-sensitive inference tasks.

    Dynamic kernel execution plays an essential role in ensuring that machine learning models are executed efficiently. By dynamically adjusting execution strategies in response to real-time system conditions, AI runtimes optimize both training and inference performance across various hardware platforms.

    11.8.3 Kernel Selection at Runtime

    While compilers may perform an initial selection of kernels based on static analysis of the machine learning model and hardware target, AI runtimes often need to override these decisions during execution. Real-time factors, such as available memory, hardware utilization, and workload priorities, may differ significantly from the assumptions made during compilation. By dynamically selecting and switching kernels at runtime, AI runtimes can adapt to these changing conditions, ensuring that models continue to perform efficiently.

    For instance, consider transformer-based language models, where a significant portion of execution time is spent on matrix multiplications. The AI runtime must determine the most efficient way to execute these operations based on the current system state. If the model is running on a GPU with specialized Tensor Cores, the runtime may switch from a standard FP32 kernel to an FP16 kernel to take advantage of hardware acceleration (Shoeybi et al. 2019a). Conversely, if the lower precision of FP16 causes unacceptable numerical instability, the runtime can opt for mixed-precision execution, selectively using FP32 where higher precision is necessary.

    Shoeybi, Mohammad, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019a. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” arXiv Preprint arXiv:1909.08053, September. http://arxiv.org/abs/1909.08053v4.

    Memory constraints also influence kernel selection. When memory bandwidth is limited, the runtime may adjust its execution strategy, reordering operations or changing the tiling strategy to fit computations into the available cache rather than relying on slower main memory. For example, a large matrix multiplication may be broken into smaller chunks, ensuring that the computation fits into the on-chip memory of the GPU, reducing overall latency.

    Additionally, batch size can influence kernel selection. For workloads that handle a mix of small and large batches, the AI runtime may choose a latency-optimized kernel for small batches and a throughput-optimized kernel for large-scale batch processing. This adjustment ensures that the model continues to operate efficiently across different execution scenarios, without the need for manual tuning.

    11.8.4 Kernel Scheduling and Resource Utilization

    Once the AI runtime selects an appropriate kernel, the next step is scheduling it in a way that maximizes parallelism and resource utilization. Unlike traditional task schedulers, which are designed to manage CPU threads, AI runtimes must coordinate a much larger number of tasks across parallel execution units such as GPU cores, tensor processing units (TPUs), or custom AI accelerators (Jouppi et al. 2017). Effective scheduling ensures that these computational resources are kept fully engaged, preventing bottlenecks and maximizing throughput.

    Jouppi, Norman P. et al. 2017. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA).

    For example, in image recognition models that use convolutional layers, operations can be distributed across multiple processing units, enabling different filters to run concurrently. This parallelization ensures that the available hardware is fully utilized, speeding up execution. Similarly, batch normalization and activation functions must be scheduled efficiently to avoid unnecessary delays. If these operations are not interleaved with other computations, they may block the pipeline and reduce overall throughput.

    Efficient kernel scheduling can also be influenced by real-time memory management . AI runtimes ensure that intermediate data, such as feature maps in deep neural networks, are preloaded into cache before they are needed. This proactive management helps prevent delays caused by waiting for data to be loaded from slower memory tiers, ensuring continuous execution.

    These techniques enable AI runtimes to ensure optimal resource utilization and efficient parallel computation, which are essential for the high-performance execution of machine learning models, particularly in environments that require scaling across multiple hardware accelerators.

    11.9 Multi-chip AI Acceleration

    Modern AI workloads increasingly demand computational resources that exceed the capabilities of single-chip accelerators. This section examines how AI systems scale from individual processors to multi-chip architectures, analyzing the motivation behind different scaling approaches and their impact on system design. By understanding this progression, we can better appreciate how each component of the AI hardware stack—from compute units to memory systems—must adapt to support large-scale machine learning workloads.

    The scaling of AI systems follows a natural progression, starting with integration within a single package through chiplet architectures, extending to multi-GPU configurations within a server, expanding to distributed accelerator pods, and culminating in wafer-scale integration. Each approach presents unique trade-offs between computational density, communication overhead, and system complexity. For instance, chiplet architectures maintain high-speed interconnects within a package, while distributed systems sacrifice communication latency for massive parallelism.

    Understanding these scaling strategies is essential for several reasons. First, it provides insight into how different hardware architectures address the growing computational demands of AI workloads. Second, it reveals the fundamental challenges that arise when extending beyond single-chip execution, such as managing inter-chip communication and coordinating distributed computation. Finally, it establishes the foundation for subsequent discussions on how mapping strategies, compilation techniques, and runtime systems evolve to support efficient execution at scale.

    Chiplet-Based Architectures: Scaling Within a Single Package

    The first step in scaling AI accelerators is to move beyond a single monolithic chip while still maintaining a compact, tightly integrated design. Chiplet architectures achieve this by partitioning large designs into smaller, modular dies that are interconnected within a single package.

    AMD’s chiplet-based architecture.

    AMD’s chiplet-based architecture.

    Modern AI accelerators, such as AMD’s Instinct MI300, take this approach by integrating multiple compute chiplets alongside memory chiplets, linked by high-speed die-to-die interconnects (Kannan, Dubey, and Horowitz 2023). This modular design allows manufacturers to bypass the manufacturing limits of monolithic chips while still achieving high-density compute.

    Kannan, Harish, Pradeep Dubey, and Mark Horowitz. 2023. “Chiplet-Based Architectures: The Future of AI Accelerators.” IEEE Micro 43 (1): 46–55. https://doi.org/10.1109/MM.2022.1234567.

    However, even within a single package, scaling is not without challenges. Inter-chiplet communication latency, memory coherence, and thermal management become critical factors as more chiplets are integrated. Unlike traditional multi-chip systems, chiplet-based designs must carefully balance latency-sensitive workloads across multiple dies without introducing excessive bottlenecks.

    Multi-GPU Systems: Scaling Beyond a Single Accelerator

    Beyond chiplet-based designs, AI workloads often require multiple discrete GPUs working together. In multi-GPU systems, each accelerator has its own dedicated memory and compute resources, but they must efficiently share data and synchronize execution.

    A common example is NVIDIA DGX systems, which integrate multiple GPUs connected via NVLink10 or PCIe11. This architecture enables workloads to be split across GPUs, typically using data parallelism (where each GPU processes a different batch of data) or model parallelism (where different GPUs handle different parts of a neural network) (Ben-Nun and Hoefler 2019).

    10 NVLink: A high-speed interconnect that enables faster data transfers between GPUs, reducing communication bottlenecks.

    11 PCIe (Peripheral Component Interconnect Express): A common interface for connecting high-speed components; however, it typically offers lower bandwidth compared to NVLink for GPU-to-GPU communication.

    Ben-Nun, Tal, and Torsten Hoefler. 2019. “Demystifying Parallel and Distributed Deep Learning: An in-Depth Concurrency Analysis.” ACM Computing Surveys 52 (4): 1–43. https://doi.org/10.1145/3320060.

    As illustrated in Figure 11.7, NVSwitch interconnects enable high-speed communication between GPUs, reducing bottlenecks in distributed training. However, scaling up the number of GPUs introduces new challenges. Cross-GPU communication bandwidth, memory consistency, and workload scheduling become critical constraints, particularly for large-scale models requiring frequent data exchanges. Unlike chiplets, which leverage high-speed die-to-die interconnects, discrete GPUs rely on external links, incurring higher latency and synchronization overhead.

    Figure 11.7: Multi-GPU architecture with NVSwitch interconnects.

    TPU Pods: Scaling Across Distributed Systems

    As models and datasets continue to expand, training and inference workloads must extend beyond single-server configurations. This scaling requirement has led to the development of sophisticated distributed systems where multiple accelerators communicate across networks. Google’s TPU Pods represent a pioneering approach to this challenge, interconnecting hundreds of TPUs to function as a unified system (Jouppi et al. 2020).

    The architectural design of TPU Pods differs fundamentally from traditional multi-GPU systems. While multi-GPU configurations typically rely on NVLink or PCIe connections within a single machine, TPU Pods employ high-bandwidth optical links to interconnect accelerators at data center scale. This design implements a 2D torus interconnect topology, enabling efficient data exchange between accelerators while minimizing communication bottlenecks as workloads scale across nodes.

    The effectiveness of this architecture is demonstrated in its performance scaling capabilities. As illustrated in Figure 11.8, TPU Pod performance exhibits near-linear scaling when running ResNet-50, from quarter-pod to full-pod configurations. The system achieves a remarkable 33.0x speedup when scaled to 1024 chips compared to a 16-TPU baseline. This scaling efficiency is particularly noteworthy in larger configurations, where performance continues to scale strongly even as the system expands from 128 to 1024 chips.

    Figure 11.8: Cloud TPU v3 pods and their performance on ResNet-50 across a range of slice sizes relative to a 16-TPU-chip baseline.

    However, distributing AI workloads across an entire data center introduces unique challenges. Systems must contend with interconnect congestion, synchronization delays, and the complexities of efficient workload partitioning. Unlike multi-GPU setups where accelerators share memory hierarchies, TPU Pods operate in a fully distributed memory system. This architecture necessitates explicit communication strategies to manage data movement effectively, requiring careful consideration of data placement and transfer patterns to maintain scaling efficiency.

    Wafer-Scale AI: Scaling to a Single Massive Processor

    At the frontier of AI scaling, wafer-scale integration represents a paradigm shift—abandoning traditional multi-chip architectures in favor of a single, massive AI processor. Rather than partitioning computation across discrete chips, this approach treats an entire silicon wafer as a unified compute fabric, eliminating the inefficiencies of inter-chip communication.

    As shown in Figure 11.9, Cerebras’ Wafer-Scale Engine (WSE) processors break away from the historical transistor scaling trends of CPUs, GPUs, and TPUs. While these architectures have steadily increased transistor counts along an exponential trajectory, WSE introduces an entirely new scaling paradigm, integrating trillions of transistors onto a single wafer—far surpassing even the most advanced GPUs and TPUs. With WSE-3, this trajectory continues, pushing wafer-scale AI to unprecedented levels (Systems 2021a).

    Systems, Cerebras. 2021a. “The Wafer-Scale Engine 2: Scaling AI Compute Beyond GPUs.” Cerebras White Paper. https://cerebras.ai/product-chip/.

    The fundamental advantage of wafer-scale AI is its ultra-fast, on-die communication. Unlike chiplets, GPUs, or TPU Pods, where data must traverse physical boundaries between separate devices, wafer-scale AI enables near-instantaneous data transfer across its vast compute array. This architecture drastically reduces communication latency, unlocking performance levels that are unachievable with conventional multi-chip systems.

    However, achieving this level of integration introduces formidable engineering challenges. Thermal dissipation, fault tolerance, and manufacturing yield become major constraints when fabricating a processor of this scale. Unlike distributed TPU systems, which mitigate failures by dynamically re-routing workloads, wafer-scale AI must incorporate built-in redundancy mechanisms to tolerate localized defects in the silicon. Successfully addressing these challenges is essential to realizing the full potential of wafer-scale computing as the next frontier in AI acceleration.

    The Scaling Trajectory of AI Systems

    Table 11.19 illustrates the progressive scaling of AI acceleration, from single-chip processors to increasingly complex architectures such as chiplet-based designs, multi-GPU systems, TPU Pods, and wafer-scale AI. Each step in this evolution introduces new challenges related to data movement, memory access, interconnect efficiency, and workload distribution. While chiplets enable modular scaling within a package, they introduce latency and memory coherence issues. Multi-GPU systems rely on high-speed interconnects like NVLink but face synchronization and communication bottlenecks. TPU Pods push scalability further by distributing workloads across clusters, yet they must contend with interconnect congestion and workload partitioning. At the extreme end, wafer-scale AI integrates an entire wafer into a single computational unit, presenting unique challenges in thermal management and fault tolerance.

    Table 11.19: Scaling trajectory of AI systems and associated challenges.
    Scaling Approach Key Feature Challenges
    Chiplets Modular scaling within a package Inter-chiplet latency, memory coherence
    Multi-GPU External GPU interconnects (NVLink) Synchronization overhead, communication bottlenecks
    TPU Pods Distributed accelerator clusters Interconnect congestion, workload partitioning
    Wafer-Scale AI Entire wafer as a single processor Thermal dissipation, fault tolerance

    11.9.1 Scaling Changes Computation and Memory

    As AI systems scale from single-chip accelerators to multi-chip architectures, the fundamental challenges in computation and memory evolve. In a single accelerator, execution is primarily optimized for locality—ensuring that computations are mapped efficiently to available processing elements while minimizing memory access latency. However, as AI systems extend beyond a single chip, the scope of these optimizations expands significantly. Computation must now be distributed across multiple accelerators, and memory access patterns become constrained by interconnect bandwidth and communication overhead.

    Computation Placement Becomes a Multi-Chip Problem

    In single-chip AI accelerators, computation placement is primarily concerned with assigning workloads to processing elements, ensuring efficient parallel execution across cores and vector or tensor units. Placement strategies focus on minimizing data movement within the chip, optimizing cache reuse, and leveraging parallelism across execution units.

    As AI workloads scale to multi-chip architectures, the approach to computation placement must be re-evaluated to encompass the entire system topology. Rather than merely optimizing placement for local processing elements, workloads are now partitioned across a diverse array of accelerators, including GPUs, TPUs, and specialized AI processors. This transition introduces several critical challenges, such as ensuring a balanced distribution of computation across chips to prevent load imbalances, strategically assigning operations to minimize the impact of high-latency off-chip communication, and effectively managing synchronization overhead arising from dependencies that span different accelerators.

    For example, in multi-GPU systems, computation placement must account for NVLink or PCIe bandwidth constraints, ensuring that operations requiring frequent communication are co-located on GPUs with high-bandwidth links. In TPU Pods, placement is influenced by the 2D torus interconnect topology, requiring structured data exchanges to optimize performance (Jouppi et al. 2020). Thus, while single-chip computation placement is primarily a local optimization problem, multi-chip computation placement introduces a global optimization challenge where interconnect topology and data transfer costs must be considered.

    Jouppi, Norman P., Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. 2020. “A Domain-Specific Supercomputer for Training Deep Neural Networks.” Communications of the ACM 63 (7): 67–78. https://doi.org/10.1145/3360307.

    Memory Hierarchy Shifts from On-Chip to Distributed Memory

    Memory organization in single-chip AI accelerators is designed to minimize latency and maximize data locality. Hierarchical memory structures, such as L1 and L2 caches, on-chip SRAM, and high-bandwidth memory (HBM), are carefully optimized to reduce reliance on slow off-chip DRAM accesses.

    As AI systems scale beyond a single chip, the memory hierarchy extends beyond a single accelerator and introduces several critical constraints. In multi-chip architectures, inter-chip memory access latency becomes a major consideration since accessing memory located on a different chip incurs significantly higher delays compared to on-chip caches. Additionally, the limited bandwidth of interconnects means that moving data between chips is orders of magnitude slower than data movement within a single chip. Finally, memory management must transition from a shared model to a distributed one, as memory is partitioned across accelerators, necessitating explicit mechanisms for data transfer. This combination of increased latency, constrained bandwidth, and distributed resource management presents unique challenges in designing and optimizing multi-chip AI systems.

    In chiplet-based architectures, for example, accelerators rely on high-speed die-to-die interconnects to exchange data between chiplets. While this enables modular scaling, it also introduces latency penalties compared to monolithic chips. In multi-GPU systems, each GPU has its own local memory (HBM or GDDR), requiring explicit communication via NVLink, PCIe, or RDMA to access data stored on another GPU.

    As a result, memory optimization at scale requires new strategies beyond those used in single-chip accelerators. Data locality, prefetching, and caching policies must now be designed to minimize inter-chip transfers, as off-chip memory accesses become the dominant bottleneck in performance.

    Data Movement Is No Longer Just a Local Concern

    In single-chip architectures, data movement optimizations primarily focus on minimizing unnecessary memory accesses, maximizing on-chip reuse, and leveraging efficient data layouts (e.g., tiling, weight stationarity, or kernel fusion). While these techniques remain relevant, their impact diminishes as AI systems scale to multi-chip execution.

    At scale, inter-chip data movement represents a dominant performance constraint, as it introduces several significant challenges. Cross-chip data transfers must navigate bandwidth limitations inherent in interconnects—such as PCIe, NVLink, or other proprietary links—which operate at speeds considerably slower than those within a single chip. Additionally, when data dependencies span multiple chips, synchronization delays can occur, potentially stalling execution until all necessary data is successfully transmitted. Finally, unlike single-chip systems where caches and shared memory automatically facilitate data reuse, distributed architectures require explicit strategies for data communication and partitioning, further complicating memory management across the system.

    For example, in TPU Pods, data movement is carefully structured using the systolic execution model, where each TPU unit passes data to its neighbor in a predictable manner. This minimizes redundant memory fetches and ensures that interconnect bandwidth is used efficiently. Similarly, in multi-GPU training, techniques such as all-reduce communication are used to synchronize weights across GPUs with minimal overhead.

    Thus, while traditional AI acceleration techniques focus on local memory optimization, large-scale AI systems must now prioritize minimizing inter-chip data movement to maintain efficiency.

    Summary: How Compilers and Runtimes Adapt to Scaling

    Table 11.21 highlights how compilers and runtimes adapt to the challenges introduced by scaling AI execution beyond a single-chip accelerator. In a single-chip environment, computation placement focuses on optimizing workload distribution among processing elements, tensor cores, and vector units. However, in a multi-chip system, compilers must implement interconnect-aware scheduling to minimize costly inter-chip communication while ensuring balanced execution across accelerators.

    Table 11.20: Adaptations in computation placement, memory management, and scheduling for multi-chip AI execution.
    Aspect Single-Chip AI Accelerator Multi-Chip AI System & How Compilers/Runtimes Adapt
    Computation Placement Local PEs, tensor cores, vector units Hierarchical mapping, interconnect-aware scheduling
    Memory Management Caching, HBM reuse, local tiling Distributed allocation, prefetching, caching
    Data Movement On-chip reuse, minimal DRAM access Communication-aware execution, overlap transfers
    Execution Scheduling Parallelism, compute occupancy Global scheduling, interconnect-aware balancing

    Memory management also evolves significantly with scaling. While a single-chip accelerator benefits from caching, HBM reuse, and efficient tiling, multi-chip systems require explicit memory partitioning and coordination. Compilers optimize memory layouts for distributed execution, and runtimes introduce prefetching and caching mechanisms to reduce inter-chip memory access overhead.

    Data movement becomes increasingly critical at scale. Single-chip accelerators emphasize on-chip data reuse and minimal DRAM accesses, but multi-chip systems must implement communication-aware execution strategies to overlap computation with data transfers. Runtimes handle inter-chip synchronization to prevent execution stalls due to data dependencies.

    Finally, execution scheduling extends from local parallelism and compute occupancy optimization to global coordination across accelerators. Multi-chip systems require dynamic scheduling strategies that balance workload distribution while accounting for interconnect bandwidth and synchronization latency. By adapting to these scaling challenges, compilers and runtimes ensure that AI systems can efficiently leverage distributed architectures for maximum performance.

    11.9.2 Mapping Complexity Increases at Scale

    As AI systems scale from single-chip accelerators to multi-chip architectures, the fundamental challenges in computation and memory evolve. In a single accelerator, execution is primarily optimized for locality—ensuring that computations are mapped efficiently to available processing elements while minimizing memory access latency. However, as AI systems extend beyond a single chip, the scope of these optimizations expands significantly. Computation must now be distributed across multiple accelerators, and memory access patterns become constrained by interconnect bandwidth and communication overhead.

    This section examines how computation placement, memory hierarchy, and data movement shift as AI acceleration scales beyond a single processor.

    Mapping Complexity Increases at Scale

    As AI accelerators scale beyond a single chip, the challenge of mapping computations to hardware becomes significantly more complex. In single-chip architectures, mapping strategies focus on placing computations efficiently within a fixed set of processing elements, while memory allocation ensures efficient reuse of on-chip storage to minimize latency and energy consumption.

    However, in multi-chip architectures, mapping strategies must now consider a broader set of constraints. Computation, memory, and data movement must be coordinated across multiple accelerators, each with independent execution units and local memory hierarchies. This shift introduces new challenges in hierarchical computation mapping, distributed memory allocation, and inter-chip data transfer minimization.

    This section explores how mapping strategies evolve as AI systems scale, highlighting key considerations for efficient execution in multi-chip architectures.

    Mapping From Local to Distributed Execution

    In single-chip AI accelerators, computation placement is concerned with mapping workloads to PEs, vector units, and tensor cores. Mapping strategies aim to maximize data locality, ensuring that computations access nearby memory to reduce costly data movement.

    As AI systems scale to multi-chip execution, computation placement must consider several critical factors. Workloads need to be partitioned across multiple accelerators, which requires explicit coordination of execution order and dependencies. This division is essential due to the inherent latency associated with cross-chip communication, which contrasts sharply with single-chip systems that benefit from shared on-chip memory. Accordingly, computation scheduling must be interconnect-aware to manage these delays effectively. Additionally, achieving load balancing across accelerators is vital; an uneven distribution of tasks can result in some accelerators remaining underutilized while others operate at full capacity, ultimately hindering overall system performance.

    For example, in multi-GPU training, computation mapping must ensure that each GPU has a balanced portion of the workload while minimizing expensive cross-GPU communication. Similarly, in TPU Pods, mapping strategies must align with the torus interconnect topology, ensuring that computation is placed to minimize long-distance data transfers.

    Thus, while computation placement in single-chip systems is a local optimization problem, in multi-chip architectures, it becomes a global optimization challenge where execution efficiency depends on minimizing inter-chip communication and balancing workload distribution.

    Memory Allocation for Distributed Access

    Memory allocation strategies in single-chip AI accelerators are designed to minimize off-chip memory accesses by leveraging on-chip caches, SRAM, and high-bandwidth memory (HBM). Techniques such as tiling, data reuse, and kernel fusion ensure that computations make efficient use of fast local memory.

    In multi-chip AI systems, each accelerator manages its own local memory, which necessitates the explicit allocation of model parameters, activations, and intermediate data across the devices. Unlike single-chip execution where data is fetched once and reused, multi-chip setups require deliberate strategies to minimize redundant data transfers, as data must be communicated between accelerators. Additionally, when overlapping data is processed by multiple accelerators, the synchronization of shared data can introduce significant overhead that must be carefully managed to ensure efficient execution.

    For instance, in multi-GPU deep learning, gradient synchronization across GPUs is a memory-intensive operation that must be optimized to avoid network congestion (Shallue, Lee, et al. 2019). In wafer-scale AI, memory allocation must account for fault tolerance and redundancy mechanisms, ensuring that defective regions of the wafer do not disrupt execution.

    Shallue, Christopher J., Jaehoon Lee, et al. 2019. “Measuring the Effects of Data Parallelism on Neural Network Training.” Journal of Machine Learning Research 20: 1–49. http://jmlr.org/papers/v20/18-789.html.

    Thus, while memory allocation in single-chip accelerators focuses on local cache efficiency, in multi-chip architectures, it must be explicitly coordinated across accelerators to balance memory bandwidth, minimize redundant transfers, and reduce synchronization overhead.

    Data Movement Becomes the Dominant Constraint

    In single-chip AI accelerators, data movement optimization is largely focused on minimizing on-chip memory access latency. Techniques such as weight stationarity, input stationarity, and tiling ensure that frequently used data remains close to the execution units, reducing off-chip memory traffic.

    In multi-chip architectures, data movement transcends being merely an intra-chip issue and becomes a significant system-wide bottleneck. Scaling introduces several critical challenges, foremost among them being inter-chip bandwidth constraints; communication links such as PCIe, NVLink, and TPU interconnects operate at speeds that are considerably slower than those of on-chip memory accesses. Additionally, when accelerators share model parameters or intermediate computations, the resulting data synchronization overhead—including latency and contention—can markedly impede execution. Finally, optimizing collective communication is essential for workloads that require frequent data exchanges, such as gradient updates in deep learning training, where minimizing synchronization penalties is imperative for achieving efficient system performance.

    For example, in TPU Pods, systolic execution models ensure that data moves in structured patterns, reducing unnecessary off-chip transfers. In multi-GPU inference, techniques like asynchronous data fetching and overlapping computation with communication help mitigate inter-chip latency.

    Thus, while data movement optimization in single-chip systems focuses on cache locality and tiling, in multi-chip architectures, the primary challenge is reducing inter-chip communication overhead to maximize efficiency.

    Summary: How Compilers and Runtimes Adapt to Scaling

    As AI acceleration extends beyond a single chip, compilers and runtimes must adapt to manage computation placement, memory organization, and execution scheduling across multiple accelerators. The fundamental principles of locality, parallelism, and efficient scheduling remain essential, but their implementation requires new strategies for distributed execution.

    One of the primary challenges in scaling AI execution is computation placement. In a single-chip accelerator, workloads are mapped to processing elements, vector units, and tensor cores with an emphasis on minimizing on-chip data movement and maximizing parallel execution. However, in a multi-chip system, computation must be partitioned hierarchically, where workloads are distributed not just across cores within a chip, but also across multiple accelerators. Compilers handle this by implementing interconnect-aware scheduling, optimizing workload placement to minimize costly inter-chip communication.

    Similarly, memory management evolves as scaling extends beyond a single accelerator. In a single-chip system, local caching, HBM reuse, and efficient tiling strategies ensure that frequently accessed data remains close to computation units. However, in a multi-chip system, each accelerator has its own independent memory, requiring explicit memory partitioning and coordination. Compilers optimize memory layouts for distributed execution, while runtimes introduce data prefetching and caching mechanisms to reduce inter-chip memory access overhead.

    Beyond computation and memory, data movement becomes a major bottleneck at scale. In a single-chip accelerator, efficient on-chip caching and minimized DRAM accesses ensure that data is reused efficiently. However, in a multi-chip system, communication-aware execution becomes critical, requiring compilers to generate execution plans that overlap computation with data transfers. Runtimes handle inter-chip synchronization, ensuring that workloads are not stalled by waiting for data to arrive from remote accelerators.

    Finally, execution scheduling must be extended for global coordination. In single-chip AI execution, scheduling is primarily concerned with parallelism and maximizing compute occupancy within the accelerator. However, in a multi-chip system, scheduling must balance workload distribution across accelerators while taking interconnect bandwidth and synchronization latency into account. Runtimes manage this complexity by implementing adaptive scheduling strategies that dynamically adjust execution plans based on system state and network congestion.

    Table 11.21 summarizes these key adaptations, highlighting how compilers and runtimes extend their capabilities to efficiently support multi-chip AI execution.

    Thus, while the fundamentals of AI acceleration remain intact, compilers and runtimes must extend their functionality to operate efficiently across distributed systems. The next section will explore how mapping strategies evolve to further optimize multi-chip AI execution.

    Table 11.21: Adaptations in computation placement, memory management, and scheduling for multi-chip AI execution.
    Aspect Single-Chip AI Accelerator Multi-Chip AI System & How Compilers/Runtimes Adapt
    Computation Placement Local PEs, tensor cores, vector units Hierarchical mapping, interconnect-aware scheduling
    Memory Management Caching, HBM reuse, local tiling Distributed allocation, prefetching, caching
    Data Movement On-chip reuse, minimal DRAM access Communication-aware execution, overlap transfers
    Execution Scheduling Parallelism, compute occupancy Global scheduling, interconnect-aware balancing

    11.9.3 Execution Models Must Adapt

    As AI accelerators scale beyond a single chip, execution models must evolve to account for the complexities introduced by distributed computation, memory partitioning, and inter-chip communication. In single-chip accelerators, execution is optimized for local processing elements, with scheduling strategies that balance parallelism, locality, and data reuse. However, in multi-chip AI systems, execution must now be coordinated across multiple accelerators, introducing new challenges in workload scheduling, memory coherence, and interconnect-aware execution.

    This section explores how execution models change as AI acceleration scales, focusing on scheduling, memory coordination, and runtime management in multi-chip systems.

    Local Scheduling to Cross-Accelerator Scheduling

    In single-chip AI accelerators, execution scheduling is primarily aimed at optimizing parallelism within the processor. This involves ensuring that workloads are effectively mapped to tensor cores, vector units, and special function units (SFUs) by employing techniques designed to enhance data locality and resource utilization. For instance, static scheduling uses a predetermined execution order that is carefully optimized for locality and reuse, while dynamic scheduling adapts in real time to variations in workload demands. Additionally, pipeline execution divides computations into stages, thereby maximizing hardware utilization by maintaining a continuous flow of operations.

    In contrast, scheduling in multi-chip architectures must address the additional challenges posed by inter-chip dependencies. Workload partitioning in such systems involves distributing tasks across various accelerators such that each receives an optimal share of the workload, all while minimizing the overhead caused by excessive communication. Moreover, interconnect-aware scheduling is essential to align execution timing with the constraints of inter-chip bandwidth, thus preventing performance stalls. Latency hiding techniques also play a critical role, as they enable the overlapping of computation with communication, effectively reducing waiting times.

    For example, in multi-GPU inference scenarios, execution scheduling is implemented in a way that allows data to be prefetched concurrently with computation, thereby mitigating memory stalls. Similarly, TPU Pods leverage the systolic array model to tightly couple execution scheduling with data flow, ensuring that each TPU core receives its required data precisely when needed. Therefore, while single-chip execution scheduling is focused largely on maximizing internal parallelism, multi-chip systems require a more holistic approach that explicitly manages communication overhead and synchronizes workload distribution across accelerators.

    Memory and Computation Coordination Across Accelerators

    In single-chip AI accelerators, memory coordination is managed through sophisticated local caching strategies that keep frequently used data in close proximity to the execution units. Techniques such as tiling, kernel fusion, and data reuse are employed to reduce the dependency on slower memory hierarchies, thereby enhancing performance and reducing latency.

    In contrast, multi-chip architectures present a distributed memory coordination challenge that necessitates more deliberate management. Each accelerator in such a system possesses its own independent memory, which must be organized through explicit memory partitioning to minimize cross-chip data accesses. Additionally, ensuring consistency and synchronization of shared data across accelerators is essential to maintain computational correctness. Efficient communication mechanisms must also be implemented to schedule data transfers in a way that limits overhead associated with synchronization delays.

    For instance, in distributed deep learning training, model parameters must be synchronized across multiple GPUs using methods such as all-reduce, where gradients are aggregated across accelerators while reducing communication latency. In wafer-scale AI, memory coordination must further address fault-tolerant execution, ensuring that defective areas do not compromise overall system performance. Consequently, while memory coordination in single-chip systems is primarily concerned with cache optimization, multi-chip architectures require comprehensive management of distributed memory access, synchronization, and communication to achieve efficient execution.

    Runtimes Must Manage Cross-Accelerator Execution

    Execution in single-chip AI accelerators is managed by AI runtimes that handle workload scheduling, memory allocation, and hardware execution. These runtimes optimize execution at the kernel level, ensuring that computations are executed efficiently within the available resources.

    In multi-chip AI systems, runtimes must incorporate a comprehensive strategy for distributed execution orchestration. This approach ensures that both computation and memory access are seamlessly coordinated across multiple accelerators, enabling efficient utilization of hardware resources and minimizing bottlenecks associated with data transfers.

    Furthermore, these systems require robust mechanisms for cross-chip workload synchronization. Careful management of dependencies and timely coordination between accelerators are essential to prevent stalls in execution that may arise from delays in inter-chip communication. Such synchronization is critical for maintaining the flow of computation, particularly in environments where latency can significantly impact overall performance.

    Finally, adaptive execution models play a pivotal role in contemporary multi-chip architectures. These models dynamically adjust execution plans based on current hardware availability and communication constraints, ensuring that the system can respond to changing conditions and optimize performance in real time. Together, these strategies provide a resilient framework for managing the complexities of distributed AI execution.

    For example, in Google’s TPU Pods, the TPU runtime is responsible for scheduling computations across multiple TPU cores, ensuring that workloads are executed in a way that minimizes communication bottlenecks. In multi-GPU frameworks like PyTorch and TensorFlow, runtime execution must synchronize operations across GPUs, ensuring that data is transferred efficiently while maintaining execution order.

    Thus, while single-chip runtimes focus on optimizing execution within a single processor, multi-chip runtimes must handle system-wide execution, balancing computation, memory, and interconnect performance.

    Summary: How Compilers and Runtimes Adapt Computation Placement

    As AI systems expand beyond single-chip execution, computation placement must adapt to account for inter-chip workload distribution and interconnect efficiency. In single-chip accelerators, compilers optimize placement by mapping workloads to tensor cores, vector units, and PEs, ensuring maximum parallelism while minimizing on-chip data movement. However, in multi-chip systems, placement strategies must address interconnect bandwidth constraints, synchronization latency, and hierarchical workload partitioning across multiple accelerators.

    Table 11.22 highlights these adaptations. To reduce expensive cross-chip communication, compilers now implement interconnect-aware workload partitioning, strategically assigning computations to accelerators based on communication cost. For instance, in multi-GPU training, compilers optimize placement to minimize NVLink or PCIe traffic, whereas TPU Pods leverage the torus interconnect topology to enhance data exchanges.

    Table 11.22: Adaptations in computation placement strategies for multi-chip AI execution.
    Aspect Single-Chip AI Accelerator Multi-Chip AI System & How Compilers/Runtimes Adapt
    Computation Placement Local PEs, tensor cores, vector units Hierarchical mapping, interconnect-aware scheduling
    Workload Distribution Optimized within a single chip Partitioning across accelerators, minimizing inter-chip communication
    Synchronization Managed within local execution units Runtimes dynamically balance workloads, adjust execution plans

    Runtimes complement this by dynamically managing execution workloads, adjusting placement in real-time to balance loads across accelerators. Unlike static compilation, which assumes a fixed hardware topology, AI runtimes continuously monitor system conditions and migrate tasks as needed to prevent bottlenecks. This ensures efficient execution even in environments with fluctuating workload demands or varying hardware availability.

    By extending local execution strategies to multi-chip environments, computation placement now requires a careful balance between parallel execution, memory locality, and interconnect-aware scheduling. The next section explores how memory hierarchy must evolve to support efficient execution across distributed AI architectures.

    Thus, computation placement at scale builds upon local execution optimizations while introducing new challenges in inter-chip coordination, communication-aware execution, and dynamic load balancing. In the next section, we explore how memory hierarchy must adapt to support efficient execution across multi-chip architectures.

    11.10 Conclusion

    The rapid advancement of machine learning has fundamentally reshaped computer architecture and system design, driving the need for specialized hardware and optimized software to support the increasing computational demands of AI workloads. This chapter has explored the foundational principles of AI acceleration, analyzing how domain-specific architectures, memory hierarchies, and data movement strategies work in concert to maximize performance and mitigate bottlenecks.

    We began by examining the historical progression of AI hardware, tracing the shift from general-purpose processors to specialized accelerators tailored for machine learning workloads. This evolution has been driven by the computational intensity of AI models, necessitating vectorized execution, matrix processing, and specialized function units to accelerate key operations.

    Memory systems play a pivotal role in AI acceleration, as modern workloads require efficient management of large-scale tensor data across hierarchical memory structures. This chapter detailed the challenges posed by memory bandwidth limitations, irregular access patterns, and off-chip communication, highlighting techniques such as tiling, kernel fusion, and memory-aware data placement that optimize data movement and reuse.

    Mapping neural networks to hardware requires balancing computation placement, memory allocation, and execution scheduling. We analyzed key mapping strategies, including weight-stationary, output-stationary, and hybrid approaches, and explored how compilers and runtimes transform high-level models into optimized execution plans that maximize hardware utilization.

    As AI workloads scale beyond single-chip accelerators, new challenges emerge in distributed execution, memory coherence, and inter-chip communication. This chapter examined how multi-GPU architectures, TPU pods, and wafer-scale AI systems address these challenges by leveraging hierarchical workload partitioning, distributed memory management, and interconnect-aware scheduling. We also explored how compilers and runtimes must adapt to orchestrate execution across multiple accelerators, ensuring efficient workload distribution and minimizing communication overhead.

    The increasing complexity of AI models and the growing scale of machine learning workloads underscore a broader shift in computing—one where specialization and hardware-software co-design are essential for achieving efficiency and scalability. Understanding the fundamental trade-offs in AI acceleration enables system designers, researchers, and engineers to make informed decisions about deploying and optimizing AI models across diverse hardware platforms.

    This chapter has provided a comprehensive foundation in AI acceleration, equipping readers with the knowledge to navigate the evolving intersection of machine learning systems, hardware design, and system optimization. As AI continues to advance, the ability to efficiently map computations to hardware will remain a key determinant of performance, scalability, and future innovation in artificial intelligence.

    11.11 Resources

    Here is a curated list of resources to support students and instructors in their learning and teaching journeys. We are continuously working on expanding this collection and will add new exercises soon.

    Slides
    • Coming soon.
    Videos
    • Coming soon.
    Exercises
    • Coming soon.