4 DNN Architectures

DALL·E 3 Prompt: A visually striking rectangular image illustrating the interplay between deep learning algorithms like CNNs, RNNs, and Attention Networks, interconnected with machine learning systems. The composition features neural network diagrams blending seamlessly with representations of computational systems such as processors, graphs, and data streams. Bright neon tones contrast against a dark futuristic background, symbolizing cutting-edge technology and intricate system complexity.

Purpose

What recurring patterns emerge across modern deep learning architectures, and how do these patterns enable systematic approaches to AI system design?

Deep learning architectures represent a convergence of computational patterns that form the building blocks of modern AI systems. These foundational patterns, ranging from convolutional structures to attention mechanisms, reveal how complex models arise from simple, repeatable components. The examination of these architectural elements provides insights into the systematic construction of flexible, efficient AI systems, establishing core principles that influence every aspect of system design and deployment. These structural insights illuminate the path toward creating scalable, adaptable solutions across diverse application domains.

Learning Objectives

Map fundamental neural network concepts to deep learning architectures (dense, spatial, temporal, attention-based).
Analyze how architectural patterns shape computational and memory demands.
Evaluate system-level impacts of architectural choices on system attributes.
Compare architectures’ hardware mapping and identify optimization strategies.
Assess trade-offs between complexity and system needs for specific applications.

4.1 Overview

Deep learning architecture stands for specific representation or organizations of neural network components, the neurons, weights, and connections (as introduced in Chapter 3), arranged to efficiently process different types of patterns in data. While the previous chapter established the fundamental building blocks of neural networks, in this chapter we examine how these components are structured into architectures that map efficiently to computer systems.

Neural network architectures have evolved to address specific pattern processing challenges. Whether processing arbitrary feature relationships, exploiting spatial patterns, managing temporal dependencies, or handling dynamic information flow, each architectural pattern emerged from particular computational needs. These architectures, from a computer systems perspective, require an examination of how their computational patterns map to system resources.

Most often the architectures are discussed in terms of their algorithmic structures (MLPs, CNNs, RNNs, Transformers). However, in this chapter we take a more fundamental approach by examining how their computational patterns map to hardware resources. Each section analyzes how specific pattern processing needs influence algorithmic structure and how these structures map to computer system resources. The implications for computer system design require examining how their computational patterns map to hardware resources. The mapping from algorithmic requirements to computer system design involves several key considerations:

Memory access patterns: How data moves through the memory hierarchy
Computation characteristics: The nature and organization of arithmetic operations
Data movement: Requirements for on-chip and off-chip data transfer
Resource utilization: How computational and memory resources are allocated

For example, dense connectivity patterns generate different memory bandwidth demands than localized processing structures. Similarly, stateful processing creates distinct requirements for on-chip memory organization compared to stateless operations. Getting a firm grasp on these mappings is important for modern computer architects and system designers who must implement these algorithms efficiently in hardware.

4.2 Multi-Layer Perceptrons: Dense Pattern Processing

Multi-Layer Perceptrons (MLPs) represent the most direct extension of neural networks into deep architectures. Unlike more specialized networks, MLPs process each input element with equal importance, making them versatile but computationally intensive. Their architecture, while simple, establishes fundamental computational patterns that appear throughout deep learning systems. These patterns were initially formalized by the introduction of the Universal Approximation Theorem (UAT) (Cybenko 1992; Hornik, Stinchcombe, and White 1989), which states that a sufficiently large MLP with non-linear activation functions can approximate any continuous function on a compact domain, given suitable weights and biases.

Cybenko, G. 1992. “Approximation by Superpositions of a Sigmoidal Function.” Mathematics of Control, Signals, and Systems 5 (4): 455–55. https://doi.org/10.1007/bf02134016.

Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2 (5): 359–66. https://doi.org/10.1016/0893-6080(89)90020-8.

When applied to the MNIST handwritten digit recognition challenge, an MLP reveals its computational power by transforming a complex $28\times 28$ pixel image into a precise digit classification. By treating each of the 784 pixels as an equally weighted input, the network learns to decompose visual information through a systematic progression of layers, converting raw pixel intensities into increasingly abstract representations that capture the essential characteristics of handwritten digits.

4.2.1 Pattern Processing Needs

Deep learning systems frequently encounter problems where any input feature could potentially influence any output, as there are no inherent constraints on these relationships. Consider analyzing financial market data: any economic indicator might affect any market outcome or in natural language processing, where the meaning of a word could depend on any other word in the sentence. These scenarios demand an architectural pattern capable of learning arbitrary relationships across all input features.

Dense pattern processing addresses this fundamental need by enabling several key capabilities. First, it allows unrestricted feature interactions where each output can depend on any combination of inputs. Second, it facilitates learned feature importance, allowing the system to determine which connections matter rather than having them prescribed. Finally, it provides adaptive representation, enabling the network to reshape its internal representations based on the data.

For example, in the MNIST digit recognition task, while humans might focus on specific parts of digits (like loops in ‘6’ or crossings in ‘8’), we cannot definitively say which pixel combinations are important for classification. A ‘7’ written with a serif could share pixel patterns with a ‘2’, while variations in handwriting mean discriminative features might appear anywhere in the image. This uncertainty about feature relationships necessitates a dense processing approach where every pixel can potentially influence the classification decision.

4.2.2 Algorithmic Structure

To enable unrestricted feature interactions, MLPs implement a direct algorithmic solution: connect everything to everything. This is realized through a series of fully-connected layers, where each neuron connects to every neuron in adjacent layers. The dense connectivity pattern translates mathematically into matrix multiplication operations. As shown in Figure 4.1, each layer transforms its input through matrix multiplication followed by element-wise activation: \[ \mathbf{h}^{(l)} = f\big(\mathbf{W}^{(l)}\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\big) \]

Figure 4.1: MLP layers and its associated matrix representation. Source: Reagen et al. (2017)

The dimensions of these operations reveal the computational scale of dense pattern processing:

Input vector: $\mathbf{h}^{(0)} \in \mathbb{R}^{d_{\text{in}}}$ represents all potential input features
Weight matrices: $\mathbf{W}^{(l)} \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ capture all possible input-output relationships
Output vector: $\mathbf{h}^{(l)} \in \mathbb{R}^{d_{\text{out}}}$ produces transformed representations

In the MNIST example, this means:

Each 784-dimensional input ($28\times 28$ pixels) connects to every neuron in the first hidden layer
A hidden layer with 100 neurons requires a $784\times 100$ weight matrix
Each weight in this matrix is a learnable relationship between an input pixel and a hidden feature

This algorithmic structure directly addresses our need for arbitrary feature relationships but creates specific computational patterns that must be handled efficiently by computer systems.

4.2.3 Computational Mapping

The elegant mathematical representation of dense matrix multiplication maps to specific computational patterns that systems must handle. Let’s examine how this mapping progresses from mathematical abstraction to computational reality.

The first implementation is shown in Listing 4.1. The function mlp_layer_matrix directly mirrors our mathematical equation. It uses high-level matrix operations (matmul) to express the computation in a single line, hiding the underlying complexity. This is the style commonly used in deep learning frameworks, where optimized libraries handle the actual computation.

Listing 4.1: Mathematical abstraction in code

def mlp_layer_matrix(X, W, b):
    # X: input matrix (batch_size × num_inputs)
    # W: weight matrix (num_inputs × num_outputs)
    # b: bias vector (num_outputs)
    H = activation(matmul(X, W) + b)
    # One clean line of math
    return H

The second implementation, mlp_layer_compute (shown in Listing 4.2), exposes the actual computational pattern through nested loops. This version shows us what really happens when we compute a layer’s output: we process each sample in the batch, computing each output neuron by accumulating weighted contributions from all inputs.

Listing 4.2: Core computational pattern

def mlp_layer_compute(X, W, b):
    # Process each sample in the batch
    for batch in range(batch_size):
        # Compute each output neuron
        for out in range(num_outputs):
            # Initialize with bias
            Z[batch,out] = b[out]
            # Accumulate weighted inputs
            for in_ in range(num_inputs):
                Z[batch,out] += X[batch,in_] * W[in_,out]

    H = activation(Z)
    return H

This translation from mathematical abstraction to concrete computation exposes how dense matrix multiplication decomposes into nested loops of simpler operations. The outer loop processes each sample in the batch, while the middle loop computes values for each output neuron. Within the innermost loop, the system performs repeated multiply-accumulate operations, combining each input with its corresponding weight.

In the MNIST example, each output neuron requires 784 multiply-accumulate operations and at least 1,568 memory accesses (784 for inputs, 784 for weights). While actual implementations use sophisticated optimizations through libraries like BLAS or cuBLAS, these fundamental patterns drive key system design decisions.

4.2.4 System Implications

When analyzing how computational patterns impact computer systems, we typically examine three fundamental dimensions: memory requirements, computation needs, and data movement. This framework enables a systematic analysis of how algorithmic patterns influence system design decisions. We will use this framework for analyzing other network architectures, allowing us to compare and contrast their different characteristics.

Memory Requirements

For dense pattern processing, the memory requirements stem from storing and accessing weights, inputs, and intermediate results. In our MNIST example, connecting our 784-dimensional input layer to a hidden layer of 100 neurons requires 78,400 weight parameters. Each forward pass must access all these weights, along with input data and intermediate results. The all-to-all connectivity pattern means there’s no inherent locality in these accesses—every output needs every input and its corresponding weights.

These memory access patterns suggest opportunities for optimization through careful data organization and reuse. Modern processors handle these patterns differently; CPUs leverage their cache hierarchy for data reuse, while GPUs employ specialized memory hierarchies designed for high-bandwidth access. Deep learning frameworks abstract these hardware-specific details through optimized matrix multiplication implementations.

Computation Needs

The core computation revolves around multiply-accumulate operations arranged in nested loops. Each output value requires as many multiply-accumulates as there are inputs. For MNIST, this means 784 multiply-accumulates per output neuron. With 100 neurons in our hidden layer, we’re performing 78,400 multiply-accumulates for a single input image. While these operations are simple, their volume and arrangement create specific demands on processing resources.

This computational structure lends itself to particular optimization strategies in modern hardware. The dense matrix multiplication pattern can be efficiently parallelized across multiple processing units, with each handling different subsets of neurons. Modern hardware accelerators take advantage of this through specialized matrix multiplication units, while deep learning frameworks automatically convert these operations into optimized BLAS (Basic Linear Algebra Subprograms) calls. CPUs and GPUs can both exploit cache locality by carefully tiling the computation to maximize data reuse, though their specific approaches differ based on their architectural strengths.

Data Movement

The all-to-all connectivity pattern in MLPs creates significant data movement requirements. Each multiply-accumulate operation needs three pieces of data: an input value, a weight value, and the running sum. For our MNIST example layer, computing a single output value requires moving 784 inputs and 784 weights to wherever the computation occurs. This movement pattern repeats for each of the 100 output neurons, creating substantial data transfer demands between memory and compute units.

The predictable nature of these data movement patterns enables strategic data staging and transfer optimizations. Different architectures address this challenge through various mechanisms; CPUs use sophisticated prefetching and multi-level caches; meanwhile, GPUs employ high-bandwidth memory systems and latency hiding through massive threading. Deep learning frameworks orchestrate these data movements through optimized memory management systems.

4.3 Convolutional Neural Networks: Spatial Pattern Processing

While MLPs treat each input element independently, many real-world data types exhibit strong spatial relationships. Images, for example, derive their meaning from the spatial arrangement of pixels—a pattern of edges and textures that form recognizable objects. Audio signals show temporal patterns of frequency components, and sensor data often contains spatial or temporal correlations. These spatial relationships suggest that treating every input-output connection with equal importance, as MLPs do, might not be the most effective approach.

4.3.1 Pattern Processing Needs

Spatial pattern processing addresses scenarios where the relationship between data points depends on their relative positions or proximity. Consider processing a natural image: a pixel’s relationship with its neighbors is important for detecting edges, textures, and shapes. These local patterns then combine hierarchically to form more complex features—edges form shapes, shapes form objects, and objects form scenes.

This hierarchical spatial pattern processing appears across many domains. In computer vision, local pixel patterns form edges and textures that combine into recognizable objects. Speech processing relies on patterns across nearby time segments to identify phonemes and words. Sensor networks analyze correlations between physically proximate sensors to understand environmental patterns. Medical imaging depends on recognizing tissue patterns that indicate biological structures.

Taking image processing as an example, if we want to detect a cat in an image, certain spatial patterns must be recognized: the triangular shape of ears, the round contours of the face, the texture of fur. Importantly, these patterns maintain their meaning regardless of where they appear in the image—a cat is still a cat whether it’s in the top-left or bottom-right corner. This suggests two key requirements for spatial pattern processing: the ability to detect local patterns and the ability to recognize these patterns regardless of their position.

Figure 4.2: How convolutional neural networks extract spaital features for classification.

This leads us to the convolutional neural network architecture (CNN), introduced by Y. LeCun et al. (1989). As illustrated in Figure 4.2, CNNs address spatial pattern processing through a fundamentally different connection pattern than MLPs. Instead of connecting every input to every output, CNNs use a local connection pattern where each output connects only to a small, spatially contiguous region of the input. This local receptive field moves across the input space, applying the same set of weights at each position—a process known as convolution.

LeCun, Y., B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. “Backpropagation Applied to Handwritten Zip Code Recognition.” Neural Computation 1 (4): 541–51. https://doi.org/10.1162/neco.1989.1.4.541.

4.3.2 Algorithmic Structure

The core operation in a CNN can be expressed mathematically as: \[ \mathbf{H}^{(l)}_{i,j,k} = f\left(\sum_{di}\sum_{dj}\sum_{c} \mathbf{W}^{(l)}_{di,dj,c,k}\mathbf{H}^{(l-1)}_{i+di,j+dj,c} + \mathbf{b}^{(l)}_k\right) \]

Here, $(i,j)$ corresponds to spatial positions, $k$ indexes output channels, $c$ indexes input channels, and $(di,dj)$ spans the local receptive field. Unlike the dense matrix multiplication of MLPs, this operation:

Processes local neighborhoods (typically $3\times 3$ or $5\times 5$)
Reuses the same weights at each spatial position
Maintains spatial structure in its output

For a concrete example, consider the MNIST digit classification task with $28\times 28$ grayscale images. Each convolutional layer applies a set of filters (e.g., $3\times 3$) that slide across the image, computing local weighted sums. If we use 32 filters, the layer produces a $28\times 28\times 32$ output, where each spatial position contains 32 different feature measurements of its local neighborhood. This contrasts sharply with the multi-layer perceptron (MLP) approach, where the entire image is flattened into a 784-dimensional vector before processing.

This algorithmic structure directly implements the requirements for spatial pattern processing, creating distinct computational patterns that influence system design. Unlike MLPs, convolutional networks preserve spatial locality, allowing for efficient hierarchical feature extraction. These properties drive architectural optimizations in AI accelerators, where operations such as data reuse, tiling, and parallel filter computation are critical for performance.

As illustrated in Figure 4.3, convolution operations involve sliding a small filter over the input image to generate a feature map. This process efficiently captures local structures while maintaining translation invariance, making it a fundamental component of modern deep learning architectures. For an interactive visual exploration of convolutional networks, the CNN Explainer project provides an insightful demonstration of how these networks are constructed.

Figure 4.3: Convolution operation, image data (blue) and $3\times 3$ filter (green). Source: V. Dumoulin, F. Visin, MIT

4.3.3 Computational Mapping

The elegant spatial structure of convolution operations maps to computational patterns quite different from the dense matrix multiplication of MLPs. Let’s examine how this mapping progresses from mathematical abstraction to computational reality.

The first implementation, conv_layer_spatial (shown in Listing 4.3), uses high-level convolution operations to express the computation concisely. This is typical in deep learning frameworks, where optimized libraries handle the underlying complexity.

Listing 4.3: Mathematical abstraction - simple and clean

def conv_layer_spatial(input, kernel, bias):
    output = convolution(input, kernel) + bias
    return activation(output)

The second implementation, conv_layer_compute (see Listing 4.4), reveals the actual computational pattern: nested loops that process each spatial position, applying the same filter weights to local regions of the input. These nested loops reveal the true nature of convolution’s computational structure.

Listing 4.4: System reality – nested loops of computation

def conv_layer_compute(input, kernel, bias):
  # Loop 1: Process each image in batch
  for image in range(batch_size):

    # Loop 2&3: Move across image spatially
    for y in range(height):
      for x in range(width):

       # Loop 4: Compute each output feature
       for out_channel in range(num_output_channels):
         result = bias[out_channel]

         # Loop 5&6: Move across kernel window
         for ky in range(kernel_height):
           for kx in range(kernel_width):

             # Loop 7: Process each input feature
             for in_channel in range(num_input_channels):
             # Get input value from correct window position
              in_y = y + ky
              in_x = x + kx
              # Perform multiply-accumulate operation
              result += (
                input[image, in_y, in_x, in_channel]
                * kernel[ky, kx, in_channel, out_channel]
              )

         # Store result for this output position
         output[image, y, x, out_channel] = result

The seven nested loops reveal different aspects of the computation:

Outer loops (1-3) manage position: which image and where in the image
Middle loop (4) handles output features: computing different learned patterns
Inner loops (5-7) perform the actual convolution: sliding the kernel window

Let’s take a closer look. The outer two loops (for y and for x) traverse each spatial position in the output feature map (for our MNIST example, this means moving across all $28\times 28$ positions). At each position, we compute values for each output channel (for k loop), which represents different learned features or patterns—our 32 different feature detectors.

The inner three loops implement the actual convolution operation at each position. For each output value, we process a local $3\times 3$ region of the input (the dy and dx loops) across all input channels (for c loop). This creates a sliding window effect, where the same $3\times 3$ filter moves across the image, performing multiply-accumulates between the filter weights and the local input values. Unlike the MLP’s global connectivity, this local processing pattern means each output value depends only on a small neighborhood of the input.

For our MNIST example with $3\times 3$ filters and 32 output channels, each output position requires only 9 multiply-accumulate operations per input channel, compared to the 784 operations needed in our MLP layer. However, this operation must be repeated for every spatial position $(28\times 28)$ and every output channel (32).

While using fewer operations per output, the spatial structure creates different patterns of memory access and computation that systems must handle efficiently. These patterns fundamentally influence system design, creating both challenges and opportunities for optimization, which we’ll examine next.

4.3.4 System Implications

When analyzing how computational patterns impact computer systems, we examine three fundamental dimensions: memory requirements, computation needs, and data movement. For CNNs, the spatial nature of processing creates distinctive patterns in each dimension that differ significantly from the dense connectivity of MLPs.

Memory Requirements

For convolutional layers, memory requirements center around two key components: filter weights and feature maps. Unlike MLPs that require storing full connection matrices, CNNs use small, reusable filters. In our MNIST example, a convolutional layer with 32 filters of size $3\times 3$ requires storing only 288 weight parameters $(3\times 3\times 32)$, in contrast to the 78,400 weights needed for our MLP’s fully-connected layer. However, the system must store feature maps for all spatial positions, creating a different memory demand—a $28\times 28$ input with 32 output channels requires storing 25,088 activation values $(28\times 28\times 32)$.

These memory access patterns suggest opportunities for optimization through weight reuse and careful feature map management. Modern processors handle these patterns by caching filter weights, which are reused across spatial positions, while streaming through feature map data. Deep learning frameworks typically implement this through specialized memory layouts that optimize for both filter reuse and spatial locality in feature map access. CPUs and GPUs approach this differently—CPUs leverage their cache hierarchy to keep frequently used filters resident, while GPUs use specialized memory architectures designed for the spatial access patterns of image processing.

Computation Needs

The core computation in CNNs involves repeatedly applying small filters across spatial positions. Each output value requires a local multiply-accumulate operation over the filter region. For our MNIST example with $3\times 3$ filters and 32 output channels, computing one spatial position involves 288 multiply-accumulates $(3\times 3\times 32)$, and this must be repeated for all 784 spatial positions $(28\times 28)$. While each individual computation involves fewer operations than an MLP layer, the total computational load remains substantial due to spatial repetition.

This computational pattern presents different optimization opportunities than MLPs. The regular, repeated nature of convolution operations enables efficient hardware utilization through structured parallelism. Modern processors exploit this pattern in various ways. CPUs leverage SIMD instructions to process multiple filter positions simultaneously, while GPUs parallelize computation across spatial positions and channels. Deep learning frameworks further optimize this through specialized convolution algorithms that transform the computation to better match hardware capabilities.

Data Movement

The sliding window pattern of convolutions creates a distinctive data movement profile. Unlike MLPs where each weight is used once per forward pass, CNN filter weights are reused many times as the filter slides across spatial positions. For our MNIST example, each $3\times 3$ filter weight is reused 784 times (once for each position in the $28\times 28$ feature map). However, this creates a different challenge: the system must stream input features through the computation unit while keeping filter weights stable.

The predictable spatial access pattern enables strategic data movement optimizations. Different architectures handle this movement pattern through specialized mechanisms. CPUs maintain frequently used filter weights in cache while streaming through input features. GPUs employ memory architectures optimized for spatial locality and provide hardware support for efficient sliding window operations. Deep learning frameworks orchestrate these movements by organizing computations to maximize filter weight reuse and minimize redundant feature map accesses.

4.4 Recurrent Neural Networks: Sequential Pattern Processing

While MLPs handle arbitrary relationships and CNNs process spatial patterns, many real-world problems involve sequential data where the order and relationship between elements over time matters. Text processing requires understanding how words relate to previous context, speech recognition needs to track how sounds form coherent patterns, and time-series analysis must capture how values evolve over time. These sequential relationships suggest that treating each time step independently misses crucial temporal patterns.

4.4.1 Pattern Processing Needs

Sequential pattern processing addresses scenarios where the meaning of current input depends on what came before it. Consider natural language processing: the meaning of a word often depends heavily on previous words in the sentence. The word “bank” means something different in “river bank” versus “bank account.” Similarly, in speech recognition, a phoneme’s interpretation often depends on surrounding sounds, and in financial forecasting, future predictions require understanding patterns in historical data.

The key challenge in sequential processing is maintaining and updating relevant context over time. When reading text, humans don’t start fresh with each word—we maintain a running understanding that evolves as we process new information. Similarly, when processing time-series data, patterns might span different timescales, from immediate dependencies to long-term trends. This suggests we need an architecture that can both maintain state over time and update it based on new inputs.

These requirements demand specific capabilities from our processing architecture. The system must maintain internal state to capture temporal context, update this state based on new inputs, and learn which historical information is relevant for current predictions. Unlike MLPs and CNNs, which process fixed-size inputs, sequential processing must handle variable-length sequences while maintaining computational efficiency. This leads us to the recurrent neural network (RNN) architecture.

4.4.2 Algorithmic Structure

RNNs address sequential processing through a fundamentally different approach than MLPs or CNNs by introducing recurrent connections. Instead of just mapping inputs to outputs, RNNs maintain an internal state that is updated at each time step. This creates a memory mechanism that allows the network to carry information forward in time. This unique ability to model temporal dependencies was first explored by Elman (2002), who demonstrated how RNNs could find structure in time-dependent data.

The core operation in a basic RNN can be expressed mathematically as: \[ \mathbf{h}_t = f(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h) \] where $\mathbf{h}_t$ corresponds to the hidden state at time $t$, $\mathbf{x}_t$ is the input at time $t$, $\mathbf{W}_{hh}$ contains the recurrent weights, and $\mathbf{W}_{xh}$ contains the input weights, as shown in the unfolded network structure in Figure 4.4.

For example, in processing a sequence of words, each word might be represented as a 100-dimensional vector ($\mathbf{x}_t$), and we might maintain a hidden state of 128 dimensions ($\mathbf{h}_t$). At each time step, the network combines the current input with its previous state to update its understanding of the sequence. This creates a form of memory that can capture patterns across time steps.

This recurrent structure directly implements our requirements for sequential processing through the introduction of recurrent connections, which maintain internal state and allow the network to carry information forward in time. Instead of processing all inputs independently, RNNs process sequences of data by iteratively updating a hidden state based on the current input and the previous hidden state, as depicted in Figure 4.4. This makes RNNs well-suited for tasks such as language modeling, speech recognition, and time-series forecasting.

Figure 4.4: RNN architecture. Source: A. Amidi, S. Amidi, Stanford

4.4.3 Computational Mapping

The sequential structure of RNNs maps to computational patterns quite different from both MLPs and CNNs. Let’s examine how this mapping progresses from mathematical abstraction to computational reality.

As shown in Listing 4.5, the rnn_layer_step function demonstrates how the operation looks using high-level matrix operations found in deep learning frameworks. It handles a single time step, taking the current input x_t and previous hidden state h_prev, along with two weight matrices: W_hh for hidden-to-hidden connections and W_xh for input-to-hidden connections. Through matrix multiplication operations (matmul), it merges the previous state and current input to generate the next hidden state.

Listing 4.5: Mathematical abstraction in code

def rnn_layer_step(x_t, h_prev, W_hh, W_xh, b):
  # x_t: input at time t (batch_size × input_dim)
  # h_prev: previous hidden state (batch_size × hidden_dim)
  # W_hh: recurrent weights (hidden_dim × hidden_dim)
  # W_xh: input weights (input_dim × hidden_dim)
  h_t = activation(
    matmul(h_prev, W_hh)
    + matmul(x_t, W_xh)
    + b
  )
  return h_t

This simplified view masks the underlying complexity of the nested loops and individual computations shown in the detailed implementation (Listing 4.6). Its actual implementation reveals a more detailed computational reality.

Listing 4.6: Core computational pattern

def rnn_layer_compute(x_t, h_prev, W_hh, W_xh, b):
    # Initialize next hidden state
    h_t = np.zeros_like(h_prev)

    # Loop 1: Process each sequence in the batch
    for batch in range(batch_size):
        # Loop 2: Compute recurrent contribution
        # (h_prev × W_hh)
        for i in range(hidden_dim):
            for j in range(hidden_dim):
                h_t[batch,i] += h_prev[batch,j] * W_hh[j,i]

        # Loop 3: Compute input contribution (x_t × W_xh)
        for i in range(hidden_dim):
            for j in range(input_dim):
                h_t[batch,i] += x_t[batch,j] * W_xh[j,i]

        # Loop 4: Add bias and apply activation
        for i in range(hidden_dim):
            h_t[batch,i] = activation(h_t[batch,i] + b[i])

    return h_t

The nested loops in rnn_layer_compute expose the core computational pattern of RNNs (see Listing 4.6). Loop 1 processes each sequence in the batch independently, allowing for batch-level parallelism. Within each batch item, Loop 2 computes how the previous hidden state influences the next state through the recurrent weights W_hh. Loop 3 then incorporates new information from the current input through the input weights W_xh. Finally, Loop 4 adds biases and applies the activation function to produce the new hidden state.

For a sequence processing task with input dimension 100 and hidden state dimension 128, each time step requires two matrix multiplications: one $128\times 128$ for the recurrent connection and one $100\times 128$ for the input projection. While individual time steps can process in parallel across batch elements, the time steps themselves must process sequentially. This creates a unique computational pattern that systems must handle efficiently.

4.4.4 System Implications

For RNNs, the sequential nature of processing creates distinctive patterns in each dimension (memory requirements, computation needs, and data movement) that differ significantly from both MLPs and CNNs.

Memory Requirements

RNNs require storing two sets of weights (input-to-hidden and hidden-to-hidden) along with the hidden state. For our example with input dimension 100 and hidden state dimension 128, this means storing 12,800 weights for input projection $(100\times 128)$ and 16,384 weights for recurrent connections $(128\times 128)$. Unlike CNNs where weights are reused across spatial positions, RNN weights are reused across time steps. Additionally, the system must maintain the hidden state, which becomes a critical factor in memory usage and access patterns.

These memory access patterns create a different profile from MLPs and CNNs. Modern processors handle these patterns by keeping the weight matrices in cache while streaming through sequence elements. Deep learning frameworks optimize memory access by batching sequences together and carefully managing hidden state storage between time steps. CPUs and GPUs approach this through different strategies; CPUs leverage their cache hierarchy for weight reuse; meanwhile, GPUs use specialized memory architectures designed for maintaining state across sequential operations.

Computation Needs

The core computation in RNNs involves repeatedly applying weight matrices across time steps. For each time step, we perform two matrix multiplications: one with the input weights and one with the recurrent weights. In our example, processing a single time step requires 12,800 multiply-accumulates for the input projection $(100\times 128)$ and 16,384 multiply-accumulates for the recurrent connection $(128\times 128)$.

This computational pattern differs from both MLPs and CNNs in a key way: while we can parallelize across batch elements, we cannot parallelize across time steps due to the sequential dependency. Each time step must wait for the previous step’s hidden state before it can begin computation. This creates a tension between the inherent sequential nature of the algorithm and the desire for parallel execution in modern hardware.

Modern processors handle these patterns through different approaches. CPUs pipeline operations within each time step while maintaining the sequential order across steps. GPUs batch multiple sequences together to maintain high throughput despite sequential dependencies. Deep learning frameworks optimize this further by techniques like sequence packing and unrolling computations across multiple time steps when possible.

Data Movement

The sequential processing in RNNs creates a distinctive data movement pattern that differs from both MLPs and CNNs. While MLPs need each weight only once per forward pass and CNNs reuse weights across spatial positions, RNNs reuse their weights across time steps while requiring careful management of the hidden state data flow.

For our example with a 128-dimensional hidden state, each time step must: load the previous hidden state (128 values), access both weight matrices (29,184 total weights from both input and recurrent connections), and store the new hidden state (128 values). This pattern repeats for every element in the sequence. Unlike CNNs where we can predict and prefetch data based on spatial patterns, RNN data movement is driven by temporal dependencies.

Different architectures handle this sequential data movement through specialized mechanisms. CPUs maintain weight matrices in cache while streaming through sequence elements and managing hidden state updates. GPUs employ memory architectures optimized for maintaining state information across sequential operations while processing multiple sequences in parallel. Deep learning frameworks orchestrate these movements by managing data transfers between time steps and optimizing batch operations.

4.5 Attention Mechanisms: Dynamic Pattern Processing

While previous architectures process patterns in fixed ways, such as MLPs with dense connectivity, CNNs with spatial operations, and RNNs with sequential updates, many tasks require dynamic relationships between elements that change based on content. Language understanding, for instance, needs to capture relationships between words that depend on meaning rather than just position. Graph analysis requires understanding connections that vary by node. These dynamic relationships suggest we need an architecture that can learn and adapt its processing patterns based on the data itself.

4.5.1 Pattern Processing Needs

Dynamic pattern processing addresses scenarios where relationships between elements aren’t fixed by architecture but instead emerge from content. Consider language translation: when translating “the bank by the river,” understanding “bank” requires attending to “river,” but in “the bank approved the loan,” the important relationship is with “approved” and “loan.” Unlike RNNs that process information sequentially or CNNs that use fixed spatial patterns, we need an architecture that can dynamically determine which relationships matter.

This requirement for dynamic processing appears across many domains. In protein structure prediction, interactions between amino acids depend on their chemical properties and spatial arrangements. In graph analysis, node relationships vary based on graph structure and node features. In document analysis, connections between different sections depend on semantic content rather than just proximity.

These scenarios demand specific capabilities from our processing architecture. The system must compute relationships between all pairs of elements, weigh these relationships based on content, and use these weights to selectively combine information. Unlike previous architectures with fixed connectivity patterns, dynamic processing requires the flexibility to modify its computation graph based on the input itself. This leads us to the Transformer architecture, which implements these capabilities through attention mechanisms. Figure 4.5 shows the relationships learned for an attention head between subwords in a sentence.

Figure 4.5: Transformer architectures “attend” or identify pairwise relationships with subwords in a sequence.

4.5.2 Basic Attention Mechanism

Algorithmic Structure

Attention mechanisms form the foundation of dynamic pattern processing by computing weighted connections between elements based on their content (Bahdanau, Cho, and Bengio 2014). This approach allows for the processing of relationships that aren’t fixed by architecture but instead emerge from the data itself. At the core of an attention mechanism is a fundamental operation that can be expressed mathematically as: \[ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax} \left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V} \]

In this equation, $\mathbf{Q}$ (queries), $\mathbf{K}$ (keys), and $\mathbf{V}$ (values) represent learned projections of the input. For a sequence of length $N$ with dimension $d$, this operation creates an $N\times N$ attention matrix, determining how each position should attend to all others.

The attention operation involves several key steps. First, it computes query, key, and value projections for each position in the sequence. Next, it generates an $N\times N$ attention matrix through query-key interactions. These steps are illustrated in Figure 4.6. Finally, it uses these attention weights to combine value vectors, producing the output.

Figure 4.6: The interaction between Query, Key, and Value components. Source: Transformer Explainer.

The key is that, unlike the fixed weight matrices found in previous architectures, as shown in Figure 4.7, these attention weights are computed dynamically for each input. This allows the model to adapt its processing based on the dynamic content at hand.

Figure 4.7: Dynamic weight calculation. Source: Transformer Explainer.

Computational Mapping

The dynamic structure of attention operations maps to computational patterns that differ significantly from those of previous architectures. To understand this mapping, let’s examine how it progresses from mathematical abstraction to computational reality (see Listing 4.7).

Listing 4.7: Mathematical abstraction in code

def attention_layer_matrix(Q, K, V):
    # Q, K, V: (batch_size × seq_len × d_model)
    scores = matmul(Q, K.transpose(-2, -1)) / \
             sqrt(d_k)           # Compute attention scores
    weights = softmax(scores)    # Normalize scores
    output = matmul(weights, V)  # Combine values
    return output

# Core computational pattern
def attention_layer_compute(Q, K, V):
    # Initialize outputs
    scores = np.zeros((batch_size, seq_len, seq_len))
    outputs = np.zeros_like(V)

    # Loop 1: Process each sequence in batch
    for b in range(batch_size):
        # Loop 2: Compute attention for each query position
        for i in range(seq_len):
            # Loop 3: Compare with each key position
            for j in range(seq_len):
                # Compute attention score
                for d in range(d_model):
                    scores[b,i,j] += Q[b,i,d] * K[b,j,d]
                scores[b,i,j] /= sqrt(d_k)

        # Apply softmax to scores
        for i in range(seq_len):
            scores[b,i] = softmax(scores[b,i])

        # Loop 4: Combine values using attention weights
        for i in range(seq_len):
            for j in range(seq_len):
                for d in range(d_model):
                    outputs[b, i, d] += (
                       scores[b, i, j]
                       * V[b, j, d]
                    )

    return outputs

The nested loops in attention_layer_compute reveal the true nature of attention’s computational pattern (see Listing 4.7). The first loop processes each sequence in the batch independently. The second and third loops compute attention scores between all pairs of positions, creating a quadratic computation pattern with respect to sequence length. The fourth loop uses these attention weights to combine values from all positions, producing the final output.

System Implications

The attention mechanism creates distinctive patterns in memory requirements, computation needs, and data movement that set it apart from previous architectures.

Memory Requirements

In terms of memory requirements, attention mechanisms necessitate storage for attention weights, key-query-value projections, and intermediate feature representations. For a sequence length $N$ and dimension d, each attention layer must store an $N\times N$ attention weight matrix for each sequence in the batch, three sets of projection matrices for queries, keys, and values (each sized $d\times d$), and input and output feature maps of size $N\times d$. The dynamic generation of attention weights for every input creates a memory access pattern where intermediate attention weights become a significant factor in memory usage.

Computation Needs

Computation needs in attention mechanisms center around two main phases: generating attention weights and applying them to values. For each attention layer, the system performs substantial multiply-accumulate operations across multiple computational stages. The query-key interactions alone require $N\times N\times d$ multiply-accumulates, with an equal number needed for applying attention weights to values. Additional computations are required for the projection matrices and softmax operations. This computational pattern differs from previous architectures due to its quadratic scaling with sequence length and the need to perform fresh computations for each input.

Data Movement

Data movement in attention mechanisms presents unique challenges. Each attention operation involves projecting and moving query, key, and value vectors for each position, storing and accessing the full attention weight matrix, and coordinating the movement of value vectors during the weighted combination phase. This creates a data movement pattern where intermediate attention weights become a major factor in system bandwidth requirements. Unlike the more predictable access patterns of CNNs or the sequential access of RNNs, attention operations require frequent movement of dynamically computed weights across the memory hierarchy.

These distinctive characteristics of attention mechanisms in terms of memory, computation, and data movement have significant implications for system design and optimization, setting the stage for the development of more advanced architectures like Transformers.

4.5.3 Transformers and Self-Attention

Transformers, first introduced by Chen et al. (2018), represent a significant evolution in the application of attention mechanisms, introducing the concept of self-attention to create a powerful architecture for dynamic pattern processing. While the basic attention mechanism allows for content-based weighting of information from a source sequence, Transformers extend this idea by applying attention within a single sequence, enabling each element to attend to all other elements including itself.

Chen, Mia Xu, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, et al. 2018. “The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 30:5998–6008. Association for Computational Linguistics. https://doi.org/10.18653/v1/p18-1008.

Algorithmic Structure

The key innovation in Transformers lies in their use of self-attention layers. In a self-attention layer, the queries, keys, and values are all derived from the same input sequence. This allows the model to weigh the importance of different positions within the same sequence when encoding each position. For instance, in processing the sentence “The animal didn’t cross the street because it was too wide,” self-attention allows the model to link “it” with “street,” capturing long-range dependencies that are challenging for traditional sequential models.

Transformers typically employ multi-head attention, which involves multiple sets of query/key/value projections. Each set, or “head,” can focus on different aspects of the input, allowing the model to jointly attend to information from different representation subspaces. This multi-head structure provides the model with a richer representational capability, enabling it to capture various types of relationships within the data simultaneously.

The self-attention mechanism in Transformers can be expressed mathematically in a form similar to the basic attention mechanism: \[ \text{SelfAttention}(\mathbf{X}) = \text{softmax} \left(\frac{\mathbf{XW_Q}(\mathbf{XW_K})^T}{\sqrt{d_k}}\right)\mathbf{XW_V} \]

Here, $\mathbf{X}$ is the input sequence, and $\mathbf{W_Q}$, $\mathbf{W_K}$, and $\mathbf{W_V}$ are learned weight matrices for queries, keys, and values respectively. This formulation highlights how self-attention derives all its components from the same input, creating a dynamic, content-dependent processing pattern.

The Transformer architecture leverages this self-attention mechanism within a broader structure that typically includes feed-forward layers, layer normalization, and residual connections (see Figure 4.8). This combination allows Transformers to process input sequences in parallel, capturing complex dependencies without the need for sequential computation. As a result, Transformers have demonstrated remarkable effectiveness across a wide range of tasks, from natural language processing to computer vision, revolutionizing the landscape of deep learning architectures.

Figure 4.8: The Transformer model architecture. Source: Attention Is All You Need

Computational Mapping

While Transformer self-attention builds upon the basic attention mechanism, it introduces distinct computational patterns that set it apart. To understand these patterns, we must examine the typical implementation of self-attention in Transformers (see Listing 4.8):

Listing 4.8: Self-attention mechanism in Transformers

def self_attention_layer(X, W_Q, W_K, W_V, d_k):
    # X: input tensor (batch_size × seq_len × d_model)
    # W_Q, W_K, W_V: weight matrices (d_model × d_k)

    Q = matmul(X, W_Q)
    K = matmul(X, W_K)
    V = matmul(X, W_V)

    scores = matmul(Q, K.transpose(-2, -1)) / sqrt(d_k)
    attention_weights = softmax(scores, dim=-1)
    output = matmul(attention_weights, V)

    return output

def multi_head_attention(
    X, W_Q, W_K, W_V, W_O, num_heads, d_k
):
    outputs = []
    for i in range(num_heads):
        head_output = self_attention_layer(
            X, W_Q[i], W_K[i], W_V[i], d_k
        )
        outputs.append(head_output)

    concat_output = torch.cat(outputs, dim=-1)
    final_output = matmul(concat_output, W_O)

    return final_output

System Implications

This implementation reveals several key computational characteristics of Transformer self-attention. First, self-attention enables parallel processing across all positions in the sequence. This is evident in the matrix multiplications that compute Q, K, and V simultaneously for all positions. Unlike recurrent architectures that process inputs sequentially, this parallel nature allows for more efficient computation, especially on modern hardware designed for parallel operations.

Second, the attention score computation results in a matrix of size (seq_len × seq_len), leading to quadratic complexity with respect to sequence length. This quadratic relationship becomes a significant computational bottleneck when processing long sequences, a challenge that has spurred research into more efficient attention mechanisms.

Third, the multi-head attention mechanism effectively runs multiple self-attention operations in parallel, each with its own set of learned projections. While this increases the computational load linearly with the number of heads, it allows the model to capture different types of relationships within the same input, enhancing the model’s representational power.

Fourth, the core computations in self-attention are dominated by large matrix multiplications. For a sequence of length $N$ and embedding dimension $d$, the main operations involve matrices of sizes $(N\times d)$, $(d\times d)$, and $(N\times N)$. These intensive matrix operations are well-suited for acceleration on specialized hardware like GPUs, but they also contribute significantly to the overall computational cost of the model.

Finally, self-attention generates memory-intensive intermediate results. The attention weights matrix $(N\times N)$ and the intermediate results for each attention head create substantial memory requirements, especially for long sequences. This can pose challenges for deployment on memory-constrained devices and necessitates careful memory management in implementations.

These computational patterns create a unique profile for Transformer self-attention, distinct from previous architectures. The parallel nature of the computations makes Transformers well-suited for modern parallel processing hardware, but the quadratic complexity with sequence length poses challenges for processing long sequences. As a result, much research has focused on developing optimization techniques, such as sparse attention patterns or low-rank approximations, to address these challenges. Each of these optimizations presents its own trade-offs between computational efficiency and model expressiveness, a balance that must be carefully considered in practical applications.

4.6 Architectural Building Blocks

Deep learning architectures, while we presented them as distinct approaches in the previous sections, are better understood as compositions of fundamental building blocks that evolved over time. Much like how complex LEGO structures are built from basic bricks, modern neural networks combine and iterate on core computational patterns that emerged through decades of research (Yann LeCun, Bengio, and Hinton 2015). Each architectural innovation introduced new building blocks while finding novel ways to use existing ones.

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436–44. https://doi.org/10.1038/nature14539.

Rosenblatt, F. 1958. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review 65 (6): 386–408. https://doi.org/10.1037/h0042519.

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.

These building blocks and their evolution provide insight into modern architectures. What began with the simple perceptron (Rosenblatt 1958) evolved into multi-layer networks (Rumelhart, Hinton, and Williams 1986), which then spawned specialized patterns for spatial and sequential processing. Each advancement maintained useful elements from its predecessors while introducing new computational primitives. Today’s sophisticated architectures, like Transformers, can be seen as carefully engineered combinations of these fundamental building blocks.

This progression reveals not just the evolution of neural networks, but also the discovery and refinement of core computational patterns that remain relevant. As we have seen through our exploration of different neural network architectures, deep learning has evolved significantly, with each new architecture bringing its own set of computational demands and system-level challenges.

Table 4.1 summarizes this evolution, highlighting the key primitives and system focus for each era of deep learning development. This table encapsulates the major shifts in deep learning architecture design and the corresponding changes in system-level considerations. From the early focus on dense matrix operations optimized for CPUs, we see a progression through convolutions leveraging GPU acceleration, to sequential operations necessitating sophisticated memory hierarchies, and finally to the current era of attention mechanisms requiring flexible accelerators and high-bandwidth memory.

Table 4.1: Evolution of deep learning architectures and their system implications

Era	Dominant Architecture	Key Primitives	System Focus
Early NN	MLP	Dense Matrix Ops	CPU optimization
CNN Revolution	CNN	Convolutions	GPU acceleration
Sequence Modeling	RNN	Sequential Ops	Memory hierarchies
Attention Era	Transformer	Attention, Dynamic Compute	Flexible accelerators, High-bandwidth memory

As we dive deeper into each of these building blocks, we see how these primitives evolved and combined to create increasingly powerful and complex neural network architectures.

4.6.1 From Perceptron to Multi-Layer Networks

While we examined MLPs earlier as a mechanism for dense pattern processing, here we focus on how they established fundamental building blocks that appear throughout deep learning. The evolution from perceptron to MLP introduced several key concepts: the power of layer stacking, the importance of non-linear transformations, and the basic feedforward computation pattern.

The introduction of hidden layers between input and output created a template for feature transformation that appears in virtually every modern architecture. Even in sophisticated networks like Transformers, we find MLP-style feedforward layers performing feature processing. The concept of transforming data through successive non-linear layers has become a fundamental paradigm that transcends the specific architecture types.

Perhaps most importantly, the development of MLPs established the backpropagation algorithm, which to this day remains the cornerstone of neural network training. This key contribution has enabled the training of deep architectures and influenced how later architectures would be designed to maintain gradient flow.

These building blocks, layered feature transformation, non-linear activation, and gradient-based learning, set the foundation for more specialized architectures. Subsequent innovations often focused on structuring these basic components in new ways rather than replacing them entirely.

4.6.2 From Dense to Spatial Processing

The development of CNNs marked a significant architectural innovation, specifically the realization that we could specialize the dense connectivity of MLPs for spatial patterns. While retaining the core concept of layer-wise processing, CNNs introduced several fundamental building blocks that would influence all future architectures.

The first key innovation was the concept of parameter sharing. Unlike MLPs where each connection had its own weight, CNNs showed how the same parameters could be reused across different parts of the input. This not only made the networks more efficient but introduced the powerful idea that architectural structure could encode useful priors about the data (Lecun et al. 1998).

Lecun, Y., L. Bottou, Y. Bengio, and P. Haffner. 1998. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 86 (11): 2278–2324. https://doi.org/10.1109/5.726791.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–78. IEEE. https://doi.org/10.1109/cvpr.2016.90.

Perhaps even more influential was the introduction of skip connections through ResNets (He et al. 2016). Originally they were designed to help train very deep CNNs, skip connections have become a fundamental building block that appears in virtually every modern architecture. They showed how direct paths through the network could help gradient flow and information propagation, a concept now central to Transformer designs.

CNNs also introduced batch normalization, a technique for stabilizing neural network training by normalizing intermediate features (Ioffe and Szegedy 2015); we will learn more about this in the AI Training chapter. This concept of feature normalization, while originating in CNNs, evolved into layer normalization and is now a key component in modern architectures.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” International Conference on Machine Learning, 448–56.

These innovations, such as parameter sharing, skip connections, and normalization, transcended their origins in spatial processing to become essential building blocks in the deep learning toolkit.

4.6.3 The Evolution of Sequence Processing

While CNNs specialized MLPs for spatial patterns, sequence models adapted neural networks for temporal dependencies. RNNs introduced the fundamental concept of maintaining and updating state, a building block that influenced how networks could process sequential information, (Elman 2002).

Elman, Jeffrey L. 2002. “Finding Structure in Time.” In Cognitive Modeling, 14:257–88. 2. The MIT Press. https://doi.org/10.7551/mitpress/1888.003.0015.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.

Cho, Kyunghyun, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches.” In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), 103–11. Association for Computational Linguistics.

The development of LSTMs and GRUs brought sophisticated gating mechanisms to neural networks (Hochreiter and Schmidhuber 1997; Cho et al. 2014). These gates, themselves small MLPs, showed how simple feedforward computations could be composed to control information flow. This concept of using neural networks to modulate other neural networks became a recurring pattern in architecture design.

Perhaps most significantly, sequence models demonstrated the power of adaptive computation paths. Unlike the fixed patterns of MLPs and CNNs, RNNs showed how networks could process variable-length inputs by reusing weights over time. This insight, that architectural patterns could adapt to input structure, laid groundwork for more flexible architectures.

Sequence models also popularized the concept of attention through encoder-decoder architectures (Bahdanau, Cho, and Bengio 2014). Initially introduced as an improvement to machine translation, attention mechanisms showed how networks could learn to dynamically focus on relevant information. This building block would later become the foundation of Transformer architectures.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv Preprint arXiv:1409.0473, September. http://arxiv.org/abs/1409.0473v7.

4.6.4 Modern Architectures: Synthesis and Innovation

Modern architectures, particularly Transformers, represent a sophisticated synthesis of these fundamental building blocks. Rather than introducing entirely new patterns, they innovate through clever combination and refinement of existing components. Consider the Transformer architecture: at its core, we find MLP-style feedforward networks processing features between attention layers. The attention mechanism itself builds on ideas from sequence models but removes the recurrent connection, instead using position embeddings inspired by CNN intuitions. The architecture extensively utilizes skip connections (see Figure 4.9), inherited from ResNets, while layer normalization, evolved from CNN’s batch normalization, stabilizes training (Ba, Kiros, and Hinton 2016).

Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. “Layer Normalization.” arXiv Preprint arXiv:1607.06450, July. http://arxiv.org/abs/1607.06450v1.

Figure 4.9: Example of a skip connection, where a computation is performed and its residual is also added to the result.

This composition of building blocks creates something greater than the sum of its parts. The self-attention mechanism, while building on previous attention concepts, enables a new form of dynamic pattern processing. The arrangement of these components, attention followed by feedforward layers, with skip connections and normalization, has proven so effective it’s become a template for new architectures.

Even recent innovations in vision and language models follow this pattern of recombining fundamental building blocks. Vision Transformers adapt the Transformer architecture to images while maintaining its essential components (Dosovitskiy et al. 2021). Large language models scale up these patterns while introducing refinements like grouped-query attention or sliding window attention, yet still rely on the core building blocks established through this architectural evolution (Brown et al. 2020).

Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2021. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” International Conference on Learning Representations.

Brown, Tom B, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877–1901.

To illustrate how these modern architectures synthesize and innovate upon previous approaches, consider the following comparison of primitive utilization across different neural network architectures:

Table 4.2: Comparison of primitive utilization across neural network architectures.

Primitive Type	MLP	CNN	RNN	Transformer
Computational	Matrix Multiplication	Convolution (Matrix Mult.)	Matrix Mult. + State Update	Matrix Mult. + Attention
Memory Access	Sequential	Strided	Sequential + Random	Random (Attention)
Data Movement	Broadcast	Sliding Window	Sequential	Broadcast + Gather

As shown in Table 4.2, Transformers combine elements from previous architectures while introducing new patterns. They retain the core matrix multiplication operations common to all architectures but introduce a more complex memory access pattern with their attention mechanism. Their data movement patterns blend the broadcast operations of MLPs with the gather operations reminiscent of more dynamic architectures.

This synthesis of primitives in Transformers exemplifies how modern architectures innovate by recombining and refining existing building blocks, rather than inventing entirely new computational paradigms. Also, this evolutionary process provides insight into the development of future architectures and helps to guide the design of efficient systems to support them.

4.7 System-Level Building Blocks

After having examined different deep learning architectures, we can distill their system requirements into fundamental primitives that underpin both hardware and software implementations. These primitives represent operations that cannot be broken down further while maintaining their essential characteristics. Just as complex molecules are built from basic atoms, sophisticated neural networks are constructed from these fundamental operations.

4.7.1 Core Computational Primitives

Three fundamental operations serve as the building blocks for all deep learning computations: matrix multiplication, sliding window operations, and dynamic computation. What makes these operations primitive is that they cannot be further decomposed without losing their essential computational properties and efficiency characteristics.

Matrix multiplication represents the most basic form of transforming sets of features. When we multiply a matrix of inputs by a matrix of weights, we’re computing weighted combinations, which is the fundamental operation of neural networks. For example, in our MNIST network, each 784-dimensional input vector multiplies with a $784\times 100$ weight matrix. This pattern appears everywhere: MLPs use it directly for layer computations, CNNs reshape convolutions into matrix multiplications (turning a $3\times 3$ convolution into a matrix operation, as illustrated in Figure 4.10), and Transformers use it extensively in their attention mechanisms.

Figure 4.10: Depiction of how `im2col` can map a convolution into a dense matrix multiplication for better efficiency.

In modern systems, matrix multiplication maps to specific hardware and software implementations. Hardware accelerators provide specialized tensor cores that can perform thousands of multiply-accumulates in parallel, NVIDIA’s A100 tensor cores can achieve up to 312 TFLOPS (32-bit) through massive parallelization of these operations. Software frameworks like PyTorch and TensorFlow automatically map these high-level operations to optimized matrix libraries (NVIDIA cuBLAS, Intel MKL) that exploit these hardware capabilities.

Sliding window operations compute local relationships by applying the same operation to chunks of data. In CNNs processing MNIST images, a $3\times 3$ convolution filter slides across the $28\times 28$ input, requiring $26\times 26$ windows of computation, assuming a stride size of 1. Modern hardware accelerators implement this through specialized memory access patterns and data buffering schemes that optimize data reuse. For example, Google’s TPU uses a $128\times 128$ systolic array where data flows systematically through processing elements, allowing each input value to be reused across multiple computations without accessing memory. Software frameworks optimize these operations by transforming them into efficient matrix multiplications (a $3\times 3$ convolution becomes a $9\times N$ matrix multiplication) and carefully managing data layout in memory to maximize spatial locality.

Dynamic computation, where the operation itself depends on the input data, emerged prominently with attention mechanisms but represents a fundamental capability needed for adaptive processing. In Transformer attention, each query dynamically determines its interaction weights with all keys; for a sequence of length 512, this means 512 different weight patterns must be computed on the fly. Unlike fixed patterns where we know the computation graph in advance, dynamic computation requires runtime decisions. This creates specific implementation challenges; hardware must provide flexible routing of data (modern GPUs use dynamic scheduling) and support variable computation patterns, while software frameworks need efficient mechanisms for handling data-dependent execution paths (PyTorch’s dynamic computation graphs, TensorFlow’s dynamic control flow).

These primitives combine in sophisticated ways in modern architectures. A Transformer layer processing a sequence of 512 tokens demonstrates this clearly: it uses matrix multiplications for feature projections ($512\times 512$ operations implemented through tensor cores), may employ sliding windows for efficient attention over long sequences (using specialized memory access patterns for local regions), and requires dynamic computation for attention weights (computing $512\times 512$ attention patterns at runtime). The way these primitives interact creates specific demands on system design, ranging from memory hierarchy organization to computation scheduling.

The building blocks we’ve discussed help explain why certain hardware features exist (like tensor cores for matrix multiplication) and why software frameworks organize computations in particular ways (like batching similar operations together). As we move from computational primitives to consider memory access and data movement patterns, it’s important to recognize how these fundamental operations shape the demands placed on memory systems and data transfer mechanisms. The way computational primitives are implemented and combined has direct implications for how data needs to be stored, accessed, and moved within the system.

4.7.2 Memory Access Primitives

The efficiency of deep learning systems heavily depends on how they access and manage memory. In fact, memory access often becomes the primary bottleneck in modern ML systems, even though a matrix multiplication unit might be capable of performing thousands of operations per cycle, it will sit idle if data isn’t available at the right time. For example, accessing data from DRAM typically takes hundreds of cycles, while on-chip computation takes only a few cycles.

Three fundamental memory access patterns dominate in deep learning architectures: sequential access, strided access, and random access. Each pattern creates different demands on the memory system and offers different opportunities for optimization.

Sequential access is the simplest and most efficient pattern. Consider an MLP performing matrix multiplication with a batch of MNIST images: it needs to access both the $784\times 100$ weight matrix and the input vectors sequentially. This pattern maps well to modern memory systems; DRAM can operate in burst mode for sequential reads (achieving up to 400 GB/s in modern GPUs), and hardware prefetchers can effectively predict and fetch upcoming data. Software frameworks optimize for this by ensuring data is laid out contiguously in memory and aligning data to cache line boundaries.

Strided access appears prominently in CNNs, where each output position needs to access a window of input values at regular intervals. For a CNN processing MNIST images with $3\times 3$ filters, each output position requires accessing 9 input values with a stride matching the input width. While less efficient than sequential access, hardware supports this through pattern-aware caching strategies and specialized memory controllers. Software frameworks often transform these strided patterns into sequential access through data layout reorganization, where the im2col transformation in deep learning frameworks converts convolution’s strided access into efficient matrix multiplications.

Random access poses the greatest challenge for system efficiency. In a Transformer processing a sequence of 512 tokens, each attention operation potentially needs to access any position in the sequence, creating unpredictable memory access patterns. Random access can severely impact performance through cache misses (potentially causing 100+ cycle stalls per access) and unpredictable memory latencies. Systems address this through large cache hierarchies (modern GPUs have several MB of L2 cache) and sophisticated prefetching strategies, while software frameworks employ techniques like attention pattern pruning to reduce random access requirements.

These different memory access patterns contribute significantly to the overall memory requirements of each architecture. To illustrate this, Table 4.3 compares the memory complexity of MLPs, CNNs, RNNs, and Transformers.

Table 4.3: DNN architecture complexity. Note that for RNNs, parameter storage is bounded by $O(N \times h)$ when $N > h$.

Architecture	Input Dependency	Parameter Storage	Activation Storage	Scaling Behavior
MLP	Linear	$O(N \times W)$	$O(B \times W)$	Predictable
CNN	Constant	$O(K \times C)$	$O(B\times H_{\text{img}}\times W_{\text{img}})$	Efficient
RNN	Linear	$O(h^2)$	$O(B \times T \times h)$	Challenging
Transformer	Quadratic	$O(N \times d)$	$O(B \times N^2)$	Problematic

Where:

$N$: Input or sequence size
$W$: Layer width
$B$: Batch size
$K$: Kernel size
$C$: Number of channels
$H_{\text{img}}$: Height of input feature map (CNN)
$W_{\text{img}}$: Width of input feature map (CNN)
$h$: Hidden state size (RNN)
$T$: Sequence length
$d$: Model dimensionality

Table 4.3 reveals how memory requirements scale with different architectural choices. The quadratic scaling of activation storage in Transformers, for instance, highlights the need for large memory capacities and efficient memory management in systems designed for Transformer-based workloads. In contrast, CNNs exhibit more favorable memory scaling due to their parameter sharing and localized processing. These memory complexity considerations are crucial when making system-level design decisions, such as choosing memory hierarchy configurations and developing memory optimization strategies.

The impact of these patterns becomes clearer when we consider data reuse opportunities. In CNNs, each input pixel participates in multiple convolution windows (typically 9 times for a $3\times 3$ filter), making effective data reuse fundamental for performance. Modern GPUs provide multi-level cache hierarchies (L1, L2, shared memory) to capture this reuse, while software techniques like loop tiling ensure data remains in cache once loaded.

Working set size, the amount of data needed simultaneously for computation, varies dramatically across architectures. An MLP layer processing MNIST images might need only a few hundred KB (weights plus activations), while a Transformer processing long sequences can require several MB just for storing attention patterns. These differences directly influence hardware design choices, like the balance between compute units and on-chip memory, and software optimizations like activation checkpointing or attention approximation techniques.

Having a good grasp of these memory access patterns is essential as architectures evolve. The shift from CNNs to Transformers, for instance, has driven the development of hardware with larger on-chip memories and more sophisticated caching strategies to handle increased working sets and more dynamic access patterns. Future architectures will likely continue to be shaped by their memory access characteristics as much as their computational requirements.

4.7.3 Data Movement Primitives

While computational and memory access patterns define what operations occur where, data movement primitives characterize how information flows through the system. These patterns are key because data movement often consumes more time and energy than computation itself, as moving data from off-chip memory typically requires 100-1000$ imes$ more energy than performing a floating-point operation.

Four fundamental data movement patterns are prevalent in deep learning architectures: broadcast, scatter, gather, and reduction. Figure 4.11 illustrates these patterns and their relationships. Broadcast operations send the same data to multiple destinations simultaneously. In matrix multiplication with batch size 32, each weight must be broadcast to process different inputs in parallel. Modern hardware supports this through specialized interconnects, NVIDIA GPUs provide hardware multicast capabilities, achieving up to 600 GB/s broadcast bandwidth, while TPUs use dedicated broadcast buses. Software frameworks optimize broadcasts by restructuring computations (like matrix tiling) to maximize data reuse.

Figure 4.11: Collective communication routines.

Scatter operations distribute different elements to different destinations. When parallelizing a $512\times 512$ matrix multiplication across GPU cores, each core receives a subset of the computation. This parallelization is important for performance but challenging, as memory conflicts and load imbalance, can reduce efficiency by 50% or more. Hardware provides flexible interconnects (like NVIDIA’s NVLink offering 600 GB/s bi-directional bandwidth), while software frameworks employ sophisticated work distribution algorithms to maintain high utilization.

Gather operations collect data from multiple sources. In Transformer attention with sequence length 512, each query must gather information from 512 different key-value pairs. These irregular access patterns are challenging, random gathering can be $10\times$ slower than sequential access. Hardware supports this through high-bandwidth interconnects and large caches, while software frameworks employ techniques like attention pattern pruning to reduce gathering overhead.

Reduction operations combine multiple values into a single result through operations like summation. When computing attention scores in Transformers or layer outputs in MLPs, efficient reduction is essential. Hardware implements tree-structured reduction networks (reducing latency from $O(n)$ to $O(\log n)$), while software frameworks use optimized parallel reduction algorithms that can achieve near-theoretical peak performance.

These patterns combine in sophisticated ways. A Transformer attention operation with sequence length 512 and batch size 32 involves:

Broadcasting query vectors ($512\times 64$ elements)
Gathering relevant keys and values ($512\times 512\times 64$ elements)
Reducing attention scores ($512\times 512$ elements per sequence)

The evolution from CNNs to Transformers has increased reliance on gather and reduction operations, driving hardware innovations like more flexible interconnects and larger on-chip memories. As models grow (some now exceeding 100 billion parameters), efficient data movement becomes increasingly critical, leading to innovations like near-memory processing and sophisticated data flow optimizations.

4.7.4 System Design Impact

The computational, memory access, and data movement primitives we’ve explored form the foundational requirements that shape the design of systems for deep learning. The way these primitives influence hardware design, create common bottlenecks, and drive trade-offs is important for developing efficient and effective deep learning systems.

One of the most significant impacts of these primitives on system design is the push towards specialized hardware. The prevalence of matrix multiplications and convolutions in deep learning has led to the development of tensor processing units (TPUs) and tensor cores in GPUs, which are specifically designed to perform these operations efficiently. These specialized units can perform many multiply-accumulate operations in parallel, dramatically accelerating the core computations of neural networks.

Memory systems have also been profoundly influenced by the demands of deep learning primitives. The need to support both sequential and random access patterns efficiently has driven the development of sophisticated memory hierarchies. High-bandwidth memory (HBM) has become common in AI accelerators to support the massive data movement requirements, especially for operations like attention mechanisms in Transformers. On-chip memory hierarchies have grown in complexity, with multiple levels of caching and scratchpad memories to support the diverse working set sizes of different neural network layers.

The data movement primitives have particularly influenced the design of interconnects and on-chip networks. The need to support efficient broadcasts, gathers, and reductions has led to the development of more flexible and higher-bandwidth interconnects. Some AI chips now feature specialized networks-on-chip designed to accelerate common data movement patterns in neural networks.

Table 4.4 summarizes the system implications of these primitives:

Table 4.4: System implications of primitives.

Primitive	Hardware Impact	Software Optimization	Key Challenges
Matrix Multiplication	Tensor Cores	Batching, GEMM libraries	Parallelization, precision
Sliding Window	Specialized datapaths	Data layout optimization	Stride handling
Dynamic Computation	Flexible routing	Dynamic graph execution	Load balancing
Sequential Access	Burst mode DRAM	Contiguous allocation	Access latency
Random Access	Large caches	Memory-aware scheduling	Cache misses
Broadcast	Specialized interconnects	Operation fusion	Bandwidth
Gather/Scatter	High-bandwidth memory	Work distribution	Load balancing

Despite these advancements, several common bottlenecks persist in deep learning systems. Memory bandwidth often remains a key limitation, particularly for models with large working sets or those that require frequent random access. The energy cost of data movement, especially between off-chip memory and processing units, continues to be a significant concern. For large-scale models, the communication overhead in distributed training can become a bottleneck, limiting scaling efficiency.

System designers must navigate complex trade-offs in supporting different primitives, each with unique characteristics that influence system design and performance. For example, optimizing for the dense matrix operations common in MLPs and CNNs might come at the cost of flexibility needed for the more dynamic computations in attention mechanisms. Supporting large working sets for Transformers might require sacrificing energy efficiency.

Balancing these trade-offs requires careful consideration of the target workloads and deployment scenarios. Having a good grip on the nature of each primitive guides the development of both hardware and software optimizations in deep learning systems, allowing designers to make informed decisions about system architecture and resource allocation.

4.8 Conclusion

Deep learning architectures, despite their diversity, exhibit common patterns in their algorithmic structures that significantly influence computational requirements and system design. In this chapter, we explored the intricate relationship between high-level architectural concepts and their practical implementation in computing systems.

From the straightforward dense connections of MLPs to the complex, dynamic patterns of Transformers, each architecture builds upon a set of fundamental building blocks. These core computational primitives, including matrix multiplication, sliding windows, and dynamic computation, recur across various architectures, forming a universal language of deep learning computation.

The identification of these shared elements provides a valuable framework for understanding and designing deep learning systems. Each primitive brings its own set of requirements in terms of memory access patterns and data movement, which in turn shape both hardware and software design decisions. This relationship between algorithmic intent and system implementation is crucial for optimizing performance and efficiency.

As the field of deep learning continues to evolve, the ability to efficiently support and optimize these fundamental building blocks will be key to the development of more powerful and scalable systems. Future advancements in deep learning are likely to stem not only from novel architectural designs but also from innovative approaches to implementing and optimizing these essential computational patterns.

In conclusion, understanding the mapping between neural architectures and their computational requirements is vital for pushing the boundaries of what’s possible in artificial intelligence. As we look to the future, the interplay between algorithmic innovation and systems optimization will continue to drive progress in this rapidly advancing field.

4.9 Resources

Slides

Coming soon.

Videos

Coming soon.

Exercises

Coming soon.