98% of this content was generated by LLM.

Table of Contents

🚀 Phases of Transformer Architecture in Terms of Computation & Data Movement Link to heading

The Transformer model (used in GPT, BERT, LLaMA, etc.) consists of multiple stages of computation, each with unique NoC (Network-on-Chip) stress patterns. Understanding these phases is crucial for optimizing AI accelerators.

📌 1. High-Level Phases of Transformer Computation Link to heading

The Transformer model operates in three major phases during training and inference:

Phase Computation Type Data Movement Characteristics Latency Bottlenecks
1. Embedding & Input Projection Lookup tables, matrix multiplications Reads from memory, low bandwidth Memory access time
2. Multi-Head Self-Attention (MHSA) Matrix multiplications (QKV), softmax High bandwidth, many-to-many communication NoC congestion
3. Feedforward Layers (MLP) Fully connected layers (FC), activation functions Less bandwidth, structured memory access Memory latency
4. Layer Norm & Residual Connections Element-wise operations, normalization Small memory access, low NoC traffic Minimal latency impact
5. Output Projection & Softmax Softmax, final probability computation Heavy memory writes Last-layer memory bottleneck

🚀 Key Takeaway:

  • MHSA phase is the most NoC-intensive part due to massive all-to-all communication.
  • Feedforward (MLP) layers are compute-heavy but require structured memory access.

📌 2. Step-by-Step Transformer Data Flow Link to heading

🚀 Phase 1: Embedding & Input Projection Link to heading

🔹 Computation: Convert input tokens into dense vector embeddings. Perform matrix multiplications to project embeddings into the model’s hidden space.

🔹 Data Movement in NoC:

Operation NoC Traffic Type
Read embeddings from memory Memory-to-core transfer (HBM)
Compute input projections Local core communication
Store projected embeddings Write to DRAM/HBM

✅ NoC Behavior: Low traffic → Mostly memory-bound, not NoC-intensive. Bottleneck: DRAM bandwidth if embeddings are large.

🚀 Phase 2: Multi-Head Self-Attention (MHSA) Link to heading

📌 Most NoC-Intensive Phase! 🔹 Computation:

  1. Compute Query (Q), Key (K), and Value (V) matrices.
  2. Perform QK^T (Attention Score Calculation).
  3. Apply Softmax & Weighted Sum of Values.

🔹 Data Movement in NoC:

Operation NoC Traffic Type
Broadcast Key (K) and Value (V) to all heads All-to-All (many-to-many)
Compute QK^T Memory-intensive tensor multiplication
Softmax normalization Local core memory accesses
Weighted sum of values High-bandwidth data movement

✅ NoC Behavior:

Extreme congestion → NoC must support high-bandwidth many-to-many traffic. Major bottleneck → Memory-bound attention operations slow down inference. Optimization needed → Hierarchical interconnects (NVLink, Infinity Fabric) reduce contention.

🚀 Phase 3: Feedforward Layers (MLP) Link to heading

📌 Compute-Intensive Phase

🔹 Computation:

  1. Linear transformation via fully connected (FC) layers.
  2. Non-linear activation functions (ReLU, GeLU, SiLU).

🔹 Data Movement in NoC:

Operation NoC Traffic Type
FC layer computation Core-local memory access
Activation function (ReLU, GeLU) Minimal memory movement
Store intermediate results Write to HBM (if batch size is large)

✅ NoC Behavior:

Structured memory access → Less NoC congestion than attention. Compute-bound bottleneck → Optimized tensor cores help accelerate FC layers.

🚀 Phase 4: Layer Norm & Residual Connections Link to heading

📌 Lightweight Memory Operations

🔹 Computation:

  1. Normalize activation outputs (LayerNorm).
  2. Add residual connection (skip connection).

🔹 Data Movement in NoC:

Operation NoC Traffic Type
Read intermediate activations Memory-to-core transfer
Apply element-wise LayerNorm Minimal NoC load
Perform residual sum Low-bandwidth local computation

✅ NoC Behavior:

Low NoC stress → No global communication required. Minimal bottlenecks → Mostly memory latency bound.

🚀 Phase 5: Output Projection & Softmax Link to heading

📌 Final Memory-Intensive Step

🔹 Computation:

Compute final token probabilities using softmax. Select the next token during inference.

🔹 Data Movement in NoC:

Operation NoC Traffic Type
Compute output probabilities High memory bandwidth needed
Store results for next token Memory write operation

✅ NoC Behavior:

Latency bottleneck in last layer → Softmax reads large activation data. Memory bandwidth limited → If batch size is large, DRAM access slows down processing.

📌 3. How NoC Behavior Changes Over Time Link to heading

The NoC traffic pattern changes dynamically across transformer layers.

✅ Transformer NoC Traffic Over Time

Phase Traffic Pattern Bottleneck
Embeddin Low traffic (read-heavy) Memory latency
MHSA (Self-Attention) All-to-all NoC congestion Memory bandwidth & communication delays
MLP (Feedforward Layers) Compute-heavy, structured NoC usage Compute efficiency
LayerNorm & Residual Minimal NoC traffic None
Output Projection Memory writes, softmax communication DRAM bandwidth

📌 Observations:

Early phases (Embedding, Attention) are memory-bound. MHSA creates the most NoC congestion (all-to-all traffic). MLP (Feedforward) is compute-heavy, but NoC load is lower.

📌 4. NoC Optimizations for Transformer Models Link to heading

Since MHSA creates the most NoC congestion, AI accelerators optimize their interconnects:

✅ Techniques to Optimize NoC for Transformer Workloads

Optimization Benefit
Hierarchical Interconnects (NVLink, Infinity Fabric) Reduces NoC congestion by distributing traffic.
3D NoC Architectures Reduces average hop count and improves bandwidth.
Express Virtual Channels (EVCs) Allows priority paths for critical tensor transfers.
Sparse Attention Techniques Reduces the total number of all-to-all memory accesses.

📌 5. Final Takeaways

✅ Self-Attention (MHSA) is the biggest NoC bottleneck due to all-to-all communication. ✅ MLP layers stress compute but not NoC as much (mostly structured memory accesses).