98% of this content was generated by LLM.

Table of Contents

🚀 Phases of Transformer Architecture in Terms of Computation & Data Movement Link to heading

The Transformer model (used in GPT, BERT, LLaMA, etc.) consists of multiple stages of computation, each with unique NoC (Network-on-Chip) stress patterns. Understanding these phases is crucial for optimizing AI accelerators.

📌 1. High-Level Phases of Transformer Computation Link to heading

The Transformer model operates in three major phases during training and inference:

PhaseComputation TypeData Movement CharacteristicsLatency Bottlenecks
1. Embedding & Input ProjectionLookup tables, matrix multiplicationsReads from memory, low bandwidthMemory access time
2. Multi-Head Self-Attention (MHSA)Matrix multiplications (QKV), softmaxHigh bandwidth, many-to-many communicationNoC congestion
3. Feedforward Layers (MLP)Fully connected layers (FC), activation functionsLess bandwidth, structured memory accessMemory latency
4. Layer Norm & Residual ConnectionsElement-wise operations, normalizationSmall memory access, low NoC trafficMinimal latency impact
5. Output Projection & SoftmaxSoftmax, final probability computationHeavy memory writesLast-layer memory bottleneck

🚀 Key Takeaway:

  • MHSA phase is the most NoC-intensive part due to massive all-to-all communication.
  • Feedforward (MLP) layers are compute-heavy but require structured memory access.

📌 2. Step-by-Step Transformer Data Flow Link to heading

🚀 Phase 1: Embedding & Input Projection Link to heading

🔹 Computation: Convert input tokens into dense vector embeddings. Perform matrix multiplications to project embeddings into the model’s hidden space.

🔹 Data Movement in NoC:

OperationNoC Traffic Type
Read embeddings from memoryMemory-to-core transfer (HBM)
Compute input projectionsLocal core communication
Store projected embeddingsWrite to DRAM/HBM

✅ NoC Behavior: Low traffic → Mostly memory-bound, not NoC-intensive. Bottleneck: DRAM bandwidth if embeddings are large.

🚀 Phase 2: Multi-Head Self-Attention (MHSA) Link to heading

📌 Most NoC-Intensive Phase! 🔹 Computation:

  1. Compute Query (Q), Key (K), and Value (V) matrices.
  2. Perform QK^T (Attention Score Calculation).
  3. Apply Softmax & Weighted Sum of Values.

🔹 Data Movement in NoC:

OperationNoC Traffic Type
Broadcast Key (K) and Value (V) to all headsAll-to-All (many-to-many)
Compute QK^TMemory-intensive tensor multiplication
Softmax normalizationLocal core memory accesses
Weighted sum of valuesHigh-bandwidth data movement

✅ NoC Behavior:

Extreme congestion → NoC must support high-bandwidth many-to-many traffic. Major bottleneck → Memory-bound attention operations slow down inference. Optimization needed → Hierarchical interconnects (NVLink, Infinity Fabric) reduce contention.

🚀 Phase 3: Feedforward Layers (MLP) Link to heading

📌 Compute-Intensive Phase

🔹 Computation:

  1. Linear transformation via fully connected (FC) layers.
  2. Non-linear activation functions (ReLU, GeLU, SiLU).

🔹 Data Movement in NoC:

OperationNoC Traffic Type
FC layer computationCore-local memory access
Activation function (ReLU, GeLU)Minimal memory movement
Store intermediate resultsWrite to HBM (if batch size is large)

✅ NoC Behavior:

Structured memory access → Less NoC congestion than attention. Compute-bound bottleneck → Optimized tensor cores help accelerate FC layers.

🚀 Phase 4: Layer Norm & Residual Connections Link to heading

📌 Lightweight Memory Operations

🔹 Computation:

  1. Normalize activation outputs (LayerNorm).
  2. Add residual connection (skip connection).

🔹 Data Movement in NoC:

OperationNoC Traffic Type
Read intermediate activationsMemory-to-core transfer
Apply element-wise LayerNormMinimal NoC load
Perform residual sumLow-bandwidth local computation

✅ NoC Behavior:

Low NoC stress → No global communication required. Minimal bottlenecks → Mostly memory latency bound.

🚀 Phase 5: Output Projection & Softmax Link to heading

📌 Final Memory-Intensive Step

🔹 Computation:

Compute final token probabilities using softmax. Select the next token during inference.

🔹 Data Movement in NoC:

OperationNoC Traffic Type
Compute output probabilitiesHigh memory bandwidth needed
Store results for next tokenMemory write operation

✅ NoC Behavior:

Latency bottleneck in last layer → Softmax reads large activation data. Memory bandwidth limited → If batch size is large, DRAM access slows down processing.

📌 3. How NoC Behavior Changes Over Time Link to heading

The NoC traffic pattern changes dynamically across transformer layers.

✅ Transformer NoC Traffic Over Time

PhaseTraffic PatternBottleneck
EmbeddinLow traffic (read-heavy)Memory latency
MHSA (Self-Attention)All-to-all NoC congestionMemory bandwidth & communication delays
MLP (Feedforward Layers)Compute-heavy, structured NoC usageCompute efficiency
LayerNorm & ResidualMinimal NoC trafficNone
Output ProjectionMemory writes, softmax communicationDRAM bandwidth

📌 Observations:

Early phases (Embedding, Attention) are memory-bound. MHSA creates the most NoC congestion (all-to-all traffic). MLP (Feedforward) is compute-heavy, but NoC load is lower.

📌 4. NoC Optimizations for Transformer Models Link to heading

Since MHSA creates the most NoC congestion, AI accelerators optimize their interconnects:

✅ Techniques to Optimize NoC for Transformer Workloads

OptimizationBenefit
Hierarchical Interconnects (NVLink, Infinity Fabric)Reduces NoC congestion by distributing traffic.
3D NoC ArchitecturesReduces average hop count and improves bandwidth.
Express Virtual Channels (EVCs)Allows priority paths for critical tensor transfers.
Sparse Attention TechniquesReduces the total number of all-to-all memory accesses.

📌 5. Final Takeaways

✅ Self-Attention (MHSA) is the biggest NoC bottleneck due to all-to-all communication. ✅ MLP layers stress compute but not NoC as much (mostly structured memory accesses).