Enjoy reading, please leave a comment or two! :)

Table of Contents Link to heading

1. Network-on-Chip (NoC) Architectures Link to heading

πŸ’‘ What is NoC (Network-on-Chip)? NoC is an on-chip communication network that connects different processing elements (CPUs, GPUs, AI accelerators, memory, etc.). It replaces traditional bus-based architectures, which become a bottleneck as chip complexity increases.

1.1 Circuit Switching vs. Packet Switching Link to heading

Feature Circuit Switching Packet Switching (Used in NoC)
Path Setup Reserves a dedicated path before transmission No prior setup; packets are routed dynamically
Efficiency Wastes bandwidth when idle More efficient, utilizes links dynamically
Scalability Poor, limited by dedicated paths Highly scalable for multi-core systems
Usage in NoC? ❌ Not used βœ… Preferred

β†’ NoC must ensure low-latency, deadlock-free routing, and thus uses packet switching.

1.2 NoC Topologies Link to heading

1.4.1 Mesh Topology Link to heading

Structure: A 2D grid where each node connects to its immediate neighbors (north, south, east, west).​

Characteristics:

  • Scalability: Easily expandable by adding rows or columns.​
  • Routing Simplicity: Supports simple routing algorithms like XY-routing.​
  • Latency: Increases with the number of hops between distant nodes.​
  • Power Consumption: Generally lower due to shorter average wire lengths.​

Use Cases: Common in systems where simplicity and scalability are prioritized over minimal latency.

1.4.2 Torus Topology Link to heading

Structure: Similar to Mesh but with additional wrap-around connections, linking edge nodes directly.​

Characteristics:

  • Reduced Latency: Wrap-around links decrease the maximum hop count between nodes.​
  • Increased Path Diversity: Multiple routing paths enhance fault tolerance.​
  • Higher Power Consumption: Additional links consume more power and increase design complexity.​

Use Cases: Suitable for applications requiring lower latency and higher fault tolerance, accepting increased power usage.​

1.4.3 Butterfly Topology Link to heading

Structure: A multi-stage, hierarchical network resembling a butterfly structure, often used in parallel computing.​

Characteristics:

  • Low Latency: Provides logarithmic path lengths relative to the number of nodes.​
  • Complex Routing: Requires sophisticated control logic for routing decisions.​
  • Scalability Limitations: Expansion increases complexity significantly.​

Use Cases: Ideal for applications demanding high-speed data transfer with predictable latency, such as certain parallel processing tasks.​

1.4.4 Recent Developments in NoC Topologies Link to heading

Advancements in NoC design focus on enhancing performance, scalability, and energy efficiency. Notable developments include:​

  • 3D NoC Architectures: Integrating vertical (z-axis) connections to traditional 2D layouts, reducing average hop count and latency. ​
  • Circulant Topologies: Utilizing mathematical properties of circulant graphs to design D-dimensional NoCs, offering real-time guarantees and reduced communication latency. ​Reference
  • Wireless NoC (WiNoC): Incorporating wireless communication channels between chiplets to enhance performance and reduce latency. ​
  • Dynamically Scalable NoC: Architectures that adjust interconnect resources at runtime based on workload demands, optimizing power consumption and performance. ​ Reference
  • Open-Source High-Bandwidth NoCs: Developments like FlooNoC provide wide physical links and support for parallel multi-stream data, achieving high energy efficiency and bandwidth. ​Reference

1.4.5 Comparative Analysis of NoC Topologies Link to heading

Topology Scalability Latency Power Consumption Complexity Fault Tolerance
Mesh High Moderate Low Low Moderate
Torus Moderate Low High Moderate High
Butterfly Low Low Moderate High Low
3D NoC High Very Low Moderate High High
Circulant High Low Moderate High High
Wireless NoC High Very Low Low High Moderate
Dynamic Scalable High Adaptive Low High High

Note: The characteristics of each topology can vary based on specific design implementations and application requirements.

1.3 Routing Algorithms Link to heading

Types of Routing Algorithms

Type Examples Description
Deterministic XY-routing Always follows the same path, predictable but can cause congestion
Adaptive Turn Model, West-First, DyXY Dynamically selects paths based on congestion or faults
Minimal XY-routing, West-First Always chooses the shortest path
Non-Minimal Valiant’s Routing Uses detours to balance traffic

1.3.1 Deterministic Routing: XY-Routing Link to heading

  • Route first along the X-axis and then along the Y-axis.
  • Usually used in Mesh and Torus topologies.
  • Advantage: Simple and deadlock-free.
  • Disadvantage: Can cause congestion in hot-spot areas.

A simple implementation is given below:

#include <iostream>

struct Packet {
    int src_x, src_y;
    int dest_x, dest_y;
};

void xy_routing(Packet p) {
    int x = p.src_x, y = p.src_y;
    while (x != p.dest_x) {
        x += (p.dest_x > x) ? 1 : -1;
        std::cout << "Moving to (" << x << ", " << y << ")\n";
    }
    while (y != p.dest_y) {
        y += (p.dest_y > y) ? 1 : -1;
        std::cout << "Moving to (" << x << ", " << y << ")\n";
    }
}

int main() {
    Packet p = {0, 0, 3, 2};  
    xy_routing(p);
    return 0;
}

1.3.2 Adaptive Routing Techniques Link to heading

  • Choose the best path based on the congestion
  • Examples: Turn Model, West-First, DyXY

πŸ’‘ Why Do We Need Special Routing Strategies? Deadlock occurs when packets are permanently blocked because they are waiting on each other and adaptive routing can reduce the probability of deadlock, although it cannot eliminate it. Such adaptive routing strategies are utilized to improve performance and avoid deadlocks. The most common adaptive routing strategies are Turn Model, West-First, and DyXY.

1.3.2.1 Turn Model Link to heading

  • A method to design deadlock-free routing algorithms.
  • How it works: By prohibiting certain 90Β° turns, it ensures packets never form cyclic dependencies.

Common Turn Models

Routing Type Forbidden Turns Why It Works?
West-First No Westward turns Ensures no cyclic dependencies
North-Last No Northward turns Guarantees deadlock-free paths
Negative-First No turns in negative X/Y first Prevents cyclic packet dependencies

βœ… Turn Model ensures deadlock-free routing without requiring virtual channels!

1.3.2.2 West-First Routing (A Special Turn Model) Link to heading

Always move west if possible, otherwise move south.

πŸ’‘ What is West-First Routing?

  • A type of deterministic routing that avoids deadlocks.
  • Packets are never allowed to turn West (left).
    • Exception: If the packet starts West of its destination, it must move West first before following the XY-routing method.
  • After traveling West first (if needed), the packet can move freely.

πŸ–ΌοΈ Visualizing West-First Routing in a 4Γ—4 NoC Let’s assume a 4Γ—4 mesh topology, where a packet must move from (3,3) to (0,0). πŸ”Ή Allowed Moves: West first, then standard XY-routing. πŸ”΄ Forbidden Moves: No turns toward the West after starting movement.

πŸš€ Example 1: West-First Routing in Action

Step Router Position (X, Y) Movement
Start (3,3) Packet begins here
Move West (2,3) βœ… Moves left (West)
Move West (1,3) βœ… Moves left (West)
Move West (0,3) βœ… Moves left (West) (West part completed)
Move South (0,2) βœ… XY-routing starts
Move South (0,1) βœ… Moves down (South)
Move South (0,0) βœ… Reached destination

πŸ“ Rule Check: βœ”οΈ West movement completed first before moving South. βœ”οΈ No Westward turns after the first movement.

βœ… Key Takeaway:

  • If the destination is West of the source, the packet must go West FIRST before making any other movement.

Thanks LLM for the example explanation!


πŸ’‘ Why West-First Routing?

  • Deadlock prevention: Packets cannot turn West, so they can’t form a cycle.
  • Simple and efficient: Only two possible directions (West and South).

Example Implementation of West-First Routing

#include <iostream>

struct Packet {
    int src_x, src_y;
    int dest_x, dest_y;
};

void west_first_routing(Packet p) {
    int x = p.src_x, y = p.src_y;

    // Step 1: Move West first (if needed)
    while (x > p.dest_x) {
        x--;
        std::cout << "Moving West to (" << x << ", " << y << ")\n";
    }

    // Step 2: Move freely in XY-routing
    while (y != p.dest_y) {
        y += (p.dest_y > y) ? 1 : -1;
        std::cout << "Moving in Y direction to (" << x << ", " << y << ")\n";
    }
}

int main() {
    Packet p = {2, 2, 0, 3};
    west_first_routing(p);
    return 0;
}

βœ… Why It Works?

  • Always moves West first before doing XY-routing.
  • Prevents cyclic dependencies β†’ No deadlocks!

βœ… When to Use?

  • Works well for low-latency NoC designs.
  • Used in AI accelerators & GPUs to optimize tensor movement.

1.3.2.3 Dynamixc XY Routing (DyXY) Link to heading

πŸ’‘ What is DyXY (Dynamic XY) Routing?

  • A congestion-aware adaptive routing technique.
  • Unlike XY-routing, DyXY allows flexible routing by checking congestion levels before deciding the path. (Dynamically adjusts)
  • Usually used in high performance AI accelerators and GPUs.
    • AMD’s Infinity Fabric (used in Ryzen, MI300 AI accelerators) uses adaptive DyXY routing to balance memory & core communication.
      • Why? Unlike CPUs, GPUs have bursty and unpredictable traffic, requiring adaptive routing to maintain low latency.

πŸ”Ή Key Rules 1️⃣ Move X-direction first (like XY-routing). 2️⃣ If congestion is detected in Y-path, choose an alternative Y-path. 3️⃣ Continue towards destination in the least congested direction.

DyXY is still mostly deterministic but introduces adaptive Y-routing to mitigate congestion.

βœ… Why is DyXY better than XY-routing?

  • XY-routing is deterministic β†’ Can suffer from congestion.
  • DyXY adapts to congestion β†’ More balanced NoC utilization.

πŸ’‘ Example situiatons in a NoC, handled by DyXY:

NoC Event DyXY Decision
High congestion detected in X-direction Moves in Y-direction instead
Southward link underutilized Uses that path dynamically
All paths congested Triggers non-minimal routing
Example: DyXY Routing vs. XY-Routing Link to heading

Scenario: Packet moves from (1,1) β†’ (3,3). XY-routing always follows (1,1) β†’ (3,1) β†’ (3,3). If (3,1) is congested, DyXY may reroute via (1,1) β†’ (1,3) β†’ (3,3).

How does DyXY improve performance?

β†’ Avoids congested paths dynamically. β†’ More efficient for high-throughput NoCs.

** Code: DyXY Routing: **

#include <iostream>
#include <cstdlib> // For random congestion simulation

struct Packet {
    int src_x, src_y;
    int dest_x, dest_y;
};

// Simulated congestion function
bool is_congested(int x, int y) {
    return rand() % 2; // Random congestion (50% chance)
}

void dyxy_routing(Packet p) {
    int x = p.src_x, y = p.src_y;

    while (x != p.dest_x || y != p.dest_y) {
        if (x != p.dest_x && !is_congested(x + 1, y)) {
            x++;
        } else if (y != p.dest_y) {
            y += (p.dest_y > y) ? 1 : -1;
        } else {
            x--;
        }
        std::cout << "Moving to (" << x << ", " << y << ")\n";
    }
}

int main() {
    srand(time(0)); // Seed random congestion
    Packet p = {1, 1, 3, 3};
    dyxy_routing(p);
    return 0;
}

1.3.2.4 Valiant’s Routing Link to heading

Valiant’s Routing is a non-minimal, load-balancing routing algorithm used in packet-switched networks and NoC architectures. It was first proposed by Leslie Valiant as a way to improve network throughput and balance traffic loads dynamically.

βœ… Key Features of Valiant’s Routing 1️⃣ Non-minimal routing β†’ Packets may take longer detours to avoid congestion. 2️⃣ Load balancing β†’ Distributes packets more evenly across the network. 3️⃣ Resilient to hot spots β†’ Reduces contention and prevents congestion build-up.

πŸ“Œ How Valiant’s Routing Works Instead of sending packets along the shortest path, Valiant’s Routing introduces a randomly chosen intermediate destination before reaching the final destination.

πŸ”Ή Step-by-Step Process 1️⃣ Step 1: Choose a random intermediate node (X’, Y’) uniformly distributed in the network. 2️⃣ Step 2: Route from source β†’ (X’, Y’) using shortest-path routing. 3️⃣ Step 3: Route from (X’, Y’) β†’ destination using shortest-path routing.

πŸ–ΌοΈ Example of Valiant’s Routing in a 4Γ—4 Mesh NoC Scenario: Packet needs to go from (0,0) β†’ (3,3)

πŸ”Ή XY-Routing (Minimal Path) (0,0) β†’ (1,0) β†’ (2,0) β†’ (3,0) β†’ (3,1) β†’ (3,2) β†’ (3,3) βœ… Shortest path, but can cause congestion.

πŸ”Ή Valiant’s Routing (Random Intermediate Hop at (1,2)) (0,0) β†’ (1,0) β†’ (1,1) β†’ (1,2) β†’ (2,2) β†’ (3,2) β†’ (3,3) βœ… Detour helps avoid congestion.

πŸ”Ή Valiant’s Routing (Random Intermediate Hop at (2,3)) (0,0) β†’ (1,0) β†’ (2,0) β†’ (2,1) β†’ (2,2) β†’ (2,3) β†’ (3,3) βœ… Another alternative path to prevent congestion.

πŸ“Œ Why Use Valiant’s Routing in NoC? πŸ’‘ Problem: Hot Spot Congestion in Deterministic Routing XY-Routing and other deterministic algorithms always take the same shortest path. If multiple packets use the same route, it causes bottlenecks (hot spots) in the network. βœ… Solution: Valiant’s Routing Distributes Traffic By randomizing paths, it balances network load and reduces congestion in high-traffic regions.

βœ… Used in: High-performance AI accelerators (e.g., Google TPU, Cerebras Wafer-Scale Engine). HPC (High-Performance Computing) networks like InfiniBand.

πŸ“Œ Valiant’s Routing vs. Other Routing Algorithms

Routing Type Path Selection Congestion Avoidance Latency Use Case
XY-Routing Fixed, shortest path ❌ No βœ… Low Simple NoCs
West-First Restricted turns βœ… Yes βœ… Low Deadlock prevention
DyXY Adaptive Dynamic based on congestion βœ… Yes βœ… Low AI workloads
Valiant’s Routing Randomized intermediate hop βœ…βœ… Best ❌ Higher HPC, AI accelerators

βœ… Key Takeaway

  • Valiant’s Routing is slower but prevents congestion best.
  • Great for large-scale AI accelerators and HPC networks.

1.3.3 Cutting-Edge NoC Routing Algorithms Link to heading

Future NoCs will likely integrate AI-driven routing algorithms either in HW or SW to dynamically adjust paths in real time, optimizing for AI accelerator workloads. Here are a few recent examples:

Algorithm Description Advantage
Deep Reinforcement Learning (DRL) Routing Uses AI to learn best NoC routing paths Self-optimizing, adjusts to dynamic traffic
Hierarchical NoC Routing Uses multi-level routing (global/local paths) Reduces congestion in large NoCs
Photonic NoC Routing Uses light-based interconnects Ultra-low latency, low power

1.3.3.1 Machine Learning-Based Adaptive Routing Link to heading

πŸ’‘ Why This Is Important? AI-driven NoC architectures now use deep reinforcement learning (DRL) to optimize routing decisions. DRL-based routers predict congestion patterns and adapt before congestion happens.

1.4 Flow Control Mechanisms in NoC Link to heading

Type Description Used in/for
Wormhole Switching Packets broken into flits, moving one flit at a time AI accelerators, GPUs
Virtual Channels (VCs) Multiple logical channels on the same physical link Reduces congestion
Credit-Based Flow Control Buffer space allocated before sending Deadlock-free designs

1.4.1 Virtual Channel Routing Link to heading

πŸ’‘ What is Virtual Channel (VC) Routing?

  • Multiple logical channels share a single physical link.
  • Packets are separated into multiple virtual lanes, preventing head-of-line (HoL) blocking.
  • Each VC has its own buffer & flow control β†’ Packets can bypass congested packets in another VC.

πŸ–ΌοΈ How Virtual Channels Work πŸ”Ή Example: A 2Γ—2 NoC where packets from different cores travel in the same direction.

  • Without VCs β†’ If one packet is blocked, all others must wait.
  • With VCs β†’ Packets use different logical lanes β†’ No waiting!

βœ… Why Virtual Channels Matter?

  • Prevents congestion & deadlocks.
  • Improves throughput & resource utilization.
  • Used in NVIDIA & AMD GPUs for AI workloads.

❓ Virtual Channels vs Adaptive Routing

Metric Virtual Channels Adaptive Routing
Latency βœ… Lower (avoids blocking) ❌ Higher under light load
Throughput βœ… Higher (parallelism) βœ… Higher (load balancing)
Congestion Handling βœ… Good βœ… Best

βœ… Key Takeaway

  • Virtual Channels prevent HoL blocking and are great for regular traffic.
  • Adaptive Routing balances load and is better for dynamic AI workloads.

1.4.1.1 Express Virtual Channels (EVCs) Link to heading

πŸ’‘ Why This Is Important?

  • EVCs bypass intermediate routers, reducing latency for high-priority packets.
  • Used in Google TPUs and NVIDIA GPUs to prioritize AI tensor movement.
  • Instead of routing each flit through every router in a Mesh NoC, EVCs allow packets to skip unnecessary intermediate hops.

πŸ”Ή Comparison:

Feature Standard Virtual Channels (VCs) Express Virtual Channels (EVCs)
Path Selection Stop at every router Skip intermediate routers
Latency Higher Lower
Best Use Case General packet traffic High-priority AI workloads

βœ… Where Is This Used?

  • Tensor data movement in AI accelerators.
  • High-speed interconnects for GPUs (NVLink, Infinity Fabric).

1.4.2 Wormhole Switching Link to heading

πŸ’‘ How It Works Wormhole switching splits packets into small flow-control units (flits) and only buffers a few flits per router, unlike traditional packet-based switching. Here’s how it works:

Packets are divided into flits: 1️⃣ Header Flit: Reserves the path and determines routing. 2️⃣ Body Flits: Carry the actual data and follow the header. 3️⃣ Tail Flit: Marks the end of the packet.

β†’ Each router only stores a few flits at a time instead of buffering entire packets. β†’ This significantly reduces memory usage and makes the design area- and power-efficient.

Advantages of Wormhole Switching βœ… Low Buffer Requirements – Needs only a few flits of buffer per router. βœ… Low Latency – Flits immediately start moving as soon as a path is available. βœ… Power Efficient – Less buffer storage means lower dynamic power consumption.

Disadvantages of Wormhole Switching ❌ Head-of-Line (HoL) Blocking – If the header flit is blocked, all following flits get stuck. ❌ Deadlock Possibility – If multiple packets wait for each other, the network can freeze. ❌ Limited Throughput – A single congested packet blocks multiple flits from different packets.

πŸ“Œ Real-World Example AI workloads (like transformers in LLMs) involve massive data transfers between compute coresβ€”wormhole switching ensures efficient low-latency movement.

πŸ”Ή How Wormhole Switching Works 1️⃣ Step 1: A flit-based switch holds multiple flits per buffer. 2️⃣ Step 2: If one flit gets blocked, all flits behind it must wait, even if they belong to different packets. 3️⃣ Step 3: VCs partially solve this by separating competing packets into independent buffers.

βœ… Why This Matters? Even with VCs, a blocked VC still causes HoL blocking within that virtual channel.

1.5 NoC Congestion Control & Performance Metrics Link to heading

1.5.1 What Causes Congestion in NoC? Link to heading

πŸ’‘ Definition of NoC Congestion Congestion occurs when packets experience excessive delays due to buffer overflow, link contention, or lack of routing flexibility. When congestion builds up in one part of the NoC, it can cause head-of-line (HoL) blocking, reducing system efficiency. πŸ”Ή Key Causes of Congestion

Cause Explanation Example
Hotspots Too many packets target the same router or link AI workloads causing localized memory traffic congestion
Limited Buffer Size Routers cannot store all incoming packets Buffers fill up, causing packet drops
Poor Routing Decisions Fixed routing paths overload some links XY-routing in a Mesh topology without adaptive routing
Lack of Virtual Channels Packets compete for the same physical link Only one packet moves at a time, blocking others

1.5.2 Techniques for NoC Congestion Control Link to heading

1️⃣ Virtual Channels (VCs) Splits a physical link into multiple logical lanes to allow packets to bypass blocked traffic. Prevents Head-of-Line (HoL) blocking by enabling different packets to move in parallel. βœ… Example: A 2Γ—2 NoC where multiple packets travel in the same direction. Without VCs β†’ One blocked packet stops all others. With VCs β†’ Other packets use separate virtual lanes, reducing delays.

2️⃣ Adaptive Routing Instead of following a fixed shortest path (like XY-routing), adaptive routing chooses less congested paths dynamically. Reduces congestion hotspots by distributing traffic more evenly. βœ… Example: If (1,0) is congested, an adaptive router may reroute through (0,1) instead.

3️⃣ Express Links Skip intermediate routers by providing direct long-range links. Reduces latency and congestion in large NoCs. βœ… Example: A Mesh NoC with express links allows packets to jump multiple hops instead of routing through every intermediate node.

4️⃣ Credit-Based Flow Control Before sending a packet, the sender checks if the receiver has buffer space available. Prevents buffer overflow and packet drops. βœ… Example: A router doesn’t send more packets than the next router can handle, avoiding buffer congestion.

5️⃣ Priority-Based NoC Routing Some packets (like AI inference requests) are prioritized over background traffic. Ensures real-time tasks meet deadlines while keeping congestion under control.

1.5.3 NoC Performance Metrics Link to heading

To optimize NoC performance, we measure key metrics:

1️⃣ Latency Definition: The time a packet takes to travel from source to destination. **Measured in cycles per hop or nanoseconds._ Lower latency = better performance.

2️⃣ Throughput Definition: The total number of packets delivered per cycle. Higher throughput = better NoC efficiency.

3️⃣ Power Consumption Definition: The total energy consumed by NoC communication. Reducing power is critical for AI accelerators and mobile chips.

βœ… Techniques to Reduce Power:

  • Power-gating inactive routers.
  • Reducing NoC voltage dynamically based on workload.

4️⃣ Buffer Occupancy Definition: How full NoC router buffers are over time. Higher occupancy = more congestion risk.


2. Performance Modeling & Simulation of NoC Link to heading

2.1 Building Performance Models for NoC Link to heading

Performance modeling is essential for designing efficient Network-on-Chip (NoC) architectures. A comprehensive NoC model typically includes:

  1. Topology Model: Representation of the network structure
  2. Router Model: Latency and throughput characteristics
  3. Traffic Model: Workload patterns and communication demands
  4. Power Model: Energy consumption estimates

2.2 Some NoC Simulators Link to heading

BookSim Link to heading

πŸ”₯ What is BookSim?

  • A cycle-accurate NoC simulator developed at Stanford University.
  • Used for AI accelerators & chip design exploration.
  • Supports various topologies (Mesh, Torus, Fat Tree).

πŸ“Œ BookSim’s Key Features βœ… Models packet-level routing, flow control, and congestion. βœ… Simulates Wormhole, Virtual-Channel, and Adaptive Routing. βœ… Evaluates latency, bandwidth, and power for different NoC designs.

πŸ”₯ Real-World Use Case: BookSim in AI Hardware πŸ”Ή Researchers use it to optimize NoC designs for AI workloads. πŸ”Ή Most companies use it to simulate NoC designs for AI accelerators.

Garnet (Gem5) Link to heading

πŸ”₯ What is Garnet (Gem5)?

  • An NoC simulator integrated into the Gem5 full-system simulator.
  • More detailed than BookSim β†’ Provides cycle-accurate CPU-GPU NoC modeling.

πŸ“Œ Garnet’s Key Features βœ… Simulates realistic multi-core CPU-GPU interconnects. βœ… Used for evaluating NoC performance in high-performance computing (HPC) systems.

πŸ”₯ Real-World Use Case: Garnet in HPC πŸ”Ή AMD and Intel use Garnet for full-chip NoC simulations before fabricating data center chips. πŸ”Ή Google uses it to model TPU interconnect performance.

➑️ Garnet can:

  • Analyze full system CPU-GPU interconnect performance under real world-workloads. Which means, you can analyze how CPU-GPU communication behaves under real workloads.
  • Analyze cache coherency and memory iccess impact on NoC performance.
  • Analyze how NoC latency affects AI model execution.

Important Differences between BookSim and Garnet Link to heading

Feature BookSim Garnet (Gem5)
Focus NoC routing & congestion analysis Full-system CPU-GPU interconnect analysis
Level of Detail Packet-level simulation Cycle-accurate system-wide simulation
Best For Comparing routing algorithms Analyzing AI workloads on full-system NoC

βœ… Why It Matters?

➑️Use BookSim to optimize NoC microarchitecture. ➑️Use Garnet to study NoC behavior in AI accelerators.

3. Software Engineering & Concepts for NoC Simulators Link to heading

3.1 Key Software Engineering Concepts for NoC Simulators Link to heading

Concept Why It Matters for NoC Simulators
Modular Design Each NoC component (router, buffer, link) should be independently testable & reusable.
Event-Driven Simulation NoCs are modeled as event-driven systems (packets arrive, propagate, stall).
Concurrency & Parallelism NoC simulators should leverage multi-threading for speed.
Abstraction Layers Separate hardware modeling from routing logic to ensure flexibility.
Performance Optimization Use efficient data structures (e.g., custom queues) for faster packet handling.
Profiling & Debugging NoC simulators should include performance metrics & logging.

3.2 Software Architecture of NoC Simulators Link to heading

3.2.1 NoC Simulation Components Link to heading

A typical NoC simulator consists of four major software components:

Component Role in NoC Simulation
Router Model Implements buffering, flow control, routing, and arbitration.
Link Model Represents NoC links (bandwidth, delay, contention effects).
Traffic Generator Generates AI workload-based packet injections.
Performance Analyzer Tracks latency, congestion, throughput, power metrics.

βœ… Example: BookSim NoC Simulator Architecture

Component Role in NoC Simulation
Router Class (Router.cpp) Implements virtual channels, input buffers, & routing logic.
Traffic Class (TrafficManager.cpp) Injects synthetic or real workload-based traffic.
Simulation Engine (Simulator.cpp) Executes event-driven simulation steps.

3.2.2 Event-Driven NoC Simulation Link to heading

Most NoC simulators use an event-driven simulation model instead of traditional step-by-step execution.

βœ… How Event-Driven Simulation Works in NoC: 1️⃣ Each NoC event (packet arrival, buffer stall, routing decision) is scheduled on an event queue. 2️⃣ The event scheduler processes events in time order (priority queue). 3️⃣ Events are processed only when necessary, improving performance.

πŸ’‘ Example: Event Handling in NoC Simulation (Pseudocode)

struct Event {
    int timestamp;
    std::function<void()> action;  
};

// Event Queue (Priority Queue Sorted by Timestamp)
std::priority_queue<Event> eventQueue;

// Add an event to NoC simulation
void scheduleEvent(int time, std::function<void()> action) {
    eventQueue.push({time, action});
}

// NoC Simulation Engine
void runSimulation() {
    while (!eventQueue.empty()) {
        auto event = eventQueue.top();
        eventQueue.pop();
        event.action();  // Execute event
    }
}

3.2.3 Key C++ Features for NoC Simulators Link to heading

Feature Why It Matters for NoC Simulators
std::optional (C++17) Avoids uninitialized packet states.
std::variant (C++17) Used for flexible routing decision storage.
std::any (C++17) Allows generic NoC event storage.
std::parallel_execution (C++17) Enables multithreading in NoC simulations.
std::coroutine (C++20) Allows event-driven NoC scheduling using async packet handling.
std::span (C++20) Efficiently manages flit buffers.
std::bit_cast (C++20) Fast conversion between flits and integers for NoC encoding.

3.2.4 Using C++ Coroutines for NoC Event Handling Link to heading

πŸ”Ή Problem: Traditional NoC simulators handle packet transmission using blocking function calls, leading to inefficiencies.

πŸ”Ή Solution: Use C++20 coroutines to make packet transmission non-blocking and event-driven.

βœ… Example: NoC Packet Transmission Using Coroutines

#include <coroutine>
#include <iostream>

// Coroutine-based NoC Event
struct NoCEvent {
    struct promise_type {
        NoCEvent get_return_object() { return {}; }
        std::suspend_always initial_suspend() { return {}; }
        std::suspend_always final_suspend() noexcept { return {}; }
        void return_void() {}
    };
};

// Simulated NoC Packet Transmission (Async)
NoCEvent transmitPacket(int packet_id) {
    std::cout << "Packet " << packet_id << " transmitted." << std::endl;
    co_return;
}

int main() {
    auto event = transmitPacket(42);  // Non-blocking NoC transmission
    return 0;
}

βœ… Why Use Coroutines for NoC Simulation? βœ”οΈ Avoids blocking execution while waiting for packet arrivals. βœ”οΈ Enables asynchronous event scheduling (no need for polling loops).

3.2.5 Parallelism & Concurrency in NoC Simulators Link to heading

Modern NoC simulators must handle thousands of packets simultaneously. This requires efficient multithreading and parallel processing.

πŸš€ Parallel NoC Simulation Using std::execution C++17 introduced parallel execution policies to enable multi-threaded NoC event processing.

βœ… Example: NoC Parallel Packet Processing

#include <vector>
#include <algorithm>
#include <execution>

std::vector<int> packets = {1, 2, 3, 4, 5};

// Process packets in parallel
std::for_each(std::execution::par, packets.begin(), packets.end(), [](int packet) {
    std::cout << "Processing packet: " << packet << std::endl;
});

βœ… Why Use std::execution::par? βœ”οΈ Automatically distributes packet processing across multiple CPU cores. βœ”οΈ Reduces simulation runtime for large NoC workloads.

πŸ“Œ Performance Optimization Techniques for NoC Simulators Efficient NoC simulators must be optimized for performance.

πŸš€ NoC Simulator Performance Bottlenecks & Solutions

Bottleneck Optimization Technique
Slow event processing Use event-driven execution model (instead of polling).
Inefficient buffer allocation Use std::span to manage flit buffers efficiently.
High memory usage Use custom memory pools instead of heap allocations.
Cache misses in routing logic Prefetch routing tables using SIMD-friendly data structures.

βœ… Example: Optimized NoC Packet Queue Using std::deque

#include <deque>
#include <iostream>

std::deque<int> packetQueue;  // Optimized for FIFO access

void enqueuePacket(int packet) {
    packetQueue.push_back(packet);  // Fast O(1) insert
}

void processNextPacket() {
    if (!packetQueue.empty()) {
        int packet = packetQueue.front();
        packetQueue.pop_front();  // Fast O(1) removal
        std::cout << "Processing packet: " << packet << std::endl;
    }
}

βœ… Why Use std::deque for NoC Buffers? βœ”οΈ O(1) enqueue/dequeue β†’ Faster than std::vector. βœ”οΈ Low memory fragmentation.

4. AI Workload Analysis Link to heading

4.1 Why Do AI Workloads Stress the NoC? Link to heading

πŸ’‘ Key Reasons:

  • Massive Data Movement: AI workloads require huge amounts of tensor data movement between processing cores, memory, and accelerators.
  • Irregular Traffic Patterns: Unlike traditional HPC workloads, AI tasks exhibit bursty, unpredictable traffic.
  • High Memory Bandwidth Needs: Training/inference models require high-speed interconnects to access DRAM, HBM, or cache.
  • Synchronization Overhead: AI workloads depend on synchronized execution across multiple cores/GPUs, leading to NoC congestion and stalls.

βœ… Impact on NoC:

AI Workload Issue NoC Bottleneck
Large tensor data movement High bandwidth demand
Unpredictable bursty traffic NoC congestion & packet loss
Cache coherence across compute cores Increased latency
Need for synchronization Increased communication overhead

4.2 How Different AI Workloads Impact NoC? Link to heading

AI Model Type NoC Stress Pattern Bottleneck Example Models
Transformer Models (LLMs) High memory bandwidth, all-to-all communication NoC congestion due to massive attention matrix communication GPT-4, Llama, Falcon
CNNs (Convolutional Neural Networks) Localized data movement, structured traffic Memory latency & NoC hotspot formation ResNet, EfficientNet
Graph Neural Networks (GNNs) Irregular, sparse traffic patterns Poor NoC utilization & congestion in specific regions GCN, GAT
Diffusion Models (Stable Diffusion, DALL-E) Burst mode traffic, unpredictable spikes Cache misses & NoC jitter Stable Diffusion, MidJourney

βœ… Key Observations:

  • LLMs stress NoC the most due to their massive attention matrix multiplications.
  • CNNs are structured and predictable, but require optimized memory access.
  • GNNs have sparse and irregular traffic, leading to load imbalance on the NoC.

4.3 Key NoC Bottlenecks for AI Workloads Link to heading

4.3.1 High Memory Bandwidth Requirements Link to heading

πŸ”Ή Why?

  • AI models, especially LLMs, require terabytes of DRAM/HBM bandwidth.
  • The NoC must provide fast access to weight matrices & activations without bottlenecks.

βœ… Example: AI Accelerator Memory Access (TPU, MI300, Hopper)

Component Memory Bandwidth (GB/s)
NVIDIA Hopper H100 (HBM3) 4.9 TB/s
Google TPU v4 2.2 TB/s
AMD MI300 5.2 TB/s

πŸ’‘ What happens if NoC bandwidth is insufficient? ❌ Memory stalls β†’ The AI pipeline stops waiting for data. ❌ High NoC congestion β†’ Increased packet drop rates and latency.

βœ… Solution: NoC Optimizations βœ”οΈ High-bandwidth interconnects (NVLink, Infinity Fabric). βœ”οΈ Efficient NoC routing (Valiant’s Routing, Adaptive DyXY). βœ”οΈ HBM-aware NoC design for direct memory access.

4.3.2 Bursty and Unpredictable Traffic Patterns Link to heading

πŸ”Ή Why?

  • Unlike HPC workloads, AI workloads generate bursts of traffic when layers process large tensors.
  • Transformer models create unpredictable spikes in NoC traffic during attention calculations.

βœ… Example: Burst Traffic in AI Training

Time (ms) Traffic Pattern
0 - 50 Low data movement
50 - 60 High burst (attention computation)
60 - 80 Medium data flow
80 - 90 Another burst (MLP layer computation)

πŸ’‘ What happens if NoC cannot handle burst traffic? ❌ High packet loss β†’ Tensor updates get delayed. ❌ Increased NoC congestion β†’ Slows down AI training/inference.

βœ… Solution: NoC Optimizations βœ”οΈ Adaptive routing (Dynamic DyXY, Reinforcement Learning-based routing). βœ”οΈ Express Virtual Channels (EVCs) to bypass congested routers. βœ”οΈ Priority-based NoC scheduling to give preference to critical AI tasks.

4.3.3 Synchronization Overhead in AI Training Link to heading

πŸ”Ή Why?

  • AI workloads run on distributed compute units (e.g., GPUs, TPUs).
  • NoC must synchronize tensor updates across all cores.

βœ… Example: AI Model Synchronization

Operation Communication Overhead
Forward Pass Minimal NoC traffic (mostly read-heavy)
Backward Pass (Gradients) Heavy NoC traffic (large tensor updates)
Parameter Update (SGD, Adam) Massive synchronization needed

πŸ’‘ What happens if NoC cannot synchronize efficiently? ❌ Deadlocks in NoC β†’ Model training halts. ❌ Increased power consumption due to excessive message passing.

βœ… Solution: NoC Optimizations βœ”οΈ Hierarchical NoC topologies to reduce synchronization latency. βœ”οΈ Priority-based packet scheduling for gradient updates. βœ”οΈ Smart data compression techniques (quantization, sparsity).

4.3.4 AI-Specific Traffic Patterns: All-to-All Communication Link to heading

πŸ”Ή Why?

  • Transformer-based models require massive matrix multiplications across multiple GPUs/TPUs.
  • AI accelerators use all-to-all NoC communication, leading to congestion.

βœ… Example: NoC Traffic in Transformer Models

Layer Data Movement
Self-Attention All-to-All (High NoC congestion)
Feedforward (MLP) Many-to-One (Moderate NoC usage)
Output Layer One-to-All (Medium NoC traffic)

πŸ’‘ What happens if NoC cannot handle all-to-all traffic? ❌ Extreme congestion at AI accelerator interconnects. ❌ Limited scalability as AI models grow (GPT-5, Llama 3, etc.).

βœ… Solution: NoC Optimizations βœ”οΈ Hybrid topologies (Torus + Express Links) to reduce hops. βœ”οΈ Optical/Wireless NoC for ultra-low latency interconnects. βœ”οΈ Dynamic workload-aware routing to balance NoC traffic.

πŸ“Œ Final Takeaways πŸ”Ή Modern AI workloads stress NoC due to high bandwidth, unpredictable traffic, and synchronization issues. πŸ”Ή LLMs (GPT-4, Llama) create the most severe NoC congestion due to all-to-all tensor communication. πŸ”Ή CNNs have structured traffic but require optimized memory access. πŸ”Ή NoCs must evolve with smarter routing, better flow control, and high-speed interconnects.

Appendix Link to heading

Queueing Theory Link to heading

Concept Definition NoC Relevance
Little’s Law L = Ξ»W (Number of items = arrival rate Γ— wait time) Predicts NoC buffer occupancy
M/M/1 Queue Single server, exponential arrival & service Models packet delays in NoC routers
M/G/1 Queue General service distribution More accurate for NoC bursty traffic

πŸ”₯ Little’s Law πŸ”Ή Formula:

L=Ξ»W

Where:

  • L = Average number of packets in the system
  • Ξ» = Arrival rate of packets
  • W = Average time a packet spends in the system

βœ… Example Application in NoC: If packets arrive at 5 per cycle and spend 4 cycles in the NoC, then:

L=5Γ—4=20

This means 20 packets are present in the NoC at any given time.

πŸ”₯ Poisson Distribution (ELI5)

  • Describes random arrival events over time.
  • Used in NoC traffic modeling to simulate bursty vs. smooth packet arrivals.

βœ… Example: β†’ If packets arrive at an average rate of 10 per second, a Poisson model predicts how likely it is to receive 15 or 20 packets in the next second. β†’ Real-world application: Models random memory accesses in NoC routers.

πŸ”₯ M/M/1 Queue (NoC Packet Queues) πŸ’‘ What is M/M/1? A single-server queueing model used to analyze NoC router delays.

M/M/1 means:

  • M (Memoryless Poisson Arrivals)
    • Markovian arrival where packets arrive randomly following a Possion process.
  • M (Memoryless Exponential Service)
    • Markovian service where the time between packet processing is exponentially distributed.
  • 1 (Single NoC router queue)

βœ… Example: If a router receives packets at a rate of 10 per cycle, but can process 8 per cycle, a queue will form.

βœ… Why M/M/1 Matters for NoC?

Predicts NoC delays based on packet arrival rate. Helps optimize buffer size & flow control.

πŸ”₯ All Queue Types

Queue Type Arrival Process Service Process Servers Example Use Case
M/M/1 Poisson (Random) Exponential (Random) 1 NoC router queue
M/M/c Poisson Exponential c servers Multi-core CPU processing requests
M/G/1 Poisson General (Any Distribution) 1 NoC traffic bursts
G/M/1 General (Any Distribution) Exponential 1 Variable packet arrivals
M/D/1 Poisson Deterministic (Fixed) 1 Real-time NoC packet scheduling
M/M/∞ Poisson Exponential Infinite servers Cloud computing load balancing

πŸ“Œ Comparing M/M/1 vs. M/M/c πŸ’‘ What is an M/M/c Queue? Same as M/M/1, but has c servers instead of 1. Used for multi-core CPUs & parallel packet processing in NoC. βœ… Example: If an NoC router has 4 ports handling traffic, we can model it as M/M/4. βœ… Why M/M/c? Reduces waiting time by serving packets in parallel. More realistic for AI accelerators & GPU interconnects.

πŸ“Œ Comparing M/M/1 vs. M/G/1 πŸ’‘ What is an M/G/1 Queue? Same as M/M/1, but service time follows ANY distribution (not just exponential). More accurate for NoC, since real-world NoC packets do not have perfectly exponential service times. βœ… Example: If a NoC router processes tensor data from an AI workload, some packets take longer than others β†’ M/G/1 models this better. βœ… Why M/G/1? More realistic for bursty NoC traffic. Used in AI interconnect modeling.

πŸ“Œ Comparing M/M/1 vs. M/D/1 πŸ’‘ What is an M/D/1 Queue? Same as M/M/1, but service times are fixed (deterministic). Used for real-time NoC scheduling where service time is constant. βœ… Example: If an NoC router always takes exactly 2 cycles to process a packet, we use M/D/1 instead of M/M/1. βœ… Why M/D/1? Used in hard real-time NoC systems (e.g., automotive chips). Predictable processing delays.

πŸ“Œ Which Queue Model is Best for NoC?

Use Case Best Queue Model
General NoC packet queuing M/M/1
Parallel NoC processing (multi-core) M/M/c
NoC with bursty packet arrivals M/G/1
Real-time NoC scheduling M/D/1

βœ… Key Takeaway: NoC routers typically use M/M/1 or M/G/1, depending on traffic patterns.