Enjoy reading, please leave a comment or two! :)

Table of Contents Link to heading

1. Network-on-Chip (NoC) Architectures Link to heading

πŸ’‘ What is NoC (Network-on-Chip)? NoC is an on-chip communication network that connects different processing elements (CPUs, GPUs, AI accelerators, memory, etc.). It replaces traditional bus-based architectures, which become a bottleneck as chip complexity increases.

1.1 Circuit Switching vs. Packet Switching Link to heading

FeatureCircuit SwitchingPacket Switching (Used in NoC)
Path SetupReserves a dedicated path before transmissionNo prior setup; packets are routed dynamically
EfficiencyWastes bandwidth when idleMore efficient, utilizes links dynamically
ScalabilityPoor, limited by dedicated pathsHighly scalable for multi-core systems
Usage in NoC?❌ Not usedβœ… Preferred

β†’ NoC must ensure low-latency, deadlock-free routing, and thus uses packet switching.

1.2 NoC Topologies Link to heading

1.4.1 Mesh Topology Link to heading

Structure: A 2D grid where each node connects to its immediate neighbors (north, south, east, west).​

Characteristics:

  • Scalability: Easily expandable by adding rows or columns.​
  • Routing Simplicity: Supports simple routing algorithms like XY-routing.​
  • Latency: Increases with the number of hops between distant nodes.​
  • Power Consumption: Generally lower due to shorter average wire lengths.​

Use Cases: Common in systems where simplicity and scalability are prioritized over minimal latency.

1.4.2 Torus Topology Link to heading

Structure: Similar to Mesh but with additional wrap-around connections, linking edge nodes directly.​

Characteristics:

  • Reduced Latency: Wrap-around links decrease the maximum hop count between nodes.​
  • Increased Path Diversity: Multiple routing paths enhance fault tolerance.​
  • Higher Power Consumption: Additional links consume more power and increase design complexity.​

Use Cases: Suitable for applications requiring lower latency and higher fault tolerance, accepting increased power usage.​

1.4.3 Butterfly Topology Link to heading

Structure: A multi-stage, hierarchical network resembling a butterfly structure, often used in parallel computing.​

Characteristics:

  • Low Latency: Provides logarithmic path lengths relative to the number of nodes.​
  • Complex Routing: Requires sophisticated control logic for routing decisions.​
  • Scalability Limitations: Expansion increases complexity significantly.​

Use Cases: Ideal for applications demanding high-speed data transfer with predictable latency, such as certain parallel processing tasks.​

1.4.4 Recent Developments in NoC Topologies Link to heading

Advancements in NoC design focus on enhancing performance, scalability, and energy efficiency. Notable developments include:​

  • 3D NoC Architectures: Integrating vertical (z-axis) connections to traditional 2D layouts, reducing average hop count and latency. ​
  • Circulant Topologies: Utilizing mathematical properties of circulant graphs to design D-dimensional NoCs, offering real-time guarantees and reduced communication latency. ​Reference
  • Wireless NoC (WiNoC): Incorporating wireless communication channels between chiplets to enhance performance and reduce latency. ​
  • Dynamically Scalable NoC: Architectures that adjust interconnect resources at runtime based on workload demands, optimizing power consumption and performance. ​ Reference
  • Open-Source High-Bandwidth NoCs: Developments like FlooNoC provide wide physical links and support for parallel multi-stream data, achieving high energy efficiency and bandwidth. ​Reference

1.4.5 Comparative Analysis of NoC Topologies Link to heading

TopologyScalabilityLatencyPower ConsumptionComplexityFault Tolerance
MeshHighModerateLowLowModerate
TorusModerateLowHighModerateHigh
ButterflyLowLowModerateHighLow
3D NoCHighVery LowModerateHighHigh
CirculantHighLowModerateHighHigh
Wireless NoCHighVery LowLowHighModerate
Dynamic ScalableHighAdaptiveLowHighHigh

Note: The characteristics of each topology can vary based on specific design implementations and application requirements.

1.3 Routing Algorithms Link to heading

Types of Routing Algorithms

TypeExamplesDescription
DeterministicXY-routingAlways follows the same path, predictable but can cause congestion
AdaptiveTurn Model, West-First, DyXYDynamically selects paths based on congestion or faults
MinimalXY-routing, West-FirstAlways chooses the shortest path
Non-MinimalValiant’s RoutingUses detours to balance traffic

1.3.1 Deterministic Routing: XY-Routing Link to heading

  • Route first along the X-axis and then along the Y-axis.
  • Usually used in Mesh and Torus topologies.
  • Advantage: Simple and deadlock-free.
  • Disadvantage: Can cause congestion in hot-spot areas.

A simple implementation is given below:

#include <iostream>

struct Packet {
    int src_x, src_y;
    int dest_x, dest_y;
};

void xy_routing(Packet p) {
    int x = p.src_x, y = p.src_y;
    while (x != p.dest_x) {
        x += (p.dest_x > x) ? 1 : -1;
        std::cout << "Moving to (" << x << ", " << y << ")\n";
    }
    while (y != p.dest_y) {
        y += (p.dest_y > y) ? 1 : -1;
        std::cout << "Moving to (" << x << ", " << y << ")\n";
    }
}

int main() {
    Packet p = {0, 0, 3, 2};  
    xy_routing(p);
    return 0;
}

1.3.2 Adaptive Routing Techniques Link to heading

  • Choose the best path based on the congestion
  • Examples: Turn Model, West-First, DyXY

πŸ’‘ Why Do We Need Special Routing Strategies? Deadlock occurs when packets are permanently blocked because they are waiting on each other and adaptive routing can reduce the probability of deadlock, although it cannot eliminate it. Such adaptive routing strategies are utilized to improve performance and avoid deadlocks. The most common adaptive routing strategies are Turn Model, West-First, and DyXY.

1.3.2.1 Turn Model Link to heading

  • A method to design deadlock-free routing algorithms.
  • How it works: By prohibiting certain 90Β° turns, it ensures packets never form cyclic dependencies.

Common Turn Models

Routing TypeForbidden TurnsWhy It Works?
West-FirstNo Westward turnsEnsures no cyclic dependencies
North-LastNo Northward turnsGuarantees deadlock-free paths
Negative-FirstNo turns in negative X/Y firstPrevents cyclic packet dependencies

βœ… Turn Model ensures deadlock-free routing without requiring virtual channels!

1.3.2.2 West-First Routing (A Special Turn Model) Link to heading

Always move west if possible, otherwise move south.

πŸ’‘ What is West-First Routing?

  • A type of deterministic routing that avoids deadlocks.
  • Packets are never allowed to turn West (left).
    • Exception: If the packet starts West of its destination, it must move West first before following the XY-routing method.
  • After traveling West first (if needed), the packet can move freely.

πŸ–ΌοΈ Visualizing West-First Routing in a 4Γ—4 NoC Let’s assume a 4Γ—4 mesh topology, where a packet must move from (3,3) to (0,0). πŸ”Ή Allowed Moves: West first, then standard XY-routing. πŸ”΄ Forbidden Moves: No turns toward the West after starting movement.

πŸš€ Example 1: West-First Routing in Action

StepRouter Position (X, Y)Movement
Start(3,3)Packet begins here
Move West(2,3)βœ… Moves left (West)
Move West(1,3)βœ… Moves left (West)
Move West(0,3)βœ… Moves left (West) (West part completed)
Move South(0,2)βœ… XY-routing starts
Move South(0,1)βœ… Moves down (South)
Move South(0,0)βœ… Reached destination

πŸ“ Rule Check: βœ”οΈ West movement completed first before moving South. βœ”οΈ No Westward turns after the first movement.

βœ… Key Takeaway:

  • If the destination is West of the source, the packet must go West FIRST before making any other movement.

Thanks LLM for the example explanation!


πŸ’‘ Why West-First Routing?

  • Deadlock prevention: Packets cannot turn West, so they can’t form a cycle.
  • Simple and efficient: Only two possible directions (West and South).

Example Implementation of West-First Routing

#include <iostream>

struct Packet {
    int src_x, src_y;
    int dest_x, dest_y;
};

void west_first_routing(Packet p) {
    int x = p.src_x, y = p.src_y;

    // Step 1: Move West first (if needed)
    while (x > p.dest_x) {
        x--;
        std::cout << "Moving West to (" << x << ", " << y << ")\n";
    }

    // Step 2: Move freely in XY-routing
    while (y != p.dest_y) {
        y += (p.dest_y > y) ? 1 : -1;
        std::cout << "Moving in Y direction to (" << x << ", " << y << ")\n";
    }
}

int main() {
    Packet p = {2, 2, 0, 3};
    west_first_routing(p);
    return 0;
}

βœ… Why It Works?

  • Always moves West first before doing XY-routing.
  • Prevents cyclic dependencies β†’ No deadlocks!

βœ… When to Use?

  • Works well for low-latency NoC designs.
  • Used in AI accelerators & GPUs to optimize tensor movement.

1.3.2.3 Dynamixc XY Routing (DyXY) Link to heading

πŸ’‘ What is DyXY (Dynamic XY) Routing?

  • A congestion-aware adaptive routing technique.
  • Unlike XY-routing, DyXY allows flexible routing by checking congestion levels before deciding the path. (Dynamically adjusts)
  • Usually used in high performance AI accelerators and GPUs.
    • AMD’s Infinity Fabric (used in Ryzen, MI300 AI accelerators) uses adaptive DyXY routing to balance memory & core communication.
      • Why? Unlike CPUs, GPUs have bursty and unpredictable traffic, requiring adaptive routing to maintain low latency.

πŸ”Ή Key Rules 1️⃣ Move X-direction first (like XY-routing). 2️⃣ If congestion is detected in Y-path, choose an alternative Y-path. 3️⃣ Continue towards destination in the least congested direction.

DyXY is still mostly deterministic but introduces adaptive Y-routing to mitigate congestion.

βœ… Why is DyXY better than XY-routing?

  • XY-routing is deterministic β†’ Can suffer from congestion.
  • DyXY adapts to congestion β†’ More balanced NoC utilization.

πŸ’‘ Example situiatons in a NoC, handled by DyXY:

NoC EventDyXY Decision
High congestion detected in X-directionMoves in Y-direction instead
Southward link underutilizedUses that path dynamically
All paths congestedTriggers non-minimal routing
Example: DyXY Routing vs. XY-Routing Link to heading

Scenario: Packet moves from (1,1) β†’ (3,3). XY-routing always follows (1,1) β†’ (3,1) β†’ (3,3). If (3,1) is congested, DyXY may reroute via (1,1) β†’ (1,3) β†’ (3,3).

How does DyXY improve performance?

β†’ Avoids congested paths dynamically. β†’ More efficient for high-throughput NoCs.

** Code: DyXY Routing: **

#include <iostream>
#include <cstdlib> // For random congestion simulation

struct Packet {
    int src_x, src_y;
    int dest_x, dest_y;
};

// Simulated congestion function
bool is_congested(int x, int y) {
    return rand() % 2; // Random congestion (50% chance)
}

void dyxy_routing(Packet p) {
    int x = p.src_x, y = p.src_y;

    while (x != p.dest_x || y != p.dest_y) {
        if (x != p.dest_x && !is_congested(x + 1, y)) {
            x++;
        } else if (y != p.dest_y) {
            y += (p.dest_y > y) ? 1 : -1;
        } else {
            x--;
        }
        std::cout << "Moving to (" << x << ", " << y << ")\n";
    }
}

int main() {
    srand(time(0)); // Seed random congestion
    Packet p = {1, 1, 3, 3};
    dyxy_routing(p);
    return 0;
}

1.3.2.4 Valiant’s Routing Link to heading

Valiant’s Routing is a non-minimal, load-balancing routing algorithm used in packet-switched networks and NoC architectures. It was first proposed by Leslie Valiant as a way to improve network throughput and balance traffic loads dynamically.

βœ… Key Features of Valiant’s Routing 1️⃣ Non-minimal routing β†’ Packets may take longer detours to avoid congestion. 2️⃣ Load balancing β†’ Distributes packets more evenly across the network. 3️⃣ Resilient to hot spots β†’ Reduces contention and prevents congestion build-up.

πŸ“Œ How Valiant’s Routing Works Instead of sending packets along the shortest path, Valiant’s Routing introduces a randomly chosen intermediate destination before reaching the final destination.

πŸ”Ή Step-by-Step Process 1️⃣ Step 1: Choose a random intermediate node (X’, Y’) uniformly distributed in the network. 2️⃣ Step 2: Route from source β†’ (X’, Y’) using shortest-path routing. 3️⃣ Step 3: Route from (X’, Y’) β†’ destination using shortest-path routing.

πŸ–ΌοΈ Example of Valiant’s Routing in a 4Γ—4 Mesh NoC Scenario: Packet needs to go from (0,0) β†’ (3,3)

πŸ”Ή XY-Routing (Minimal Path) (0,0) β†’ (1,0) β†’ (2,0) β†’ (3,0) β†’ (3,1) β†’ (3,2) β†’ (3,3) βœ… Shortest path, but can cause congestion.

πŸ”Ή Valiant’s Routing (Random Intermediate Hop at (1,2)) (0,0) β†’ (1,0) β†’ (1,1) β†’ (1,2) β†’ (2,2) β†’ (3,2) β†’ (3,3) βœ… Detour helps avoid congestion.

πŸ”Ή Valiant’s Routing (Random Intermediate Hop at (2,3)) (0,0) β†’ (1,0) β†’ (2,0) β†’ (2,1) β†’ (2,2) β†’ (2,3) β†’ (3,3) βœ… Another alternative path to prevent congestion.

πŸ“Œ Why Use Valiant’s Routing in NoC? πŸ’‘ Problem: Hot Spot Congestion in Deterministic Routing XY-Routing and other deterministic algorithms always take the same shortest path. If multiple packets use the same route, it causes bottlenecks (hot spots) in the network. βœ… Solution: Valiant’s Routing Distributes Traffic By randomizing paths, it balances network load and reduces congestion in high-traffic regions.

βœ… Used in: High-performance AI accelerators (e.g., Google TPU, Cerebras Wafer-Scale Engine). HPC (High-Performance Computing) networks like InfiniBand.

πŸ“Œ Valiant’s Routing vs. Other Routing Algorithms

Routing TypePath SelectionCongestion AvoidanceLatencyUse Case
XY-RoutingFixed, shortest path❌ Noβœ… LowSimple NoCs
West-FirstRestricted turnsβœ… Yesβœ… LowDeadlock prevention
DyXY AdaptiveDynamic based on congestionβœ… Yesβœ… LowAI workloads
Valiant’s RoutingRandomized intermediate hopβœ…βœ… Best❌ HigherHPC, AI accelerators

βœ… Key Takeaway

  • Valiant’s Routing is slower but prevents congestion best.
  • Great for large-scale AI accelerators and HPC networks.

1.3.3 Cutting-Edge NoC Routing Algorithms Link to heading

Future NoCs will likely integrate AI-driven routing algorithms either in HW or SW to dynamically adjust paths in real time, optimizing for AI accelerator workloads. Here are a few recent examples:

AlgorithmDescriptionAdvantage
Deep Reinforcement Learning (DRL) RoutingUses AI to learn best NoC routing pathsSelf-optimizing, adjusts to dynamic traffic
Hierarchical NoC RoutingUses multi-level routing (global/local paths)Reduces congestion in large NoCs
Photonic NoC RoutingUses light-based interconnectsUltra-low latency, low power

1.3.3.1 Machine Learning-Based Adaptive Routing Link to heading

πŸ’‘ Why This Is Important? AI-driven NoC architectures now use deep reinforcement learning (DRL) to optimize routing decisions. DRL-based routers predict congestion patterns and adapt before congestion happens.

1.4 Flow Control Mechanisms in NoC Link to heading

TypeDescriptionUsed in/for
Wormhole SwitchingPackets broken into flits, moving one flit at a timeAI accelerators, GPUs
Virtual Channels (VCs)Multiple logical channels on the same physical linkReduces congestion
Credit-Based Flow ControlBuffer space allocated before sendingDeadlock-free designs

1.4.1 Virtual Channel Routing Link to heading

πŸ’‘ What is Virtual Channel (VC) Routing?

  • Multiple logical channels share a single physical link.
  • Packets are separated into multiple virtual lanes, preventing head-of-line (HoL) blocking.
  • Each VC has its own buffer & flow control β†’ Packets can bypass congested packets in another VC.

πŸ–ΌοΈ How Virtual Channels Work πŸ”Ή Example: A 2Γ—2 NoC where packets from different cores travel in the same direction.

  • Without VCs β†’ If one packet is blocked, all others must wait.
  • With VCs β†’ Packets use different logical lanes β†’ No waiting!

βœ… Why Virtual Channels Matter?

  • Prevents congestion & deadlocks.
  • Improves throughput & resource utilization.
  • Used in NVIDIA & AMD GPUs for AI workloads.

❓ Virtual Channels vs Adaptive Routing

MetricVirtual ChannelsAdaptive Routing
Latencyβœ… Lower (avoids blocking)❌ Higher under light load
Throughputβœ… Higher (parallelism)βœ… Higher (load balancing)
Congestion Handlingβœ… Goodβœ… Best

βœ… Key Takeaway

  • Virtual Channels prevent HoL blocking and are great for regular traffic.
  • Adaptive Routing balances load and is better for dynamic AI workloads.

1.4.1.1 Express Virtual Channels (EVCs) Link to heading

πŸ’‘ Why This Is Important?

  • EVCs bypass intermediate routers, reducing latency for high-priority packets.
  • Used in Google TPUs and NVIDIA GPUs to prioritize AI tensor movement.
  • Instead of routing each flit through every router in a Mesh NoC, EVCs allow packets to skip unnecessary intermediate hops.

πŸ”Ή Comparison:

FeatureStandard Virtual Channels (VCs)Express Virtual Channels (EVCs)
Path SelectionStop at every routerSkip intermediate routers
LatencyHigherLower
Best Use CaseGeneral packet trafficHigh-priority AI workloads

βœ… Where Is This Used?

  • Tensor data movement in AI accelerators.
  • High-speed interconnects for GPUs (NVLink, Infinity Fabric).

1.4.2 Wormhole Switching Link to heading

πŸ’‘ How It Works Wormhole switching splits packets into small flow-control units (flits) and only buffers a few flits per router, unlike traditional packet-based switching. Here’s how it works:

Packets are divided into flits: 1️⃣ Header Flit: Reserves the path and determines routing. 2️⃣ Body Flits: Carry the actual data and follow the header. 3️⃣ Tail Flit: Marks the end of the packet.

β†’ Each router only stores a few flits at a time instead of buffering entire packets. β†’ This significantly reduces memory usage and makes the design area- and power-efficient.

Advantages of Wormhole Switching βœ… Low Buffer Requirements – Needs only a few flits of buffer per router. βœ… Low Latency – Flits immediately start moving as soon as a path is available. βœ… Power Efficient – Less buffer storage means lower dynamic power consumption.

Disadvantages of Wormhole Switching ❌ Head-of-Line (HoL) Blocking – If the header flit is blocked, all following flits get stuck. ❌ Deadlock Possibility – If multiple packets wait for each other, the network can freeze. ❌ Limited Throughput – A single congested packet blocks multiple flits from different packets.

πŸ“Œ Real-World Example AI workloads (like transformers in LLMs) involve massive data transfers between compute coresβ€”wormhole switching ensures efficient low-latency movement.

πŸ”Ή How Wormhole Switching Works 1️⃣ Step 1: A flit-based switch holds multiple flits per buffer. 2️⃣ Step 2: If one flit gets blocked, all flits behind it must wait, even if they belong to different packets. 3️⃣ Step 3: VCs partially solve this by separating competing packets into independent buffers.

βœ… Why This Matters? Even with VCs, a blocked VC still causes HoL blocking within that virtual channel.

1.5 NoC Congestion Control & Performance Metrics Link to heading

1.5.1 What Causes Congestion in NoC? Link to heading

πŸ’‘ Definition of NoC Congestion Congestion occurs when packets experience excessive delays due to buffer overflow, link contention, or lack of routing flexibility. When congestion builds up in one part of the NoC, it can cause head-of-line (HoL) blocking, reducing system efficiency. πŸ”Ή Key Causes of Congestion

CauseExplanationExample
HotspotsToo many packets target the same router or linkAI workloads causing localized memory traffic congestion
Limited Buffer SizeRouters cannot store all incoming packetsBuffers fill up, causing packet drops
Poor Routing DecisionsFixed routing paths overload some linksXY-routing in a Mesh topology without adaptive routing
Lack of Virtual ChannelsPackets compete for the same physical linkOnly one packet moves at a time, blocking others

1.5.2 Techniques for NoC Congestion Control Link to heading

1️⃣ Virtual Channels (VCs) Splits a physical link into multiple logical lanes to allow packets to bypass blocked traffic. Prevents Head-of-Line (HoL) blocking by enabling different packets to move in parallel. βœ… Example: A 2Γ—2 NoC where multiple packets travel in the same direction. Without VCs β†’ One blocked packet stops all others. With VCs β†’ Other packets use separate virtual lanes, reducing delays.

2️⃣ Adaptive Routing Instead of following a fixed shortest path (like XY-routing), adaptive routing chooses less congested paths dynamically. Reduces congestion hotspots by distributing traffic more evenly. βœ… Example: If (1,0) is congested, an adaptive router may reroute through (0,1) instead.

3️⃣ Express Links Skip intermediate routers by providing direct long-range links. Reduces latency and congestion in large NoCs. βœ… Example: A Mesh NoC with express links allows packets to jump multiple hops instead of routing through every intermediate node.

4️⃣ Credit-Based Flow Control Before sending a packet, the sender checks if the receiver has buffer space available. Prevents buffer overflow and packet drops. βœ… Example: A router doesn’t send more packets than the next router can handle, avoiding buffer congestion.

5️⃣ Priority-Based NoC Routing Some packets (like AI inference requests) are prioritized over background traffic. Ensures real-time tasks meet deadlines while keeping congestion under control.

1.5.3 NoC Performance Metrics Link to heading

To optimize NoC performance, we measure key metrics:

1️⃣ Latency Definition: The time a packet takes to travel from source to destination. **Measured in cycles per hop or nanoseconds._ Lower latency = better performance.

2️⃣ Throughput Definition: The total number of packets delivered per cycle. Higher throughput = better NoC efficiency.

3️⃣ Power Consumption Definition: The total energy consumed by NoC communication. Reducing power is critical for AI accelerators and mobile chips.

βœ… Techniques to Reduce Power:

  • Power-gating inactive routers.
  • Reducing NoC voltage dynamically based on workload.

4️⃣ Buffer Occupancy Definition: How full NoC router buffers are over time. Higher occupancy = more congestion risk.


2. Performance Modeling & Simulation of NoC Link to heading

2.1 Building Performance Models for NoC Link to heading

Performance modeling is essential for designing efficient Network-on-Chip (NoC) architectures. A comprehensive NoC model typically includes:

  1. Topology Model: Representation of the network structure
  2. Router Model: Latency and throughput characteristics
  3. Traffic Model: Workload patterns and communication demands
  4. Power Model: Energy consumption estimates

2.2 Some NoC Simulators Link to heading

BookSim Link to heading

πŸ”₯ What is BookSim?

  • A cycle-accurate NoC simulator developed at Stanford University.
  • Used for AI accelerators & chip design exploration.
  • Supports various topologies (Mesh, Torus, Fat Tree).

πŸ“Œ BookSim’s Key Features βœ… Models packet-level routing, flow control, and congestion. βœ… Simulates Wormhole, Virtual-Channel, and Adaptive Routing. βœ… Evaluates latency, bandwidth, and power for different NoC designs.

πŸ”₯ Real-World Use Case: BookSim in AI Hardware πŸ”Ή Researchers use it to optimize NoC designs for AI workloads. πŸ”Ή Most companies use it to simulate NoC designs for AI accelerators.

Garnet (Gem5) Link to heading

πŸ”₯ What is Garnet (Gem5)?

  • An NoC simulator integrated into the Gem5 full-system simulator.
  • More detailed than BookSim β†’ Provides cycle-accurate CPU-GPU NoC modeling.

πŸ“Œ Garnet’s Key Features βœ… Simulates realistic multi-core CPU-GPU interconnects. βœ… Used for evaluating NoC performance in high-performance computing (HPC) systems.

πŸ”₯ Real-World Use Case: Garnet in HPC πŸ”Ή AMD and Intel use Garnet for full-chip NoC simulations before fabricating data center chips. πŸ”Ή Google uses it to model TPU interconnect performance.

➑️ Garnet can:

  • Analyze full system CPU-GPU interconnect performance under real world-workloads. Which means, you can analyze how CPU-GPU communication behaves under real workloads.
  • Analyze cache coherency and memory iccess impact on NoC performance.
  • Analyze how NoC latency affects AI model execution.

Important Differences between BookSim and Garnet Link to heading

FeatureBookSimGarnet (Gem5)
FocusNoC routing & congestion analysisFull-system CPU-GPU interconnect analysis
Level of DetailPacket-level simulationCycle-accurate system-wide simulation
Best ForComparing routing algorithmsAnalyzing AI workloads on full-system NoC

βœ… Why It Matters?

➑️Use BookSim to optimize NoC microarchitecture. ➑️Use Garnet to study NoC behavior in AI accelerators.

3. Software Engineering & Concepts for NoC Simulators Link to heading

3.1 Key Software Engineering Concepts for NoC Simulators Link to heading

ConceptWhy It Matters for NoC Simulators
Modular DesignEach NoC component (router, buffer, link) should be independently testable & reusable.
Event-Driven SimulationNoCs are modeled as event-driven systems (packets arrive, propagate, stall).
Concurrency & ParallelismNoC simulators should leverage multi-threading for speed.
Abstraction LayersSeparate hardware modeling from routing logic to ensure flexibility.
Performance OptimizationUse efficient data structures (e.g., custom queues) for faster packet handling.
Profiling & DebuggingNoC simulators should include performance metrics & logging.

3.2 Software Architecture of NoC Simulators Link to heading

3.2.1 NoC Simulation Components Link to heading

A typical NoC simulator consists of four major software components:

ComponentRole in NoC Simulation
Router ModelImplements buffering, flow control, routing, and arbitration.
Link ModelRepresents NoC links (bandwidth, delay, contention effects).
Traffic GeneratorGenerates AI workload-based packet injections.
Performance AnalyzerTracks latency, congestion, throughput, power metrics.

βœ… Example: BookSim NoC Simulator Architecture

ComponentRole in NoC Simulation
Router Class (Router.cpp)Implements virtual channels, input buffers, & routing logic.
Traffic Class (TrafficManager.cpp)Injects synthetic or real workload-based traffic.
Simulation Engine (Simulator.cpp)Executes event-driven simulation steps.

3.2.2 Event-Driven NoC Simulation Link to heading

Most NoC simulators use an event-driven simulation model instead of traditional step-by-step execution.

βœ… How Event-Driven Simulation Works in NoC: 1️⃣ Each NoC event (packet arrival, buffer stall, routing decision) is scheduled on an event queue. 2️⃣ The event scheduler processes events in time order (priority queue). 3️⃣ Events are processed only when necessary, improving performance.

πŸ’‘ Example: Event Handling in NoC Simulation (Pseudocode)

struct Event {
    int timestamp;
    std::function<void()> action;  
};

// Event Queue (Priority Queue Sorted by Timestamp)
std::priority_queue<Event> eventQueue;

// Add an event to NoC simulation
void scheduleEvent(int time, std::function<void()> action) {
    eventQueue.push({time, action});
}

// NoC Simulation Engine
void runSimulation() {
    while (!eventQueue.empty()) {
        auto event = eventQueue.top();
        eventQueue.pop();
        event.action();  // Execute event
    }
}

3.2.3 Key C++ Features for NoC Simulators Link to heading

FeatureWhy It Matters for NoC Simulators
std::optional (C++17)Avoids uninitialized packet states.
std::variant (C++17)Used for flexible routing decision storage.
std::any (C++17)Allows generic NoC event storage.
std::parallel_execution (C++17)Enables multithreading in NoC simulations.
std::coroutine (C++20)Allows event-driven NoC scheduling using async packet handling.
std::span (C++20)Efficiently manages flit buffers.
std::bit_cast (C++20)Fast conversion between flits and integers for NoC encoding.

3.2.4 Using C++ Coroutines for NoC Event Handling Link to heading

πŸ”Ή Problem: Traditional NoC simulators handle packet transmission using blocking function calls, leading to inefficiencies.

πŸ”Ή Solution: Use C++20 coroutines to make packet transmission non-blocking and event-driven.

βœ… Example: NoC Packet Transmission Using Coroutines

#include <coroutine>
#include <iostream>

// Coroutine-based NoC Event
struct NoCEvent {
    struct promise_type {
        NoCEvent get_return_object() { return {}; }
        std::suspend_always initial_suspend() { return {}; }
        std::suspend_always final_suspend() noexcept { return {}; }
        void return_void() {}
    };
};

// Simulated NoC Packet Transmission (Async)
NoCEvent transmitPacket(int packet_id) {
    std::cout << "Packet " << packet_id << " transmitted." << std::endl;
    co_return;
}

int main() {
    auto event = transmitPacket(42);  // Non-blocking NoC transmission
    return 0;
}

βœ… Why Use Coroutines for NoC Simulation? βœ”οΈ Avoids blocking execution while waiting for packet arrivals. βœ”οΈ Enables asynchronous event scheduling (no need for polling loops).

3.2.5 Parallelism & Concurrency in NoC Simulators Link to heading

Modern NoC simulators must handle thousands of packets simultaneously. This requires efficient multithreading and parallel processing.

πŸš€ Parallel NoC Simulation Using std::execution C++17 introduced parallel execution policies to enable multi-threaded NoC event processing.

βœ… Example: NoC Parallel Packet Processing

#include <vector>
#include <algorithm>
#include <execution>

std::vector<int> packets = {1, 2, 3, 4, 5};

// Process packets in parallel
std::for_each(std::execution::par, packets.begin(), packets.end(), [](int packet) {
    std::cout << "Processing packet: " << packet << std::endl;
});

βœ… Why Use std::execution::par? βœ”οΈ Automatically distributes packet processing across multiple CPU cores. βœ”οΈ Reduces simulation runtime for large NoC workloads.

πŸ“Œ Performance Optimization Techniques for NoC Simulators Efficient NoC simulators must be optimized for performance.

πŸš€ NoC Simulator Performance Bottlenecks & Solutions

BottleneckOptimization Technique
Slow event processingUse event-driven execution model (instead of polling).
Inefficient buffer allocationUse std::span to manage flit buffers efficiently.
High memory usageUse custom memory pools instead of heap allocations.
Cache misses in routing logicPrefetch routing tables using SIMD-friendly data structures.

βœ… Example: Optimized NoC Packet Queue Using std::deque

#include <deque>
#include <iostream>

std::deque<int> packetQueue;  // Optimized for FIFO access

void enqueuePacket(int packet) {
    packetQueue.push_back(packet);  // Fast O(1) insert
}

void processNextPacket() {
    if (!packetQueue.empty()) {
        int packet = packetQueue.front();
        packetQueue.pop_front();  // Fast O(1) removal
        std::cout << "Processing packet: " << packet << std::endl;
    }
}

βœ… Why Use std::deque for NoC Buffers? βœ”οΈ O(1) enqueue/dequeue β†’ Faster than std::vector. βœ”οΈ Low memory fragmentation.

4. AI Workload Analysis Link to heading

4.1 Why Do AI Workloads Stress the NoC? Link to heading

πŸ’‘ Key Reasons:

  • Massive Data Movement: AI workloads require huge amounts of tensor data movement between processing cores, memory, and accelerators.
  • Irregular Traffic Patterns: Unlike traditional HPC workloads, AI tasks exhibit bursty, unpredictable traffic.
  • High Memory Bandwidth Needs: Training/inference models require high-speed interconnects to access DRAM, HBM, or cache.
  • Synchronization Overhead: AI workloads depend on synchronized execution across multiple cores/GPUs, leading to NoC congestion and stalls.

βœ… Impact on NoC:

AI Workload IssueNoC Bottleneck
Large tensor data movementHigh bandwidth demand
Unpredictable bursty trafficNoC congestion & packet loss
Cache coherence across compute coresIncreased latency
Need for synchronizationIncreased communication overhead

4.2 How Different AI Workloads Impact NoC? Link to heading

AI Model TypeNoC Stress PatternBottleneckExample Models
Transformer Models (LLMs)High memory bandwidth, all-to-all communicationNoC congestion due to massive attention matrix communicationGPT-4, Llama, Falcon
CNNs (Convolutional Neural Networks)Localized data movement, structured trafficMemory latency & NoC hotspot formationResNet, EfficientNet
Graph Neural Networks (GNNs)Irregular, sparse traffic patternsPoor NoC utilization & congestion in specific regionsGCN, GAT
Diffusion Models (Stable Diffusion, DALL-E)Burst mode traffic, unpredictable spikesCache misses & NoC jitterStable Diffusion, MidJourney

βœ… Key Observations:

  • LLMs stress NoC the most due to their massive attention matrix multiplications.
  • CNNs are structured and predictable, but require optimized memory access.
  • GNNs have sparse and irregular traffic, leading to load imbalance on the NoC.

4.3 Key NoC Bottlenecks for AI Workloads Link to heading

4.3.1 High Memory Bandwidth Requirements Link to heading

πŸ”Ή Why?

  • AI models, especially LLMs, require terabytes of DRAM/HBM bandwidth.
  • The NoC must provide fast access to weight matrices & activations without bottlenecks.

βœ… Example: AI Accelerator Memory Access (TPU, MI300, Hopper)

ComponentMemory Bandwidth (GB/s)
NVIDIA Hopper H100 (HBM3)4.9 TB/s
Google TPU v42.2 TB/s
AMD MI3005.2 TB/s

πŸ’‘ What happens if NoC bandwidth is insufficient? ❌ Memory stalls β†’ The AI pipeline stops waiting for data. ❌ High NoC congestion β†’ Increased packet drop rates and latency.

βœ… Solution: NoC Optimizations βœ”οΈ High-bandwidth interconnects (NVLink, Infinity Fabric). βœ”οΈ Efficient NoC routing (Valiant’s Routing, Adaptive DyXY). βœ”οΈ HBM-aware NoC design for direct memory access.

4.3.2 Bursty and Unpredictable Traffic Patterns Link to heading

πŸ”Ή Why?

  • Unlike HPC workloads, AI workloads generate bursts of traffic when layers process large tensors.
  • Transformer models create unpredictable spikes in NoC traffic during attention calculations.

βœ… Example: Burst Traffic in AI Training

Time (ms)Traffic Pattern
0 - 50Low data movement
50 - 60High burst (attention computation)
60 - 80Medium data flow
80 - 90Another burst (MLP layer computation)

πŸ’‘ What happens if NoC cannot handle burst traffic? ❌ High packet loss β†’ Tensor updates get delayed. ❌ Increased NoC congestion β†’ Slows down AI training/inference.

βœ… Solution: NoC Optimizations βœ”οΈ Adaptive routing (Dynamic DyXY, Reinforcement Learning-based routing). βœ”οΈ Express Virtual Channels (EVCs) to bypass congested routers. βœ”οΈ Priority-based NoC scheduling to give preference to critical AI tasks.

4.3.3 Synchronization Overhead in AI Training Link to heading

πŸ”Ή Why?

  • AI workloads run on distributed compute units (e.g., GPUs, TPUs).
  • NoC must synchronize tensor updates across all cores.

βœ… Example: AI Model Synchronization

OperationCommunication Overhead
Forward PassMinimal NoC traffic (mostly read-heavy)
Backward Pass (Gradients)Heavy NoC traffic (large tensor updates)
Parameter Update (SGD, Adam)Massive synchronization needed

πŸ’‘ What happens if NoC cannot synchronize efficiently? ❌ Deadlocks in NoC β†’ Model training halts. ❌ Increased power consumption due to excessive message passing.

βœ… Solution: NoC Optimizations βœ”οΈ Hierarchical NoC topologies to reduce synchronization latency. βœ”οΈ Priority-based packet scheduling for gradient updates. βœ”οΈ Smart data compression techniques (quantization, sparsity).

4.3.4 AI-Specific Traffic Patterns: All-to-All Communication Link to heading

πŸ”Ή Why?

  • Transformer-based models require massive matrix multiplications across multiple GPUs/TPUs.
  • AI accelerators use all-to-all NoC communication, leading to congestion.

βœ… Example: NoC Traffic in Transformer Models

LayerData Movement
Self-AttentionAll-to-All (High NoC congestion)
Feedforward (MLP)Many-to-One (Moderate NoC usage)
Output LayerOne-to-All (Medium NoC traffic)

πŸ’‘ What happens if NoC cannot handle all-to-all traffic? ❌ Extreme congestion at AI accelerator interconnects. ❌ Limited scalability as AI models grow (GPT-5, Llama 3, etc.).

βœ… Solution: NoC Optimizations βœ”οΈ Hybrid topologies (Torus + Express Links) to reduce hops. βœ”οΈ Optical/Wireless NoC for ultra-low latency interconnects. βœ”οΈ Dynamic workload-aware routing to balance NoC traffic.

πŸ“Œ Final Takeaways πŸ”Ή Modern AI workloads stress NoC due to high bandwidth, unpredictable traffic, and synchronization issues. πŸ”Ή LLMs (GPT-4, Llama) create the most severe NoC congestion due to all-to-all tensor communication. πŸ”Ή CNNs have structured traffic but require optimized memory access. πŸ”Ή NoCs must evolve with smarter routing, better flow control, and high-speed interconnects.

Appendix Link to heading

Queueing Theory Link to heading

ConceptDefinitionNoC Relevance
Little’s LawL = Ξ»W (Number of items = arrival rate Γ— wait time)Predicts NoC buffer occupancy
M/M/1 QueueSingle server, exponential arrival & serviceModels packet delays in NoC routers
M/G/1 QueueGeneral service distributionMore accurate for NoC bursty traffic

πŸ”₯ Little’s Law πŸ”Ή Formula:

L=Ξ»W

Where:

  • L = Average number of packets in the system
  • Ξ» = Arrival rate of packets
  • W = Average time a packet spends in the system

βœ… Example Application in NoC: If packets arrive at 5 per cycle and spend 4 cycles in the NoC, then:

L=5Γ—4=20

This means 20 packets are present in the NoC at any given time.

πŸ”₯ Poisson Distribution (ELI5)

  • Describes random arrival events over time.
  • Used in NoC traffic modeling to simulate bursty vs. smooth packet arrivals.

βœ… Example: β†’ If packets arrive at an average rate of 10 per second, a Poisson model predicts how likely it is to receive 15 or 20 packets in the next second. β†’ Real-world application: Models random memory accesses in NoC routers.

πŸ”₯ M/M/1 Queue (NoC Packet Queues) πŸ’‘ What is M/M/1? A single-server queueing model used to analyze NoC router delays.

M/M/1 means:

  • M (Memoryless Poisson Arrivals)
    • Markovian arrival where packets arrive randomly following a Possion process.
  • M (Memoryless Exponential Service)
    • Markovian service where the time between packet processing is exponentially distributed.
  • 1 (Single NoC router queue)

βœ… Example: If a router receives packets at a rate of 10 per cycle, but can process 8 per cycle, a queue will form.

βœ… Why M/M/1 Matters for NoC?

Predicts NoC delays based on packet arrival rate. Helps optimize buffer size & flow control.

πŸ”₯ All Queue Types

Queue TypeArrival ProcessService ProcessServersExample Use Case
M/M/1Poisson (Random)Exponential (Random)1NoC router queue
M/M/cPoissonExponentialc serversMulti-core CPU processing requests
M/G/1PoissonGeneral (Any Distribution)1NoC traffic bursts
G/M/1General (Any Distribution)Exponential1Variable packet arrivals
M/D/1PoissonDeterministic (Fixed)1Real-time NoC packet scheduling
M/M/∞PoissonExponentialInfinite serversCloud computing load balancing

πŸ“Œ Comparing M/M/1 vs. M/M/c πŸ’‘ What is an M/M/c Queue? Same as M/M/1, but has c servers instead of 1. Used for multi-core CPUs & parallel packet processing in NoC. βœ… Example: If an NoC router has 4 ports handling traffic, we can model it as M/M/4. βœ… Why M/M/c? Reduces waiting time by serving packets in parallel. More realistic for AI accelerators & GPU interconnects.

πŸ“Œ Comparing M/M/1 vs. M/G/1 πŸ’‘ What is an M/G/1 Queue? Same as M/M/1, but service time follows ANY distribution (not just exponential). More accurate for NoC, since real-world NoC packets do not have perfectly exponential service times. βœ… Example: If a NoC router processes tensor data from an AI workload, some packets take longer than others β†’ M/G/1 models this better. βœ… Why M/G/1? More realistic for bursty NoC traffic. Used in AI interconnect modeling.

πŸ“Œ Comparing M/M/1 vs. M/D/1 πŸ’‘ What is an M/D/1 Queue? Same as M/M/1, but service times are fixed (deterministic). Used for real-time NoC scheduling where service time is constant. βœ… Example: If an NoC router always takes exactly 2 cycles to process a packet, we use M/D/1 instead of M/M/1. βœ… Why M/D/1? Used in hard real-time NoC systems (e.g., automotive chips). Predictable processing delays.

πŸ“Œ Which Queue Model is Best for NoC?

Use CaseBest Queue Model
General NoC packet queuingM/M/1
Parallel NoC processing (multi-core)M/M/c
NoC with bursty packet arrivalsM/G/1
Real-time NoC schedulingM/D/1

βœ… Key Takeaway: NoC routers typically use M/M/1 or M/G/1, depending on traffic patterns.