Enjoy reading, please leave a comment or two! :)

Table of Contents Link to heading

Table of Contents
1. Network-on-Chip (NoC) Architectures
2. Performance Modeling & Simulation of NoC
- 2.1 Building Performance Models for NoC
- 2.2 Some NoC Simulators
3. Software Engineering & Concepts for NoC Simulators
- 3.1 Key Software Engineering Concepts for NoC Simulators
- 3.2 Software Architecture of NoC Simulators
4. AI Workload Analysis
Appendix
- Queueing Theory

1. Network-on-Chip (NoC) Architectures Link to heading

💡 What is NoC (Network-on-Chip)? NoC is an on-chip communication network that connects different processing elements (CPUs, GPUs, AI accelerators, memory, etc.). It replaces traditional bus-based architectures, which become a bottleneck as chip complexity increases.

1.1 Circuit Switching vs. Packet Switching Link to heading

Feature	Circuit Switching	Packet Switching (Used in NoC)
Path Setup	Reserves a dedicated path before transmission	No prior setup; packets are routed dynamically
Efficiency	Wastes bandwidth when idle	More efficient, utilizes links dynamically
Scalability	Poor, limited by dedicated paths	Highly scalable for multi-core systems
Usage in NoC?	❌ Not used	✅ Preferred

→ NoC must ensure low-latency, deadlock-free routing, and thus uses packet switching.

1.2 NoC Topologies Link to heading

1.4.1 Mesh Topology Link to heading

Structure: A 2D grid where each node connects to its immediate neighbors (north, south, east, west).

Characteristics:

Scalability: Easily expandable by adding rows or columns.
Routing Simplicity: Supports simple routing algorithms like XY-routing.
Latency: Increases with the number of hops between distant nodes.
Power Consumption: Generally lower due to shorter average wire lengths.

Use Cases: Common in systems where simplicity and scalability are prioritized over minimal latency.

1.4.2 Torus Topology Link to heading

Structure: Similar to Mesh but with additional wrap-around connections, linking edge nodes directly.

Characteristics:

Reduced Latency: Wrap-around links decrease the maximum hop count between nodes.
Increased Path Diversity: Multiple routing paths enhance fault tolerance.
Higher Power Consumption: Additional links consume more power and increase design complexity.

Use Cases: Suitable for applications requiring lower latency and higher fault tolerance, accepting increased power usage.

1.4.3 Butterfly Topology Link to heading

Structure: A multi-stage, hierarchical network resembling a butterfly structure, often used in parallel computing.

Characteristics:

Low Latency: Provides logarithmic path lengths relative to the number of nodes.
Complex Routing: Requires sophisticated control logic for routing decisions.
Scalability Limitations: Expansion increases complexity significantly.

Use Cases: Ideal for applications demanding high-speed data transfer with predictable latency, such as certain parallel processing tasks.

1.4.4 Recent Developments in NoC Topologies Link to heading

Advancements in NoC design focus on enhancing performance, scalability, and energy efficiency. Notable developments include:

3D NoC Architectures: Integrating vertical (z-axis) connections to traditional 2D layouts, reducing average hop count and latency.
Circulant Topologies: Utilizing mathematical properties of circulant graphs to design D-dimensional NoCs, offering real-time guarantees and reduced communication latency. Reference
Wireless NoC (WiNoC): Incorporating wireless communication channels between chiplets to enhance performance and reduce latency.
Dynamically Scalable NoC: Architectures that adjust interconnect resources at runtime based on workload demands, optimizing power consumption and performance. Reference
Open-Source High-Bandwidth NoCs: Developments like FlooNoC provide wide physical links and support for parallel multi-stream data, achieving high energy efficiency and bandwidth. Reference

1.4.5 Comparative Analysis of NoC Topologies Link to heading

Topology	Scalability	Latency	Power Consumption	Complexity	Fault Tolerance
Mesh	High	Moderate	Low	Low	Moderate
Torus	Moderate	Low	High	Moderate	High
Butterfly	Low	Low	Moderate	High	Low
3D NoC	High	Very Low	Moderate	High	High
Circulant	High	Low	Moderate	High	High
Wireless NoC	High	Very Low	Low	High	Moderate
Dynamic Scalable	High	Adaptive	Low	High	High

Note: The characteristics of each topology can vary based on specific design implementations and application requirements.

1.3 Routing Algorithms Link to heading

Types of Routing Algorithms

Type	Examples	Description
Deterministic	XY-routing	Always follows the same path, predictable but can cause congestion
Adaptive	Turn Model, West-First, DyXY	Dynamically selects paths based on congestion or faults
Minimal	XY-routing, West-First	Always chooses the shortest path
Non-Minimal	Valiant’s Routing	Uses detours to balance traffic

1.3.1 Deterministic Routing: XY-Routing Link to heading

Route first along the X-axis and then along the Y-axis.
Usually used in Mesh and Torus topologies.
Advantage: Simple and deadlock-free.
Disadvantage: Can cause congestion in hot-spot areas.

A simple implementation is given below:

#include <iostream>

struct Packet {
    int src_x, src_y;
    int dest_x, dest_y;
};

void xy_routing(Packet p) {
    int x = p.src_x, y = p.src_y;
    while (x != p.dest_x) {
        x += (p.dest_x > x) ? 1 : -1;
        std::cout << "Moving to (" << x << ", " << y << ")\n";
    }
    while (y != p.dest_y) {
        y += (p.dest_y > y) ? 1 : -1;
        std::cout << "Moving to (" << x << ", " << y << ")\n";
    }
}

int main() {
    Packet p = {0, 0, 3, 2};  
    xy_routing(p);
    return 0;
}

1.3.2 Adaptive Routing Techniques Link to heading

Choose the best path based on the congestion
Examples: Turn Model, West-First, DyXY

💡 Why Do We Need Special Routing Strategies? Deadlock occurs when packets are permanently blocked because they are waiting on each other and adaptive routing can reduce the probability of deadlock, although it cannot eliminate it. Such adaptive routing strategies are utilized to improve performance and avoid deadlocks. The most common adaptive routing strategies are Turn Model, West-First, and DyXY.

1.3.2.1 Turn Model Link to heading

A method to design deadlock-free routing algorithms.
How it works: By prohibiting certain 90° turns, it ensures packets never form cyclic dependencies.

Common Turn Models

Routing Type	Forbidden Turns	Why It Works?
West-First	No Westward turns	Ensures no cyclic dependencies
North-Last	No Northward turns	Guarantees deadlock-free paths
Negative-First	No turns in negative X/Y first	Prevents cyclic packet dependencies

✅ Turn Model ensures deadlock-free routing without requiring virtual channels!

1.3.2.2 West-First Routing (A Special Turn Model) Link to heading

Always move west if possible, otherwise move south.

💡 What is West-First Routing?

A type of deterministic routing that avoids deadlocks.
Packets are never allowed to turn West (left).
- Exception: If the packet starts West of its destination, it must move West first before following the XY-routing method.
After traveling West first (if needed), the packet can move freely.

🖼️ Visualizing West-First Routing in a 4×4 NoC Let’s assume a 4×4 mesh topology, where a packet must move from (3,3) to (0,0). 🔹 Allowed Moves: West first, then standard XY-routing. 🔴 Forbidden Moves: No turns toward the West after starting movement.

🚀 Example 1: West-First Routing in Action

Step	Router Position (X, Y)	Movement
Start	(3,3)	Packet begins here
Move West	(2,3)	✅ Moves left (West)
Move West	(1,3)	✅ Moves left (West)
Move West	(0,3)	✅ Moves left (West) (West part completed)
Move South	(0,2)	✅ XY-routing starts
Move South	(0,1)	✅ Moves down (South)
Move South	(0,0)	✅ Reached destination

📝 Rule Check: ✔️ West movement completed first before moving South. ✔️ No Westward turns after the first movement.

✅ Key Takeaway:

If the destination is West of the source, the packet must go West FIRST before making any other movement.

Thanks LLM for the example explanation!

💡 Why West-First Routing?

Deadlock prevention: Packets cannot turn West, so they can’t form a cycle.
Simple and efficient: Only two possible directions (West and South).

Example Implementation of West-First Routing

#include <iostream>

struct Packet {
    int src_x, src_y;
    int dest_x, dest_y;
};

void west_first_routing(Packet p) {
    int x = p.src_x, y = p.src_y;

    // Step 1: Move West first (if needed)
    while (x > p.dest_x) {
        x--;
        std::cout << "Moving West to (" << x << ", " << y << ")\n";
    }

    // Step 2: Move freely in XY-routing
    while (y != p.dest_y) {
        y += (p.dest_y > y) ? 1 : -1;
        std::cout << "Moving in Y direction to (" << x << ", " << y << ")\n";
    }
}

int main() {
    Packet p = {2, 2, 0, 3};
    west_first_routing(p);
    return 0;
}

✅ Why It Works?

Always moves West first before doing XY-routing.
Prevents cyclic dependencies → No deadlocks!

✅ When to Use?

Works well for low-latency NoC designs.
Used in AI accelerators & GPUs to optimize tensor movement.

1.3.2.3 Dynamixc XY Routing (DyXY) Link to heading

💡 What is DyXY (Dynamic XY) Routing?

A congestion-aware adaptive routing technique.
Unlike XY-routing, DyXY allows flexible routing by checking congestion levels before deciding the path. (Dynamically adjusts)
Usually used in high performance AI accelerators and GPUs.
- AMD’s Infinity Fabric (used in Ryzen, MI300 AI accelerators) uses adaptive DyXY routing to balance memory & core communication.
  - Why? Unlike CPUs, GPUs have bursty and unpredictable traffic, requiring adaptive routing to maintain low latency.

🔹 Key Rules 1️⃣ Move X-direction first (like XY-routing). 2️⃣ If congestion is detected in Y-path, choose an alternative Y-path. 3️⃣ Continue towards destination in the least congested direction.

DyXY is still mostly deterministic but introduces adaptive Y-routing to mitigate congestion.

✅ Why is DyXY better than XY-routing?

XY-routing is deterministic → Can suffer from congestion.
DyXY adapts to congestion → More balanced NoC utilization.

💡 Example situiatons in a NoC, handled by DyXY:

NoC Event	DyXY Decision
High congestion detected in X-direction	Moves in Y-direction instead
Southward link underutilized	Uses that path dynamically
All paths congested	Triggers non-minimal routing

Example: DyXY Routing vs. XY-Routing Link to heading

Scenario: Packet moves from (1,1) → (3,3). XY-routing always follows (1,1) → (3,1) → (3,3). If (3,1) is congested, DyXY may reroute via (1,1) → (1,3) → (3,3).

How does DyXY improve performance?

→ Avoids congested paths dynamically. → More efficient for high-throughput NoCs.

** Code: DyXY Routing: **

#include <iostream>
#include <cstdlib> // For random congestion simulation

struct Packet {
    int src_x, src_y;
    int dest_x, dest_y;
};

// Simulated congestion function
bool is_congested(int x, int y) {
    return rand() % 2; // Random congestion (50% chance)
}

void dyxy_routing(Packet p) {
    int x = p.src_x, y = p.src_y;

    while (x != p.dest_x || y != p.dest_y) {
        if (x != p.dest_x && !is_congested(x + 1, y)) {
            x++;
        } else if (y != p.dest_y) {
            y += (p.dest_y > y) ? 1 : -1;
        } else {
            x--;
        }
        std::cout << "Moving to (" << x << ", " << y << ")\n";
    }
}

int main() {
    srand(time(0)); // Seed random congestion
    Packet p = {1, 1, 3, 3};
    dyxy_routing(p);
    return 0;
}

1.3.2.4 Valiant’s Routing Link to heading

Valiant’s Routing is a non-minimal, load-balancing routing algorithm used in packet-switched networks and NoC architectures. It was first proposed by Leslie Valiant as a way to improve network throughput and balance traffic loads dynamically.

✅ Key Features of Valiant’s Routing 1️⃣ Non-minimal routing → Packets may take longer detours to avoid congestion. 2️⃣ Load balancing → Distributes packets more evenly across the network. 3️⃣ Resilient to hot spots → Reduces contention and prevents congestion build-up.

📌 How Valiant’s Routing Works Instead of sending packets along the shortest path, Valiant’s Routing introduces a randomly chosen intermediate destination before reaching the final destination.

🔹 Step-by-Step Process 1️⃣ Step 1: Choose a random intermediate node (X’, Y’) uniformly distributed in the network. 2️⃣ Step 2: Route from source → (X’, Y’) using shortest-path routing. 3️⃣ Step 3: Route from (X’, Y’) → destination using shortest-path routing.

🖼️ Example of Valiant’s Routing in a 4×4 Mesh NoC Scenario: Packet needs to go from (0,0) → (3,3)

🔹 XY-Routing (Minimal Path) (0,0) → (1,0) → (2,0) → (3,0) → (3,1) → (3,2) → (3,3) ✅ Shortest path, but can cause congestion.

🔹 Valiant’s Routing (Random Intermediate Hop at (1,2)) (0,0) → (1,0) → (1,1) → (1,2) → (2,2) → (3,2) → (3,3) ✅ Detour helps avoid congestion.

🔹 Valiant’s Routing (Random Intermediate Hop at (2,3)) (0,0) → (1,0) → (2,0) → (2,1) → (2,2) → (2,3) → (3,3) ✅ Another alternative path to prevent congestion.

📌 Why Use Valiant’s Routing in NoC? 💡 Problem: Hot Spot Congestion in Deterministic Routing XY-Routing and other deterministic algorithms always take the same shortest path. If multiple packets use the same route, it causes bottlenecks (hot spots) in the network. ✅ Solution: Valiant’s Routing Distributes Traffic By randomizing paths, it balances network load and reduces congestion in high-traffic regions.

✅ Used in: High-performance AI accelerators (e.g., Google TPU, Cerebras Wafer-Scale Engine). HPC (High-Performance Computing) networks like InfiniBand.

📌 Valiant’s Routing vs. Other Routing Algorithms

Routing Type	Path Selection	Congestion Avoidance	Latency	Use Case
XY-Routing	Fixed, shortest path	❌ No	✅ Low	Simple NoCs
West-First	Restricted turns	✅ Yes	✅ Low	Deadlock prevention
DyXY Adaptive	Dynamic based on congestion	✅ Yes	✅ Low	AI workloads
Valiant’s Routing	Randomized intermediate hop	✅✅ Best	❌ Higher	HPC, AI accelerators

✅ Key Takeaway

Valiant’s Routing is slower but prevents congestion best.
Great for large-scale AI accelerators and HPC networks.

1.3.3 Cutting-Edge NoC Routing Algorithms Link to heading

Future NoCs will likely integrate AI-driven routing algorithms either in HW or SW to dynamically adjust paths in real time, optimizing for AI accelerator workloads. Here are a few recent examples:

Algorithm	Description	Advantage
Deep Reinforcement Learning (DRL) Routing	Uses AI to learn best NoC routing paths	Self-optimizing, adjusts to dynamic traffic
Hierarchical NoC Routing	Uses multi-level routing (global/local paths)	Reduces congestion in large NoCs
Photonic NoC Routing	Uses light-based interconnects	Ultra-low latency, low power

1.3.3.1 Machine Learning-Based Adaptive Routing Link to heading

💡 Why This Is Important? AI-driven NoC architectures now use deep reinforcement learning (DRL) to optimize routing decisions. DRL-based routers predict congestion patterns and adapt before congestion happens.

1.4 Flow Control Mechanisms in NoC Link to heading

Type	Description	Used in/for
Wormhole Switching	Packets broken into flits, moving one flit at a time	AI accelerators, GPUs
Virtual Channels (VCs)	Multiple logical channels on the same physical link	Reduces congestion
Credit-Based Flow Control	Buffer space allocated before sending	Deadlock-free designs

1.4.1 Virtual Channel Routing Link to heading

💡 What is Virtual Channel (VC) Routing?

Multiple logical channels share a single physical link.
Packets are separated into multiple virtual lanes, preventing head-of-line (HoL) blocking.
Each VC has its own buffer & flow control → Packets can bypass congested packets in another VC.

🖼️ How Virtual Channels Work 🔹 Example: A 2×2 NoC where packets from different cores travel in the same direction.

Without VCs → If one packet is blocked, all others must wait.
With VCs → Packets use different logical lanes → No waiting!

✅ Why Virtual Channels Matter?

Prevents congestion & deadlocks.
Improves throughput & resource utilization.
Used in NVIDIA & AMD GPUs for AI workloads.

❓ Virtual Channels vs Adaptive Routing

Metric	Virtual Channels	Adaptive Routing
Latency	✅ Lower (avoids blocking)	❌ Higher under light load
Throughput	✅ Higher (parallelism)	✅ Higher (load balancing)
Congestion Handling	✅ Good	✅ Best

✅ Key Takeaway

Virtual Channels prevent HoL blocking and are great for regular traffic.
Adaptive Routing balances load and is better for dynamic AI workloads.

1.4.1.1 Express Virtual Channels (EVCs) Link to heading

💡 Why This Is Important?

EVCs bypass intermediate routers, reducing latency for high-priority packets.
Used in Google TPUs and NVIDIA GPUs to prioritize AI tensor movement.
Instead of routing each flit through every router in a Mesh NoC, EVCs allow packets to skip unnecessary intermediate hops.

🔹 Comparison:

Feature	Standard Virtual Channels (VCs)	Express Virtual Channels (EVCs)
Path Selection	Stop at every router	Skip intermediate routers
Latency	Higher	Lower
Best Use Case	General packet traffic	High-priority AI workloads

✅ Where Is This Used?

Tensor data movement in AI accelerators.
High-speed interconnects for GPUs (NVLink, Infinity Fabric).

1.4.2 Wormhole Switching Link to heading

💡 How It Works Wormhole switching splits packets into small flow-control units (flits) and only buffers a few flits per router, unlike traditional packet-based switching. Here’s how it works:

Packets are divided into flits: 1️⃣ Header Flit: Reserves the path and determines routing. 2️⃣ Body Flits: Carry the actual data and follow the header. 3️⃣ Tail Flit: Marks the end of the packet.

→ Each router only stores a few flits at a time instead of buffering entire packets. → This significantly reduces memory usage and makes the design area- and power-efficient.

Advantages of Wormhole Switching ✅ Low Buffer Requirements – Needs only a few flits of buffer per router. ✅ Low Latency – Flits immediately start moving as soon as a path is available. ✅ Power Efficient – Less buffer storage means lower dynamic power consumption.

Disadvantages of Wormhole Switching ❌ Head-of-Line (HoL) Blocking – If the header flit is blocked, all following flits get stuck. ❌ Deadlock Possibility – If multiple packets wait for each other, the network can freeze. ❌ Limited Throughput – A single congested packet blocks multiple flits from different packets.

📌 Real-World Example AI workloads (like transformers in LLMs) involve massive data transfers between compute cores—wormhole switching ensures efficient low-latency movement.

🔹 How Wormhole Switching Works 1️⃣ Step 1: A flit-based switch holds multiple flits per buffer. 2️⃣ Step 2: If one flit gets blocked, all flits behind it must wait, even if they belong to different packets. 3️⃣ Step 3: VCs partially solve this by separating competing packets into independent buffers.

✅ Why This Matters? Even with VCs, a blocked VC still causes HoL blocking within that virtual channel.

1.5 NoC Congestion Control & Performance Metrics Link to heading

1.5.1 What Causes Congestion in NoC? Link to heading

💡 Definition of NoC Congestion Congestion occurs when packets experience excessive delays due to buffer overflow, link contention, or lack of routing flexibility. When congestion builds up in one part of the NoC, it can cause head-of-line (HoL) blocking, reducing system efficiency. 🔹 Key Causes of Congestion

Cause	Explanation	Example
Hotspots	Too many packets target the same router or link	AI workloads causing localized memory traffic congestion
Limited Buffer Size	Routers cannot store all incoming packets	Buffers fill up, causing packet drops
Poor Routing Decisions	Fixed routing paths overload some links	XY-routing in a Mesh topology without adaptive routing
Lack of Virtual Channels	Packets compete for the same physical link	Only one packet moves at a time, blocking others

1.5.2 Techniques for NoC Congestion Control Link to heading

1️⃣ Virtual Channels (VCs) Splits a physical link into multiple logical lanes to allow packets to bypass blocked traffic. Prevents Head-of-Line (HoL) blocking by enabling different packets to move in parallel. ✅ Example: A 2×2 NoC where multiple packets travel in the same direction. Without VCs → One blocked packet stops all others. With VCs → Other packets use separate virtual lanes, reducing delays.

2️⃣ Adaptive Routing Instead of following a fixed shortest path (like XY-routing), adaptive routing chooses less congested paths dynamically. Reduces congestion hotspots by distributing traffic more evenly. ✅ Example: If (1,0) is congested, an adaptive router may reroute through (0,1) instead.

3️⃣ Express Links Skip intermediate routers by providing direct long-range links. Reduces latency and congestion in large NoCs. ✅ Example: A Mesh NoC with express links allows packets to jump multiple hops instead of routing through every intermediate node.

4️⃣ Credit-Based Flow Control Before sending a packet, the sender checks if the receiver has buffer space available. Prevents buffer overflow and packet drops. ✅ Example: A router doesn’t send more packets than the next router can handle, avoiding buffer congestion.

5️⃣ Priority-Based NoC Routing Some packets (like AI inference requests) are prioritized over background traffic. Ensures real-time tasks meet deadlines while keeping congestion under control.

1.5.3 NoC Performance Metrics Link to heading

To optimize NoC performance, we measure key metrics:

1️⃣ Latency Definition: The time a packet takes to travel from source to destination. **Measured in cycles per hop or nanoseconds._ Lower latency = better performance.

2️⃣ Throughput Definition: The total number of packets delivered per cycle. Higher throughput = better NoC efficiency.

3️⃣ Power Consumption Definition: The total energy consumed by NoC communication. Reducing power is critical for AI accelerators and mobile chips.

✅ Techniques to Reduce Power:

Power-gating inactive routers.
Reducing NoC voltage dynamically based on workload.

4️⃣ Buffer Occupancy Definition: How full NoC router buffers are over time. Higher occupancy = more congestion risk.

2. Performance Modeling & Simulation of NoC Link to heading

2.1 Building Performance Models for NoC Link to heading

Performance modeling is essential for designing efficient Network-on-Chip (NoC) architectures. A comprehensive NoC model typically includes:

Topology Model: Representation of the network structure
Router Model: Latency and throughput characteristics
Traffic Model: Workload patterns and communication demands
Power Model: Energy consumption estimates

2.2 Some NoC Simulators Link to heading

BookSim Link to heading

🔥 What is BookSim?

A cycle-accurate NoC simulator developed at Stanford University.
Used for AI accelerators & chip design exploration.
Supports various topologies (Mesh, Torus, Fat Tree).

📌 BookSim’s Key Features ✅ Models packet-level routing, flow control, and congestion. ✅ Simulates Wormhole, Virtual-Channel, and Adaptive Routing. ✅ Evaluates latency, bandwidth, and power for different NoC designs.

🔥 Real-World Use Case: BookSim in AI Hardware 🔹 Researchers use it to optimize NoC designs for AI workloads. 🔹 Most companies use it to simulate NoC designs for AI accelerators.

Garnet (Gem5) Link to heading

🔥 What is Garnet (Gem5)?

An NoC simulator integrated into the Gem5 full-system simulator.
More detailed than BookSim → Provides cycle-accurate CPU-GPU NoC modeling.

📌 Garnet’s Key Features ✅ Simulates realistic multi-core CPU-GPU interconnects. ✅ Used for evaluating NoC performance in high-performance computing (HPC) systems.

🔥 Real-World Use Case: Garnet in HPC 🔹 AMD and Intel use Garnet for full-chip NoC simulations before fabricating data center chips. 🔹 Google uses it to model TPU interconnect performance.

➡️ Garnet can:

Analyze full system CPU-GPU interconnect performance under real world-workloads. Which means, you can analyze how CPU-GPU communication behaves under real workloads.
Analyze cache coherency and memory iccess impact on NoC performance.
Analyze how NoC latency affects AI model execution.

Important Differences between BookSim and Garnet Link to heading

Feature	BookSim	Garnet (Gem5)
Focus	NoC routing & congestion analysis	Full-system CPU-GPU interconnect analysis
Level of Detail	Packet-level simulation	Cycle-accurate system-wide simulation
Best For	Comparing routing algorithms	Analyzing AI workloads on full-system NoC

✅ Why It Matters?

➡️Use BookSim to optimize NoC microarchitecture. ➡️Use Garnet to study NoC behavior in AI accelerators.

3. Software Engineering & Concepts for NoC Simulators Link to heading

3.1 Key Software Engineering Concepts for NoC Simulators Link to heading

Concept	Why It Matters for NoC Simulators
Modular Design	Each NoC component (router, buffer, link) should be independently testable & reusable.
Event-Driven Simulation	NoCs are modeled as event-driven systems (packets arrive, propagate, stall).
Concurrency & Parallelism	NoC simulators should leverage multi-threading for speed.
Abstraction Layers	Separate hardware modeling from routing logic to ensure flexibility.
Performance Optimization	Use efficient data structures (e.g., custom queues) for faster packet handling.
Profiling & Debugging	NoC simulators should include performance metrics & logging.

3.2 Software Architecture of NoC Simulators Link to heading

3.2.1 NoC Simulation Components Link to heading

A typical NoC simulator consists of four major software components:

Component	Role in NoC Simulation
Router Model	Implements buffering, flow control, routing, and arbitration.
Link Model	Represents NoC links (bandwidth, delay, contention effects).
Traffic Generator	Generates AI workload-based packet injections.
Performance Analyzer	Tracks latency, congestion, throughput, power metrics.

✅ Example: BookSim NoC Simulator Architecture

Component	Role in NoC Simulation
Router Class (`Router.cpp`)	Implements virtual channels, input buffers, & routing logic.
Traffic Class (`TrafficManager.cpp`)	Injects synthetic or real workload-based traffic.
Simulation Engine (`Simulator.cpp`)	Executes event-driven simulation steps.

3.2.2 Event-Driven NoC Simulation Link to heading

Most NoC simulators use an event-driven simulation model instead of traditional step-by-step execution.

✅ How Event-Driven Simulation Works in NoC: 1️⃣ Each NoC event (packet arrival, buffer stall, routing decision) is scheduled on an event queue. 2️⃣ The event scheduler processes events in time order (priority queue). 3️⃣ Events are processed only when necessary, improving performance.

💡 Example: Event Handling in NoC Simulation (Pseudocode)

struct Event {
    int timestamp;
    std::function<void()> action;  
};

// Event Queue (Priority Queue Sorted by Timestamp)
std::priority_queue<Event> eventQueue;

// Add an event to NoC simulation
void scheduleEvent(int time, std::function<void()> action) {
    eventQueue.push({time, action});
}

// NoC Simulation Engine
void runSimulation() {
    while (!eventQueue.empty()) {
        auto event = eventQueue.top();
        eventQueue.pop();
        event.action();  // Execute event
    }
}

3.2.3 Key C++ Features for NoC Simulators Link to heading

Feature	Why It Matters for NoC Simulators
`std::optional` (C++17)	Avoids uninitialized packet states.
`std::variant` (C++17)	Used for flexible routing decision storage.
`std::any` (C++17)	Allows generic NoC event storage.
`std::parallel_execution` (C++17)	Enables multithreading in NoC simulations.
`std::coroutine` (C++20)	Allows event-driven NoC scheduling using async packet handling.
`std::span` (C++20)	Efficiently manages flit buffers.
`std::bit_cast` (C++20)	Fast conversion between flits and integers for NoC encoding.

3.2.4 Using C++ Coroutines for NoC Event Handling Link to heading

🔹 Problem: Traditional NoC simulators handle packet transmission using blocking function calls, leading to inefficiencies.

🔹 Solution: Use C++20 coroutines to make packet transmission non-blocking and event-driven.

✅ Example: NoC Packet Transmission Using Coroutines

#include <coroutine>
#include <iostream>

// Coroutine-based NoC Event
struct NoCEvent {
    struct promise_type {
        NoCEvent get_return_object() { return {}; }
        std::suspend_always initial_suspend() { return {}; }
        std::suspend_always final_suspend() noexcept { return {}; }
        void return_void() {}
    };
};

// Simulated NoC Packet Transmission (Async)
NoCEvent transmitPacket(int packet_id) {
    std::cout << "Packet " << packet_id << " transmitted." << std::endl;
    co_return;
}

int main() {
    auto event = transmitPacket(42);  // Non-blocking NoC transmission
    return 0;
}

✅ Why Use Coroutines for NoC Simulation? ✔️ Avoids blocking execution while waiting for packet arrivals. ✔️ Enables asynchronous event scheduling (no need for polling loops).

3.2.5 Parallelism & Concurrency in NoC Simulators Link to heading

Modern NoC simulators must handle thousands of packets simultaneously. This requires efficient multithreading and parallel processing.

🚀 Parallel NoC Simulation Using std::execution C++17 introduced parallel execution policies to enable multi-threaded NoC event processing.

✅ Example: NoC Parallel Packet Processing

#include <vector>
#include <algorithm>
#include <execution>

std::vector<int> packets = {1, 2, 3, 4, 5};

// Process packets in parallel
std::for_each(std::execution::par, packets.begin(), packets.end(), [](int packet) {
    std::cout << "Processing packet: " << packet << std::endl;
});

✅ Why Use std::execution::par? ✔️ Automatically distributes packet processing across multiple CPU cores. ✔️ Reduces simulation runtime for large NoC workloads.

📌 Performance Optimization Techniques for NoC Simulators Efficient NoC simulators must be optimized for performance.

🚀 NoC Simulator Performance Bottlenecks & Solutions

Bottleneck	Optimization Technique
Slow event processing	Use event-driven execution model (instead of polling).
Inefficient buffer allocation	Use `std::span` to manage flit buffers efficiently.
High memory usage	Use custom memory pools instead of heap allocations.
Cache misses in routing logic	Prefetch routing tables using SIMD-friendly data structures.

✅ Example: Optimized NoC Packet Queue Using std::deque

#include <deque>
#include <iostream>

std::deque<int> packetQueue;  // Optimized for FIFO access

void enqueuePacket(int packet) {
    packetQueue.push_back(packet);  // Fast O(1) insert
}

void processNextPacket() {
    if (!packetQueue.empty()) {
        int packet = packetQueue.front();
        packetQueue.pop_front();  // Fast O(1) removal
        std::cout << "Processing packet: " << packet << std::endl;
    }
}

✅ Why Use std::deque for NoC Buffers? ✔️ O(1) enqueue/dequeue → Faster than std::vector. ✔️ Low memory fragmentation.

4. AI Workload Analysis Link to heading

4.1 Why Do AI Workloads Stress the NoC? Link to heading

💡 Key Reasons:

Massive Data Movement: AI workloads require huge amounts of tensor data movement between processing cores, memory, and accelerators.
Irregular Traffic Patterns: Unlike traditional HPC workloads, AI tasks exhibit bursty, unpredictable traffic.
High Memory Bandwidth Needs: Training/inference models require high-speed interconnects to access DRAM, HBM, or cache.
Synchronization Overhead: AI workloads depend on synchronized execution across multiple cores/GPUs, leading to NoC congestion and stalls.

✅ Impact on NoC:

AI Workload Issue	NoC Bottleneck
Large tensor data movement	High bandwidth demand
Unpredictable bursty traffic	NoC congestion & packet loss
Cache coherence across compute cores	Increased latency
Need for synchronization	Increased communication overhead

4.2 How Different AI Workloads Impact NoC? Link to heading

AI Model Type	NoC Stress Pattern	Bottleneck	Example Models
Transformer Models (LLMs)	High memory bandwidth, all-to-all communication	NoC congestion due to massive attention matrix communication	GPT-4, Llama, Falcon
CNNs (Convolutional Neural Networks)	Localized data movement, structured traffic	Memory latency & NoC hotspot formation	ResNet, EfficientNet
Graph Neural Networks (GNNs)	Irregular, sparse traffic patterns	Poor NoC utilization & congestion in specific regions	GCN, GAT
Diffusion Models (Stable Diffusion, DALL-E)	Burst mode traffic, unpredictable spikes	Cache misses & NoC jitter	Stable Diffusion, MidJourney

✅ Key Observations:

LLMs stress NoC the most due to their massive attention matrix multiplications.
CNNs are structured and predictable, but require optimized memory access.
GNNs have sparse and irregular traffic, leading to load imbalance on the NoC.

4.3 Key NoC Bottlenecks for AI Workloads Link to heading

4.3.1 High Memory Bandwidth Requirements Link to heading

🔹 Why?

AI models, especially LLMs, require terabytes of DRAM/HBM bandwidth.
The NoC must provide fast access to weight matrices & activations without bottlenecks.

✅ Example: AI Accelerator Memory Access (TPU, MI300, Hopper)

Component	Memory Bandwidth (GB/s)
NVIDIA Hopper H100 (HBM3)	4.9 TB/s
Google TPU v4	2.2 TB/s
AMD MI300	5.2 TB/s

💡 What happens if NoC bandwidth is insufficient? ❌ Memory stalls → The AI pipeline stops waiting for data. ❌ High NoC congestion → Increased packet drop rates and latency.

✅ Solution: NoC Optimizations ✔️ High-bandwidth interconnects (NVLink, Infinity Fabric). ✔️ Efficient NoC routing (Valiant’s Routing, Adaptive DyXY). ✔️ HBM-aware NoC design for direct memory access.

4.3.2 Bursty and Unpredictable Traffic Patterns Link to heading

🔹 Why?

Unlike HPC workloads, AI workloads generate bursts of traffic when layers process large tensors.
Transformer models create unpredictable spikes in NoC traffic during attention calculations.

✅ Example: Burst Traffic in AI Training

Time (ms)	Traffic Pattern
0 - 50	Low data movement
50 - 60	High burst (attention computation)
60 - 80	Medium data flow
80 - 90	Another burst (MLP layer computation)

💡 What happens if NoC cannot handle burst traffic? ❌ High packet loss → Tensor updates get delayed. ❌ Increased NoC congestion → Slows down AI training/inference.

✅ Solution: NoC Optimizations ✔️ Adaptive routing (Dynamic DyXY, Reinforcement Learning-based routing). ✔️ Express Virtual Channels (EVCs) to bypass congested routers. ✔️ Priority-based NoC scheduling to give preference to critical AI tasks.

4.3.3 Synchronization Overhead in AI Training Link to heading

🔹 Why?

AI workloads run on distributed compute units (e.g., GPUs, TPUs).
NoC must synchronize tensor updates across all cores.

✅ Example: AI Model Synchronization

Operation	Communication Overhead
Forward Pass	Minimal NoC traffic (mostly read-heavy)
Backward Pass (Gradients)	Heavy NoC traffic (large tensor updates)
Parameter Update (SGD, Adam)	Massive synchronization needed

💡 What happens if NoC cannot synchronize efficiently? ❌ Deadlocks in NoC → Model training halts. ❌ Increased power consumption due to excessive message passing.

✅ Solution: NoC Optimizations ✔️ Hierarchical NoC topologies to reduce synchronization latency. ✔️ Priority-based packet scheduling for gradient updates. ✔️ Smart data compression techniques (quantization, sparsity).

4.3.4 AI-Specific Traffic Patterns: All-to-All Communication Link to heading

🔹 Why?

Transformer-based models require massive matrix multiplications across multiple GPUs/TPUs.
AI accelerators use all-to-all NoC communication, leading to congestion.

✅ Example: NoC Traffic in Transformer Models

Layer	Data Movement
Self-Attention	All-to-All (High NoC congestion)
Feedforward (MLP)	Many-to-One (Moderate NoC usage)
Output Layer	One-to-All (Medium NoC traffic)

💡 What happens if NoC cannot handle all-to-all traffic? ❌ Extreme congestion at AI accelerator interconnects. ❌ Limited scalability as AI models grow (GPT-5, Llama 3, etc.).

✅ Solution: NoC Optimizations ✔️ Hybrid topologies (Torus + Express Links) to reduce hops. ✔️ Optical/Wireless NoC for ultra-low latency interconnects. ✔️ Dynamic workload-aware routing to balance NoC traffic.

📌 Final Takeaways 🔹 Modern AI workloads stress NoC due to high bandwidth, unpredictable traffic, and synchronization issues. 🔹 LLMs (GPT-4, Llama) create the most severe NoC congestion due to all-to-all tensor communication. 🔹 CNNs have structured traffic but require optimized memory access. 🔹 NoCs must evolve with smarter routing, better flow control, and high-speed interconnects.

Appendix Link to heading

Queueing Theory Link to heading

Concept	Definition	NoC Relevance
Little’s Law	L = λW (Number of items = arrival rate × wait time)	Predicts NoC buffer occupancy
M/M/1 Queue	Single server, exponential arrival & service	Models packet delays in NoC routers
M/G/1 Queue	General service distribution	More accurate for NoC bursty traffic

🔥 Little’s Law 🔹 Formula:

L=λW

Where:

L = Average number of packets in the system
λ = Arrival rate of packets
W = Average time a packet spends in the system

✅ Example Application in NoC: If packets arrive at 5 per cycle and spend 4 cycles in the NoC, then:

L=5×4=20

This means 20 packets are present in the NoC at any given time.

🔥 Poisson Distribution (ELI5)

Describes random arrival events over time.
Used in NoC traffic modeling to simulate bursty vs. smooth packet arrivals.

✅ Example: → If packets arrive at an average rate of 10 per second, a Poisson model predicts how likely it is to receive 15 or 20 packets in the next second. → Real-world application: Models random memory accesses in NoC routers.

🔥 M/M/1 Queue (NoC Packet Queues) 💡 What is M/M/1? A single-server queueing model used to analyze NoC router delays.

M/M/1 means:

M (Memoryless Poisson Arrivals)
- Markovian arrival where packets arrive randomly following a Possion process.
M (Memoryless Exponential Service)
- Markovian service where the time between packet processing is exponentially distributed.
1 (Single NoC router queue)

✅ Example: If a router receives packets at a rate of 10 per cycle, but can process 8 per cycle, a queue will form.

✅ Why M/M/1 Matters for NoC?

Predicts NoC delays based on packet arrival rate. Helps optimize buffer size & flow control.

🔥 All Queue Types

Queue Type	Arrival Process	Service Process	Servers	Example Use Case
M/M/1	Poisson (Random)	Exponential (Random)	1	NoC router queue
M/M/c	Poisson	Exponential	c servers	Multi-core CPU processing requests
M/G/1	Poisson	General (Any Distribution)	1	NoC traffic bursts
G/M/1	General (Any Distribution)	Exponential	1	Variable packet arrivals
M/D/1	Poisson	Deterministic (Fixed)	1	Real-time NoC packet scheduling
M/M/∞	Poisson	Exponential	Infinite servers	Cloud computing load balancing

📌 Comparing M/M/1 vs. M/M/c 💡 What is an M/M/c Queue? Same as M/M/1, but has c servers instead of 1. Used for multi-core CPUs & parallel packet processing in NoC. ✅ Example: If an NoC router has 4 ports handling traffic, we can model it as M/M/4. ✅ Why M/M/c? Reduces waiting time by serving packets in parallel. More realistic for AI accelerators & GPU interconnects.

📌 Comparing M/M/1 vs. M/G/1 💡 What is an M/G/1 Queue? Same as M/M/1, but service time follows ANY distribution (not just exponential). More accurate for NoC, since real-world NoC packets do not have perfectly exponential service times. ✅ Example: If a NoC router processes tensor data from an AI workload, some packets take longer than others → M/G/1 models this better. ✅ Why M/G/1? More realistic for bursty NoC traffic. Used in AI interconnect modeling.

📌 Comparing M/M/1 vs. M/D/1 💡 What is an M/D/1 Queue? Same as M/M/1, but service times are fixed (deterministic). Used for real-time NoC scheduling where service time is constant. ✅ Example: If an NoC router always takes exactly 2 cycles to process a packet, we use M/D/1 instead of M/M/1. ✅ Why M/D/1? Used in hard real-time NoC systems (e.g., automotive chips). Predictable processing delays.

📌 Which Queue Model is Best for NoC?

Use Case	Best Queue Model
General NoC packet queuing	M/M/1
Parallel NoC processing (multi-core)	M/M/c
NoC with bursty packet arrivals	M/G/1
Real-time NoC scheduling	M/D/1

✅ Key Takeaway: NoC routers typically use M/M/1 or M/G/1, depending on traffic patterns.