100% LLM generated content.
QoS and arbitration in SoC Interconnects.
🚦 1. QoS Mechanisms for Memory and Interconnect Traffic Link to heading
✅ Why QoS? Link to heading
In a heterogeneous SoC, different IPs have different needs:
| IP | Requirement |
|---|---|
| CPU | Moderate latency, high reactivity |
| NPU/GPU | High throughput, can tolerate latency |
| Camera ISP | Real-time deadlines, low jitter |
| DMA | Background, best-effort |
Without QoS, aggressive masters (e.g., GPU or NPU) can starve latency-sensitive clients (e.g., camera).
📚 Key QoS Mechanisms Link to heading
🧩 1. Priority Levels Link to heading
- Masters assign priority tags to requests (e.g.,
QoS[3:0]) - Interconnect and memory arbiters serve higher-priority traffic first
⏲️ 2. Bandwidth Reservation / Budgeting Link to heading
- Reserve minimum bandwidth for each master
- Use token buckets or time-slotting (e.g., TDMA)
🧃 3. Traffic Shaping / Throttling Link to heading
- Limit rate of high-throughput IPs to avoid queue congestion
- Smooth out bursts that could stall real-time traffic
📦 4. Virtual Channels Link to heading
- Use logically separate queues per class (real-time, best-effort)
- Avoid head-of-line blocking between unrelated traffic types
🛡️ 5. Preemption / Aging Link to heading
- Allow high-priority preemption of in-flight requests
- Use aging to prevent starvation of low-priority traffic
💥 2. Burst Sizes, AXI/ACE Transactions Link to heading
✅ AXI Burst Transaction Basics Link to heading
- AXI uses burst-based memory access to improve bandwidth efficiency
- Types:
- Fixed (same address repeated)
- Incrementing (default, linear addr steps)
- Wrapping (useful for cache-line-aligned accesses)
📏 Burst Length Link to heading
- AXI allows burst lengths up to 16 beats (1 beat = 1 data word)
- Larger bursts = better bandwidth, less overhead
- Ideal for:
- NPU weights
- GPU framebuffers
- DMA copies
🔁 ACE Extensions Link to heading
- ACE = AXI + cache coherency extensions
- Adds:
- Barrier transactions
- Snoop requests (e.g.,
ReadClean,MakeUnique) - Cache maintenance operations
⚖️ 3. Arbitration Logic Link to heading
🔀 Purpose: Link to heading
Arbitration decides which master gets access to a shared resource (e.g., interconnect switch, memory port).
📚 Types of Arbitration: Link to heading
| Type | Description | Pros | Cons |
|---|---|---|---|
| Round-Robin | Equal turn-taking | Fair | Ignores urgency |
| Fixed Priority | Always favors certain masters | Simple | Can starve low-priority |
| TDMA (Time-Division) | Reserved slots | Predictable | Rigid |
| QoS-Aware Weighted Arbitration | Prioritized based on request class, with bandwidth targets | Balanced | Complex |
🛠️ Techniques: Link to heading
- Token buckets for rate-limiting
- Latency-aware arbitration for real-time IPs
- Age-based fairness to avoid starvation
- Hierarchical arbitration for multi-level NoCs
🧪 Hands-On Scenario 1: Optimizing CHI Interconnect for CPU + NPU + Camera ISP Link to heading
🎯 Scenario Link to heading
You’re designing a CHI-based interconnect for an SoC with:
- CPU Cluster (4 cores)
- NPU (Neural Processing Unit) with high-bandwidth burst loads
- Camera ISP for real-time 4K video processing
🧱 Goal Link to heading
Ensure:
- Camera meets 33ms frame deadline (30 FPS)
- NPU achieves max sustained throughput
- CPU gets responsive access
🔧 Step-by-Step Optimization Plan Link to heading
1. 🧩 Classify Traffic Link to heading
| IP | Traffic Class | Constraints |
|---|---|---|
| ISP | Real-Time (High QoS) | < 33ms latency |
| NPU | Throughput | Saturate memory |
| CPU | Moderate Priority | Low jitter preferred |
2. 🧠 CHI QoS Configuration Link to heading
Assign QoS tags:
- ISP =
QoS[3:0] = 0xF - CPU =
QoS = 0xA - NPU =
QoS = 0x4
- ISP =
Enable Virtual Channels for real-time traffic
Configure CHI arbitration to prioritize high-QoS VC for reads/writes
3. 🚦 Memory System Behavior Link to heading
- Use TDMA slots for ISP (e.g., every 10us)
- Apply bandwidth cap on NPU (e.g., 2 GB/s max)
- Insert write-combining buffers for NPU burst handling
4. 🔬 Monitoring Tools Link to heading
Counters for:
- Interconnect latency per QoS level
- Memory access turnaround time
- Write buffer occupancy
Enable debug snoop tracing for:
- Snoop-induced stalls
- Cross-IP interference
5. 📈 Expected Performance Outcomes Link to heading
| IP | Without QoS | With QoS Tuned |
|---|---|---|
| ISP | Misses deadline | Hits 33ms reliably |
| NPU | Starves ISP | Slightly throttled, steady |
| CPU | High jitter | Stable latency window |
🧠 Takeaways Link to heading
- Use QoS tagging + arbitration to prioritize latency over bandwidth
- Match burst size and interconnect width for throughput-heavy IPs
- Use TDMA or bandwidth guards for isolation
- Profile VC congestion, latency histograms to optimize
🧪 Hands-On Scenario 2: Debugging a camera ISP with a CMN-600 interconnect Link to heading
You’re a performance architect debugging a real-time camera ISP in an SoC using Arm’s CMN-600 interconnect. The ISP captures and processes 4K frames at 30 FPS (frame deadline: 33.3ms), but you’re seeing occasional frame drops in logs.
Let’s simulate the debug process you’d walk through in a post-silicon trace or a SystemC simulation.
📍 Step 1: Understand the Memory Path Camera ISP → L2 Cache / TCM → CMN-600 Interconnect → DRAM Controller → LPDDR4 DRAM
Key suspects:
- Contention on CMN-600 NoC
- High snoop latency
- ISP writeback stalls
- DRAM bandwidth saturation
- CHI QoS misconfiguration
📊 Step 2: Gather Profiling Counters Assume you have access to CMN-600 perf monitors and memory controller counters.
| Metric | Observed Value |
|---|---|
| ISP DRAM Write Latency (P95) | 75 μs |
| NPU DRAM Read Bandwidth | 4.5 GB/s |
| CPU Cache Snoop Latency (avg) | 110 cycles |
| DRAM Row Buffer Hit Rate | 30% |
| ISP Transaction QoS Tag | 0x4 |
→ Red flag: ISP latency > frame budget → Red flag: ISP QoS tag too low (0x4 = best effort)
🔎 Step 3: Hypothesis Formation
- QoS Priority Violation: ISP requests are treated as best-effort → delayed behind NPU bursts
- VC Congestion in CMN-600: Shared virtual channels → head-of-line blocking
- Poor DRAM scheduling: Row-buffer misses or long write-to-read turnaround
- Backpressure from interconnect: Write combining buffer full, stalling ISP
🛠️ Step 4: Fixes
✅ QoS Elevation:
Set ISP CHI QoS = 0xF
Use CMN-600 VC0 for real-time.
✅ Dedicated Virtual Channel:
Map ISP writes to VC0.
Isolate NPU traffic to VC1.
Enable VC arbitration based on fixed priority.
✅ Bandwidth Cap on NPU: Throttle NPU via interconnect rate limiter to 3 GB/s. Insert burst shaping in DMA engine.
✅ Memory Partitioning: Assign ISP to Bank Group 0–1. Assign NPU to Bank Group 2–3.
✅ Interconnect Clock Domain Boost: Increase NoC frequency from 600 MHz → 800 MHz. Reduce backpressure latency.
✅ Outcome:
| Metric | Before | After |
|---|---|---|
| ISP DRAM Write Latency (P95) | 75 μs | 21 μs |
| ISP QoS Priority | 0x4 | 0xF |
| Frame Drops per 10s | 4 | 0 |
| NPU Throughput Impact | – | -5% max |
💡 ISP stabilizes at < 25μs latency, zero frame drops.