100% LLM generated content.
QoS and arbitration in SoC Interconnects.
🚦 1. QoS Mechanisms for Memory and Interconnect Traffic Link to heading
✅ Why QoS? Link to heading
In a heterogeneous SoC, different IPs have different needs:
IP | Requirement |
---|---|
CPU | Moderate latency, high reactivity |
NPU/GPU | High throughput, can tolerate latency |
Camera ISP | Real-time deadlines, low jitter |
DMA | Background, best-effort |
Without QoS, aggressive masters (e.g., GPU or NPU) can starve latency-sensitive clients (e.g., camera).
📚 Key QoS Mechanisms Link to heading
🧩 1. Priority Levels Link to heading
- Masters assign priority tags to requests (e.g.,
QoS[3:0]
) - Interconnect and memory arbiters serve higher-priority traffic first
⏲️ 2. Bandwidth Reservation / Budgeting Link to heading
- Reserve minimum bandwidth for each master
- Use token buckets or time-slotting (e.g., TDMA)
🧃 3. Traffic Shaping / Throttling Link to heading
- Limit rate of high-throughput IPs to avoid queue congestion
- Smooth out bursts that could stall real-time traffic
📦 4. Virtual Channels Link to heading
- Use logically separate queues per class (real-time, best-effort)
- Avoid head-of-line blocking between unrelated traffic types
🛡️ 5. Preemption / Aging Link to heading
- Allow high-priority preemption of in-flight requests
- Use aging to prevent starvation of low-priority traffic
💥 2. Burst Sizes, AXI/ACE Transactions Link to heading
✅ AXI Burst Transaction Basics Link to heading
- AXI uses burst-based memory access to improve bandwidth efficiency
- Types:
- Fixed (same address repeated)
- Incrementing (default, linear addr steps)
- Wrapping (useful for cache-line-aligned accesses)
📏 Burst Length Link to heading
- AXI allows burst lengths up to 16 beats (1 beat = 1 data word)
- Larger bursts = better bandwidth, less overhead
- Ideal for:
- NPU weights
- GPU framebuffers
- DMA copies
🔁 ACE Extensions Link to heading
- ACE = AXI + cache coherency extensions
- Adds:
- Barrier transactions
- Snoop requests (e.g.,
ReadClean
,MakeUnique
) - Cache maintenance operations
⚖️ 3. Arbitration Logic Link to heading
🔀 Purpose: Link to heading
Arbitration decides which master gets access to a shared resource (e.g., interconnect switch, memory port).
📚 Types of Arbitration: Link to heading
Type | Description | Pros | Cons |
---|---|---|---|
Round-Robin | Equal turn-taking | Fair | Ignores urgency |
Fixed Priority | Always favors certain masters | Simple | Can starve low-priority |
TDMA (Time-Division) | Reserved slots | Predictable | Rigid |
QoS-Aware Weighted Arbitration | Prioritized based on request class, with bandwidth targets | Balanced | Complex |
🛠️ Techniques: Link to heading
- Token buckets for rate-limiting
- Latency-aware arbitration for real-time IPs
- Age-based fairness to avoid starvation
- Hierarchical arbitration for multi-level NoCs
🧪 Hands-On Scenario 1: Optimizing CHI Interconnect for CPU + NPU + Camera ISP Link to heading
🎯 Scenario Link to heading
You’re designing a CHI-based interconnect for an SoC with:
- CPU Cluster (4 cores)
- NPU (Neural Processing Unit) with high-bandwidth burst loads
- Camera ISP for real-time 4K video processing
🧱 Goal Link to heading
Ensure:
- Camera meets 33ms frame deadline (30 FPS)
- NPU achieves max sustained throughput
- CPU gets responsive access
🔧 Step-by-Step Optimization Plan Link to heading
1. 🧩 Classify Traffic Link to heading
IP | Traffic Class | Constraints |
---|---|---|
ISP | Real-Time (High QoS) | < 33ms latency |
NPU | Throughput | Saturate memory |
CPU | Moderate Priority | Low jitter preferred |
2. 🧠 CHI QoS Configuration Link to heading
Assign QoS tags:
- ISP =
QoS[3:0] = 0xF
- CPU =
QoS = 0xA
- NPU =
QoS = 0x4
- ISP =
Enable Virtual Channels for real-time traffic
Configure CHI arbitration to prioritize high-QoS VC for reads/writes
3. 🚦 Memory System Behavior Link to heading
- Use TDMA slots for ISP (e.g., every 10us)
- Apply bandwidth cap on NPU (e.g., 2 GB/s max)
- Insert write-combining buffers for NPU burst handling
4. 🔬 Monitoring Tools Link to heading
Counters for:
- Interconnect latency per QoS level
- Memory access turnaround time
- Write buffer occupancy
Enable debug snoop tracing for:
- Snoop-induced stalls
- Cross-IP interference
5. 📈 Expected Performance Outcomes Link to heading
IP | Without QoS | With QoS Tuned |
---|---|---|
ISP | Misses deadline | Hits 33ms reliably |
NPU | Starves ISP | Slightly throttled, steady |
CPU | High jitter | Stable latency window |
🧠 Takeaways Link to heading
- Use QoS tagging + arbitration to prioritize latency over bandwidth
- Match burst size and interconnect width for throughput-heavy IPs
- Use TDMA or bandwidth guards for isolation
- Profile VC congestion, latency histograms to optimize
🧪 Hands-On Scenario 2: Debugging a camera ISP with a CMN-600 interconnect Link to heading
You’re a performance architect debugging a real-time camera ISP in an SoC using Arm’s CMN-600 interconnect. The ISP captures and processes 4K frames at 30 FPS (frame deadline: 33.3ms), but you’re seeing occasional frame drops in logs.
Let’s simulate the debug process you’d walk through in a post-silicon trace or a SystemC simulation.
📍 Step 1: Understand the Memory Path Camera ISP → L2 Cache / TCM → CMN-600 Interconnect → DRAM Controller → LPDDR4 DRAM
Key suspects:
- Contention on CMN-600 NoC
- High snoop latency
- ISP writeback stalls
- DRAM bandwidth saturation
- CHI QoS misconfiguration
📊 Step 2: Gather Profiling Counters Assume you have access to CMN-600 perf monitors and memory controller counters.
Metric | Observed Value |
---|---|
ISP DRAM Write Latency (P95) | 75 μs |
NPU DRAM Read Bandwidth | 4.5 GB/s |
CPU Cache Snoop Latency (avg) | 110 cycles |
DRAM Row Buffer Hit Rate | 30% |
ISP Transaction QoS Tag | 0x4 |
→ Red flag: ISP latency > frame budget → Red flag: ISP QoS tag too low (0x4 = best effort)
🔎 Step 3: Hypothesis Formation
- QoS Priority Violation: ISP requests are treated as best-effort → delayed behind NPU bursts
- VC Congestion in CMN-600: Shared virtual channels → head-of-line blocking
- Poor DRAM scheduling: Row-buffer misses or long write-to-read turnaround
- Backpressure from interconnect: Write combining buffer full, stalling ISP
🛠️ Step 4: Fixes
✅ QoS Elevation:
Set ISP CHI QoS = 0xF
Use CMN-600 VC0
for real-time.
✅ Dedicated Virtual Channel:
Map ISP writes to VC0
.
Isolate NPU traffic to VC1
.
Enable VC arbitration based on fixed priority.
✅ Bandwidth Cap on NPU: Throttle NPU via interconnect rate limiter to 3 GB/s. Insert burst shaping in DMA engine.
✅ Memory Partitioning: Assign ISP to Bank Group 0–1. Assign NPU to Bank Group 2–3.
✅ Interconnect Clock Domain Boost: Increase NoC frequency from 600 MHz → 800 MHz. Reduce backpressure latency.
✅ Outcome:
Metric | Before | After |
---|---|---|
ISP DRAM Write Latency (P95) | 75 μs | 21 μs |
ISP QoS Priority | 0x4 | 0xF |
Frame Drops per 10s | 4 | 0 |
NPU Throughput Impact | – | -5% max |
💡 ISP stabilizes at < 25μs latency, zero frame drops.