100% LLM generated content.

QoS and arbitration in SoC Interconnects.

🚦 1. QoS Mechanisms for Memory and Interconnect Traffic Link to heading

✅ Why QoS? Link to heading

In a heterogeneous SoC, different IPs have different needs:

IPRequirement
CPUModerate latency, high reactivity
NPU/GPUHigh throughput, can tolerate latency
Camera ISPReal-time deadlines, low jitter
DMABackground, best-effort

Without QoS, aggressive masters (e.g., GPU or NPU) can starve latency-sensitive clients (e.g., camera).


📚 Key QoS Mechanisms Link to heading

🧩 1. Priority Levels Link to heading

  • Masters assign priority tags to requests (e.g., QoS[3:0])
  • Interconnect and memory arbiters serve higher-priority traffic first

⏲️ 2. Bandwidth Reservation / Budgeting Link to heading

  • Reserve minimum bandwidth for each master
  • Use token buckets or time-slotting (e.g., TDMA)

🧃 3. Traffic Shaping / Throttling Link to heading

  • Limit rate of high-throughput IPs to avoid queue congestion
  • Smooth out bursts that could stall real-time traffic

📦 4. Virtual Channels Link to heading

  • Use logically separate queues per class (real-time, best-effort)
  • Avoid head-of-line blocking between unrelated traffic types

🛡️ 5. Preemption / Aging Link to heading

  • Allow high-priority preemption of in-flight requests
  • Use aging to prevent starvation of low-priority traffic

💥 2. Burst Sizes, AXI/ACE Transactions Link to heading

✅ AXI Burst Transaction Basics Link to heading

  • AXI uses burst-based memory access to improve bandwidth efficiency
  • Types:
    • Fixed (same address repeated)
    • Incrementing (default, linear addr steps)
    • Wrapping (useful for cache-line-aligned accesses)

📏 Burst Length Link to heading

  • AXI allows burst lengths up to 16 beats (1 beat = 1 data word)
  • Larger bursts = better bandwidth, less overhead
  • Ideal for:
    • NPU weights
    • GPU framebuffers
    • DMA copies

🔁 ACE Extensions Link to heading

  • ACE = AXI + cache coherency extensions
  • Adds:
    • Barrier transactions
    • Snoop requests (e.g., ReadClean, MakeUnique)
    • Cache maintenance operations

⚖️ 3. Arbitration Logic Link to heading

🔀 Purpose: Link to heading

Arbitration decides which master gets access to a shared resource (e.g., interconnect switch, memory port).


📚 Types of Arbitration: Link to heading

TypeDescriptionProsCons
Round-RobinEqual turn-takingFairIgnores urgency
Fixed PriorityAlways favors certain mastersSimpleCan starve low-priority
TDMA (Time-Division)Reserved slotsPredictableRigid
QoS-Aware Weighted ArbitrationPrioritized based on request class, with bandwidth targetsBalancedComplex

🛠️ Techniques: Link to heading

  • Token buckets for rate-limiting
  • Latency-aware arbitration for real-time IPs
  • Age-based fairness to avoid starvation
  • Hierarchical arbitration for multi-level NoCs

🧪 Hands-On Scenario 1: Optimizing CHI Interconnect for CPU + NPU + Camera ISP Link to heading

🎯 Scenario Link to heading

You’re designing a CHI-based interconnect for an SoC with:

  • CPU Cluster (4 cores)
  • NPU (Neural Processing Unit) with high-bandwidth burst loads
  • Camera ISP for real-time 4K video processing

🧱 Goal Link to heading

Ensure:

  • Camera meets 33ms frame deadline (30 FPS)
  • NPU achieves max sustained throughput
  • CPU gets responsive access

🔧 Step-by-Step Optimization Plan Link to heading

1. 🧩 Classify Traffic Link to heading

IPTraffic ClassConstraints
ISPReal-Time (High QoS)< 33ms latency
NPUThroughputSaturate memory
CPUModerate PriorityLow jitter preferred

2. 🧠 CHI QoS Configuration Link to heading

  • Assign QoS tags:

    • ISP = QoS[3:0] = 0xF
    • CPU = QoS = 0xA
    • NPU = QoS = 0x4
  • Enable Virtual Channels for real-time traffic

  • Configure CHI arbitration to prioritize high-QoS VC for reads/writes


3. 🚦 Memory System Behavior Link to heading

  • Use TDMA slots for ISP (e.g., every 10us)
  • Apply bandwidth cap on NPU (e.g., 2 GB/s max)
  • Insert write-combining buffers for NPU burst handling

4. 🔬 Monitoring Tools Link to heading

  • Counters for:

    • Interconnect latency per QoS level
    • Memory access turnaround time
    • Write buffer occupancy
  • Enable debug snoop tracing for:

    • Snoop-induced stalls
    • Cross-IP interference

5. 📈 Expected Performance Outcomes Link to heading

IPWithout QoSWith QoS Tuned
ISPMisses deadlineHits 33ms reliably
NPUStarves ISPSlightly throttled, steady
CPUHigh jitterStable latency window

🧠 Takeaways Link to heading

  • Use QoS tagging + arbitration to prioritize latency over bandwidth
  • Match burst size and interconnect width for throughput-heavy IPs
  • Use TDMA or bandwidth guards for isolation
  • Profile VC congestion, latency histograms to optimize

🧪 Hands-On Scenario 2: Debugging a camera ISP with a CMN-600 interconnect Link to heading

You’re a performance architect debugging a real-time camera ISP in an SoC using Arm’s CMN-600 interconnect. The ISP captures and processes 4K frames at 30 FPS (frame deadline: 33.3ms), but you’re seeing occasional frame drops in logs.

Let’s simulate the debug process you’d walk through in a post-silicon trace or a SystemC simulation.

📍 Step 1: Understand the Memory Path Camera ISP → L2 Cache / TCM → CMN-600 Interconnect → DRAM Controller → LPDDR4 DRAM

Key suspects:

  • Contention on CMN-600 NoC
  • High snoop latency
  • ISP writeback stalls
  • DRAM bandwidth saturation
  • CHI QoS misconfiguration

📊 Step 2: Gather Profiling Counters Assume you have access to CMN-600 perf monitors and memory controller counters.

MetricObserved Value
ISP DRAM Write Latency (P95)75 μs
NPU DRAM Read Bandwidth4.5 GB/s
CPU Cache Snoop Latency (avg)110 cycles
DRAM Row Buffer Hit Rate30%
ISP Transaction QoS Tag0x4

Red flag: ISP latency > frame budget → Red flag: ISP QoS tag too low (0x4 = best effort)

🔎 Step 3: Hypothesis Formation

  1. QoS Priority Violation: ISP requests are treated as best-effort → delayed behind NPU bursts
  2. VC Congestion in CMN-600: Shared virtual channels → head-of-line blocking
  3. Poor DRAM scheduling: Row-buffer misses or long write-to-read turnaround
  4. Backpressure from interconnect: Write combining buffer full, stalling ISP

🛠️ Step 4: FixesQoS Elevation: Set ISP CHI QoS = 0xF Use CMN-600 VC0 for real-time.

Dedicated Virtual Channel: Map ISP writes to VC0. Isolate NPU traffic to VC1. Enable VC arbitration based on fixed priority.

Bandwidth Cap on NPU: Throttle NPU via interconnect rate limiter to 3 GB/s. Insert burst shaping in DMA engine.

Memory Partitioning: Assign ISP to Bank Group 0–1. Assign NPU to Bank Group 2–3.

Interconnect Clock Domain Boost: Increase NoC frequency from 600 MHz → 800 MHz. Reduce backpressure latency.

Outcome:

MetricBeforeAfter
ISP DRAM Write Latency (P95)75 μs21 μs
ISP QoS Priority0x40xF
Frame Drops per 10s40
NPU Throughput Impact-5% max

💡 ISP stabilizes at < 25μs latency, zero frame drops.