Table of Contents Link to heading
- Table of Contents
- The 90s
- Noteworthy GPUs
- Evolution of GPU Architecture
- Dictionary
This is a post about the history of GPUs, from the early days to the latest GPUs.
The 90s Link to heading
The following table shows the releases of different GPUs over the 90s and 2000s. The list that I grew up with.
Year | Company | Graphics Card | Key Features |
---|---|---|---|
1995 | 3dfx | Voodoo Graphics | First dedicated 3D accelerator (no 2D support) |
1995 | Matrox | Millennium | High-end 2D performance but weak 3D |
1995 | S3 | ViRGE | One of the first consumer 3D GPUs (slow) |
1996 | 3dfx | Voodoo Rush | Voodoo Graphics + 2D support (but slow) |
1996 | NVIDIA | NV1 | First NVIDIA card, quadratic texture mapping (failed) |
1997 | 3dfx | Voodoo2 | SLI support (two cards linked together), multiple texture units |
1997 | NVIDIA | Riva 128 (NV3) | First successful NVIDIA card, 2D + 3D integration |
1997 | ATI | Rage Pro | ATI’s first serious 3D accelerator |
1998 | 3dfx | Voodoo Banshee | Integrated 2D + 3D (but only one texture unit) |
1998 | NVIDIA | Riva TNT (NV4) | First dual-pipeline GPU, 32-bit color |
1998 | S3 | Savage3D | Trilinear filtering, S3 Texture Compression (S3TC) |
1999 | 3dfx | Voodoo3 | Higher clock speed but no 32-bit color |
1999 | NVIDIA | Riva TNT2 (NV5) | Higher clocks, AGP 4X support |
1999 | ATI | Rage 128 | 32-bit color, DirectX 6 support |
1999 | Matrox | G400 | First DualHead multi-monitor support |
1999 | NVIDIA | GeForce 256 (NV10) | First “GPU” (Hardware T&L), DDR memory |
2000 | 3dfx | Voodoo5 5500 | FSAA (Anti-aliasing), dual GPUs on one card |
2000 | NVIDIA | GeForce 2 GTS (NV15) | First programmable shaders (basic) |
2000 | ATI | Radeon DDR | First ATI Radeon-branded GPU, 32-bit rendering |
2000 | S3 | Savage 2000 | Failed due to poor drivers |
2001 | NVIDIA | GeForce 3 (NV20) | First programmable pixel & vertex shaders (DX8) |
2001 | ATI | Radeon 8500 | Competed with GeForce 3, introduced TruForm (N-Patches) |
2001 | 3dfx | (Acquired by NVIDIA) | 3dfx shuts down after bankruptcy |
2002 | NVIDIA | GeForce 4 Ti (NV25) | Improved shaders, fastest DX8 card |
2002 | ATI | Radeon 9700 Pro | First DirectX 9 GPU, superior to GeForce 4 |
2003 | NVIDIA | GeForce FX 5800 (NV30) | First DX9 NVIDIA card, but too hot & loud |
2003 | ATI | Radeon 9800 Pro | Faster than GeForce FX, best DX9 card of the time |
2004 | ATI | Radeon X800 XT | Competed with GeForce 6800, lacked SM3.0 |
2005 | NVIDIA | GeForce 7800 GTX | First DirectX 9.0c GPU with HDR support |
2005 | ATI | Radeon X1800 XT | High-performance alternative to 7800 GTX |
In the late 1990s when I was just 5-10 years old
, the GPU industry was highly competitive, with 3dfx, ATI, Matrox, and S3 leading the market. NVIDIA’s previous GPU, Riva 128 (NV3), introduced integrated 2D/3D acceleration, but its single-pipeline design and limited 4 MB of SDRAM held it back.
I still remember the awe I had for Voodoo2.
Essentially: 3dfx (1996–2000):
- Pioneered 3D gaming (Voodoo 1, Voodoo2).
- Fell behind in innovation (no 32-bit color, weak T&L).
- Acquired by NVIDIA in 2001.
- Revolutionized 3D graphics but failed to adapt.
NVIDIA (1997–Present):
- TNT series (1998–1999): First dual-pipeline GPU.
- GeForce 256 (1999): First true GPU (Hardware T&L).
- Dominated from 2000 onward with GeForce FX, 6800, 7800.
ATI (1997–2006, later AMD Radeon):
- Rage series struggled vs. NVIDIA.
- Radeon 9700 Pro (2002) beat GeForce 4.
- Radeon 9800 Pro won vs. GeForce FX (2003–2004).
S3 Graphics & Matrox (Declined after 2000):
- S3 Savage3D (1998) introduced texture compression (S3TC) but had bad drivers.
- Matrox G400 (1999) introduced dual-monitor tech, but was weak in 3D gaming.
In this post, we are looking at the history of the GPUs, and the ones that left a mark on my life. Starting with: TNT (NVIDIA Riva TNT architecture)
Noteworthy GPUs Link to heading
NVIDIA Riva TNT Link to heading
The “TNT” suffix refers to the chip’s ability to work on two texels at once (TwiN Texel). Riva stands for “Real-time Interactive Video and Animation accelerator”
The NVIDIA Riva TNT, released in 1998, was NVIDIA’s first GPU with a dual-pipeline architecture. It directly competed against 3dfx Voodoo2, offering:
- Integrated 2D + 3D acceleration
- Dual pixel pipelines (two pixels processed simultaneously)
- 32-bit color support
- Full DirectX 6 and OpenGL compatibility
Let’s explore the internal architecture, data flow, and the detailed graphics pipeline clearly.
The TNT card had a black heatsink, and it was a rather simple design. It looked like this.
NVIDIA TNT (NV4) Block Diagram Link to heading
Here’s a simplified block diagram showing the internal flow of data in the TNT chip:
CPU
│ (geometry data via AGP 2X/PCI)
▼
┌───────────────────────────────────────┐
│ AGP/PCI Interface │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ Triangle Setup Engine │
│ (Triangle Setup, Clipping, Cullings) │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ Rasterization Unit │
│ (Dual Pixel Pipelines) │
└───────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Pipeline 1 │ │ Pipeline 2 │
│ ├─ Texture Unit │ │ ├─ Texture Unit │
│ └─ Pixel Shader │ │ └─ Pixel Shader │
└─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌───────────────────────────────────────┐
│ Raster Operation (ROP) │
│ (Z-buffering, Alpha blending) │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 128-bit Memory Interface (SDRAM/SGRAM)│
│ (Frame buffer & Textures) │
└───────────────────────────────────────┘
│
▼
Display (CRT Monitor)
Let’s carefully break down how the TNT GPU processed data from CPU to display:
🚩 Step 1: CPU & AGP/PCI Interface
- CPU handles geometry transformations (3D math).
- Sends transformed triangle data (vertices) over the AGP 2X (or PCI) bus.
- AGP provided higher bandwidth (up to 533 MB/s) than PCI (133 MB/s), improving performance.
CPU → AGP 2X → TNT GPU
Anyone remembers the AGP bus? :) I remember the time AGP x8 was released.
🚩 Step 2: Triangle Setup Engine
- Receives transformed vertices from the CPU.
- Triangle setup converts vertex data into screen-space triangles.
- Performs backface culling (discarding triangles facing away from the camera).
- Performs clipping (removes triangles outside view).
Triangle Setup Engine:
├─ Clipping
├─ Culling
└─ Screen-Space Triangle Generation
🚩 Step 3: Rasterization Unit (Dual Pipelines)
- The triangle data is passed into the Rasterization Unit, which converts triangles into pixels (fragments).
- Dual-pipeline design:
- Two pixels processed simultaneously per clock cycle.
- Significantly improved fill-rate performance compared to single-pipeline designs.
Rasterization Unit:
├─ Pipeline #1 ──► Texture Unit ──► Pixel Shader
└─ Pipeline #2 ──► Texture Unit ──► Pixel Shader
🚩 Step 4: Texture Mapping (Per-Pipeline Texture Units)
- Each pipeline has one texture mapping unit (TMU).
- Applies textures to each pixel:
- Performs bilinear or trilinear filtering.
- Handles mipmapping to improve texture quality at varying distances.
Texture Unit:
├─ Texture Fetch from Memory
├─ Bilinear/Trilinear Filtering
└─ Mipmapping Selection
Each pixel then passes through a basic Pixel Shader stage (though programmable pixel shaders were not available yet in TNT—it was fixed-function shading):
Pixel Shader (fixed-function):
├─ Lighting calculation (basic)
└─ Color blending with texture
🚩 Step 5: Raster Operation (ROP) Stage
- After texture mapping, each pixel goes to the Raster Operation stage (ROP).
- Performs Z-buffering (depth test) and alpha blending (transparency):
- Z-buffering: checks pixel depth and discards pixels behind already rendered ones.
- Alpha blending: allows semi-transparent objects.
ROP Stage:
├─ Z-buffer (Depth Test)
├─ Alpha blending (Transparency)
└─ Final pixel color determination
🚩 Step 6: Frame Buffer and Memory Interface
- Pixel data is finally stored in the Frame Buffer (SDRAM/SGRAM).
- TNT GPU featured a 128-bit wide memory interface, significantly improving bandwidth.
- Supported up to 16MB of video memory, allowing higher resolutions and 32-bit color.
Memory Interface (128-bit SDRAM/SGRAM):
├─ Frame Buffer (Final rendered pixels)
└─ Texture Storage (Texture caching)
🚩 Step 7: Display Output
- The completed frame (stored in frame buffer) is continuously sent to the display monitor.
- Supports high-resolution output up to 1600×1200 pixels (though often limited by memory and performance at these high resolutions).
Frame Buffer → CRT Monitor (Display Output)
Why NVIDIA TNT Was Significant? Link to heading
The TNT GPU marked a shift from single-pipeline GPUs (e.g., 3dfx Voodoo) towards multi-pipeline architectures:
✅ Dual pixel pipelines → double fill rate. ✅ 32-bit color → superior color quality vs. 3dfx Voodoo2’s 16-bit. ✅ Integrated 2D+3D, no separate 2D card required.
Although limited by clock speed (only around 90 MHz), it established the groundwork for NVIDIA’s future GPUs (TNT2, GeForce 256).
Up next: Voodoo2, which was a very popular card at the time.
3dfx Voodoo2 Link to heading
The year is
1998
. The coolest thing about Voodoo2 was that it had aSLI
(Scalable Link Interface) option. You could put two Voodoo2 cards in your computer and play games at a higher resolution and with better performance.
The 3dfx Voodoo2 was a legendary GPU released in 1998, following the success of the original Voodoo Graphics (1996). It was widely used in arcade machines and high-end gaming PCs during the late 1990s and was the first GPU to support SLI (Scan-Line Interleave), allowing two cards to work together for better performance.
Key Innovations of Voodoo2:
- Dual Texture Mapping Units (TMUs) – Allowed multi-texturing in a single pass.
- SLI (Scan-Line Interleave) Support – Two Voodoo2 cards could be linked for double the rendering power.
- Dedicated 3D Accelerator – Unlike NVIDIA’s Riva TNT, Voodoo2 had no 2D support, requiring a separate 2D card.
Voodoo2 Block Diagram: Here is a simplified architecture of the Voodoo2 graphics pipeline:
┌────────────────────────────────┐
│ CPU (Game Logic) │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ PCI Interface (66 MHz) │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ 2D Graphics Card (Required) │ <-- Voodoo2 only did 3D!
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ Voodoo2 Graphics Pipeline │
├────────────────────────────────┤
│ Triangle Setup & Rasterizer │ <-- Converts triangles into pixels
├────────────────────────────────┤
│ Texture Mapping Unit (TMU 1) │ <-- First texture
├────────────────────────────────┤
│ Texture Mapping Unit (TMU 2) │ <-- Second texture (multi-texturing)
├────────────────────────────────┤
│ Frame Buffer & Z-Buffer │ <-- Stores final pixel colors & depth
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ VGA Pass-through (2D) │ <-- Sends final image to monitor
└────────────────────────────────┘
What made Voodoo2 special?
- Dual TMUs (Texture Mapping Units) → Allowed multi-texturing in a single pass.
- SLI Support (Scan-Line Interleave) → Allowed two cards to split rendering work.
- High Memory Bandwidth → 8 MB to 12 MB of EDO RAM (6 MB for frame buffer + 6 MB for textures).
Step 1: CPU Handles Geometry & Sends to Voodoo2 The CPU handled vertex transformations & lighting (T&L was not yet in hardware). It sent transformed triangles to the Voodoo2 over the PCI bus (66 MHz, 132 MB/s bandwidth).
Step 2: Rasterization (Triangle Setup) The Rasterizer took triangle data from the CPU and converted it into pixels (fragments). It determined which pixels are covered by each triangle. No vertex shaders yet!
┌────────────────────┐
│ Triangle Setup │
├────────────────────┤
│ Rasterization │
└────────────────────┘
Step 3: Texture Mapping (Multi-Texturing) Voodoo2 introduced Dual TMUs (Texture Mapping Units). This meant that one polygon could receive two textures in a single pass, doubling texture performance compared to Voodoo1.
┌────────────────────┐
│ Texture Mapping │
├────────────────────┤
│ TMU 1 (Texture 1) │
├────────────────────┤
│ TMU 2 (Texture 2) │
└────────────────────┘
Multi-texturing was a game changer – used for lightmaps, bump maps, and reflections.
Step 4: Z-Buffering & Alpha Blending Z-buffering ensured correct depth sorting (far objects were hidden behind closer ones). Alpha blending allowed semi-transparent effects (e.g., glass, smoke).
┌────────────────────┐
│ Z-Buffering │ <-- Depth Testing (Hidden Surface Removal)
├────────────────────┤
│ Alpha Blending │ <-- Transparency Effects
├────────────────────┤
│ Frame Buffer Write │ <-- Stores Final Pixels
└────────────────────┘
Step 5: Final Output (VGA Pass-Through) Since Voodoo2 did not support 2D rendering, it had a pass-through cable. The final 3D image was sent to the 2D graphics card, which then displayed it on the monitor.
┌─────────────────────────┐
│ VGA Pass-Through Output │
└─────────────────────────┘
This is why Voodoo2 needed a separate 2D card like Matrox Millennium!
3dfx SLI (Scan-Line Interleave)
- Voodoo2 was the first consumer GPU to support SLI (Scan-Line Interleave).
- Two Voodoo2 cards could be linked together to increase performance.
- Each GPU would render every other scanline, effectively doubling the rendering power.
SLI Diagram:
┌──────────────┐ ┌──────────────┐
│ Voodoo2 #1 │ │ Voodoo2 #2 │
│ (Odd lines) │ │ (Even lines)│
└──────┬───────┘ └──────┬───────┘
▼ ▼
┌──────────────────────────┐
│ Final Framebuffer │ <-- Image combined from both GPUs
└──────────────────────────┘
SLI doubled the fill rate and allowed resolutions up to 1024×768 (a big deal in 1998). This set the foundation for NVIDIA SLI & AMD CrossFire in the 2000s.
Comparison: Voodoo2 vs. Competitors
Feature | Voodoo1 (1996) | Voodoo2 (1998) | NVIDIA Riva TNT (1998) |
---|---|---|---|
Pipelines | 1 | 1 | 2 |
TMUs | 1 | 2 | 1 |
Max RAM | 8 MB | 12 MB | 16 MB |
Max Resolution | 800×600 | 1024×768 | 1600×1200 |
SLI Support | ❌ No | ✅ Yes | ❌ No |
2D Support | ❌ No | ❌ No | ✅ Yes |
- Voodoo2 was amazing for multi-texturing and SLI, but had no 2D support.
- NVIDIA TNT was more advanced (integrated 2D & 3D, 32-bit color), leading to 3dfx’s downfall.
Conclusion
- 3dfx Voodoo2 was revolutionary, but lacked 32-bit color & full 2D support.
- SLI was ahead of its time, inspiring NVIDIA’s future GPUs.
- 3dfx failed to adapt to unified 2D/3D architectures, leading to its demise.
ATI R300 (Radeon 9700 Pro) Architecture Link to heading
The ATI R300 GPU (Radeon 9700 Pro, 2002) was one of the most groundbreaking GPUs in history, introducing:
- First DirectX 9 GPU – Enabled advanced programmable shaders (Shader Model 2.0).
- 8-Pipeline Architecture – Twice the power of NVIDIA GeForce 4 Ti.
- 256-bit Memory Interface – First consumer GPU with 256-bit GDDR memory, delivering unprecedented bandwidth.
This GPU forced NVIDIA to redesign their GeForce FX (NV30), as it outperformed everything at the time.
R300 Architecture Block Diagram Here’s a simplified view of the Radeon 9700 Pro pipeline:
┌────────────────────────────────┐
│ CPU (Game Logic) │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ AGP 8X Bus (Transfers Data) │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ R300 Graphics Pipeline │
├────────────────────────────────┤
│ Vertex Shader (Programmable) │ <-- First DX9 GPU with SM2.0
├────────────────────────────────┤
│ Triangle Setup & Rasterizer │
├────────────────────────────────┤
│ 8 Pixel Pipelines (Dual TMUs) │ <-- Twice the power of GeForce 4
├────────────────────────────────┤
│ Z-Buffer & Stencil Buffer │
├────────────────────────────────┤
│ 256-bit GDDR Memory Interface │ <-- First consumer GPU with 256-bit bus
├────────────────────────────────┤
│ Frame Buffer Output │
└────────────────────────────────┘
What made R300 special?
- First GPU with full DirectX 9 support (Shader Model 2.0).
- 8 pixel pipelines with dual texture units per pipe (superior to GeForce 4 Ti’s 4 pipelines).
- Introduced Hierarchical Z-Buffer and Fast-Z Clear to optimize performance.
- Massive 256-bit memory interface (first of its kind in gaming GPUs).
Radeon 9700 Pro Graphics Pipeline Let’s break down how the R300 processed a 3D scene step by step.
Step 1: Vertex Processing (First Fully Programmable Vertex Shader) R300 introduced fully programmable Vertex Shaders (SM2.0). Instead of fixed-function T&L (like GeForce 4 Ti), developers could write custom transformation and lighting effects. This enabled realistic animations, per-vertex deformations, and advanced lighting effects.
┌──────────────────────┐
│ Vertex Shader (SM2.0)│ <-- Custom transformations, deformations
├──────────────────────┤
│ Geometry Processing │
├──────────────────────┤
│ Triangle Setup │
└──────────────────────┘
This was a massive leap over GeForce 4 Ti’s fixed-function pipeline.
Step 2: Rasterization & Pixel Processing (8 Pipelines!) R300 had 8 pixel pipelines (GeForce 4 Ti only had 4). Each pipeline had 2 texture mapping units (TMUs) → 16 textures per clock cycle. Enabled multi-texturing in a single pass (light maps, bump maps, shadows).
┌─────────────────────────┐
│ 8 Pixel Pipelines │ <-- High parallelism
├─────────────────────────┤
│ Dual TMUs per Pipeline │ <-- Faster multi-texturing
├─────────────────────────┤
│ Pixel Shader (SM2.0) │ <-- Programmable pixel effects
└─────────────────────────┘
GeForce 4 Ti used fixed-function shading → R300’s programmable shaders were more advanced. This allowed more realistic lighting, shadows, and materials.
Step 3: Z-Buffering & Stencil Buffering (Fast-Z Clear) Hierarchical Z-Buffering (HZB) – Optimized depth sorting before pixel shading, reducing workload. Fast-Z Clear – Quickly reset depth values between frames → Improved efficiency. Stencil Buffer – Used for shadows, reflections, and outlines.
┌─────────────────────────┐
│ Hierarchical Z-Buffer │ <-- Fast hidden surface removal
├─────────────────────────┤
│ Stencil Buffer │ <-- Used for shadows, mirrors
├─────────────────────────┤
│ Early Z-Culling │ <-- Prevents processing hidden pixels
└─────────────────────────┘
This massively reduced fill-rate bottlenecks and improved FPS.
Step 4: 256-bit GDDR Memory Controller (Industry First) First consumer GPU with a 256-bit memory bus. Max bandwidth: 19.8 GB/s (GeForce 4 Ti only had 10.4 GB/s). Allowed higher resolutions and anti-aliasing without major performance drops.
┌───────────────────────┐
│ 256-bit Memory Bus │ <-- First GPU with this bandwidth
├───────────────────────┤
│ High-Speed GDDR Memory│
├───────────────────────┤
│ Frame Buffer Storage │
└───────────────────────┘
This gave Radeon 9700 Pro a major lead over GeForce 4 Ti.
Radeon 9700 Pro dominated the market because:
8 pipelines vs. GeForce 4 Ti’s 4 → Twice the fill rate. 256-bit memory interface vs. 128-bit → More bandwidth. Shader Model 2.0 → Allowed more complex pixel & vertex shading effects. Even NVIDIA’s GeForce FX 5800 struggled to compete (it was loud & hot).
I do remember FX 5800, the loudest GPU at the time.
Conclusion
- Radeon 9700 Pro was the first truly next-gen GPU (DirectX 9, SM2.0).
- 256-bit memory interface was revolutionary and set the standard for future GPUs.
- Forced NVIDIA to rethink their strategy after GeForce FX 5800 failed.
The first GPU that I truly fallen in love with? Radeon 9800XT. I still remember the red PCB and the huge heatsink, it looked like this.
Radeon 9700 vs. Radeon 9800 – What’s the Difference? Link to heading
Feature | Radeon 9700 Pro (R300, 2002) | Radeon 9800 Pro (R350, 2003) | Radeon 9800 XT (R360, 2003) |
---|---|---|---|
GPU Core | R300 | R350 (Refined R300) | R360 (Further Refined R350) |
Process Node | 150nm | 150nm | 130nm |
Pipelines | 8 | 8 | 8 |
TMUs per Pipe | 2 | 2 | 2 |
Core Clock | 325 MHz | 380 MHz | 412 MHz |
Memory Clock | 620 MHz (DDR) | 680 MHz (DDR) | 730 MHz (DDR) |
Memory Bus | 256-bit | 256-bit | 256-bit |
Memory Bandwidth | 19.8 GB/s | 21.8 GB/s | 23.4 GB/s |
DirectX Support | 9.0 (SM2.0) | 9.0 (SM2.0) | 9.0 (SM2.0) |
Pixel Shader Version | 2.0 | 2.0 | 2.0 |
Vertex Shader Version | 2.0 | 2.0 | 2.0 |
AA & AF Performance | Good | Better | Best |
- 9800 Pro and 9800 XT were essentially overclocked and optimized versions of the 9700 Pro.
- 9800 XT used a smaller 130nm process, allowing for higher clock speeds.
Evolution of GPU Architecture Link to heading
Early GPUs (such as the 3dfx Voodoo series, NVIDIA Riva TNT, and ATI Rage) followed a fixed-function pipeline, meaning they had dedicated hardware blocks for tasks like:
- Vertex transformation (adjusting 3D models for perspective)
- Rasterization (converting shapes into pixels)
- Texture mapping (applying textures onto 3D objects)
- Lighting calculations (basic shading) These GPUs worked very efficiently for rendering predefined graphics but lacked flexibility for custom computations. They were essentially hardwired state machines—great for graphics but terrible for general-purpose computing.
An overview of the evolution of GPUs:
Era | Key Feature | Example GPUs |
---|---|---|
1995–2000 | Fixed-Function Pipeline, Hardware T&L | NVIDIA GeForce 256, ATI Rage 128 |
2001–2006 | Programmable Shaders (SM2.0–SM3.0) | NVIDIA GeForce 3, Radeon 9700 Pro |
2006–2012 | Unified Shader Architecture (SM4.0–SM5.0) | GeForce 8, Radeon HD 5000 |
2018–Present | Ray Tracing (RT Cores), AI (Tensor Cores) | RTX 2000, Radeon RX 6000 |
Dictionary Link to heading
GPU Terminology Link to heading
- Rasterization: Converting triangles into pixels.
- Shader: Program running on GPU (vertex, pixel, geometry).
- Vertex Shader: Manipulates vertex positions (geometry).
- Pixel Shader (Fragment Shader): Determines pixel colors.
- Texture: Image mapped onto geometry surfaces.
- TMU (Texture Mapping Unit): GPU hardware fetching and filtering textures.
- ROP (Raster Operation): Handles depth tests (Z-buffer), blending, writing pixels to framebuffer.
- Z-buffering (Depth Buffer): Ensures correct visibility by depth comparison.
- Stencil Buffer: Defines pixel rendering masks.
- Framebuffer: Stores rendered pixels awaiting display.
- Compute Shader: GPU programs performing general computations.
Hardware Transform & Lighting (H/W T&L) 1999
Link to heading
Two of the most important milestones in the evolution of GPUs were:
- Hardware T&L (Transform & Lighting) – Introduced by NVIDIA GeForce 256 (1999).
- Shader Model 3.0 – Introduced by NVIDIA GeForce 6 series (2004, DirectX 9.0c).
Hardware T&L (Transform & Lighting) is the ability of a GPU to process 3D transformations and lighting calculations directly, instead of relying on the CPU. Before GeForce 256 (1999), all geometry transformations and lighting were done on the CPU. This limited performance because:
- The CPU was already handling game logic, physics, AI, and sound.
- As games became more complex, CPU-bound T&L computations became a bottleneck.
→ Hardware T&L offloaded these calculations to the GPU, significantly improving performance.
Every 3D object is represented by vertices in 3D space. Before rendering, we must:
- Transform them (move, rotate, scale).
- Apply lighting effects to simulate realism.
The T&L Pipeline (Pre-GPU Era) Link to heading
1995–1999, before Hardware T&L Link to heading
- The CPU calculates vertex transformations (object → world → camera space).
- The CPU calculates lighting per vertex (Phong, Lambertian, etc.).
- The CPU sends final transformed vertices to the GPU, which only does rasterization.
- This was slow because all vertices were processed on the CPU!
The Impact of Hardware T&L (1999–2002) Link to heading
With GeForce 256 (NV10, 1999), NVIDIA introduced dedicated T&L hardware inside the GPU. Now:
- The GPU performs all transformation & lighting calculations.
- The CPU is freed to handle AI, physics, etc..
- Massive speed-up in rendering complex 3D scenes.
Games That Used Hardware T&L
- Quake III Arena (1999) – Huge performance boost with GeForce 256.
- Max Payne (2001) – Required Hardware T&L for full effects.
- Morrowind (2002) – Used advanced lighting powered by T&L.
Shader Model 3.0 2004
Link to heading
Shader Model 3.0 (SM 3.0) is a DirectX 9.0c feature that introduced programmable shaders with improved flexibility and performance. It was first supported by:
- NVIDIA GeForce 6 series (2004, NV40 GPU).
- ATI Radeon X1000 series (2005).
Unlike Hardware T&L, which was fixed-function, Shader Model 3.0 allowed fully programmable shading with:
- Longer shader programs (more complex effects).
- Branching and loops in shaders (better performance).
- Higher precision in pixel calculations (HDR lighting).
Shaders
are small programs that run on the GPU to control how objects appear on the screen.
Types of Shaders in Shader Model 3.0
- Vertex Shaders – Modify the position of vertices in 3D space.
- Pixel Shaders (Fragment Shaders) – Control how pixels are shaded (lighting, reflections, textures).
- Geometry Shaders (Introduced in SM4.0, DX10) – Create new geometry from existing ones.
Key Features of Shader Model 3.0
Feature | Shader Model 2.0 (DX9.0) | Shader Model 3.0 (DX9.0c) |
---|---|---|
Instruction Limit | 64 | 512+ |
Dynamic Branching | ❌ No | ✅ Yes |
Vertex Texture Fetch | ❌ No | ✅ Yes |
Longer Shader Programs | ❌ Limited | ✅ Supported |
HDR (High Dynamic Range) | ✅ Limited | ✅ Fully supported |
Why Shader Model 3.0 Was a Big Deal
- More Realistic Graphics – Games looked better with per-pixel lighting and soft shadows.
- Better Performance – Dynamic branching reduced unnecessary calculations.
- HDR Support – Enabled High Dynamic Range lighting (used in Far Cry, Half-Life 2).
Games That Used Shader Model 3.0
- Far Cry (2004) – Better water reflections, HDR lighting.
- Splinter Cell: Chaos Theory (2005) – Used SM 3.0 for realistic shadows.
- Battlefield 2 (2005) – Required SM 3.0 for advanced graphics.
Z-Buffering and Depth Management in 3D Graphics Link to heading
When rendering 3D scenes, we need a way to determine which objects are visible and which should be hidden behind others. This process is called hidden surface removal (HSR). One of the most widely used techniques for this is Z-buffering (or depth buffering).
What is Z-Buffering? Link to heading
Z-buffering is a per-pixel depth management technique used in rasterization-based rendering. It helps determine which pixels should be drawn and which should be discarded based on their distance from the camera. Every pixel in the frame buffer has a corresponding depth value (Z-value) stored in a Z-buffer (depth buffer). When a new pixel is drawn, its Z-value is compared with the stored value:
- If the new pixel is closer to the camera (lower Z-value) → It replaces the old pixel.
- If the new pixel is farther away (higher Z-value) → It is discarded.
Key Properties of Z-Buffering:
- Per-Pixel Accuracy – Works at the finest granularity.
- Efficient for Arbitrary Geometry – Handles complex overlapping objects.
- No Pre-Sorting Required – Unlike other techniques like the Painter’s Algorithm.
How Z-Buffering Works? Link to heading
Step-by-Step Process:
- Initialize the Z-buffer: Each pixel in the Z-buffer is initialized to a large value (e.g., the far clipping plane depth). Frame buffer is initialized to background color.
- Rasterize Each Triangle:
- For every pixel covered by the triangle, compute the Z-depth (distance from the camera).
- Compare this Z-depth to the stored depth in the Z-buffer.
- If new depth < stored depth → Overwrite color & depth.
- If new depth >= stored depth → Discard pixel.
- Final Image Composition: After processing all triangles, the frame buffer contains the final rendered image.
Comparison with Other Hidden Surface Removal Techniques Link to heading
Let’s compare Z-buffering with alternative visibility techniques.
Method | Accuracy | Sorting Required? | Memory Usage | Performance Impact | Best Use Case |
---|---|---|---|---|---|
Z-Buffering | Per-pixel | No | High (depth buffer) | Moderate (per-pixel comparisons) | General 3D rendering (games, CAD) |
Painter’s Algorithm | Per-object/triangle | Yes (Back-to-front sorting) | Low | High (due to sorting & overdrawing) | Simple scenes with few overlapping objects |
Binary Space Partitioning (BSP) | Per-object/triangle | Yes (Precomputed BSP tree) | Medium | Very high (preprocessing overhead) | Static scenes (Doom-style rendering) |
Ray Tracing | Per-pixel | No | High | Very high (traces rays for each pixel) | High-quality reflections/shadows (Offline rendering, RTX GPUs) |
Painter’s Algorithm (Back-to-Front Sorting)
- Sort all polygons by depth (far → near).
- Draw them in order, so closer objects naturally overwrite farther ones.
- Issues:
- Sorting overhead is expensive.
- Transparency is hard to handle.
- Doesn’t work well for intersecting objects.
Binary Space Partitioning (BSP)
- Preprocess the scene into a BSP tree.
- At runtime, traverse the tree to determine drawing order.
- Used in classic games like DOOM (1993).
- Issues:
- Works best for static geometry (dynamic objects break the tree).
- Preprocessing is expensive.
Ray Tracing (Alternative to Rasterization)
- Instead of rasterizing triangles, traces rays from the camera into the scene.
- Handles shadows, reflections, and refractions naturally.
- Used in modern RTX GPUs.
- Issues:
- Computationally expensive without hardware acceleration.
- Needs denoising techniques to remove noise in real-time applications.
Conclusion Z-buffering is the best balance between accuracy and performance for real-time rendering.
Z-Buffer Precision Issues & Solutions Link to heading
Z-buffer precision is limited by the number of bits allocated per pixel. Common depths are 16-bit, 24-bit, and 32-bit.
Precision Problem: Z-Fighting When two surfaces are very close together, limited Z-buffer precision causes fluctuations in depth values, leading to flickering artifacts.
Example Overlapping polygons on a car’s dashboard in a video game may “flicker” as the camera moves.
Solutions
- Use a 24-bit or 32-bit Z-buffer instead of 16-bit.
- Adjust Near and Far Plane Clipping:
- Keep the near plane as far as possible.
- Avoid using an excessively large far plane (e.g., 0.1m near → 10,000m far is bad).
- Use Floating-Point Depth Buffers (if supported).
- Enable Depth Biasing (Polygon Offset) to slightly separate overlapping surfaces.
Optimizations Since Z-buffering requires memory reads & writes for every pixel, it can be slow. Here are some optimizations:
Early Z-Culling (Hierarchical Z-Buffering)
- Before fragment shading, discard pixels that fail the depth test.
- Modern GPUs use hierarchical Z-buffers to reject large chunks of pixels early.
Reverse Z-Buffering
- Instead of storing depth as 1/Z, store it as Z directly.
- This improves precision near the camera, where it matters most.
- Used in modern OpenGL and DirectX engines.
Tiling & Deferred Rendering
- GPUs like PowerVR (used in mobile devices) use tile-based rendering.
- The scene is split into tiles, and depth tests are performed in small chunks.
Z-Buffering Summary Link to heading
🔹 Z-buffering is the most widely used technique for real-time hidden surface removal. 🔹 It provides per-pixel accuracy, but has precision issues that require depth buffer optimizations. 🔹 Compared to other methods, it is more scalable and general-purpose, making it dominant in modern GPUs. 🔹 Future techniques (e.g., Ray Tracing) may complement Z-buffering for hybrid rendering.
I have used LLMs to generate some content for this post. Involvement of LLMs w.r.t. the content is approx. 50%.