Table of Contents Link to heading

Table of Contents
The 90s
Noteworthy GPUs
Evolution of GPU Architecture
Dictionary

This is a post about the history of GPUs, from the early days to the latest GPUs.

The 90s Link to heading

The following table shows the releases of different GPUs over the 90s and 2000s. The list that I grew up with.

Year	Company	Graphics Card	Key Features
1995	3dfx	Voodoo Graphics	First dedicated 3D accelerator (no 2D support)
1995	Matrox	Millennium	High-end 2D performance but weak 3D
1995	S3	ViRGE	One of the first consumer 3D GPUs (slow)
1996	3dfx	Voodoo Rush	Voodoo Graphics + 2D support (but slow)
1996	NVIDIA	NV1	First NVIDIA card, quadratic texture mapping (failed)
1997	3dfx	Voodoo2	SLI support (two cards linked together), multiple texture units
1997	NVIDIA	Riva 128 (NV3)	First successful NVIDIA card, 2D + 3D integration
1997	ATI	Rage Pro	ATI’s first serious 3D accelerator
1998	3dfx	Voodoo Banshee	Integrated 2D + 3D (but only one texture unit)
1998	NVIDIA	Riva TNT (NV4)	First dual-pipeline GPU, 32-bit color
1998	S3	Savage3D	Trilinear filtering, S3 Texture Compression (S3TC)
1999	3dfx	Voodoo3	Higher clock speed but no 32-bit color
1999	NVIDIA	Riva TNT2 (NV5)	Higher clocks, AGP 4X support
1999	ATI	Rage 128	32-bit color, DirectX 6 support
1999	Matrox	G400	First DualHead multi-monitor support
1999	NVIDIA	GeForce 256 (NV10)	First “GPU” (Hardware T&L), DDR memory
2000	3dfx	Voodoo5 5500	FSAA (Anti-aliasing), dual GPUs on one card
2000	NVIDIA	GeForce 2 GTS (NV15)	First programmable shaders (basic)
2000	ATI	Radeon DDR	First ATI Radeon-branded GPU, 32-bit rendering
2000	S3	Savage 2000	Failed due to poor drivers
2001	NVIDIA	GeForce 3 (NV20)	First programmable pixel & vertex shaders (DX8)
2001	ATI	Radeon 8500	Competed with GeForce 3, introduced TruForm (N-Patches)
2001	3dfx	(Acquired by NVIDIA)	3dfx shuts down after bankruptcy
2002	NVIDIA	GeForce 4 Ti (NV25)	Improved shaders, fastest DX8 card
2002	ATI	Radeon 9700 Pro	First DirectX 9 GPU, superior to GeForce 4
2003	NVIDIA	GeForce FX 5800 (NV30)	First DX9 NVIDIA card, but too hot & loud
2003	ATI	Radeon 9800 Pro	Faster than GeForce FX, best DX9 card of the time
2004	ATI	Radeon X800 XT	Competed with GeForce 6800, lacked SM3.0
2005	NVIDIA	GeForce 7800 GTX	First DirectX 9.0c GPU with HDR support
2005	ATI	Radeon X1800 XT	High-performance alternative to 7800 GTX

In the late 1990s when I was just 5-10 years old, the GPU industry was highly competitive, with 3dfx, ATI, Matrox, and S3 leading the market. NVIDIA’s previous GPU, Riva 128 (NV3), introduced integrated 2D/3D acceleration, but its single-pipeline design and limited 4 MB of SDRAM held it back.

I still remember the awe I had for Voodoo2.

Essentially: 3dfx (1996–2000):

Pioneered 3D gaming (Voodoo 1, Voodoo2).
Fell behind in innovation (no 32-bit color, weak T&L).
Acquired by NVIDIA in 2001.
Revolutionized 3D graphics but failed to adapt.

NVIDIA (1997–Present):

TNT series (1998–1999): First dual-pipeline GPU.
GeForce 256 (1999): First true GPU (Hardware T&L).
Dominated from 2000 onward with GeForce FX, 6800, 7800.

ATI (1997–2006, later AMD Radeon):

Rage series struggled vs. NVIDIA.
Radeon 9700 Pro (2002) beat GeForce 4.
Radeon 9800 Pro won vs. GeForce FX (2003–2004).

S3 Graphics & Matrox (Declined after 2000):

S3 Savage3D (1998) introduced texture compression (S3TC) but had bad drivers.
Matrox G400 (1999) introduced dual-monitor tech, but was weak in 3D gaming.

In this post, we are looking at the history of the GPUs, and the ones that left a mark on my life. Starting with: TNT (NVIDIA Riva TNT architecture)

Noteworthy GPUs Link to heading

NVIDIA Riva TNT Link to heading

The “TNT” suffix refers to the chip’s ability to work on two texels at once (TwiN Texel). Riva stands for “Real-time Interactive Video and Animation accelerator”

The NVIDIA Riva TNT, released in 1998, was NVIDIA’s first GPU with a dual-pipeline architecture. It directly competed against 3dfx Voodoo2, offering:

Integrated 2D + 3D acceleration
Dual pixel pipelines (two pixels processed simultaneously)
32-bit color support
Full DirectX 6 and OpenGL compatibility

Let’s explore the internal architecture, data flow, and the detailed graphics pipeline clearly.

The TNT card had a black heatsink, and it was a rather simple design. It looked like this.

NVIDIA TNT (NV4) Block Diagram Link to heading

Here’s a simplified block diagram showing the internal flow of data in the TNT chip:

                CPU
                │ (geometry data via AGP 2X/PCI)
                ▼
┌───────────────────────────────────────┐
│           AGP/PCI Interface           │
└───────────────────────────────────────┘
                │
                ▼
┌───────────────────────────────────────┐
│         Triangle Setup Engine         │
│  (Triangle Setup, Clipping, Cullings) │
└───────────────────────────────────────┘
                │
                ▼
┌───────────────────────────────────────┐
│           Rasterization Unit          │
│      (Dual Pixel Pipelines)           │
└───────────────────────────────────────┘
 │                    │
 ▼                    ▼
┌─────────────────┐  ┌─────────────────┐
│ Pipeline 1      │  │ Pipeline 2      │
│ ├─ Texture Unit │  │ ├─ Texture Unit │
│ └─ Pixel Shader │  │ └─ Pixel Shader │
└─────────────────┘  └─────────────────┘
        │                    │
        ▼                    ▼
┌───────────────────────────────────────┐
│        Raster Operation (ROP)         │
│    (Z-buffering, Alpha blending)      │
└───────────────────────────────────────┘
         │
         ▼
┌───────────────────────────────────────┐
│ 128-bit Memory Interface (SDRAM/SGRAM)│
│        (Frame buffer & Textures)      │
└───────────────────────────────────────┘
         │
         ▼
Display (CRT Monitor)

Let’s carefully break down how the TNT GPU processed data from CPU to display:

🚩 Step 1: CPU & AGP/PCI Interface

CPU handles geometry transformations (3D math).
Sends transformed triangle data (vertices) over the AGP 2X (or PCI) bus.
AGP provided higher bandwidth (up to 533 MB/s) than PCI (133 MB/s), improving performance.

CPU → AGP 2X → TNT GPU

Anyone remembers the AGP bus? :) I remember the time AGP x8 was released.

🚩 Step 2: Triangle Setup Engine

Receives transformed vertices from the CPU.
Triangle setup converts vertex data into screen-space triangles.
Performs backface culling (discarding triangles facing away from the camera).
Performs clipping (removes triangles outside view).

Triangle Setup Engine:
    ├─ Clipping
    ├─ Culling
    └─ Screen-Space Triangle Generation

🚩 Step 3: Rasterization Unit (Dual Pipelines)

The triangle data is passed into the Rasterization Unit, which converts triangles into pixels (fragments).
Dual-pipeline design:
- Two pixels processed simultaneously per clock cycle.
- Significantly improved fill-rate performance compared to single-pipeline designs.

Rasterization Unit:
     ├─ Pipeline #1 ──► Texture Unit ──► Pixel Shader
     └─ Pipeline #2 ──► Texture Unit ──► Pixel Shader

🚩 Step 4: Texture Mapping (Per-Pipeline Texture Units)

Each pipeline has one texture mapping unit (TMU).
Applies textures to each pixel:
- Performs bilinear or trilinear filtering.
- Handles mipmapping to improve texture quality at varying distances.

Texture Unit:
    ├─ Texture Fetch from Memory
    ├─ Bilinear/Trilinear Filtering
    └─ Mipmapping Selection

Each pixel then passes through a basic Pixel Shader stage (though programmable pixel shaders were not available yet in TNT—it was fixed-function shading):

Pixel Shader (fixed-function):
    ├─ Lighting calculation (basic)
    └─ Color blending with texture

🚩 Step 5: Raster Operation (ROP) Stage

After texture mapping, each pixel goes to the Raster Operation stage (ROP).
Performs Z-buffering (depth test) and alpha blending (transparency):
- Z-buffering: checks pixel depth and discards pixels behind already rendered ones.
- Alpha blending: allows semi-transparent objects.

ROP Stage:
    ├─ Z-buffer (Depth Test)
    ├─ Alpha blending (Transparency)
    └─ Final pixel color determination

🚩 Step 6: Frame Buffer and Memory Interface

Pixel data is finally stored in the Frame Buffer (SDRAM/SGRAM).
TNT GPU featured a 128-bit wide memory interface, significantly improving bandwidth.
Supported up to 16MB of video memory, allowing higher resolutions and 32-bit color.

Memory Interface (128-bit SDRAM/SGRAM):
    ├─ Frame Buffer (Final rendered pixels)
    └─ Texture Storage (Texture caching)

🚩 Step 7: Display Output

The completed frame (stored in frame buffer) is continuously sent to the display monitor.
Supports high-resolution output up to 1600×1200 pixels (though often limited by memory and performance at these high resolutions).

Frame Buffer → CRT Monitor (Display Output)

Why NVIDIA TNT Was Significant? Link to heading

The TNT GPU marked a shift from single-pipeline GPUs (e.g., 3dfx Voodoo) towards multi-pipeline architectures:

✅ Dual pixel pipelines → double fill rate. ✅ 32-bit color → superior color quality vs. 3dfx Voodoo2’s 16-bit. ✅ Integrated 2D+3D, no separate 2D card required.

Although limited by clock speed (only around 90 MHz), it established the groundwork for NVIDIA’s future GPUs (TNT2, GeForce 256).

Up next: Voodoo2, which was a very popular card at the time.

3dfx Voodoo2 Link to heading

The year is 1998. The coolest thing about Voodoo2 was that it had a SLI (Scalable Link Interface) option. You could put two Voodoo2 cards in your computer and play games at a higher resolution and with better performance.

The 3dfx Voodoo2 was a legendary GPU released in 1998, following the success of the original Voodoo Graphics (1996). It was widely used in arcade machines and high-end gaming PCs during the late 1990s and was the first GPU to support SLI (Scan-Line Interleave), allowing two cards to work together for better performance.

Key Innovations of Voodoo2:

Dual Texture Mapping Units (TMUs) – Allowed multi-texturing in a single pass.
SLI (Scan-Line Interleave) Support – Two Voodoo2 cards could be linked for double the rendering power.
Dedicated 3D Accelerator – Unlike NVIDIA’s Riva TNT, Voodoo2 had no 2D support, requiring a separate 2D card.

Voodoo2 Block Diagram: Here is a simplified architecture of the Voodoo2 graphics pipeline:

        ┌────────────────────────────────┐
        │         CPU (Game Logic)       │
        └────────────────────────────────┘
                      │
                      ▼
        ┌────────────────────────────────┐
        │     PCI Interface (66 MHz)     │
        └────────────────────────────────┘
                      │
                      ▼
        ┌────────────────────────────────┐
        │   2D Graphics Card (Required)  │  <-- Voodoo2 only did 3D!
        └────────────────────────────────┘
                      │
                      ▼
        ┌────────────────────────────────┐
        │    Voodoo2 Graphics Pipeline   │
        ├────────────────────────────────┤
        │   Triangle Setup & Rasterizer  │  <-- Converts triangles into pixels
        ├────────────────────────────────┤
        │  Texture Mapping Unit (TMU 1)  │  <-- First texture
        ├────────────────────────────────┤
        │  Texture Mapping Unit (TMU 2)  │  <-- Second texture (multi-texturing)
        ├────────────────────────────────┤
        │    Frame Buffer & Z-Buffer     │  <-- Stores final pixel colors & depth
        └────────────────────────────────┘
                      │
                      ▼
        ┌────────────────────────────────┐
        │      VGA Pass-through (2D)     │  <-- Sends final image to monitor
        └────────────────────────────────┘

What made Voodoo2 special?

Dual TMUs (Texture Mapping Units) → Allowed multi-texturing in a single pass.
SLI Support (Scan-Line Interleave) → Allowed two cards to split rendering work.
High Memory Bandwidth → 8 MB to 12 MB of EDO RAM (6 MB for frame buffer + 6 MB for textures).

Step 1: CPU Handles Geometry & Sends to Voodoo2 The CPU handled vertex transformations & lighting (T&L was not yet in hardware). It sent transformed triangles to the Voodoo2 over the PCI bus (66 MHz, 132 MB/s bandwidth).

Step 2: Rasterization (Triangle Setup) The Rasterizer took triangle data from the CPU and converted it into pixels (fragments). It determined which pixels are covered by each triangle. No vertex shaders yet!

        ┌────────────────────┐
        │  Triangle Setup    │
        ├────────────────────┤
        │ Rasterization      │
        └────────────────────┘

Step 3: Texture Mapping (Multi-Texturing) Voodoo2 introduced Dual TMUs (Texture Mapping Units). This meant that one polygon could receive two textures in a single pass, doubling texture performance compared to Voodoo1.

        ┌────────────────────┐
        │ Texture Mapping    │
        ├────────────────────┤
        │ TMU 1 (Texture 1)  │
        ├────────────────────┤
        │ TMU 2 (Texture 2)  │
        └────────────────────┘

Multi-texturing was a game changer – used for lightmaps, bump maps, and reflections.

Step 4: Z-Buffering & Alpha Blending Z-buffering ensured correct depth sorting (far objects were hidden behind closer ones). Alpha blending allowed semi-transparent effects (e.g., glass, smoke).

        ┌────────────────────┐
        │    Z-Buffering     │  <-- Depth Testing (Hidden Surface Removal)
        ├────────────────────┤
        │  Alpha Blending    │  <-- Transparency Effects
        ├────────────────────┤
        │ Frame Buffer Write │  <-- Stores Final Pixels
        └────────────────────┘

Step 5: Final Output (VGA Pass-Through) Since Voodoo2 did not support 2D rendering, it had a pass-through cable. The final 3D image was sent to the 2D graphics card, which then displayed it on the monitor.

        ┌─────────────────────────┐
        │ VGA Pass-Through Output │
        └─────────────────────────┘

This is why Voodoo2 needed a separate 2D card like Matrox Millennium!

3dfx SLI (Scan-Line Interleave)

Voodoo2 was the first consumer GPU to support SLI (Scan-Line Interleave).
Two Voodoo2 cards could be linked together to increase performance.
Each GPU would render every other scanline, effectively doubling the rendering power.

SLI Diagram:

       ┌──────────────┐  ┌──────────────┐
       │  Voodoo2 #1  │  │  Voodoo2 #2  │
       │  (Odd lines) │  │  (Even lines)│
       └──────┬───────┘  └──────┬───────┘
              ▼                ▼
       ┌──────────────────────────┐
       │     Final Framebuffer    │  <-- Image combined from both GPUs
       └──────────────────────────┘

SLI doubled the fill rate and allowed resolutions up to 1024×768 (a big deal in 1998). This set the foundation for NVIDIA SLI & AMD CrossFire in the 2000s.

Comparison: Voodoo2 vs. Competitors

Feature	Voodoo1 (1996)	Voodoo2 (1998)	NVIDIA Riva TNT (1998)
Pipelines	1	1	2
TMUs	1	2	1
Max RAM	8 MB	12 MB	16 MB
Max Resolution	800×600	1024×768	1600×1200
SLI Support	❌ No	✅ Yes	❌ No
2D Support	❌ No	❌ No	✅ Yes

Voodoo2 was amazing for multi-texturing and SLI, but had no 2D support.
NVIDIA TNT was more advanced (integrated 2D & 3D, 32-bit color), leading to 3dfx’s downfall.

Conclusion

3dfx Voodoo2 was revolutionary, but lacked 32-bit color & full 2D support.
SLI was ahead of its time, inspiring NVIDIA’s future GPUs.
3dfx failed to adapt to unified 2D/3D architectures, leading to its demise.

ATI R300 (Radeon 9700 Pro) Architecture Link to heading

The ATI R300 GPU (Radeon 9700 Pro, 2002) was one of the most groundbreaking GPUs in history, introducing:

First DirectX 9 GPU – Enabled advanced programmable shaders (Shader Model 2.0).
8-Pipeline Architecture – Twice the power of NVIDIA GeForce 4 Ti.
256-bit Memory Interface – First consumer GPU with 256-bit GDDR memory, delivering unprecedented bandwidth.

This GPU forced NVIDIA to redesign their GeForce FX (NV30), as it outperformed everything at the time.

R300 Architecture Block Diagram Here’s a simplified view of the Radeon 9700 Pro pipeline:

        ┌────────────────────────────────┐
        │         CPU (Game Logic)       │
        └────────────────────────────────┘
                      │
                      ▼
        ┌────────────────────────────────┐
        │  AGP 8X Bus (Transfers Data)   │
        └────────────────────────────────┘
                      │
                      ▼
        ┌────────────────────────────────┐
        │    R300 Graphics Pipeline      │
        ├────────────────────────────────┤
        │  Vertex Shader (Programmable)  │  <-- First DX9 GPU with SM2.0
        ├────────────────────────────────┤
        │  Triangle Setup & Rasterizer   │
        ├────────────────────────────────┤
        │  8 Pixel Pipelines (Dual TMUs) │  <-- Twice the power of GeForce 4
        ├────────────────────────────────┤
        │  Z-Buffer & Stencil Buffer     │
        ├────────────────────────────────┤
        │  256-bit GDDR Memory Interface │  <-- First consumer GPU with 256-bit bus
        ├────────────────────────────────┤
        │       Frame Buffer Output      │
        └────────────────────────────────┘

What made R300 special?

First GPU with full DirectX 9 support (Shader Model 2.0).
8 pixel pipelines with dual texture units per pipe (superior to GeForce 4 Ti’s 4 pipelines).
Introduced Hierarchical Z-Buffer and Fast-Z Clear to optimize performance.
Massive 256-bit memory interface (first of its kind in gaming GPUs).

Radeon 9700 Pro Graphics Pipeline Let’s break down how the R300 processed a 3D scene step by step.

Step 1: Vertex Processing (First Fully Programmable Vertex Shader) R300 introduced fully programmable Vertex Shaders (SM2.0). Instead of fixed-function T&L (like GeForce 4 Ti), developers could write custom transformation and lighting effects. This enabled realistic animations, per-vertex deformations, and advanced lighting effects.

        ┌──────────────────────┐
        │ Vertex Shader (SM2.0)│  <-- Custom transformations, deformations
        ├──────────────────────┤
        │ Geometry Processing  │
        ├──────────────────────┤
        │ Triangle Setup       │
        └──────────────────────┘

This was a massive leap over GeForce 4 Ti’s fixed-function pipeline.

Step 2: Rasterization & Pixel Processing (8 Pipelines!) R300 had 8 pixel pipelines (GeForce 4 Ti only had 4). Each pipeline had 2 texture mapping units (TMUs) → 16 textures per clock cycle. Enabled multi-texturing in a single pass (light maps, bump maps, shadows).

        ┌─────────────────────────┐
        │  8 Pixel Pipelines      │  <-- High parallelism
        ├─────────────────────────┤
        │  Dual TMUs per Pipeline │  <-- Faster multi-texturing
        ├─────────────────────────┤
        │  Pixel Shader (SM2.0)   │  <-- Programmable pixel effects
        └─────────────────────────┘

GeForce 4 Ti used fixed-function shading → R300’s programmable shaders were more advanced. This allowed more realistic lighting, shadows, and materials.

Step 3: Z-Buffering & Stencil Buffering (Fast-Z Clear) Hierarchical Z-Buffering (HZB) – Optimized depth sorting before pixel shading, reducing workload. Fast-Z Clear – Quickly reset depth values between frames → Improved efficiency. Stencil Buffer – Used for shadows, reflections, and outlines.

        ┌─────────────────────────┐
        │   Hierarchical Z-Buffer │  <-- Fast hidden surface removal
        ├─────────────────────────┤
        │   Stencil Buffer        │  <-- Used for shadows, mirrors
        ├─────────────────────────┤
        │   Early Z-Culling       │  <-- Prevents processing hidden pixels
        └─────────────────────────┘

This massively reduced fill-rate bottlenecks and improved FPS.

Step 4: 256-bit GDDR Memory Controller (Industry First) First consumer GPU with a 256-bit memory bus. Max bandwidth: 19.8 GB/s (GeForce 4 Ti only had 10.4 GB/s). Allowed higher resolutions and anti-aliasing without major performance drops.

        ┌───────────────────────┐
        │ 256-bit Memory Bus    │  <-- First GPU with this bandwidth
        ├───────────────────────┤
        │ High-Speed GDDR Memory│
        ├───────────────────────┤
        │ Frame Buffer Storage  │
        └───────────────────────┘

This gave Radeon 9700 Pro a major lead over GeForce 4 Ti.

Radeon 9700 Pro dominated the market because:

8 pipelines vs. GeForce 4 Ti’s 4 → Twice the fill rate. 256-bit memory interface vs. 128-bit → More bandwidth. Shader Model 2.0 → Allowed more complex pixel & vertex shading effects. Even NVIDIA’s GeForce FX 5800 struggled to compete (it was loud & hot).

I do remember FX 5800, the loudest GPU at the time.

Conclusion

Radeon 9700 Pro was the first truly next-gen GPU (DirectX 9, SM2.0).
256-bit memory interface was revolutionary and set the standard for future GPUs.
Forced NVIDIA to rethink their strategy after GeForce FX 5800 failed.

The first GPU that I truly fallen in love with? Radeon 9800XT. I still remember the red PCB and the huge heatsink, it looked like this.

Radeon 9700 vs. Radeon 9800 – What’s the Difference? Link to heading

Feature	Radeon 9700 Pro (R300, 2002)	Radeon 9800 Pro (R350, 2003)	Radeon 9800 XT (R360, 2003)
GPU Core	R300	R350 (Refined R300)	R360 (Further Refined R350)
Process Node	150nm	150nm	130nm
Pipelines	8	8	8
TMUs per Pipe	2	2	2
Core Clock	325 MHz	380 MHz	412 MHz
Memory Clock	620 MHz (DDR)	680 MHz (DDR)	730 MHz (DDR)
Memory Bus	256-bit	256-bit	256-bit
Memory Bandwidth	19.8 GB/s	21.8 GB/s	23.4 GB/s
DirectX Support	9.0 (SM2.0)	9.0 (SM2.0)	9.0 (SM2.0)
Pixel Shader Version	2.0	2.0	2.0
Vertex Shader Version	2.0	2.0	2.0
AA & AF Performance	Good	Better	Best

9800 Pro and 9800 XT were essentially overclocked and optimized versions of the 9700 Pro.
9800 XT used a smaller 130nm process, allowing for higher clock speeds.

Evolution of GPU Architecture Link to heading

Early GPUs (such as the 3dfx Voodoo series, NVIDIA Riva TNT, and ATI Rage) followed a fixed-function pipeline, meaning they had dedicated hardware blocks for tasks like:

Vertex transformation (adjusting 3D models for perspective)
Rasterization (converting shapes into pixels)
Texture mapping (applying textures onto 3D objects)
Lighting calculations (basic shading) These GPUs worked very efficiently for rendering predefined graphics but lacked flexibility for custom computations. They were essentially hardwired state machines—great for graphics but terrible for general-purpose computing.

An overview of the evolution of GPUs:

Era	Key Feature	Example GPUs
1995–2000	Fixed-Function Pipeline, Hardware T&L	NVIDIA GeForce 256, ATI Rage 128
2001–2006	Programmable Shaders (SM2.0–SM3.0)	NVIDIA GeForce 3, Radeon 9700 Pro
2006–2012	Unified Shader Architecture (SM4.0–SM5.0)	GeForce 8, Radeon HD 5000
2018–Present	Ray Tracing (RT Cores), AI (Tensor Cores)	RTX 2000, Radeon RX 6000

Dictionary Link to heading

GPU Terminology Link to heading

Rasterization: Converting triangles into pixels.
Shader: Program running on GPU (vertex, pixel, geometry).
Vertex Shader: Manipulates vertex positions (geometry).
Pixel Shader (Fragment Shader): Determines pixel colors.
Texture: Image mapped onto geometry surfaces.
TMU (Texture Mapping Unit): GPU hardware fetching and filtering textures.
ROP (Raster Operation): Handles depth tests (Z-buffer), blending, writing pixels to framebuffer.
Z-buffering (Depth Buffer): Ensures correct visibility by depth comparison.
Stencil Buffer: Defines pixel rendering masks.
Framebuffer: Stores rendered pixels awaiting display.
Compute Shader: GPU programs performing general computations.

Hardware Transform & Lighting (H/W T&L) `1999` Link to heading

Two of the most important milestones in the evolution of GPUs were:

Hardware T&L (Transform & Lighting) – Introduced by NVIDIA GeForce 256 (1999).
Shader Model 3.0 – Introduced by NVIDIA GeForce 6 series (2004, DirectX 9.0c).

Hardware T&L (Transform & Lighting) is the ability of a GPU to process 3D transformations and lighting calculations directly, instead of relying on the CPU. Before GeForce 256 (1999), all geometry transformations and lighting were done on the CPU. This limited performance because:

The CPU was already handling game logic, physics, AI, and sound.
As games became more complex, CPU-bound T&L computations became a bottleneck.

→ Hardware T&L offloaded these calculations to the GPU, significantly improving performance.

Every 3D object is represented by vertices in 3D space. Before rendering, we must:

Transform them (move, rotate, scale).
Apply lighting effects to simulate realism.

The T&L Pipeline (Pre-GPU Era) Link to heading

1995–1999, before Hardware T&L Link to heading

The CPU calculates vertex transformations (object → world → camera space).
The CPU calculates lighting per vertex (Phong, Lambertian, etc.).
The CPU sends final transformed vertices to the GPU, which only does rasterization.
This was slow because all vertices were processed on the CPU!

The Impact of Hardware T&L (1999–2002) Link to heading

With GeForce 256 (NV10, 1999), NVIDIA introduced dedicated T&L hardware inside the GPU. Now:

The GPU performs all transformation & lighting calculations.
The CPU is freed to handle AI, physics, etc..
Massive speed-up in rendering complex 3D scenes.

Games That Used Hardware T&L

Quake III Arena (1999) – Huge performance boost with GeForce 256.
Max Payne (2001) – Required Hardware T&L for full effects.
Morrowind (2002) – Used advanced lighting powered by T&L.

Shader Model 3.0 `2004` Link to heading

Shader Model 3.0 (SM 3.0) is a DirectX 9.0c feature that introduced programmable shaders with improved flexibility and performance. It was first supported by:

NVIDIA GeForce 6 series (2004, NV40 GPU).
ATI Radeon X1000 series (2005).

Unlike Hardware T&L, which was fixed-function, Shader Model 3.0 allowed fully programmable shading with:

Longer shader programs (more complex effects).
Branching and loops in shaders (better performance).
Higher precision in pixel calculations (HDR lighting).

Shaders are small programs that run on the GPU to control how objects appear on the screen.

Types of Shaders in Shader Model 3.0

Vertex Shaders – Modify the position of vertices in 3D space.
Pixel Shaders (Fragment Shaders) – Control how pixels are shaded (lighting, reflections, textures).
Geometry Shaders (Introduced in SM4.0, DX10) – Create new geometry from existing ones.

Key Features of Shader Model 3.0

Feature	Shader Model 2.0 (DX9.0)	Shader Model 3.0 (DX9.0c)
Instruction Limit	64	512+
Dynamic Branching	❌ No	✅ Yes
Vertex Texture Fetch	❌ No	✅ Yes
Longer Shader Programs	❌ Limited	✅ Supported
HDR (High Dynamic Range)	✅ Limited	✅ Fully supported

Why Shader Model 3.0 Was a Big Deal

More Realistic Graphics – Games looked better with per-pixel lighting and soft shadows.
Better Performance – Dynamic branching reduced unnecessary calculations.
HDR Support – Enabled High Dynamic Range lighting (used in Far Cry, Half-Life 2).

Games That Used Shader Model 3.0

Far Cry (2004) – Better water reflections, HDR lighting.
Splinter Cell: Chaos Theory (2005) – Used SM 3.0 for realistic shadows.
Battlefield 2 (2005) – Required SM 3.0 for advanced graphics.

Z-Buffering and Depth Management in 3D Graphics Link to heading

When rendering 3D scenes, we need a way to determine which objects are visible and which should be hidden behind others. This process is called hidden surface removal (HSR). One of the most widely used techniques for this is Z-buffering (or depth buffering).

What is Z-Buffering? Link to heading

Z-buffering is a per-pixel depth management technique used in rasterization-based rendering. It helps determine which pixels should be drawn and which should be discarded based on their distance from the camera. Every pixel in the frame buffer has a corresponding depth value (Z-value) stored in a Z-buffer (depth buffer). When a new pixel is drawn, its Z-value is compared with the stored value:

If the new pixel is closer to the camera (lower Z-value) → It replaces the old pixel.
If the new pixel is farther away (higher Z-value) → It is discarded.

Key Properties of Z-Buffering:

Per-Pixel Accuracy – Works at the finest granularity.
Efficient for Arbitrary Geometry – Handles complex overlapping objects.
No Pre-Sorting Required – Unlike other techniques like the Painter’s Algorithm.

How Z-Buffering Works? Link to heading

Step-by-Step Process:

Initialize the Z-buffer: Each pixel in the Z-buffer is initialized to a large value (e.g., the far clipping plane depth). Frame buffer is initialized to background color.
Rasterize Each Triangle:
1. For every pixel covered by the triangle, compute the Z-depth (distance from the camera).
2. Compare this Z-depth to the stored depth in the Z-buffer.
  - If new depth < stored depth → Overwrite color & depth.
  - If new depth >= stored depth → Discard pixel.
Final Image Composition: After processing all triangles, the frame buffer contains the final rendered image.

Comparison with Other Hidden Surface Removal Techniques Link to heading

Let’s compare Z-buffering with alternative visibility techniques.

Method	Accuracy	Sorting Required?	Memory Usage	Performance Impact	Best Use Case
Z-Buffering	Per-pixel	No	High (depth buffer)	Moderate (per-pixel comparisons)	General 3D rendering (games, CAD)
Painter’s Algorithm	Per-object/triangle	Yes (Back-to-front sorting)	Low	High (due to sorting & overdrawing)	Simple scenes with few overlapping objects
Binary Space Partitioning (BSP)	Per-object/triangle	Yes (Precomputed BSP tree)	Medium	Very high (preprocessing overhead)	Static scenes (Doom-style rendering)
Ray Tracing	Per-pixel	No	High	Very high (traces rays for each pixel)	High-quality reflections/shadows (Offline rendering, RTX GPUs)

Painter’s Algorithm (Back-to-Front Sorting)

Sort all polygons by depth (far → near).
Draw them in order, so closer objects naturally overwrite farther ones.
Issues:
- Sorting overhead is expensive.
- Transparency is hard to handle.
- Doesn’t work well for intersecting objects.

Binary Space Partitioning (BSP)

Preprocess the scene into a BSP tree.
At runtime, traverse the tree to determine drawing order.
Used in classic games like DOOM (1993).
Issues:
- Works best for static geometry (dynamic objects break the tree).
- Preprocessing is expensive.

Ray Tracing (Alternative to Rasterization)

Instead of rasterizing triangles, traces rays from the camera into the scene.
Handles shadows, reflections, and refractions naturally.
Used in modern RTX GPUs.
Issues:
- Computationally expensive without hardware acceleration.
- Needs denoising techniques to remove noise in real-time applications.

Conclusion Z-buffering is the best balance between accuracy and performance for real-time rendering.

Z-Buffer Precision Issues & Solutions Link to heading

Z-buffer precision is limited by the number of bits allocated per pixel. Common depths are 16-bit, 24-bit, and 32-bit.

Precision Problem: Z-Fighting When two surfaces are very close together, limited Z-buffer precision causes fluctuations in depth values, leading to flickering artifacts.

Example Overlapping polygons on a car’s dashboard in a video game may “flicker” as the camera moves.

Solutions

Use a 24-bit or 32-bit Z-buffer instead of 16-bit.
Adjust Near and Far Plane Clipping:
- Keep the near plane as far as possible.
- Avoid using an excessively large far plane (e.g., 0.1m near → 10,000m far is bad).
- Use Floating-Point Depth Buffers (if supported).
- Enable Depth Biasing (Polygon Offset) to slightly separate overlapping surfaces.

Optimizations Since Z-buffering requires memory reads & writes for every pixel, it can be slow. Here are some optimizations:

Early Z-Culling (Hierarchical Z-Buffering)

Before fragment shading, discard pixels that fail the depth test.
Modern GPUs use hierarchical Z-buffers to reject large chunks of pixels early.

Reverse Z-Buffering

Instead of storing depth as 1/Z, store it as Z directly.
This improves precision near the camera, where it matters most.
Used in modern OpenGL and DirectX engines.

Tiling & Deferred Rendering

GPUs like PowerVR (used in mobile devices) use tile-based rendering.
The scene is split into tiles, and depth tests are performed in small chunks.

Z-Buffering Summary Link to heading

🔹 Z-buffering is the most widely used technique for real-time hidden surface removal. 🔹 It provides per-pixel accuracy, but has precision issues that require depth buffer optimizations. 🔹 Compared to other methods, it is more scalable and general-purpose, making it dominant in modern GPUs. 🔹 Future techniques (e.g., Ray Tracing) may complement Z-buffering for hybrid rendering.

I have used LLMs to generate some content for this post. Involvement of LLMs w.r.t. the content is approx. 50%.