The silicon beneath inference: a GPU architecture primer for AI engineers
Every production inference deployment sits on a stack of abstractions. Frameworks like vLLM and TensorRT-LLM present clean APIs. CUDA kernels hide the hardware. But when latency spikes, throughput plateaus, or costs spiral, the abstraction layers become walls. The engineers who break through those walls are the ones who understand what sits beneath: the silicon itself.
This matters more than it once did. As models grow larger and inference costs dominate training costs in production, the GPU is no longer just “the thing that makes AI fast.” It is the constraint surface against which every optimization decision is made. Batch size, quantization strategy, attention mechanism, model parallelism: each of these choices interacts with specific properties of the hardware. Understanding those properties turns guesswork into engineering.
What follows is a practitioner’s map of GPU architecture for AI inference, covering the three major silicon vendors (NVIDIA, AMD, Intel), the memory hierarchy that governs performance, and the techniques that bridge the gap between raw hardware capability and practical throughput.
How GPUs think about work
The memory hierarchy determines everything
A GPU is, at its core, a machine for moving data. The compute units (CUDA cores, Tensor Cores, or their equivalents) can perform arithmetic far faster than the memory system can feed them. This imbalance - the gap between compute throughput and memory bandwidth - is the central tension of GPU architecture. Every design decision radiates from it.
The memory hierarchy exists to manage this tension. It consists of progressively larger, slower layers of storage, each acting as a staging area for the one above it. Registers sit closest to the compute units, with access times of roughly one clock cycle and bandwidth exceeding 19 TB/s per streaming multiprocessor (SM). Shared memory and L1 cache, which occupy the same physical SRAM on NVIDIA GPUs, offer 128 to 228 KB per SM with latencies of 20 to 30 cycles. The L2 cache (40 to 60 MB on current-generation GPUs) is shared across all SMs, with latencies around 150 to 200 cycles. HBM (High Bandwidth Memory), the main GPU memory where model weights reside, delivers 2 to 8 TB/s of bandwidth but at latencies of 400 to 800 cycles. System memory on the host CPU, reached via PCIe or NVLink, adds microseconds of latency.
Figure 1: GPU memory hierarchy showing approximate latencies and bandwidths. Width indicates relative capacity.
For inference, the critical insight is this: model weights live in HBM. During the decode phase of autoregressive generation (when the model produces tokens one at a time), the GPU must read the entire weight matrix for every single token. The arithmetic intensity of this operation - the ratio of compute operations to bytes moved - is extremely low. The result is that most inference workloads are memory-bandwidth-bound, not compute-bound. The Tensor Cores sit idle, waiting for data.
Throughput over latency: the GPU’s design philosophy
CPUs optimize for latency. They invest transistors in branch predictors, speculative execution, deep out-of-order pipelines, and large caches, all aimed at making a single thread run as fast as possible. A modern server CPU might execute 8 to 16 threads with each one completing work in nanoseconds.
GPUs take the opposite approach. They optimize for throughput by running thousands of threads simultaneously, each individually slower than a CPU thread. Where a CPU hides memory latency through caching and speculation, a GPU hides it through parallelism: when one group of threads (a “warp” of 32 threads on NVIDIA hardware) stalls waiting for a memory fetch, the hardware scheduler instantly switches to another warp that has data ready. This technique - latency hiding through massive thread occupancy - is why GPUs can sustain high utilization even with 400-cycle memory access times.
The parallel structure maps naturally to neural network inference. A matrix multiplication (the dominant operation in transformer models) decomposes into thousands of independent multiply-accumulate operations that can run on separate threads. Each thread computes a small tile of the output matrix, and the results combine to form the final answer. This is not a coincidence: GPU architectures and deep learning co-evolved, each shaping the other.
Three architectures, one problem
NVIDIA: the incumbent’s depth
NVIDIA’s data center GPU lineage runs from Ampere (A100, 2020) through Hopper (H100/H200, 2022/2024) to Blackwell (B200, launched 2025). Each generation has increased both compute density and memory bandwidth, but the architectural skeleton remains consistent: an array of streaming multiprocessors (SMs), each containing CUDA cores for general-purpose parallel computation, Tensor Cores for matrix math, and a hierarchy of on-chip memory.
The H100 GPU contains 132 SMs. Each SM holds 128 FP32 CUDA cores, four fourth-generation Tensor Cores, a Transformer Engine that dynamically selects between FP8 and FP16 precision per layer, four warp schedulers, a 256 KB register file, and up to 228 KB of configurable L1 cache and shared memory. The SMs share a 50 MB L2 cache. The GPU connects to 80 GB of HBM3 at 3.35 TB/s.
Figure 2: Internal structure of an H100 SM. Each SM contains four processing partitions, each with its own warp scheduler and 32 CUDA cores. Tensor Cores and the Transformer Engine handle matrix operations for AI workloads.
The Tensor Core is the critical unit for inference. Unlike a CUDA core, which executes one multiply-add per cycle, a fourth-generation Tensor Core performs a 16x8x16 matrix multiply-accumulate in a single operation, yielding hundreds of operations per cycle. In FP8 precision (the sweet spot for most inference), the H100 delivers 1,979 TFLOPS of dense compute. The H200, an incremental upgrade, keeps the same compute engine but expands memory to 141 GB of HBM3e at 4.8 TB/s, directly addressing the bandwidth bottleneck for large-model inference.
The B200 (Blackwell architecture) uses a dual-die chiplet design with 208 billion transistors, delivers 4,500 TFLOPS in FP8 and 9,000 TFLOPS in FP4, and offers 192 GB of HBM3e at 8 TB/s. Blackwell’s fifth-generation Tensor Cores add native FP4 support, which effectively doubles inference throughput for compatible workloads by halving the bytes that must traverse the memory bus per operation.
AMD: the memory advantage
AMD’s data center GPU strategy centers on the Instinct MI300 series, built on the CDNA 3 architecture. Where NVIDIA uses the term “streaming multiprocessor,” AMD uses “compute unit” (CU). Each CU contains 64 stream processors and four matrix cores that execute MFMA (matrix fused multiply-add) instructions. The MI300X packs 304 CUs, totaling 19,456 stream processors.
AMD’s competitive edge lies in memory. The MI300X ships with 192 GB of HBM3 at 5.3 TB/s, a substantial lead over the H100’s 80 GB at 3.35 TB/s. The MI325X extends this to 256 GB of HBM3E at 6 TB/s. For large language models, memory capacity determines whether a model fits on a single GPU or needs sharding across multiple devices (with the associated interconnect overhead). A 70B-parameter model in FP16 requires roughly 140 GB of memory for weights alone, which fits comfortably on an MI300X but barely squeezes into an H100 with no room for KV cache.
The tradeoff is software maturity. AMD’s ROCm stack has narrowed the gap with CUDA significantly, with native PyTorch and TensorFlow support and HIP as a CUDA translation layer. But analysis from 2025 shows MI300X achieving only 37 to 66 percent of H100/H200 performance on many workloads, largely due to software overhead rather than hardware limitation. The hardware is competitive; the ecosystem still trails.
Intel: the value proposition
Intel’s AI accelerator story has consolidated around Gaudi 3 (from the acquired Habana Labs team), with the Xe-HPC architecture (Data Center GPU Max, “Ponte Vecchio”) playing a secondary role. Gaudi 3 uses a dual-die package with 64 tensor processor cores (TPCs) and eight matrix multiplication engines (MMEs). Each TPC is a VLIW vector processor with 256-bit wide execution units. The chip delivers 1,835 TFLOPS in BF16 and FP8, roughly on par with the H100’s dense (non-sparse) FP8 throughput.
Gaudi 3’s distinguishing feature is its networking architecture: 24 integrated 200 Gbps Ethernet ports per chip, enabling direct scale-out without external network switches. Its primary limitation is memory: 128 GB of HBM2e (not HBM3 or HBM3e) at 3.67 TB/s. The older memory generation constrains both capacity and bandwidth relative to competitors. Intel positions Gaudi 3 on price, often at half the cost of an H100, making it attractive for organizations where software flexibility (Gaudi supports PyTorch directly through the SynapseAI SDK) and total cost of ownership matter more than peak throughput.
Figure 3: Compute unit architecture comparison across NVIDIA Hopper, AMD CDNA 3, and Intel Gaudi 3.
The numbers side by side
Table 1: Memory specifications across GPU families
| GPU | Architecture | HBM capacity | HBM type | Bandwidth | TDP |
|---|---|---|---|---|---|
| NVIDIA A100 SXM | Ampere | 80 GB | HBM2e | 2.0 TB/s | 400W |
| NVIDIA H100 SXM | Hopper | 80 GB | HBM3 | 3.35 TB/s | 700W |
| NVIDIA H200 SXM | Hopper | 141 GB | HBM3e | 4.8 TB/s | 700W |
| NVIDIA B200 | Blackwell | 192 GB | HBM3e | 8.0 TB/s | 1000W |
| AMD MI300X | CDNA 3 | 192 GB | HBM3 | 5.3 TB/s | 750W |
| AMD MI325X | CDNA 3 | 256 GB | HBM3E | 6.0 TB/s | 1000W |
| Intel Gaudi 3 | Gaudi | 128 GB | HBM2e | 3.67 TB/s | 600-900W |
Table 2: Compute throughput (dense, non-sparse TFLOPS)
| GPU | FP64 | FP32/TF32 | FP16/BF16 | FP8 | FP4 |
|---|---|---|---|---|---|
| NVIDIA A100 | 19.5 | 156 (TF32) | 312 | n/a | n/a |
| NVIDIA H100 | 67 | 495 (TF32) | 990 | 1,979 | n/a |
| NVIDIA H200 | 67 | 495 (TF32) | 990 | 1,979 | n/a |
| NVIDIA B200 | 40 | ~1,125 (TF32) | ~2,250 | 4,500 | 9,000 |
| AMD MI300X | 81.7 | 653.7 (TF32) | 1,307 | 2,615 | n/a |
| AMD MI325X | 81.7 | 653.7 (TF32) | 1,307 | 2,615 | n/a |
| Intel Gaudi 3 | n/a | n/a | 1,835 (BF16) | 1,835 | n/a |
NVIDIA numbers are Tensor Core throughput. AMD numbers are Matrix Core throughput. Gaudi 3 BF16 and FP8 are at parity (same rate) - unusual, as most architectures double throughput from BF16 to FP8.
From silicon to inference optimization
Two phases, two bottlenecks
LLM inference splits into two distinct computational phases, each with a different performance profile. The prefill phase processes the entire input prompt in parallel. Because the GPU operates on a large input matrix (all prompt tokens at once), the arithmetic intensity is high: many multiply-accumulate operations per byte of memory read. Tensor Cores stay busy. This phase is compute-bound.
The decode phase generates output tokens one at a time, autoregressively. Each token requires reading the full model weights from HBM but performs relatively little computation per byte loaded. The arithmetic intensity drops, often below 1 FLOP per byte, which is far below the roofline peak. This phase is memory-bandwidth-bound. For most production inference workloads, the decode phase dominates total latency.
Figure 4: LLM inference pipeline from model loading through token generation. The decode phase (step 4) dominates latency and is memory-bandwidth-bound.
The memory-bandwidth bottleneck, mathematically
A concrete example makes the bottleneck tangible. Take a 70B-parameter model quantized to FP16 (2 bytes per parameter). The model weights occupy 140 GB. During the decode phase, generating a single token requires reading all 140 GB of weights from HBM and performing roughly 140 billion multiply-add operations (2 FLOPs per parameter).
On an H100 (3.35 TB/s bandwidth, 990 TFLOPS FP16):
Time to read weights = 140 GB / 3,350 GB/s = 41.8 ms
Time to compute = 280 GFLOP / 990 TFLOPS = 0.28 ms
Arithmetic intensity = 280 GFLOP / 140 GB = 2 FLOP/byte
Roofline ridge point = 990 TFLOPS / 3.35 TB/s = 295 FLOP/byte
At 2 FLOP/byte, we are deep in the memory-bound region.
Theoretical max decode throughput = 1 / 0.0418s = ~24 tokens/second (batch size 1)
The arithmetic intensity of 2 FLOP/byte is roughly 150x below the roofline ridge point of 295 FLOP/byte. The compute units sit 99% idle during single-token decode. This explains why the H200’s 43% bandwidth increase (4.8 TB/s vs 3.35 TB/s) translates almost directly into a proportional inference speedup for memory-bound workloads, despite having identical compute.
Batching: amortizing the weight reads
Batching is the primary lever for improving GPU utilization during decode. If the GPU serves multiple requests simultaneously, it reads the weights from HBM once and applies them to all requests in the batch. The compute scales linearly with batch size, but the memory traffic stays roughly constant (ignoring KV cache growth). Larger batches push the workload from memory-bound toward compute-bound.
Table 3: Latency and utilization by batch size (70B model, FP16, H100 SXM)
| Batch size | Arithmetic intensity | Bottleneck | Approx. tokens/s | Compute util. |
|---|---|---|---|---|
| 1 | 2 FLOP/byte | Memory BW | ~24 | <1% |
| 8 | 16 FLOP/byte | Memory BW | ~190 | ~5% |
| 32 | 64 FLOP/byte | Memory BW | ~750 | ~22% |
| 128 | 256 FLOP/byte | Approaching ridge | ~2,800 | ~87% |
| 256 | 512 FLOP/byte | Compute | ~3,500 | ~95% |
Estimates assume decode-phase weight-dominant workload. Real throughput varies with sequence length, KV cache size, and attention overhead. The transition from memory-bound to compute-bound occurs near batch size 128 on H100.
The catch: larger batch sizes require more memory for KV caches. Each request’s KV cache grows linearly with sequence length and model depth. For a 70B model with a 4096-token context, the KV cache per request is roughly 2.5 GB in FP16. At batch size 32, that is 80 GB of KV cache alone, leaving little room on an 80 GB H100. This is why the H200’s 141 GB and the MI300X’s 192 GB matter: they allow larger batch sizes before exhausting VRAM.
Quantization: trading precision for bandwidth
Quantization reduces the number of bytes per parameter, directly increasing effective memory bandwidth. Moving from FP16 (2 bytes) to INT8 (1 byte) halves the data that must traverse the memory bus, roughly doubling decode throughput for memory-bound workloads. FP8 achieves similar savings while maintaining the floating-point format’s dynamic range. FP4 (supported on NVIDIA Blackwell) quarters the original footprint.
The quality tradeoff is nuanced. FP8 quantization for inference causes minimal accuracy degradation on most production tasks (classification, summarization, standard chat completion). INT4 quantization (GPTQ, AWQ) introduces measurable quality loss but enables models that otherwise would not fit in GPU memory. The engineering decision is a three-way tradeoff between model quality, throughput, and hardware cost.
The framework layer
Several inference frameworks translate these hardware-level optimizations into practical APIs. TensorRT-LLM (NVIDIA) provides the tightest hardware integration, with custom CUDA kernels for attention, quantization-aware compilation, and Transformer Engine support. It targets maximum throughput on NVIDIA hardware. vLLM introduced PagedAttention, which manages KV cache memory in fixed-size blocks (like OS virtual memory pages), eliminating fragmentation and enabling higher batch sizes. It supports both NVIDIA and AMD GPUs. Ollama wraps these optimizations in a user-friendly interface for local deployment, primarily on consumer hardware with llama.cpp as its backend. Each framework represents a different point on the ease-of-use vs. performance frontier.
Production tradeoffs
Identifying the bottleneck: roofline thinking
The roofline model provides a framework for reasoning about GPU performance. It plots achievable FLOPS as a function of arithmetic intensity (FLOP/byte of memory traffic). Below the ridge point, performance scales with memory bandwidth; above it, performance plateaus at peak compute. For the H100, the ridge point is approximately 295 FLOP/byte in FP16. For the MI300X, it is about 247 FLOP/byte.
Most LLM decode workloads operate at 2 to 16 FLOP/byte (batch sizes 1 to 8), firmly in the memory-bound regime. Prefill, by contrast, can reach 100+ FLOP/byte with long prompts. The practical implication: optimization effort during decode should focus on reducing memory traffic (quantization, KV cache compression, speculative decoding), while prefill optimization targets compute efficiency (custom GEMM kernels, FlashAttention).
Hardware-software co-design in practice
The convergence of hardware features and software techniques tells a coherent story. NVIDIA’s Transformer Engine dynamically switches between FP8 and FP16 per layer because some layers tolerate lower precision while others do not; the hardware adapts where static quantization would force a single choice. FlashAttention restructures the attention computation to maximize data reuse in shared memory (SRAM), avoiding the O(n²) HBM reads that standard attention requires. PagedAttention addresses a software problem (KV cache memory fragmentation) that arises from a hardware constraint (fixed VRAM capacity).
These are not independent optimizations. They compose. A well-tuned inference stack running on an H200 might use FP8 quantization (halving weight memory), PagedAttention (maximizing batch size within 141 GB), FlashAttention (reducing attention memory traffic), and continuous batching (avoiding idle cycles between requests). Each technique addresses a different layer of the hardware-software stack, and their combined effect is multiplicative.
The price-performance decision
Table 4: Price-performance comparison (approximate, as of early 2026)
| GPU | Purchase price | Cloud (on-demand/hr) | 70B FP8 tok/s | $/M tokens (est.) |
|---|---|---|---|---|
| NVIDIA H100 SXM | $25-30K | $1.90-3.50 | ~50 (bs=1) | $0.55-0.80 |
| NVIDIA H200 SXM | $30-40K | $3.50-5.00 | ~70 (bs=1) | $0.50-0.70 |
| NVIDIA B200 | $35-50K | $6.00+ | ~150 (bs=1) | $0.35-0.50 |
| AMD MI300X | $22-28K | $0.95-2.50 | ~40 (bs=1) | $0.45-0.65 |
| Intel Gaudi 3 | $12-16K | $2.00-3.00 | ~35 (bs=1) | $0.55-0.75 |
Prices are approximate and vary by provider, contract terms, and availability. Token throughput estimates are for batch-size-1 single-GPU inference of a 70B model in FP8. Cost per million tokens depends heavily on batch size, utilization, and workload mix. Cloud pricing dropped significantly through 2025 as supply increased.
The table reveals a pattern that price per token does not always favor the fastest GPU. The MI300X, despite lower raw throughput, achieves competitive cost per token because of its lower acquisition cost and superior memory capacity (which enables larger batches). Gaudi 3, at roughly half the H100’s purchase price, delivers reasonable throughput for organizations prioritizing total cost of ownership over peak performance. The B200 dominates on raw throughput but commands a premium and higher power consumption (1000W, often requiring liquid cooling).
The right hardware choice depends on the workload profile. Latency-sensitive applications (real-time chat, code completion) favor the fastest per-request GPU regardless of cost. Throughput-oriented workloads (batch processing, offline analysis) favor cost-per-token, where AMD and Intel become competitive. Mixed workloads often land on the H200, which offers a balance of capacity, bandwidth, and ecosystem maturity.
What comes next
The trajectory is clear. Memory bandwidth grows faster than compute, because the industry has recognized that inference workloads are bandwidth-starved. NVIDIA’s Vera Rubin, announced at CES 2026 and now in production, brings HBM4 with 22 TB/s of bandwidth and 288 GB of capacity per GPU, with cloud availability expected in the second half of 2026. AMD’s MI355X (CDNA 4), which shipped in H2 2025, delivers 8 TB/s with 288 GB of HBM3E. Each generation narrows the gap between what compute units can consume and what the memory system can deliver.
But hardware alone does not solve the problem. The engineers who build effective inference systems will be those who understand the interplay: how quantization decisions interact with Tensor Core precision support, how batch sizes interact with KV cache memory pressure, how attention mechanisms interact with SRAM capacity. The silicon is the constraint surface. The optimization happens at the boundary between hardware and software, where architectural knowledge translates directly into lower latency, higher throughput, and reduced cost.
That boundary is where the interesting engineering lives.