February 9, 2026 · AI & Hardware · 10 min read

Understanding Unified Memory: Why It Matters for AI Workloads

When I started running large language models locally, I hit a wall that every hobbyist AI enthusiast hits: VRAM. My RTX 3080 Ti has 12 GB of video memory, which sounds like a lot until you try to load a 30-billion parameter model that needs 16 GB just for the weights. That's when I started paying attention to unified memory architectures — and understanding why they've changed the equation for local AI.

The Traditional Memory Split

In a conventional desktop, you have two separate memory pools. System RAM (DDR4 or DDR5, usually 16-64 GB) sits on the motherboard and serves the CPU. Video RAM (GDDR6X, typically 8-24 GB on consumer cards) sits on the GPU and serves graphics and compute workloads. These pools are connected by the PCIe bus, which on a Gen 4 x16 slot gives you roughly 32 GB/s of bandwidth in each direction.

For gaming, this split makes sense. The GPU needs fast, dedicated memory for textures and frame buffers, and the CPU needs its own pool for game logic and asset streaming. The PCIe bus is fast enough that data transfers between them don't bottleneck normal rendering.

For AI inference, this split is a disaster.

Why VRAM Is the Bottleneck

Large language models are memory-bound, not compute-bound, during inference. When you generate a token, the model needs to read billions of parameters from memory, multiply them by your input, and write the result. The actual math is fast — modern GPUs can do trillions of floating-point operations per second. The bottleneck is getting the data to the compute units fast enough.

An RTX 3080 Ti has about 912 GB/s of memory bandwidth to its 12 GB of GDDR6X. That's incredibly fast, but you can only access 12 GB of it. A 30B parameter model in Q4 quantization needs about 17 GB. So what happens when the model doesn't fit?

You have two options. You can split the model across GPU and CPU memory (called "offloading"), which means some layers run on the GPU at 912 GB/s and others run on the CPU at roughly 50 GB/s (DDR4-3200 dual-channel). The slow layers become the bottleneck, and your token generation speed plummets. Or you can use a smaller model that fits entirely in VRAM, sacrificing quality for speed.

Neither option is great. This is the problem unified memory solves.

How Unified Memory Works

In a unified memory architecture, the CPU and GPU share the same physical memory pool. Apple Silicon is the most prominent example: an M1 Max has 32 or 64 GB of memory that both the CPU cores and GPU cores can access directly, without copying data over a bus.

The key metrics are pool size and bandwidth. An M2 Ultra has 192 GB of unified memory with 800 GB/s of bandwidth. That means you can load a 70B parameter model (about 40 GB in Q4) entirely into memory that the GPU cores can access at high speed. No offloading, no PCIe bottleneck, no compromises.

Compare this to my desktop setup: 12 GB of VRAM at 912 GB/s plus 32 GB of system RAM at 50 GB/s. The Apple Silicon machine has more usable memory for AI at high bandwidth, even though the per-GB bandwidth is lower than dedicated GDDR6X.

Real-World Performance Comparison

I've run the same models on both my RTX 3080 Ti desktop and a friend's M2 Max MacBook Pro (32 GB). Here are the results that surprised me:

For a 7B model (fits entirely in 12 GB VRAM): my RTX 3080 Ti wins handily. About 80 tokens/second versus 35 tokens/second on the M2 Max. The dedicated VRAM bandwidth advantage is clear when the model fits.

For a 13B model (needs ~8 GB in Q4, fits in both): the RTX 3080 Ti still wins, about 45 tokens/second versus 25 tokens/second. Same story — dedicated bandwidth matters.

For a 30B model (17 GB in Q4, doesn't fit in 12 GB VRAM): the M2 Max pulls ahead. About 15 tokens/second versus 8 tokens/second on my desktop with CPU offloading. The unified memory advantage kicks in exactly when you exceed the VRAM ceiling.

For a 70B model (40 GB in Q4): the M2 Max runs it at about 6 tokens/second. My desktop can barely run it at all — with aggressive CPU offloading, maybe 1-2 tokens/second, and it uses all 32 GB of system RAM doing it.

The NVIDIA Response: Multi-GPU and NVLink

NVIDIA's answer to the VRAM problem is simple: buy more GPUs. Two RTX 3090s give you 48 GB of VRAM, which is enough for most models. But NVLink (the high-speed interconnect between GPUs) isn't available on consumer cards — you'd need to rely on PCIe for inter-GPU communication, which adds latency.

The datacenter-class H100 has 80 GB of HBM3 at 3.35 TB/s — absolute monster specifications. But it costs $25,000-$40,000 per card. For hobbyists and small developers, that's not a real option.

This is where the value proposition of unified memory machines becomes clear. A Mac Studio with M2 Ultra (192 GB) costs about €4,500. It can run a 120B parameter model locally. Try doing that on any consumer NVIDIA setup at any price point.

What This Means for Developers

If you're building AI applications that need local inference — for privacy, latency, or cost reasons — your hardware choice depends entirely on model size:

7B-13B models: A single GPU with 12+ GB VRAM is the best option. Faster, cheaper, and you get CUDA ecosystem benefits (PyTorch, llama.cpp CUDA builds, etc.).
30B-70B models: Unified memory machines (Apple Silicon or AMD APUs with large memory) offer the best price-performance. The model fits in fast memory without offloading.
70B+ models: Either cloud APIs, multi-GPU rigs, or high-end Apple Silicon. There's no cheap option here.

In my own setup, I run a 7B model on my RTX 3080 Ti for fast, routine tasks (code completion, quick questions) and offload larger models to a Mac when I need deeper reasoning. It's not elegant, but it matches the workload to the hardware's strengths.

The Future: Where Is This Going?

AMD's MI300X has 192 GB of HBM3 with unified memory between CPU and GPU — datacenter pricing, but the architecture is telling. Intel's upcoming Falcon Shores is similar. The trend is clear: unified memory is becoming the standard for AI hardware.

On the consumer side, AMD's APUs are slowly increasing their memory bandwidth, and there are rumors of DDR5-based systems that could offer 100+ GB/s to both CPU and GPU. That's still 8x slower than Apple Silicon's memory bandwidth, but it's 2x faster than current DDR4 offloading.

For now, if you're serious about running large models locally and you don't want to build a multi-GPU rig, Apple Silicon with maxed-out RAM is the pragmatic choice. It's not the fastest per-token, but it runs models that simply won't fit elsewhere at this price point.

💬 Comments

← Back to Blog