Tools

LLM VRAM calculator

How much GPU memory does a model actually need? Pick a model, a quantization, and a context length — this computes the real footprint (weights + KV cache + overhead), tells you if it fits your card, and shows your maximum context.

Advanced — architecture & MoE
✅ Fits — 9.28 GB headroom
6.72 GB
Speed
38–61tok/s
Model weights4.85 GB
KV cache @ 8,192 ctx1.07 GB
Framework overhead~0.80 GB

Max context in 16 GB: 78,978 tokens

Estimates, ±10–20%. Speed is single-stream decode (batch 1), bandwidth-bound; real tok/s varies widely by backend (llama.cpp, vLLM, MLX), flash-attention, and context. If a model doesn’t fit, partial CPU offload makes it far slower.

How VRAM for an LLM is calculated

Three things occupy your GPU when you run a language model locally. The calculator above adds them up:

1. Model weights

The dominant cost. It equals the parameter count times the bits per weight of your quantization, divided by eight:

weights (GB) = parameters (billions) × bits-per-weight ÷ 8

An 8B model at Q4_K_M (~4.83 bits/weight) is about 4.8 GB; at FP16 (16 bits) it’s ~16 GB. This is why quantization is what makes local models practical — see distillation vs. quantization.

2. KV cache

The model caches a key and value vector for every token in the context window. This is the part most people forget — and at long context it can rival the weights:

KV cache (GB) = 2 × layers × kv-heads × head-dim × bytes × tokens ÷ 1e9

Modern models use grouped-query attention (GQA) — fewer key/value heads than attention heads — which shrinks this a lot. Quantizing the KV cache to Q8 or Q4 (the “KV cache precision” control above) roughly halves or quarters it, letting you push context much further.

3. Framework overhead

The CUDA context, activation buffers, and compute buffers — roughly a gigabyte, depending on runtime (llama.cpp, vLLM, MLX) and settings.

How tokens per second is estimated

For a single stream (batch 1) running locally, generation speed is almost entirely memory-bandwidth bound: to produce each token the GPU must read the model’s weights from VRAM once. So a good first-order estimate is:

tokens/sec ≈ memory bandwidth (GB/s) ÷ bytes read per token

where bytes-per-token is the active weights plus the KV cache. This is why a 4090 (~1 TB/s) generates several times faster than a 4060 (~272 GB/s) on the same model, and why speed drops as context grows (the KV cache gets read every step). The calculator shows a range because real bandwidth utilization runs ~50–80% depending on the backend, flash-attention, and settings.

Mixture-of-Experts (MoE) models are the interesting case: all the weights must fit in VRAM, but only the active experts are read per token — so a 235B-A22B model needs the memory of a 235B but generates at the speed of a ~22B. The calculator accounts for this (the “Active” field).

Rules of thumb

  • ~8 GB — a 3–4B model at Q4_K_M, or a 7–8B tight.
  • ~16 GB — a 7–8B comfortably (room for context), or a 14B tight.
  • ~24 GB — a 14B with long context, or a 32B at Q4_K_M.
  • ~48 GB — a 70B at Q4_K_M.

Want a model picked for your hardware instead of doing the math? The Stillhouse lists the best open models by RAM tier and task.

Results are estimates (±10–20%). Actual usage depends on your runtime, flash-attention, batch size, and whether layers are offloaded to CPU.