VRAM Estimator

Model Specs

Model Preset

Quantization Bit Rate

Standard GGUF compression choice

Context length (tokens)

Batch Size

Understanding VRAM requirements

The total VRAM needed to run a local LLM comes from three components:

Model weights: The bulk of memory usage. Calculated as parameters × bytes_per_weight
KV cache: Memory for storing key-value attention pairs for the context window. Scales with context length × batch size
Overhead: CUDA runtime, activations, optimizer states (if training). Typically 10-15% extra

Quantization formats explained

FP16/BF16: Full inference precision. Best quality, highest VRAM. Standard for API providers.
Q8 (INT8): 8-bit quantization. Nearly no quality loss, 2x smaller than FP16.
Q6_K: 6-bit with k-means quantization. Excellent quality-to-size ratio.
Q5_K_M: 5-bit with mixed precision. Sweet spot for most use cases.
Q4_K_M: Most popular. 4-bit with mixed precision. Noticeable but acceptable quality trade-off.
Q2_K: Heavily quantized. Use only when VRAM is extremely constrained.

Apple Silicon note

Mac M1/M2/M3 chips use unified memory — the same pool serves both CPU and GPU. This means you can effectively use all your system RAM as "VRAM" when running with Metal/MPS backends via Ollama or llama.cpp. A MacBook Pro with 36GB RAM can run a 30B model at Q4 quantization without issue.

Related tools

LLM Cost Calculator — Compare API costs vs running locally
AI Token Calculator — Count tokens for any model

Model Specs

GPU Profile Compatibility

Understanding VRAM requirements

Quantization formats explained

Apple Silicon note

Related tools

Related reading