Free Tool

VRAM Estimator

Know before you download. Estimate GPU memory requirements for running local LLMs at different quantization levels.

Model Specs

Standard GGUF compression choice

Weights footprint

3.73GB

KV Cache memory

0.07GB

Total VRAM Required

4.99GB

GPU Profile Compatibility

Estimation for running Llama 3 8B with Q4_K_M (4-bit)

RTX 4060 8GBConsumer Desktop
8GB
62%
RTX 3060 12GBConsumer Desktop
12GB
42%
RTX 4080 16GBConsumer Desktop
16GB
31%
RTX 3090 / 4090 24GBWorkstation Desktop
24GB
21%
Mac Studio M1/M2 Max (Unified)Apple Unified Memory
32GB
16%
A10G 24GB (Cloud Inference)Cloud Accelerator
24GB
21%
L4 24GB (Cloud Inference)Cloud Accelerator
24GB
21%
A100 40GB (Cloud Server)Cloud Dedicated
40GB
12%
Mac Studio M1/M2 Ultra (Unified)Apple Unified Memory
64GB
8%
A100 80GB (Cloud Server)Cloud Dedicated
80GB
6%
H100 80GB (Cloud Server)Cloud Dedicated
80GB
6%
Mac Studio M3 Max (Unified)Apple Unified Memory
128GB
4%

Estimates calculate baseline runtime allocations and static KV bounds. Actual GPU memory behaviors vary based on PyTorch frameworks, context scaling, and compilation parameters.

Understanding VRAM requirements

The total VRAM needed to run a local LLM comes from three components:

  • Model weights: The bulk of memory usage. Calculated as parameters × bytes_per_weight
  • KV cache: Memory for storing key-value attention pairs for the context window. Scales with context length × batch size
  • Overhead: CUDA runtime, activations, optimizer states (if training). Typically 10-15% extra

Quantization formats explained

  • FP16/BF16: Full inference precision. Best quality, highest VRAM. Standard for API providers.
  • Q8 (INT8): 8-bit quantization. Nearly no quality loss, 2x smaller than FP16.
  • Q6_K: 6-bit with k-means quantization. Excellent quality-to-size ratio.
  • Q5_K_M: 5-bit with mixed precision. Sweet spot for most use cases.
  • Q4_K_M: Most popular. 4-bit with mixed precision. Noticeable but acceptable quality trade-off.
  • Q2_K: Heavily quantized. Use only when VRAM is extremely constrained.

Apple Silicon note

Mac M1/M2/M3 chips use unified memory — the same pool serves both CPU and GPU. This means you can effectively use all your system RAM as "VRAM" when running with Metal/MPS backends via Ollama or llama.cpp. A MacBook Pro with 36GB RAM can run a 30B model at Q4 quantization without issue.

Related tools

Related reading