Running Local LLMs on macOS with Ollama: A Complete Setup Guide

Step-by-step guide to running Llama 3/4, Mistral, Phi-3/4, and Gemma 3 locally on macOS using Ollama. Covers Apple Silicon performance, model selection, and API integration.

· Updated May 31, 2026 · NextReach Studio ·
local-llmollamamacos

Running language models locally used to mean a Linux tower with an Nvidia GPU and a weekend of configuration. On Apple Silicon Macs, it’s now a legitimate option for daily development work — with real throughput, offline operation, and zero per-token cost. This guide covers the actual setup, the tradeoffs, and how to integrate local models into development workflows without friction.

Why Local? The Honest Case

Local LLMs make sense in specific situations:

  • Offline development — planes, trains, unreliable hotel WiFi
  • Sensitive codebases — you don’t want proprietary code in someone else’s logs
  • High-volume tasks — batch processing where cloud API costs add up fast
  • Experimentation — testing prompt behavior without a billing anxiety loop

They don’t make sense as a primary tool if you need state-of-the-art reasoning on hard problems. Llama 3.1 8B is genuinely good for many tasks. It’s not Claude Opus or GPT-4o. Know what you’re trading before you commit to a workflow.

Hardware Baseline

Apple Silicon’s unified memory architecture makes it unusually good for running LLMs. The GPU and CPU share the same memory pool, so large models can use the full RAM rather than being constrained by VRAM.

Practical minimums:

| RAM | What runs well | |-----|---------------| | 8GB | Phi-3 Mini (3.8B), Phi-4 Mini (5B), Llama 3.2 3B | | 16GB | Llama 3.1 8B, Llama 4 8B, Mistral 7B, Gemma 2 9B, Gemma 3 12B | | 32GB | Llama 3.1 70B (quantized), Llama 4 17B (quantized), Mixtral 8x7B | | 64GB | Llama 3.1 70B (full), Llama 4 17B (full), most 70B models |

M1/M2/M3/M4 chips all run Ollama well. The M4 Ultra and Max variants offer particularly strong inference performance thanks to increased memory bandwidth. The difference between chip generations matters less than RAM amount for most workloads.

Installing Ollama

The simplest installation is via Homebrew:

brew install ollama

Or download the macOS app directly from ollama.com. The app version installs a menu bar icon and manages the server automatically. The Homebrew version gives you more control over when the server runs.

Start the server manually (Homebrew install):

ollama serve

Verify it’s running:

curl http://localhost:11434/api/tags

You should get a JSON response with an empty models array if you haven’t pulled anything yet.

Pulling Models

Models are pulled by name. Ollama’s library handles versioning and quantization:

# Llama 3.1 8B — good all-around choice for 16GB systems
ollama pull llama3.1

# Llama 4 8B — latest generation, improved reasoning and context handling
ollama pull llama4

# Phi-3 Mini — excellent for 8GB systems, surprisingly capable
ollama pull phi3

# Phi-4 14B — Microsoft's latest, strong reasoning at a compact size
ollama pull phi-4

# Mistral 7B — strong at code and structured output
ollama pull mistral

# Gemma 2 9B — Google's model, good instruction following
ollama pull gemma2

# Gemma 3 12B — Google's latest, improved multilingual and long-context performance
ollama pull gemma3

# Code Llama — specifically tuned for code generation (note: superseded by CodeGemma and DeepSeek Coder)
ollama pull codellama

Quantization levels matter. By default, Ollama pulls a Q4 quantized version (4-bit), which balances quality and memory usage. You can pull specific quantizations:

# Q8 quantization — better quality, needs more RAM
ollama pull llama3.1:8b-instruct-q8_0

# Q4_K_M — the K-means variant, slightly better than standard Q4
ollama pull llama3.1:8b-instruct-q4_K_M

For most development tasks, the default Q4 is fine. The quality difference between Q4 and Q8 is noticeable for nuanced reasoning but negligible for code generation and structured extraction.

Running Models

Quick test from the terminal:

ollama run llama3.1 "Write a Python function to validate an email address"

Interactive chat mode:

ollama run llama3.1
>>> What's the difference between a mutex and a semaphore?

Exit with /bye or Ctrl+D.

The REST API

This is where local LLMs become genuinely useful in workflows. Ollama exposes an OpenAI-compatible API:

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1",
    "messages": [
      {"role": "user", "content": "Explain Rust ownership in 3 sentences"}
    ],
    "stream": false
  }'

The OpenAI-compatible endpoint works with most existing tooling:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ollama" \
  -d '{
    "model": "llama3.1",
    "messages": [
      {"role": "user", "content": "Explain Rust ownership in 3 sentences"}
    ]
  }'

Python Integration

Using the OpenAI Python client with Ollama requires only changing the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a code reviewer. Be concise."},
        {"role": "user", "content": "Review this Python function for issues:\n\n```python\ndef divide(a, b):\n    return a / b\n```"},
    ],
    temperature=0.2,
)

print(response.choices[0].message.content)

For streaming:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

stream = client.chat.completions.create(
    model="mistral",
    messages=[{"role": "user", "content": "Write a bash script to backup a directory"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

TypeScript / Node Integration

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama",
});

async function reviewCode(code: string): Promise<string> {
  const response = await client.chat.completions.create({
    model: "llama3.1",
    messages: [
      {
        role: "system",
        content: "Review code for bugs and security issues. Be specific and brief.",
      },
      {
        role: "user",
        content: `Review this code:\n\n\`\`\`\n${code}\n\`\`\``,
      },
    ],
    temperature: 0.1, // Low temperature for consistent code review
  });

  return response.choices[0].message.content ?? "";
}

// Usage
const feedback = await reviewCode(`
function parseUserInput(input) {
  return eval(input); // process user formula
}
`);

console.log(feedback);

Building a Local Embedding Pipeline

Ollama also serves embedding models, which you can use for semantic search and RAG workflows without any cloud API:

ollama pull nomic-embed-text
import httpx
import json
import numpy as np

def embed(text: str, model: str = "nomic-embed-text") -> list[float]:
    response = httpx.post(
        "http://localhost:11434/api/embeddings",
        json={"model": model, "prompt": text},
    )
    return response.json()["embedding"]

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a_arr = np.array(a)
    b_arr = np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

# Simple semantic search example
documents = [
    "Drizzle ORM works with PostgreSQL and SQLite",
    "Next.js supports both App Router and Pages Router",
    "Rust's borrow checker prevents memory safety issues",
    "TypeScript adds static typing to JavaScript",
]

query = "type safety in JavaScript"
query_embedding = embed(query)
doc_embeddings = [embed(doc) for doc in documents]

scores = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)

for doc, score in ranked:
    print(f"{score:.3f}{doc}")

Performance Tuning

For faster inference on Apple Silicon:

# Set the number of layers to offload to GPU
OLLAMA_NUM_GPU=1 ollama serve

# Increase context window (uses more memory)
ollama run llama3.1 --num-ctx 8192

# Reduce context if RAM is limited
ollama run llama3.1 --num-ctx 2048

Modelfile for custom system prompts:

# Save as Modelfile
FROM llama3.1

SYSTEM """
You are a senior TypeScript developer reviewing pull requests.
Always check for: type safety issues, missing error handling, and performance problems.
Be concise. Use bullet points. If code is fine, say so.
"""

PARAMETER temperature 0.1
PARAMETER num_ctx 4096
ollama create code-reviewer -f Modelfile
ollama run code-reviewer

This creates a persistent custom model you can call by name.

Typical Throughput on Apple Silicon

Approximate tokens-per-second on different chips (Q4 quantized, 2048 context):

| Model | M1 Pro 16GB | M2 Max 32GB | M3 Max 64GB | M4 Max 64GB | |-------|-------------|-------------|-------------|-------------| | Phi-3 Mini 3.8B | ~35 t/s | ~55 t/s | ~70 t/s | ~90 t/s | | Llama 3.1 8B | ~18 t/s | ~35 t/s | ~45 t/s | ~60 t/s | | Llama 4 8B | ~15 t/s | ~30 t/s | ~40 t/s | ~55 t/s | | Mistral 7B | ~20 t/s | ~38 t/s | ~48 t/s | ~62 t/s | | Llama 3.1 70B | — | ~8 t/s | ~15 t/s | ~22 t/s |

These are estimates — actual performance varies with context length and concurrent load. 15–20 t/s for an 8B model is fast enough for interactive use. Below ~8 t/s starts to feel slow for chat but is fine for batch processing.

Integrating with Your Dev Environment

Shell function for quick queries:

# Add to ~/.zshrc
llm() {
  local prompt="$*"
  curl -s http://localhost:11434/api/chat \
    -H "Content-Type: application/json" \
    -d "{\"model\": \"llama3.1\", \"messages\": [{\"role\": \"user\", \"content\": \"$prompt\"}], \"stream\": false}" \
    | python3 -c "import sys, json; print(json.load(sys.stdin)['message']['content'])"
}

# Usage
llm "What's the bash syntax for a for loop over an array?"

Using with jq for structured output:

# Ask for JSON, parse with jq
curl -s http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral",
    "messages": [{
      "role": "user", 
      "content": "Return a JSON object with 3 fields: language, useCase, example. Language: TypeScript"
    }],
    "stream": false,
    "format": "json"
  }' | python3 -c "import sys, json; data=json.load(sys.stdin); print(data['message']['content'])" | jq .

The "format": "json" parameter forces Ollama to return valid JSON, which is useful for pipelines.

Model Selection Guide

For most development tasks, start with one of these:

  • General coding + chat: llama4 (8B) or llama3.1 (8B)
  • Code generation specifically: deepseek-coder or codellama (note: codellama is increasingly superseded by CodeGemma and DeepSeek Coder)
  • Low RAM / fast responses: phi-4 or phi3 or llama3.2:3b
  • Structured output / JSON: mistral (good instruction following) or gemma3
  • Long documents: llama3.1 with --num-ctx 8192 or gemma3 with its extended context support

The LLM Cost Calculator on this site doesn’t apply to local models (there’s no per-token cost), but the Context Window Calculator helps you understand how much of a document you can fit in the model’s context at once.

A Note on Model Updates

Ollama doesn’t auto-update models. Run this to refresh to the latest version of a model:

ollama pull llama3.1

Models are stored at ~/.ollama/models. If disk space is tight:

# List models with sizes
ollama list

# Remove a model
ollama rm codellama

What Local Can’t Replace (Yet)

Be clear-eyed about the limits. Local 8B models consistently underperform frontier models at:

  • Multi-step reasoning over long contexts
  • Complex refactoring that requires understanding architecture-level tradeoffs
  • Writing comprehensive tests that anticipate edge cases in complex business logic
  • Anything requiring broad, up-to-date knowledge (they’re frozen at training time)

The right mental model: use local LLMs the way you’d use a capable junior developer. They can handle clearly scoped tasks well. They need more guidance on ambiguous problems, and you should review their output carefully.

For the tasks where that’s fine — and there are many — running local cuts API costs to zero and keeps sensitive code off external servers entirely.

May 2026 Update: New Model Landscape

Since the original publication, the local LLM landscape has evolved significantly. Ollama now supports the latest model architectures including Llama 4 (Meta’s multimodal generation), Phi-4 (Microsoft’s compact reasoning model), and Gemma 3 (Google’s improved multilingual model). These newer models offer substantially better reasoning at comparable RAM requirements, making local inference more viable than ever for development workflows.

Ollama also added support for newer GGUF format variants, safetensors-based model loading, and improved Metal GPU acceleration on M4 Macs. Run ollama --version and update to the latest release to take advantage of these improvements.