AI API Cost Optimization: Reduce Your Bill by 60% Without Sacrificing Quality

If you’re calling LLM APIs at any meaningful scale, the bill lands fast. A team running Claude Sonnet for code review, documentation, and test generation across a 10-engineer org can easily hit $2,000–5,000/month without realizing it. Most of that cost is preventable.

This is not a “use a cheaper model” guide. That advice is obvious and often wrong — swapping Claude Sonnet for GPT-4o mini on complex tasks causes quality degradation that creates downstream engineering debt. Instead, this guide covers architectural and prompt-level techniques that reduce costs without meaningful quality loss. Real numbers included.

May 2026 update: Prompt caching is now available across all major providers (Anthropic, OpenAI, Google, Mistral), making it the single highest-leverage optimization. The routing strategies have also been updated to account for Claude 4 Sonnet, GPT-5, and Gemini 2.5 Flash — all of which shift the cost-quality tradeoffs.

First: Know Where Your Money Actually Goes

Before optimizing, measure. Most teams discover their cost distribution looks nothing like they assumed.

A typical developer tooling API bill breaks down like this:

| Usage Pattern | % of Calls | % of Cost | |---|---|---| | Large context reads (>50K tokens) | 8% | 42% | | Standard feature implementation | 35% | 30% | | Test and docs generation | 40% | 18% | | Simple completions / autocomplete | 17% | 10% |

The insight: 8% of your calls eat 42% of your budget. Optimizing large-context patterns is where you’ll find the biggest wins.

Use the LLM Cost Calculator and AI Token Calculator to model your current and projected costs before diving into optimization.

Technique 1: Prompt Caching

Prompt caching is the single highest-leverage optimization available in 2026. As of mid-2026, prompt caching is widely available across all major providers: Anthropic (since late 2024), Google (Gemini API), OpenAI (now expanded beyond the limited version), and Mistral AI. This means you can apply caching regardless of which provider you’re using.

How it works: When you send the same prefix (system prompt + static context) across multiple requests, the provider caches the processed tokens after the first call. Subsequent calls with the same prefix only charge for the new/unique tokens.

Real savings: Anthropic charges 10% of the normal input price for cache hits (vs. 100% for cache misses). If your system prompt is 8,000 tokens and you make 1,000 API calls/day:

Without caching:
8,000 tokens × 1,000 calls × $3.00/1M = $24.00/day

With caching (95% cache hit rate):
8,000 tokens × 50 cache misses × $3.00/1M = $1.20
8,000 tokens × 950 cache hits × $0.30/1M = $2.28
Total: $3.48/day

Savings: $20.52/day → $615/month on system prompts alone

Implementation with Anthropic Python SDK

import anthropic

client = anthropic.Anthropic()

# Cache the static parts of your prompt
response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": "You are an expert code reviewer specializing in Python...",
            "cache_control": {"type": "ephemeral"}  # Cache this block
        },
        {
            "type": "text",
            # Large static context (e.g., coding standards doc, codebase conventions)
            "text": STATIC_CODEBASE_CONTEXT,  # Could be 10K+ tokens
            "cache_control": {"type": "ephemeral"}  # Cache this too
        }
    ],
    messages=[
        {
            "role": "user",
            "content": f"Review this PR: {dynamic_pr_content}"  # Only this varies
        }
    ]
)

# Check cache performance
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Uncached input tokens: {response.usage.input_tokens}")

What to Cache

| Content Type | Cache? | Reason | |---|---|---| | System prompt | Always | Highest reuse rate | | Coding standards / style guides | Yes | Rarely changes | | Codebase architecture docs | Yes | Changes weekly at most | | The actual code being reviewed | No | Always unique | | User message | No | Always unique |

Cache TTL: Anthropic’s ephemeral cache lasts 5 minutes by default. For longer retention, structure your caching architecture to re-warm the cache before it expires if you have continuous workloads.

Technique 2: Model Routing

Model routing means sending different task types to different models based on a complexity classifier. It’s the most impactful architectural change for high-volume applications.

The principle: Not every task needs Claude Sonnet. Test generation, docstring writing, and simple completions can run on Haiku or Gemini Flash at 10–20x lower cost.

Build a Simple Classifier

import re
from enum import Enum

class TaskComplexity(Enum):
    SIMPLE = "simple"
    MEDIUM = "medium"  
    COMPLEX = "complex"

def classify_task(prompt: str, context_tokens: int) -> TaskComplexity:
    # Large context always bumps to complex
    if context_tokens > 40_000:
        return TaskComplexity.COMPLEX
    
    # Keywords suggesting complexity
    complex_keywords = [
        "debug", "root cause", "race condition", "architecture",
        "design", "refactor", "security", "performance bottleneck",
        "concurrency", "memory leak"
    ]
    
    simple_keywords = [
        "docstring", "comment", "rename", "format", "test for",
        "generate unit test", "explain this function", "add type hints"
    ]
    
    prompt_lower = prompt.lower()
    
    if any(kw in prompt_lower for kw in complex_keywords):
        return TaskComplexity.COMPLEX
    elif any(kw in prompt_lower for kw in simple_keywords):
        return TaskComplexity.SIMPLE
    else:
        return TaskComplexity.MEDIUM

def get_model(complexity: TaskComplexity) -> str:
    return {
        TaskComplexity.SIMPLE: "gemini-2.0-flash",        # $0.075/1M input
        TaskComplexity.MEDIUM: "claude-3-5-haiku-20241022",  # $0.80/1M input
        TaskComplexity.COMPLEX: "claude-4-sonnet-20260515"   # $3.00/1M input
    }[complexity]

Real Cost Comparison With vs. Without Routing

Assume a team making 10,000 API calls/day with average 2,000 input + 800 output tokens:

Without routing (all Claude Sonnet):

Input: 10,000 × 2,000 × $3.00/1M = $60.00/day
Output: 10,000 × 800 × $15.00/1M = $120.00/day
Total: $180/day → $5,400/month

With routing (40% simple, 40% medium, 20% complex):

Simple (Flash):   4,000 calls × 2,800 tokens × $0.30/1M = $3.36/day
Medium (Haiku):   4,000 calls × 2,800 tokens × $4.00/1M = $44.80/day  
Complex (Sonnet): 2,000 calls × 2,800 tokens × $18.00/1M = $100.80/day
Total: $148.96/day → $4,469/month

That’s a 17% reduction just from routing — and with better quality on complex tasks than you’d get running everything through a single mid-tier model. Refine the classifier over time and you can push this further.

Technique 3: Aggressive Context Trimming

Output tokens cost 4–10x more than input tokens across every provider. But overstuffed input context is also expensive and often hurts quality.

The most common waste: sending the entire conversation history on every turn when only the last 2–3 exchanges are relevant.

Sliding Window Context

from dataclasses import dataclass
from typing import List
import tiktoken

@dataclass
class Message:
    role: str
    content: str

def trim_context(
    messages: List[Message],
    system_prompt: str,
    max_input_tokens: int = 8_000,
    model: str = "gpt-4o"
) -> List[Message]:
    """
    Keep only recent messages that fit within token budget.
    Always preserves the first user message (task context).
    """
    enc = tiktoken.encoding_for_model(model)
    
    system_tokens = len(enc.encode(system_prompt))
    budget = max_input_tokens - system_tokens - 500  # Buffer for response
    
    # Always keep the first message
    pinned = [messages[0]] if messages else []
    pinned_tokens = sum(len(enc.encode(m.content)) for m in pinned)
    
    # Fill from recent messages backwards
    recent = []
    recent_tokens = 0
    
    for msg in reversed(messages[1:]):
        msg_tokens = len(enc.encode(msg.content))
        if recent_tokens + msg_tokens + pinned_tokens > budget:
            break
        recent.insert(0, msg)
        recent_tokens += msg_tokens
    
    return pinned + recent

Summarize Instead of Truncate

For long conversations, summarizing older turns is better than dropping them:

async def summarize_old_turns(
    messages: List[Message],
    keep_recent: int = 6
) -> List[Message]:
    """Compress old messages into a summary to preserve context cheaply."""
    if len(messages) <= keep_recent:
        return messages
    
    old = messages[:-keep_recent]
    recent = messages[-keep_recent:]
    
    old_text = "\n".join(f"{m.role}: {m.content}" for m in old)
    
    # Use cheapest model for summarization
    summary_response = await gemini_flash.ainvoke(
        f"Summarize this conversation history in 200 words, preserving key decisions:\n\n{old_text}"
    )
    
    summary_message = Message(
        role="user",
        content=f"[Previous conversation summary: {summary_response.content}]"
    )
    
    return [summary_message] + recent

Technique 4: Response Length Control

Every token you don’t generate is money you don’t spend. Most prompts that don’t constrain output length will get verbose responses.

Add these instructions to every prompt:

“Be concise. Answer in under [N] words.”
“Return only the code, no explanation.”
“List format, no prose.”

Impact of output length constraints:

| Prompt Type | Unconstrained Output | Constrained Output | Cost Reduction | |---|---|---|---| | Code review | ~1,200 tokens | ~400 tokens | 67% | | Bug explanation | ~800 tokens | ~200 tokens | 75% | | Test generation | ~600 tokens | ~450 tokens | 25% | | Docs writing | ~900 tokens | ~600 tokens | 33% |

# Verbose (expensive)
prompt_bad = "Review this code for bugs."

# Concise (cheap)
prompt_good = """Review this code for bugs.
Output format:
- List only real bugs (not style issues)
- One line per bug: [LINE_NUMBER]: [BUG_DESCRIPTION]
- If no bugs, return: "No bugs found"
- Maximum 10 items
"""

Technique 5: Batch Processing

If your use case includes non-real-time tasks (nightly doc generation, weekly code review reports, bulk test creation), use batch APIs. Anthropic and OpenAI both offer ~50% discounts for batch processing.

import anthropic

client = anthropic.Anthropic()

# Create a batch of up to 10,000 requests
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"doc-gen-{file_path}",
            "params": {
                "model": "claude-3-5-haiku-20241022",  # Or claude-4-sonnet-20260515 for complex docs
                "max_tokens": 1024,
                "messages": [
                    {
                        "role": "user",
                        "content": f"Generate a docstring for:\n{code}"
                    }
                ]
            }
        }
        for file_path, code in files_to_document.items()
    ]
)

print(f"Batch ID: {batch.id}")
# Results are available within 24 hours at 50% cost

Batch vs. real-time cost:

| Task | Real-time Cost | Batch Cost | Savings | |---|---|---|---| | 1,000 doc generations (Haiku) | $0.80 | $0.40 | 50% | | 500 test suites (Haiku) | $1.20 | $0.60 | 50% | | 100 code reviews (Sonnet) | $18.00 | $9.00 | 50% |

If latency doesn’t matter for a task, always batch.

Technique 6: Structured Output to Reduce Post-Processing

When you need JSON or specific formats, using the model’s built-in structured output support is cheaper than asking for prose and parsing it yourself (which often requires retry calls when parsing fails).

# OpenAI structured output
from openai import OpenAI
from pydantic import BaseModel

class CodeReview(BaseModel):
    bugs: list[str]
    suggestions: list[str]
    severity: str  # "low" | "medium" | "high"
    summary: str

client = OpenAI()
completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": f"Review: {code}"}],
    response_format=CodeReview,
)

review = completion.choices[0].message.parsed
# No retry logic needed — output is guaranteed valid

Structured output eliminates:

Prose wrapper tokens (“Here is the JSON you requested…”)
Retry API calls when your parser fails (1–5% failure rate without structured output)
Post-processing compute

Technique 7: Semantic Caching at the Application Layer

Provider-level caching handles identical prompts. Application-layer semantic caching handles similar prompts — which is often more relevant.

from sentence_transformers import SentenceTransformer
import numpy as np
import redis
import json

model = SentenceTransformer('all-MiniLM-L6-v2')
cache = redis.Redis()

def semantic_cache_get(prompt: str, threshold: float = 0.92) -> str | None:
    prompt_embedding = model.encode(prompt).tolist()
    
    # Check cached prompts for semantic similarity
    keys = cache.keys("prompt:*")
    for key in keys:
        cached = json.loads(cache.get(key))
        similarity = np.dot(prompt_embedding, cached["embedding"])
        
        if similarity > threshold:
            return cached["response"]
    
    return None

def semantic_cache_set(prompt: str, response: str):
    embedding = model.encode(prompt).tolist()
    cache.setex(
        f"prompt:{hash(prompt)}",
        3600,  # 1-hour TTL
        json.dumps({"embedding": embedding, "response": response})
    )

Effective for: repetitive tasks like “explain this function,” “write a test for this,” or “add error handling to this code” where the code pattern varies slightly but the task is structurally identical.

Cache hit rates in practice: 15–35% for developer tooling applications, 40–60% for customer-facing coding assistants with repeated user patterns.

Note: Semantic caching becomes even more valuable with higher-cost models like Claude 4 Sonnet and GPT-5. If you’re using a frontier model for repetitive analysis tasks, a semantic cache can cut effective costs by 30–50% on top of provider-level prompt caching.

Putting It All Together: A Realistic Savings Estimate

| Technique | Typical Savings | Implementation Effort | |---|---|---| | Prompt caching | 30–60% on repeated prefixes | Low (1 day) | | Model routing | 15–40% overall | Medium (3–5 days) | | Context trimming | 10–25% on input costs | Low (1–2 days) | | Output constraints | 20–50% on output costs | Low (hours) | | Batch processing | 50% on async tasks | Low (1 day) | | Structured output | 5–15% (reduced retries) | Low (hours) | | Semantic caching | 15–35% hit rate | Medium (2–3 days) |

Combined and applied to a $3,000/month bill, these techniques realistically deliver a 55–65% reduction to $1,050–$1,350/month. The most impactful are prompt caching, model routing, and output constraints — start there.

Quick Wins Checklist

[ ] Add cache_control: ephemeral to your system prompts (Anthropic)
[ ] Set temperature: 0 for all code tasks
[ ] Add “Be concise. Answer in under X words.” to every prompt
[ ] Route test generation and docs to Gemini Flash or Claude Haiku
[ ] Move all non-real-time batch tasks to Batch API
[ ] Trim conversation history to last 5–6 turns
[ ] Use structured output (JSON mode / Pydantic) to eliminate parse retries
[ ] Log token usage per call to identify expensive outliers