AI API Cost Optimization: Reduce Your Bill by 60% Without Sacrificing Quality
Proven techniques for cutting LLM API costs including prompt caching, batching, model routing, and smarter context management. With real cost calculations.
If you’re calling LLM APIs at any meaningful scale, the bill lands fast. A team running Claude Sonnet for code review, documentation, and test generation across a 10-engineer org can easily hit $2,000–5,000/month without realizing it. Most of that cost is preventable.
This is not a “use a cheaper model” guide. That advice is obvious and often wrong — swapping Claude Sonnet for GPT-4o mini on complex tasks causes quality degradation that creates downstream engineering debt. Instead, this guide covers architectural and prompt-level techniques that reduce costs without meaningful quality loss. Real numbers included.
May 2026 update: Prompt caching is now available across all major providers (Anthropic, OpenAI, Google, Mistral), making it the single highest-leverage optimization. The routing strategies have also been updated to account for Claude 4 Sonnet, GPT-5, and Gemini 2.5 Flash — all of which shift the cost-quality tradeoffs.
First: Know Where Your Money Actually Goes
Before optimizing, measure. Most teams discover their cost distribution looks nothing like they assumed.
A typical developer tooling API bill breaks down like this:
| Usage Pattern | % of Calls | % of Cost | |---|---|---| | Large context reads (>50K tokens) | 8% | 42% | | Standard feature implementation | 35% | 30% | | Test and docs generation | 40% | 18% | | Simple completions / autocomplete | 17% | 10% |
The insight: 8% of your calls eat 42% of your budget. Optimizing large-context patterns is where you’ll find the biggest wins.
Use the LLM Cost Calculator and AI Token Calculator to model your current and projected costs before diving into optimization.
Technique 1: Prompt Caching
Prompt caching is the single highest-leverage optimization available in 2026. As of mid-2026, prompt caching is widely available across all major providers: Anthropic (since late 2024), Google (Gemini API), OpenAI (now expanded beyond the limited version), and Mistral AI. This means you can apply caching regardless of which provider you’re using.
How it works: When you send the same prefix (system prompt + static context) across multiple requests, the provider caches the processed tokens after the first call. Subsequent calls with the same prefix only charge for the new/unique tokens.
Real savings: Anthropic charges 10% of the normal input price for cache hits (vs. 100% for cache misses). If your system prompt is 8,000 tokens and you make 1,000 API calls/day:
Without caching:
8,000 tokens × 1,000 calls × $3.00/1M = $24.00/day
With caching (95% cache hit rate):
8,000 tokens × 50 cache misses × $3.00/1M = $1.20
8,000 tokens × 950 cache hits × $0.30/1M = $2.28
Total: $3.48/day
Savings: $20.52/day → $615/month on system prompts alone
Implementation with Anthropic Python SDK
import anthropic
client = anthropic.Anthropic()
# Cache the static parts of your prompt
response = client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=4096,
system=[
{
"type": "text",
"text": "You are an expert code reviewer specializing in Python...",
"cache_control": {"type": "ephemeral"} # Cache this block
},
{
"type": "text",
# Large static context (e.g., coding standards doc, codebase conventions)
"text": STATIC_CODEBASE_CONTEXT, # Could be 10K+ tokens
"cache_control": {"type": "ephemeral"} # Cache this too
}
],
messages=[
{
"role": "user",
"content": f"Review this PR: {dynamic_pr_content}" # Only this varies
}
]
)
# Check cache performance
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Uncached input tokens: {response.usage.input_tokens}")
What to Cache
| Content Type | Cache? | Reason | |---|---|---| | System prompt | Always | Highest reuse rate | | Coding standards / style guides | Yes | Rarely changes | | Codebase architecture docs | Yes | Changes weekly at most | | The actual code being reviewed | No | Always unique | | User message | No | Always unique |
Cache TTL: Anthropic’s ephemeral cache lasts 5 minutes by default. For longer retention, structure your caching architecture to re-warm the cache before it expires if you have continuous workloads.
Technique 2: Model Routing
Model routing means sending different task types to different models based on a complexity classifier. It’s the most impactful architectural change for high-volume applications.
The principle: Not every task needs Claude Sonnet. Test generation, docstring writing, and simple completions can run on Haiku or Gemini Flash at 10–20x lower cost.
Build a Simple Classifier
import re
from enum import Enum
class TaskComplexity(Enum):
SIMPLE = "simple"
MEDIUM = "medium"
COMPLEX = "complex"
def classify_task(prompt: str, context_tokens: int) -> TaskComplexity:
# Large context always bumps to complex
if context_tokens > 40_000:
return TaskComplexity.COMPLEX
# Keywords suggesting complexity
complex_keywords = [
"debug", "root cause", "race condition", "architecture",
"design", "refactor", "security", "performance bottleneck",
"concurrency", "memory leak"
]
simple_keywords = [
"docstring", "comment", "rename", "format", "test for",
"generate unit test", "explain this function", "add type hints"
]
prompt_lower = prompt.lower()
if any(kw in prompt_lower for kw in complex_keywords):
return TaskComplexity.COMPLEX
elif any(kw in prompt_lower for kw in simple_keywords):
return TaskComplexity.SIMPLE
else:
return TaskComplexity.MEDIUM
def get_model(complexity: TaskComplexity) -> str:
return {
TaskComplexity.SIMPLE: "gemini-2.0-flash", # $0.075/1M input
TaskComplexity.MEDIUM: "claude-3-5-haiku-20241022", # $0.80/1M input
TaskComplexity.COMPLEX: "claude-4-sonnet-20260515" # $3.00/1M input
}[complexity]
Real Cost Comparison With vs. Without Routing
Assume a team making 10,000 API calls/day with average 2,000 input + 800 output tokens:
Without routing (all Claude Sonnet):
Input: 10,000 × 2,000 × $3.00/1M = $60.00/day
Output: 10,000 × 800 × $15.00/1M = $120.00/day
Total: $180/day → $5,400/month
With routing (40% simple, 40% medium, 20% complex):
Simple (Flash): 4,000 calls × 2,800 tokens × $0.30/1M = $3.36/day
Medium (Haiku): 4,000 calls × 2,800 tokens × $4.00/1M = $44.80/day
Complex (Sonnet): 2,000 calls × 2,800 tokens × $18.00/1M = $100.80/day
Total: $148.96/day → $4,469/month
That’s a 17% reduction just from routing — and with better quality on complex tasks than you’d get running everything through a single mid-tier model. Refine the classifier over time and you can push this further.
Technique 3: Aggressive Context Trimming
Output tokens cost 4–10x more than input tokens across every provider. But overstuffed input context is also expensive and often hurts quality.
The most common waste: sending the entire conversation history on every turn when only the last 2–3 exchanges are relevant.
Sliding Window Context
from dataclasses import dataclass
from typing import List
import tiktoken
@dataclass
class Message:
role: str
content: str
def trim_context(
messages: List[Message],
system_prompt: str,
max_input_tokens: int = 8_000,
model: str = "gpt-4o"
) -> List[Message]:
"""
Keep only recent messages that fit within token budget.
Always preserves the first user message (task context).
"""
enc = tiktoken.encoding_for_model(model)
system_tokens = len(enc.encode(system_prompt))
budget = max_input_tokens - system_tokens - 500 # Buffer for response
# Always keep the first message
pinned = [messages[0]] if messages else []
pinned_tokens = sum(len(enc.encode(m.content)) for m in pinned)
# Fill from recent messages backwards
recent = []
recent_tokens = 0
for msg in reversed(messages[1:]):
msg_tokens = len(enc.encode(msg.content))
if recent_tokens + msg_tokens + pinned_tokens > budget:
break
recent.insert(0, msg)
recent_tokens += msg_tokens
return pinned + recent
Summarize Instead of Truncate
For long conversations, summarizing older turns is better than dropping them:
async def summarize_old_turns(
messages: List[Message],
keep_recent: int = 6
) -> List[Message]:
"""Compress old messages into a summary to preserve context cheaply."""
if len(messages) <= keep_recent:
return messages
old = messages[:-keep_recent]
recent = messages[-keep_recent:]
old_text = "\n".join(f"{m.role}: {m.content}" for m in old)
# Use cheapest model for summarization
summary_response = await gemini_flash.ainvoke(
f"Summarize this conversation history in 200 words, preserving key decisions:\n\n{old_text}"
)
summary_message = Message(
role="user",
content=f"[Previous conversation summary: {summary_response.content}]"
)
return [summary_message] + recent
Technique 4: Response Length Control
Every token you don’t generate is money you don’t spend. Most prompts that don’t constrain output length will get verbose responses.
Add these instructions to every prompt:
- “Be concise. Answer in under [N] words.”
- “Return only the code, no explanation.”
- “List format, no prose.”
Impact of output length constraints:
| Prompt Type | Unconstrained Output | Constrained Output | Cost Reduction | |---|---|---|---| | Code review | ~1,200 tokens | ~400 tokens | 67% | | Bug explanation | ~800 tokens | ~200 tokens | 75% | | Test generation | ~600 tokens | ~450 tokens | 25% | | Docs writing | ~900 tokens | ~600 tokens | 33% |
# Verbose (expensive)
prompt_bad = "Review this code for bugs."
# Concise (cheap)
prompt_good = """Review this code for bugs.
Output format:
- List only real bugs (not style issues)
- One line per bug: [LINE_NUMBER]: [BUG_DESCRIPTION]
- If no bugs, return: "No bugs found"
- Maximum 10 items
"""
Technique 5: Batch Processing
If your use case includes non-real-time tasks (nightly doc generation, weekly code review reports, bulk test creation), use batch APIs. Anthropic and OpenAI both offer ~50% discounts for batch processing.
import anthropic
client = anthropic.Anthropic()
# Create a batch of up to 10,000 requests
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"doc-gen-{file_path}",
"params": {
"model": "claude-3-5-haiku-20241022", # Or claude-4-sonnet-20260515 for complex docs
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": f"Generate a docstring for:\n{code}"
}
]
}
}
for file_path, code in files_to_document.items()
]
)
print(f"Batch ID: {batch.id}")
# Results are available within 24 hours at 50% cost
Batch vs. real-time cost:
| Task | Real-time Cost | Batch Cost | Savings | |---|---|---|---| | 1,000 doc generations (Haiku) | $0.80 | $0.40 | 50% | | 500 test suites (Haiku) | $1.20 | $0.60 | 50% | | 100 code reviews (Sonnet) | $18.00 | $9.00 | 50% |
If latency doesn’t matter for a task, always batch.
Technique 6: Structured Output to Reduce Post-Processing
When you need JSON or specific formats, using the model’s built-in structured output support is cheaper than asking for prose and parsing it yourself (which often requires retry calls when parsing fails).
# OpenAI structured output
from openai import OpenAI
from pydantic import BaseModel
class CodeReview(BaseModel):
bugs: list[str]
suggestions: list[str]
severity: str # "low" | "medium" | "high"
summary: str
client = OpenAI()
completion = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Review: {code}"}],
response_format=CodeReview,
)
review = completion.choices[0].message.parsed
# No retry logic needed — output is guaranteed valid
Structured output eliminates:
- Prose wrapper tokens (“Here is the JSON you requested…”)
- Retry API calls when your parser fails (1–5% failure rate without structured output)
- Post-processing compute
Technique 7: Semantic Caching at the Application Layer
Provider-level caching handles identical prompts. Application-layer semantic caching handles similar prompts — which is often more relevant.
from sentence_transformers import SentenceTransformer
import numpy as np
import redis
import json
model = SentenceTransformer('all-MiniLM-L6-v2')
cache = redis.Redis()
def semantic_cache_get(prompt: str, threshold: float = 0.92) -> str | None:
prompt_embedding = model.encode(prompt).tolist()
# Check cached prompts for semantic similarity
keys = cache.keys("prompt:*")
for key in keys:
cached = json.loads(cache.get(key))
similarity = np.dot(prompt_embedding, cached["embedding"])
if similarity > threshold:
return cached["response"]
return None
def semantic_cache_set(prompt: str, response: str):
embedding = model.encode(prompt).tolist()
cache.setex(
f"prompt:{hash(prompt)}",
3600, # 1-hour TTL
json.dumps({"embedding": embedding, "response": response})
)
Effective for: repetitive tasks like “explain this function,” “write a test for this,” or “add error handling to this code” where the code pattern varies slightly but the task is structurally identical.
Cache hit rates in practice: 15–35% for developer tooling applications, 40–60% for customer-facing coding assistants with repeated user patterns.
Note: Semantic caching becomes even more valuable with higher-cost models like Claude 4 Sonnet and GPT-5. If you’re using a frontier model for repetitive analysis tasks, a semantic cache can cut effective costs by 30–50% on top of provider-level prompt caching.
Putting It All Together: A Realistic Savings Estimate
| Technique | Typical Savings | Implementation Effort | |---|---|---| | Prompt caching | 30–60% on repeated prefixes | Low (1 day) | | Model routing | 15–40% overall | Medium (3–5 days) | | Context trimming | 10–25% on input costs | Low (1–2 days) | | Output constraints | 20–50% on output costs | Low (hours) | | Batch processing | 50% on async tasks | Low (1 day) | | Structured output | 5–15% (reduced retries) | Low (hours) | | Semantic caching | 15–35% hit rate | Medium (2–3 days) |
Combined and applied to a $3,000/month bill, these techniques realistically deliver a 55–65% reduction to $1,050–$1,350/month. The most impactful are prompt caching, model routing, and output constraints — start there.
Quick Wins Checklist
- [ ] Add
cache_control: ephemeralto your system prompts (Anthropic) - [ ] Set
temperature: 0for all code tasks - [ ] Add “Be concise. Answer in under X words.” to every prompt
- [ ] Route test generation and docs to Gemini Flash or Claude Haiku
- [ ] Move all non-real-time batch tasks to Batch API
- [ ] Trim conversation history to last 5–6 turns
- [ ] Use structured output (JSON mode / Pydantic) to eliminate parse retries
- [ ] Log token usage per call to identify expensive outliers