LLM API Providers Directory: Every Major Provider Compared

Choosing an LLM API provider in 2026 is not just picking the smartest model — it’s a systems decision that affects your latency, reliability, compliance posture, and monthly burn rate. This directory covers every provider worth considering, with honest assessments of where each one excels and where it falls short.

Pricing is as of May 2026 and reflects standard public API rates. Many providers have enterprise pricing that differs significantly. Use the LLM Cost Calculator to model costs for your specific volume.

May 2026 update: This directory has been updated to include Claude 4 Sonnet, GPT-5, Gemini 3, Llama 4, and DeepSeek. Previous-generation models remain listed for reference and backward compatibility.

How to Use This Directory

Each provider listing includes:

Models available — current flagship and budget options
Pricing — per 1M input/output tokens
Context window — maximum tokens per request
Rate limits — default tier limits
Strengths — where this provider genuinely wins
Weaknesses — honest limitations
Best for — specific use cases where this provider is the right call

Anthropic

Website: api.anthropic.com
Status: Production-ready

Models

| Model | Context | Input ($/1M) | Output ($/1M) | |---|---|---|---| | Claude 4 Sonnet | 200K | $3.00 | $15.00 | | Claude 3.7 Sonnet | 200K | $3.00 | $15.00 | | Claude 3.5 Sonnet | 200K | $3.00 | $15.00 | | Claude 3.5 Haiku | 200K | $0.80 | $4.00 |

Claude 4 Sonnet is Anthropic’s latest (May 2026), offering improved instruction-following, reduced hallucinations, and better recall of decisions across long sessions compared to Claude 3.7 Sonnet. It maintains the same pricing and 200K context window while delivering measurable gains on SWE-bench (74.1% vs 70.3%).

Rate Limits (Tier 2 / Standard)

| Metric | Limit | |---|---| | Requests per minute | 2,000 | | Tokens per minute | 100,000 | | Tokens per day | 2,500,000 |

Pricing Features

Prompt caching: Cache hit tokens billed at 10% of normal input rate (Sonnet) or 8% (Haiku)
Batch API: 50% discount for async batch processing, results within 24 hours

Strengths

Best-in-class performance on SWE-bench (real software engineering tasks) — Claude 4 Sonnet leads at 74.1%
Most reliable instruction-following among all providers
Extended thinking mode for complex reasoning tasks
Best JSON/structured output reliability
Strong safety profile with predictable refusal behavior
Claude 4 Sonnet further improves decision recall across long sessions

Weaknesses

No image generation
No audio/speech models
Rate limits on lower tiers can be restrictive
No embedding models (need a separate provider)

Best For

Complex code generation and debugging
Agentic workflows requiring reliable tool use
Any task where output quality directly affects user-facing product
Teams with compliance requirements (SOC 2 Type II certified)

Integration

import anthropic

client = anthropic.Anthropic(api_key="your-key")

message = client.messages.create(
    model="claude-4-sonnet-20260515",  # or "claude-3-7-sonnet-20250219" for previous gen
    max_tokens=4096,
    messages=[{"role": "user", "content": "Review this code: ..."}]
)

OpenAI

Website: platform.openai.com
Status: Production-ready

Models

| Model | Context | Input ($/1M) | Output ($/1M) | |---|---|---|---| | GPT-5 | 256K | $3.00 | $12.00 | | GPT-4o | 128K | $2.50 | $10.00 | | GPT-4o mini | 128K | $0.15 | $0.60 | | o3 | 200K | $10.00 | $40.00 | | o3-mini | 200K | $1.10 | $4.40 | | GPT-4.1 | 1M | $2.00 | $8.00 |

GPT-5 (released in 2026) is OpenAI’s latest flagship, offering a 256K context window and improved reasoning over GPT-4o. It strikes a balance between GPT-4o’s affordability and o3’s reasoning depth — performing well on both code generation (92.1% HumanEval+) and complex reasoning (81.5% GPQA Diamond).

Rate Limits (Tier 2)

| Metric | Limit | |---|---| | Requests per minute | 5,000 | | Tokens per minute | 450,000 | | Requests per day | Unlimited (Tier 4+) |

Pricing Features

Prompt caching: Automatic (no code changes needed), billed at 50% discount for matching prefix
Batch API: 50% discount, 24-hour turnaround

Strengths

Largest model family — from ultra-cheap mini to reasoning-specialized o3 and the new GPT-5 flagship
GPT-5 offers improved reasoning (81.5% GPQA Diamond) while being significantly cheaper than o3
Broadest ecosystem support (every framework, tool, and library integrates with OpenAI first)
Best fine-tuning pipeline — most mature, documentation, and cost transparency
Strong multimodal capability (vision, audio with GPT-4o and GPT-5)
Assistants API for stateful, multi-turn applications

Weaknesses

o3’s $40/1M output cost is prohibitive for high-volume use
GPT-4o mini quality noticeably lower than Claude Haiku for complex tasks
Rate limits on lower tiers require careful management
API reliability has had well-documented incidents in 2025

Best For

Teams wanting one API for everything (text, vision, audio, embeddings, fine-tuning)
Applications needing the broadest third-party integration support
When you need fine-tuned models — OpenAI’s pipeline is the most mature
Reasoning-heavy tasks where o3 quality justifies cost

Integration

from openai import OpenAI

client = OpenAI(api_key="your-key")

response = client.chat.completions.create(
    model="gpt-5",  # or "gpt-4o" for lower cost
    messages=[{"role": "user", "content": "Debug this function..."}],
    response_format={"type": "json_object"}  # Force JSON output
)

Google (Gemini API / Vertex AI)

Website: ai.google.dev (Gemini API), cloud.google.com/vertex-ai (Vertex)
Status: Production-ready

Models

| Model | Context | Input ($/1M) | Output ($/1M) | |---|---|---|---|---| | Gemini 3 Pro | 2M | $1.50 (under 200K) | $12.00 | | Gemini 3 Pro | 2M | $3.00 (over 200K) | $18.00 | | Gemini 2.5 Pro | 2M | $1.25 (under 200K) | $10.00 | | Gemini 2.5 Pro | 2M | $2.50 (over 200K) | $15.00 | | Gemini 2.5 Flash | 1M | $0.15 | $0.60 | | Gemini 2.0 Flash | 1M | $0.075 | $0.30 | | Gemini 2.0 Flash-Lite | 1M | $0.018 | $0.075 |

Gemini 3 Pro is Google’s latest reasoning model (May 2026), with further improvements in instruction-following, code generation, and multilingual performance. It retains the 2M token context window and adds better structured output handling. Gemini 2.5 Flash fills the gap between 2.0 Flash and 2.5 Pro, offering better quality than Flash at $0.15/1M input — a price/performance sweet spot for high-volume tasks.

Rate Limits (Pay-as-you-go)

| Model | RPM | TPM | |---|---|---| | Gemini 2.5 Pro | 150 | 2,000,000 | | Gemini 2.0 Flash | 2,000 | 4,000,000 |

Strengths

Lowest cost per token at every quality tier — Flash is exceptionally cheap
Largest context window available (2M tokens on 2.5 Pro and 3 Pro)
Best performance on reasoning benchmarks (GPQA Diamond, MATH)
Native Google Search grounding for RAG-free real-time knowledge
Generous free tier for development
Gemini 2.5 Flash adds a quality mid-point at $0.15/1M input between 2.0 Flash and 2.5 Pro

Weaknesses

Slightly less consistent instruction-following vs. Claude on complex structured tasks
Vertex AI setup is more complex for teams without GCP infrastructure
Rate limits on Gemini 2.5 Pro (150 RPM) can block high-frequency workloads
Context above 200K tokens is billed at double the rate

Best For

High-volume, cost-sensitive applications (Flash is 40x cheaper than Sonnet)
Tasks requiring very long context (entire codebases, large document sets)
Applications that benefit from real-time web access via grounding
Teams already in the GCP ecosystem

Integration

import google.generativeai as genai

genai.configure(api_key="your-key")
model = genai.GenerativeModel("gemini-2.0-flash")  # or "gemini-3-pro" for latest reasoning

response = model.generate_content(
    "Generate unit tests for this function:\n\n{code}",
    generation_config=genai.GenerationConfig(
        temperature=0,
        max_output_tokens=2048,
    )
)

Meta (via Inference Providers)

Direct access: Not available — Meta releases weights, not an API
Via: Groq, Together AI, Fireworks AI, Replicate, AWS Bedrock

Models (Llama Family)

| Model | Parameters | Context | Typical Cost ($/1M) | |---|---|---|---| | Llama 4 17B | 17B | 256K | $0.10–$0.40 | | Llama 3.3 70B | 70B | 128K | $0.20–$0.90 | | Llama 3.1 405B | 405B | 128K | $1.50–$5.00 | | Llama 3.2 11B Vision | 11B | 128K | $0.08–$0.20 | | CodeLlama 70B | 70B | 100K | $0.20–$0.60 |

Llama 4 17B is Meta’s latest open-weight model, offering a 256K context window and improved multimodal capabilities. Despite its smaller parameter count (17B vs 70B for Llama 3.3), it delivers competitive reasoning quality thanks to architectural improvements, including interleaved MoE layers and improved training data curation. Ideal for self-hosting on modest hardware.

Cost varies by inference provider. See provider-specific pricing below.

Strengths

Open weights — you can fine-tune and run on your own infrastructure
No data leaves your infrastructure when self-hosted
Competitive quality for the cost, especially Llama 3.3 70B
No rate limits when self-hosted

Weaknesses

Requires infrastructure management when self-hosted
Managed providers add cost and their own rate limits
Llama models trail Claude/GPT-4o on complex reasoning and instruction-following
No official SLA or support for API stability

Best For

Privacy-sensitive applications where data cannot leave your infrastructure
High-volume tasks where open-model quality is sufficient
Teams wanting to fine-tune on proprietary code/data
Research and experimentation without API billing

Groq

Website: console.groq.com
Status: Production-ready (focuses on inference speed, not model ownership)

Models Hosted

| Model | Input ($/1M) | Output ($/1M) | Speed | |---|---|---|---|---| | Llama 4 17B | $0.30 | $0.40 | ~1,200 tokens/sec | | Llama 3.3 70B | $0.59 | $0.79 | ~800 tokens/sec | | Mixtral 8x7B | $0.24 | $0.24 | ~500 tokens/sec | | Gemma 3 12B | $0.25 | $0.25 | ~1,100 tokens/sec | | Gemma 2 9B | $0.20 | $0.20 | ~1,200 tokens/sec |

Strengths

Fastest inference available — LPU (Language Processing Unit) hardware delivers 5–10x faster throughput than GPU-based providers
No rate limit pain for throughput-heavy workloads
Simple pricing, no egress fees

Weaknesses

Limited model selection — only runs open-weight models
Not suitable for tasks requiring frontier model quality
Context windows capped at 128K even for Llama 3.1 405B

Best For

Real-time applications where latency matters (coding assistants, chat)
High-volume inference with open models
Testing and development with fast iteration loops

Together AI

Website: api.together.ai
Status: Production-ready

Notable Models

| Model | Input ($/1M) | Output ($/1M) | |---|---|---|---| | Llama 4 17B | $0.30 | $0.30 | | Llama 3.1 405B | $3.50 | $3.50 | | Llama 3.3 70B | $0.54 | $0.54 | | DeepSeek R1 | $0.55 | $2.19 | | DeepSeek V3 | $0.27 | $1.10 | | Mistral 7B | $0.10 | $0.10 | | DeepSeek Coder V2 | $0.14 | $0.28 |

Strengths

Largest selection of open-weight models on a single API
Fine-tuning support for most hosted models
Serverless and dedicated deployment options
Good documentation and SDK support

Best For

Experimenting with and comparing open models
Fine-tuning workflows on open models
Teams that want open-weight model quality with managed infrastructure

DeepSeek

Website: platform.deepseek.com
Status: Production-ready

Models

| Model | Context | Input ($/1M) | Output ($/1M) | |---|---|---|---| | DeepSeek R1 | 128K | $0.55 | $2.19 | | DeepSeek V3 | 128K | $0.27 | $1.10 |

Strengths

DeepSeek R1 offers strong reasoning at a fraction of the cost of o3 — comparable GPQA Diamond scores (78.6%) at ~5% of the price
V3 is an excellent budget choice for coding tasks, competitive with GPT-4o mini at similar pricing
Both models support OpenAI-compatible API format for easy integration
Available via deep infrastructure providers (Together AI, Fireworks) as well as direct API

Weaknesses

Smaller context window (128K) compared to Gemini or Claude
R1 can be verbose in its reasoning traces, increasing token costs for the thinking budget
Less ecosystem support and fewer third-party integrations than OpenAI/Anthropic
Data residency limited to US/Asia — not ideal for EU compliance

Best For

Cost-sensitive reasoning tasks where o3 pricing is prohibitive
Code generation and structured output at budget-friendly rates
Teams comfortable with OpenAI-compatible APIs and minimal vendor lock-in

Mistral AI

Website: console.mistral.ai
Status: Production-ready

Models

| Model | Context | Input ($/1M) | Output ($/1M) | |---|---|---|---|---| | Mistral Large 3 | 256K | $2.50 | $8.00 | | Mistral Large 2 | 128K | $2.00 | $6.00 | | Mistral Small 3 | 128K | $0.10 | $0.30 | | Codestral | 256K | $0.20 | $0.60 |

Mistral Large 3 (released 2026) extends the context window to 256K and improves reasoning quality, narrowing the gap with Claude 4 Sonnet on complex coding tasks. Pricing remains competitive.

Strengths

Codestral is purpose-built for code generation and performs above its weight class
Strong European data residency options (GDPR-friendly by default)
Competitive pricing vs. comparable-quality models
Fill-in-the-middle (FIM) support for code completion tasks

Weaknesses

Smaller ecosystem than Anthropic/OpenAI
Mistral Large 2 (previous gen) trails Claude Sonnet and GPT-4o on complex reasoning; Large 3 closes this gap significantly

Best For

European teams with GDPR data residency requirements
Code completion and autocomplete (Codestral with FIM support)
Cost-sensitive teams wanting European-hosted inference

from mistralai import Mistral

client = Mistral(api_key="your-key")

# Codestral fill-in-the-middle for code completion
response = client.fim.complete(
    model="codestral-latest",
    prompt="def calculate_total(items):\n    ",
    suffix="\n    return total"
)

AWS Bedrock

Website: aws.amazon.com/bedrock
Status: Production-ready (enterprise)

Models Available

Claude (all Anthropic models), Llama (all Meta models), Mistral, Titan, Cohere, AI21 Jurassic — essentially a managed multi-provider API.

Strengths

Single billing relationship for multiple providers
Enterprise compliance: HIPAA, SOC 2, FedRAMP (for eligible models)
Native AWS IAM for access control
Guardrails built in for content filtering
VPC endpoints for private network access

Weaknesses

Adds latency vs. direct provider APIs
Sometimes lags direct providers on new model availability
More complex setup than direct provider SDKs

Best For

Enterprise teams already on AWS infrastructure
Healthcare, finance, or government applications needing compliance coverage
Teams wanting to avoid multiple vendor billing relationships

Embedding Providers

Don’t forget: you need embeddings for RAG pipelines. These are separate from generation APIs.

| Provider | Model | Dimensions | Cost ($/1M tokens) | |---|---|---|---|---| | OpenAI | text-embedding-3-large | 3072 | $0.13 | | OpenAI | text-embedding-3-small | 1536 | $0.02 | | Cohere | embed-v5.0 | 1024 | $0.10 | | Voyage AI | voyage-3-large | 1024 | $0.06 | | Google | text-embedding-004 | 768 | Free (Gemini API) | | Anthropic | (uses third-party — no native embedding API) | — | — |

For most applications, text-embedding-3-small at $0.02/1M is the right default. Use voyage-3-large if you need best-in-class retrieval quality.

Quick Provider Selection Guide

| If you need… | Use | |---|---|---| | Best code generation quality | Anthropic (Claude 4 Sonnet) | | Cheapest high-volume inference | Google (Gemini Flash) | | Best reasoning value | DeepSeek (R1) | | Widest model selection / fine-tuning | OpenAI | | Fastest inference latency | Groq | | Data privacy / self-hosting | Meta Llama 4 + Ollama or Together AI | | EU data residency | Mistral or Anthropic (EU endpoints) | | Enterprise compliance (AWS) | AWS Bedrock | | Code completion / FIM | Mistral Codestral | | Long context (200K+ tokens) | Google Gemini 2.5/3 Pro | | Complex reasoning | OpenAI o3, Gemini 3 Pro, or DeepSeek R1 |

How to Use This Directory

Anthropic

Models

Rate Limits (Tier 2 / Standard)

Pricing Features

Strengths

Weaknesses

Best For

Integration

OpenAI

Models

Rate Limits (Tier 2)

Pricing Features

Strengths

Weaknesses

Best For

Integration

Google (Gemini API / Vertex AI)

Models

Rate Limits (Pay-as-you-go)

Strengths

Weaknesses

Best For

Integration

Meta (via Inference Providers)

Models (Llama Family)

Strengths

Weaknesses

Best For

Groq

Models Hosted

Strengths

Weaknesses

Best For

Together AI

Notable Models

Strengths

Best For

DeepSeek

Models

Strengths

Weaknesses

Best For

Mistral AI

Models

Strengths

Weaknesses

Best For

AWS Bedrock

Models Available

Strengths

Weaknesses

Best For

Embedding Providers

Quick Provider Selection Guide

Related Resources