Multi-Provider LLM Orchestration: Architecture & Implementation
Relying on a single LLM provider is risky. Claude is great. OpenAI is robust. But neither is always available, always cheap, or always best for every task.
Production systems need multi-provider orchestration: request routing, automatic failover, cost optimization, and rate limiting across providers. Here's how to build it.
Why Multi-Provider Architecture?
Three compelling reasons:
- Availability. If Claude API goes down, fall back to OpenAI. No service degradation.
- Cost. Claude 3.5 Sonnet costs $3/M input tokens. Gemini costs $0.0075/M (400x cheaper). Route long-context tasks to Gemini.
- Performance. Claude is best for reasoning. GPT-4o is best for code. Gemini 2.0 is best for images. Route to the right tool.
"Single-provider LLM systems are single points of failure. Production requires redundancy."
The Architecture: Four Layers
┌─────────────────────────────────┐
│ User Request │
└────────────────┬────────────────┘
│
┌────────────────▼────────────────┐
│ Router │ ← Decides which provider
├────────────────────────────────┤
│ - Analyze request features │
│ - Check provider health │
│ - Apply rate limits │
└────────────────┬────────────────┘
│
┌────────────────▼────────────────┐
│ Provider Abstraction │ ← Unified interface
├────────────────────────────────┤
│ - Claude client │
│ - OpenAI client │
│ - Gemini client │
└────────────────┬────────────────┘
│
┌────────────────▼────────────────┐
│ Provider APIs │
└────────────────────────────────┘
Layer 1: The Router (Routing Logic)
The router decides which provider handles the request. Decisions based on:
- Input length. If input > 100k tokens, use Gemini (cheaper). If < 10k, use Claude (faster).
- Task type. Code generation → OpenAI. Reasoning → Claude. Summarization → Gemini.
- Provider health. Is Claude API healthy? Check status. If degraded, route to backup.
- Rate limits. Are we at quota for this provider? Try next one.
- Cost budget. Monthly spend at 80%? Route to cheaper provider.
function selectProvider(request) {
// 1. Check health
if (!providers.claude.healthy) {
return providers.openai
}
// 2. Route by task
if (request.task === "code_generation") {
return providers.openai
}
// 3. Route by cost
if (request.inputTokens > 100000) {
return providers.gemini // cheaper
}
// 4. Default
return providers.claude
}
Layer 2: Failover (Resilience)
When a provider fails, automatically retry with another:
async function callWithFailover(request) {
const providers = [
providers.claude, // primary
providers.openai, // secondary
providers.gemini, // tertiary
]
for (const provider of providers) {
try {
const result = await provider.call(request)
recordSuccess(provider)
return result
} catch (error) {
recordFailure(provider)
if (!shouldRetry(error)) {
throw error // unrecoverable
}
continue // try next provider
}
}
throw new Error("All providers exhausted")
}
Layer 3: Rate Limiting & Quotas
Each provider has limits. Track them:
class RateLimiter {
async acquire(provider, tokens) {
const limit = this.limits[provider]
// Check if we have quota
if (limit.used + tokens > limit.max) {
throw new Error(`Rate limit exceeded for ${provider}`)
}
limit.used += tokens
// Reset daily
if (Date.now() - limit.lastReset > 24 * 60 * 60 * 1000) {
limit.used = 0
limit.lastReset = Date.now()
}
}
}
Layer 4: Cost Optimization
Track spending and optimize:
- Cache responses. Identical requests shouldn't cost twice. Use Redis.
- Batch requests. When possible, combine multiple queries into one (lower cost per request).
- Prompt caching. Claude and GPT-4o support prompt caching. Reuse expensive system prompts.
- Cheaper models for simple tasks. Use Claude 3.5 Haiku for classification. Save Sonnet for complex reasoning.
async function callOptimized(request) {
// 1. Check cache first
const cached = await cache.get(request.hash)
if (cached) {
return cached
}
// 2. Route intelligently
let provider = selectProvider(request)
let result = await provider.call(request)
// 3. Cache result
await cache.set(request.hash, result, TTL)
// 4. Log cost
recordCost(provider, result.tokens)
return result
}
Real Example: Incident Resolution at alt.bank
Our incident resolution agent needed multi-provider setup:
- Quick checks (< 1000 tokens): Use Claude 3.5 Haiku ($0.80/M input tokens)
- Root cause analysis (1k-10k tokens): Use Claude 3.5 Sonnet ($3/M)
- Log analysis (> 10k tokens): Use Gemini 2.0 ($0.0075/M)
- Failover: If Claude down, fallback to OpenAI
Result: 70% cost reduction while improving latency through smart routing.
Observability: What to Log
Log everything for optimization:
{
"request_id": "uuid",
"selected_provider": "claude",
"failover_attempts": 0,
"input_tokens": 2500,
"output_tokens": 800,
"cost_usd": 0.012,
"latency_ms": 450,
"cache_hit": false,
"timestamp": "2026-05-09T10:30:00Z"
}
Analyze this data weekly. Find patterns like "Gemini always slower on code tasks" and adjust routing.
Key Takeaways
- Single provider = single point of failure. Use multi-provider for production.
- Route based on task type, input size, provider health, and cost.
- Implement automatic failover with exponential backoff.
- Cache responses to reduce redundant calls.
- Use prompt caching (Claude, GPT-4o) for expensive system prompts.
- Log everything; optimize based on data.
- Monitor cost weekly—LLM bills compound fast without controls.
Multi-provider orchestration is complex but necessary for production LLM systems. Start simple, add layers as you scale.
Building LLM infrastructure?
I specialize in production AI systems, multi-provider orchestration, and cost optimization.
Back to Portfolio