Intelligent model selection and routing patterns for AI orchestration. Route requests to the right model based on cost, capability, latency, and reliability constraints.
Model routing embodies the economic shift in industrialized intelligence: models become fungible, interchangeable components. Route work to the cheapest capable provider, not the "best" model.
Core insight: Intelligence pricing determines architecture. When intelligence becomes cheap and fungible, routing optimization becomes a primary value driver. Read the full framework in Philosophy: The Industrialization of Intelligence.
Route each request to the cheapest model that can reliably handle it.
Not the best model. Not the smartest model. The cheapest model that meets your quality threshold. This is the fundamental shift from "AI as magic" to "AI as infrastructure."
Route requests based on cost optimization. Try cheaper models first, fallback to expensive models only when necessary.
// Vercel AI SDK - Cost-optimized waterfall routing
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
async function generateWithCostOptimization(prompt: string) {
const models = [
{ provider: openai('gpt-4o-mini'), cost: 0.15, name: 'GPT-4o Mini' },
{ provider: openai('gpt-4o'), cost: 2.5, name: 'GPT-4o' },
{ provider: anthropic('claude-3-5-sonnet-20241022'), cost: 3.0, name: 'Claude 3.5 Sonnet' },
];
for (const model of models) {
try {
const result = await generateText({
model: model.provider,
prompt,
maxTokens: 500,
});
// Validate quality (simple check - customize for your needs)
if (result.text.length > 50 && !result.text.includes('[ERROR]')) {
console.log(`✅ Success with ${model.name} ($${model.cost}/1M tokens)`);
return { text: result.text, model: model.name, cost: model.cost };
}
} catch (error) {
console.log(`❌ Failed with ${model.name}, trying next...`);
continue;
}
}
throw new Error('All models failed');
}Cost Reality Check: GPT-4o Mini at $0.15/1M input tokens is 16x cheaper than Claude 3.5 Sonnet at $3.00/1M. For many tasks, the quality difference doesn't justify the 16x cost multiplier.
Route based on task complexity and required capabilities. Match task characteristics to model strengths.
// Route based on task complexity
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
type TaskComplexity = 'simple' | 'moderate' | 'complex' | 'reasoning';
interface RoutingDecision {
model: any;
name: string;
reason: string;
}
function selectModelByComplexity(complexity: TaskComplexity): RoutingDecision {
switch (complexity) {
case 'simple':
// Classification, extraction, simple Q&A
return {
model: openai('gpt-4o-mini'),
name: 'GPT-4o Mini',
reason: 'Fast and cheap for simple tasks',
};
case 'moderate':
// Summarization, moderate writing, basic reasoning
return {
model: openai('gpt-4o'),
name: 'GPT-4o',
reason: 'Balanced cost/performance for moderate complexity',
};
case 'complex':
// Long-form writing, complex analysis, nuanced tasks
return {
model: anthropic('claude-3-5-sonnet-20241022'),
name: 'Claude 3.5 Sonnet',
reason: 'Superior performance on complex creative and analytical tasks',
};
case 'reasoning':
// Mathematical reasoning, logic, step-by-step problem solving
return {
model: openai('o1-mini'),
name: 'OpenAI o1-mini',
reason: 'Specialized reasoning model with extended thinking time',
};
}
}
async function generateByComplexity(
prompt: string,
complexity: TaskComplexity
) {
const { model, name, reason } = selectModelByComplexity(complexity);
console.log(`Routing to ${name}: ${reason}`);
const result = await generateText({
model,
prompt,
maxTokens: 1000,
});
return { text: result.text, model: name };
}| Task Type | Recommended Model | Why |
|---|---|---|
| Classification | GPT-4o Mini | Fast, cheap, accurate enough |
| Data extraction | GPT-4o Mini | Structured output at low cost |
| Creative writing | Claude 3.5 Sonnet | Superior creative capabilities |
| Code generation | GPT-4o / Claude 3.5 | Both excel, choose by ecosystem |
| Math/reasoning | o1-mini / o1 | Specialized reasoning capabilities |
| Long context | Claude 3.5 Sonnet | 200K context, better recall |
| Real-time chat | GPT-4o | Fast response, good streaming |
Route based on response time requirements. User-facing features need fast models, background jobs can use slower, cheaper models.
// Route based on latency requirements
type LatencyRequirement = 'realtime' | 'interactive' | 'background';
interface ModelPerformance {
model: any;
name: string;
avgLatency: number; // milliseconds
cost: number; // per 1M tokens
}
const modelsByLatency: Record<LatencyRequirement, ModelPerformance> = {
realtime: {
// <500ms target - user typing, autocomplete
model: openai('gpt-4o-mini'),
name: 'GPT-4o Mini',
avgLatency: 300,
cost: 0.15,
},
interactive: {
// <2s target - chat responses, form submissions
model: openai('gpt-4o'),
name: 'GPT-4o',
avgLatency: 1200,
cost: 2.5,
},
background: {
// >2s acceptable - reports, batch processing
model: anthropic('claude-3-5-sonnet-20241022'),
name: 'Claude 3.5 Sonnet',
avgLatency: 2500,
cost: 3.0,
},
};
async function generateByLatency(
prompt: string,
requirement: LatencyRequirement
) {
const modelConfig = modelsByLatency[requirement];
console.log(
`Latency requirement: ${requirement} (~${modelConfig.avgLatency}ms)`
);
const result = await generateText({
model: modelConfig.model,
prompt,
maxTokens: 500,
});
return { text: result.text, model: modelConfig.name };
}Latency Reality: Model response time varies significantly. GPT-4o Mini averages 300-500ms for short completions. Claude 3.5 Sonnet can take 2-4s for similar tasks. For user-facing features, latency often matters more than quality.
Combine cost, capability, and latency constraints. This is what you actually use in production.
// Production-ready router combining multiple factors
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
interface RoutingContext {
taskType: 'classification' | 'generation' | 'reasoning' | 'creative';
latencyBudget: number; // milliseconds
costBudget: number; // per request
userTier: 'free' | 'pro' | 'enterprise';
}
interface ModelConfig {
provider: any;
name: string;
cost: number; // per 1M tokens
avgLatency: number; // ms
capabilities: string[];
}
const MODEL_REGISTRY: ModelConfig[] = [
{
provider: openai('gpt-4o-mini'),
name: 'GPT-4o Mini',
cost: 0.15,
avgLatency: 300,
capabilities: ['classification', 'extraction', 'simple-generation'],
},
{
provider: openai('gpt-4o'),
name: 'GPT-4o',
cost: 2.5,
avgLatency: 1200,
capabilities: ['generation', 'reasoning', 'creative', 'code'],
},
{
provider: anthropic('claude-3-5-sonnet-20241022'),
name: 'Claude 3.5 Sonnet',
cost: 3.0,
avgLatency: 2500,
capabilities: ['creative', 'reasoning', 'long-context', 'nuanced'],
},
{
provider: openai('o1-mini'),
name: 'OpenAI o1-mini',
cost: 3.0,
avgLatency: 5000,
capabilities: ['reasoning', 'math', 'logic'],
},
];
function selectModel(context: RoutingContext): ModelConfig {
// Filter by capability
let candidates = MODEL_REGISTRY.filter((m) =>
m.capabilities.includes(context.taskType)
);
// Filter by latency budget
candidates = candidates.filter((m) => m.avgLatency <= context.latencyBudget);
// Filter by cost budget (estimate: 500 tokens @ $X/1M)
const estimatedTokens = 500;
candidates = candidates.filter((m) => {
const estimatedCost = (m.cost / 1_000_000) * estimatedTokens;
return estimatedCost <= context.costBudget;
});
// Apply user tier constraints
if (context.userTier === 'free') {
// Free tier: only cheapest models
candidates = candidates.filter((m) => m.cost < 1.0);
}
// Sort by cost (cheapest first) and return the first match
candidates.sort((a, b) => a.cost - b.cost);
if (candidates.length === 0) {
throw new Error(
`No model available for constraints: ${JSON.stringify(context)}`
);
}
return candidates[0];
}
async function generateWithRouting(
prompt: string,
context: RoutingContext
) {
const model = selectModel(context);
console.log(`
🎯 Model Selection:
Task: ${context.taskType}
Model: ${model.name}
Cost: $${model.cost}/1M tokens
Latency: ~${model.avgLatency}ms
User Tier: ${context.userTier}
`);
const result = await generateText({
model: model.provider,
prompt,
maxTokens: 500,
});
return {
text: result.text,
model: model.name,
estimatedCost: (model.cost / 1_000_000) * 500,
};
}
// Usage example
const result = await generateWithRouting(
'Summarize this article: ...',
{
taskType: 'generation',
latencyBudget: 2000, // 2s max
costBudget: 0.01, // 1 cent max
userTier: 'free',
}
);Model routing is fundamentally about economics. Here's what the numbers actually look like:
| Model | Input ($/1M) | Output ($/1M) | Cost per 1K Request* | Relative Cost |
|---|---|---|---|---|
| GPT-4o Mini | $0.15 | $0.60 | $0.38 | 1x (baseline) |
| GPT-4o | $2.50 | $10.00 | $6.25 | 16x |
| Claude 3.5 Sonnet | $3.00 | $15.00 | $9.00 | 24x |
| OpenAI o1-mini | $3.00 | $12.00 | $7.50 | 20x |
| GPT-4 Turbo | $10.00 | $30.00 | $20.00 | 53x |
* Estimated cost per 1,000 requests assuming 500 input tokens, 500 output tokens per request
Using Claude 3.5 Sonnet for every request when GPT-4o Mini would suffice costs 24x more. At 1M requests/month, that's $9,000 vs $380. Smart routing isn't optional—it's survival.
Scenario: SaaS app with 100K requests/day (3M/month)
• No routing (all Claude 3.5): 3M × $0.009 = $27,000/month
• Simple routing (70% GPT-4o Mini, 30% Claude 3.5):
2.1M × $0.00038 + 900K × $0.009 = $8,898/month
💰 Savings: $18,102/month ($217,224/year)
Models fail. APIs have outages. Always have a fallback plan.
// Fallback with retry logic
async function generateWithFallback(prompt: string) {
const models = [
openai('gpt-4o-mini'),
openai('gpt-4o'),
anthropic('claude-3-5-sonnet-20241022'),
];
for (const model of models) {
try {
return await generateText({ model, prompt, maxTokens: 500 });
} catch (error) {
console.error(`Model failed, trying next: ${error}`);
continue;
}
}
throw new Error('All models failed');
}Different models have different rate limits. Route around congestion.
Typical Rate Limits (varies by tier):
If you hit rate limits on your primary model, automatically route to secondary model instead of failing.
Track which models are actually being used and their performance.
// Add telemetry to routing decisions
async function generateWithTelemetry(prompt: string, context: RoutingContext) {
const model = selectModel(context);
const startTime = Date.now();
try {
const result = await generateText({
model: model.provider,
prompt,
maxTokens: 500,
});
const latency = Date.now() - startTime;
// Log to your analytics (PostHog, Segment, etc.)
analytics.track('model_routing', {
model: model.name,
taskType: context.taskType,
latency,
success: true,
userTier: context.userTier,
estimatedCost: (model.cost / 1_000_000) * 500,
});
return result;
} catch (error) {
analytics.track('model_routing', {
model: model.name,
taskType: context.taskType,
success: false,
error: error.message,
});
throw error;
}
}Don't guess which routing strategy works best. Measure it.
What to test:
Run A/B tests with different routing strategies and measure user engagement, satisfaction scores, and retention.
Claude 3.5 Sonnet is amazing, but using it for simple classification tasks is burning money. Most tasks don't need the best model.
When your primary model has an outage (and it will), your entire app goes down. Always have a backup model ready.
Saving $0.002 per request doesn't matter if users bounce because your app feels slow. Factor latency into routing.
You can't optimize what you don't measure. Track which models are used, their latency, cost, and success rates.
Start simple: cost-based waterfall routing. Add complexity only when you have data showing you need it.
Try cheap models first, fallback to expensive ones only when necessary. This single pattern handles 80% of routing needs.
Check if cheaper model output meets quality threshold before accepting it. Simple checks: length, format, keyword presence.
User-facing: optimize for latency. Background: optimize for cost. Don't use the same routing strategy for both.
Track model usage, latency, cost, and success rates. You'll need this data to optimize routing decisions.
A/B test different routing approaches. Measure user satisfaction, not just cost savings. Sometimes expensive models are worth it.
Sequential, parallel, and hierarchical multi-agent workflows. Orchestrate routed models into complex systems.
Implementation examples using Vercel AI SDK 5.0. Model routing, streaming, and multi-provider integration.
Strategic framework explaining why model routing matters: fungible intelligence, economics, and infrastructure thinking.
Coming soon: Managing state across routed models. Memory, context, and temporal continuity.