Model Routing Strategies

Intelligent model selection and routing patterns for AI orchestration. Route requests to the right model based on cost, capability, latency, and reliability constraints.

Strategic Foundation: Intelligence as Fungible Infrastructure

Model routing embodies the economic shift in industrialized intelligence: models become fungible, interchangeable components. Route work to the cheapest capable provider, not the "best" model.

Core insight: Intelligence pricing determines architecture. When intelligence becomes cheap and fungible, routing optimization becomes a primary value driver. Read the full framework in Philosophy: The Industrialization of Intelligence.

The Core Principle

Route each request to the cheapest model that can reliably handle it.

Not the best model. Not the smartest model. The cheapest model that meets your quality threshold. This is the fundamental shift from "AI as magic" to "AI as infrastructure."

Routing Strategies

1. Cost-Based Routing (The Default)

Route requests based on cost optimization. Try cheaper models first, fallback to expensive models only when necessary.

Pattern: Waterfall Routing

// Vercel AI SDK - Cost-optimized waterfall routing
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';

async function generateWithCostOptimization(prompt: string) {
  const models = [
    { provider: openai('gpt-4o-mini'), cost: 0.15, name: 'GPT-4o Mini' },
    { provider: openai('gpt-4o'), cost: 2.5, name: 'GPT-4o' },
    { provider: anthropic('claude-3-5-sonnet-20241022'), cost: 3.0, name: 'Claude 3.5 Sonnet' },
  ];

  for (const model of models) {
    try {
      const result = await generateText({
        model: model.provider,
        prompt,
        maxTokens: 500,
      });

      // Validate quality (simple check - customize for your needs)
      if (result.text.length > 50 && !result.text.includes('[ERROR]')) {
        console.log(`✅ Success with ${model.name} ($${model.cost}/1M tokens)`);
        return { text: result.text, model: model.name, cost: model.cost };
      }
    } catch (error) {
      console.log(`❌ Failed with ${model.name}, trying next...`);
      continue;
    }
  }

  throw new Error('All models failed');
}

Cost Reality Check: GPT-4o Mini at $0.15/1M input tokens is 16x cheaper than Claude 3.5 Sonnet at $3.00/1M. For many tasks, the quality difference doesn't justify the 16x cost multiplier.

2. Capability-Based Routing

Route based on task complexity and required capabilities. Match task characteristics to model strengths.

Pattern: Task Classification Router

// Route based on task complexity
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';

type TaskComplexity = 'simple' | 'moderate' | 'complex' | 'reasoning';

interface RoutingDecision {
  model: any;
  name: string;
  reason: string;
}

function selectModelByComplexity(complexity: TaskComplexity): RoutingDecision {
  switch (complexity) {
    case 'simple':
      // Classification, extraction, simple Q&A
      return {
        model: openai('gpt-4o-mini'),
        name: 'GPT-4o Mini',
        reason: 'Fast and cheap for simple tasks',
      };

    case 'moderate':
      // Summarization, moderate writing, basic reasoning
      return {
        model: openai('gpt-4o'),
        name: 'GPT-4o',
        reason: 'Balanced cost/performance for moderate complexity',
      };

    case 'complex':
      // Long-form writing, complex analysis, nuanced tasks
      return {
        model: anthropic('claude-3-5-sonnet-20241022'),
        name: 'Claude 3.5 Sonnet',
        reason: 'Superior performance on complex creative and analytical tasks',
      };

    case 'reasoning':
      // Mathematical reasoning, logic, step-by-step problem solving
      return {
        model: openai('o1-mini'),
        name: 'OpenAI o1-mini',
        reason: 'Specialized reasoning model with extended thinking time',
      };
  }
}

async function generateByComplexity(
  prompt: string,
  complexity: TaskComplexity
) {
  const { model, name, reason } = selectModelByComplexity(complexity);

  console.log(`Routing to ${name}: ${reason}`);

  const result = await generateText({
    model,
    prompt,
    maxTokens: 1000,
  });

  return { text: result.text, model: name };
}

Capability Mapping Guide:

Task Type	Recommended Model	Why
Classification	GPT-4o Mini	Fast, cheap, accurate enough
Data extraction	GPT-4o Mini	Structured output at low cost
Creative writing	Claude 3.5 Sonnet	Superior creative capabilities
Code generation	GPT-4o / Claude 3.5	Both excel, choose by ecosystem
Math/reasoning	o1-mini / o1	Specialized reasoning capabilities
Long context	Claude 3.5 Sonnet	200K context, better recall
Real-time chat	GPT-4o	Fast response, good streaming

3. Latency-Based Routing

Route based on response time requirements. User-facing features need fast models, background jobs can use slower, cheaper models.

Pattern: Latency-Aware Router

// Route based on latency requirements
type LatencyRequirement = 'realtime' | 'interactive' | 'background';

interface ModelPerformance {
  model: any;
  name: string;
  avgLatency: number; // milliseconds
  cost: number; // per 1M tokens
}

const modelsByLatency: Record<LatencyRequirement, ModelPerformance> = {
  realtime: {
    // <500ms target - user typing, autocomplete
    model: openai('gpt-4o-mini'),
    name: 'GPT-4o Mini',
    avgLatency: 300,
    cost: 0.15,
  },
  interactive: {
    // <2s target - chat responses, form submissions
    model: openai('gpt-4o'),
    name: 'GPT-4o',
    avgLatency: 1200,
    cost: 2.5,
  },
  background: {
    // >2s acceptable - reports, batch processing
    model: anthropic('claude-3-5-sonnet-20241022'),
    name: 'Claude 3.5 Sonnet',
    avgLatency: 2500,
    cost: 3.0,
  },
};

async function generateByLatency(
  prompt: string,
  requirement: LatencyRequirement
) {
  const modelConfig = modelsByLatency[requirement];

  console.log(
    `Latency requirement: ${requirement} (~${modelConfig.avgLatency}ms)`
  );

  const result = await generateText({
    model: modelConfig.model,
    prompt,
    maxTokens: 500,
  });

  return { text: result.text, model: modelConfig.name };
}

Latency Reality: Model response time varies significantly. GPT-4o Mini averages 300-500ms for short completions. Claude 3.5 Sonnet can take 2-4s for similar tasks. For user-facing features, latency often matters more than quality.

4. Hybrid Routing (Production Pattern)

Combine cost, capability, and latency constraints. This is what you actually use in production.

Pattern: Production Router with Multi-Factor Decision

// Production-ready router combining multiple factors
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';

interface RoutingContext {
  taskType: 'classification' | 'generation' | 'reasoning' | 'creative';
  latencyBudget: number; // milliseconds
  costBudget: number; // per request
  userTier: 'free' | 'pro' | 'enterprise';
}

interface ModelConfig {
  provider: any;
  name: string;
  cost: number; // per 1M tokens
  avgLatency: number; // ms
  capabilities: string[];
}

const MODEL_REGISTRY: ModelConfig[] = [
  {
    provider: openai('gpt-4o-mini'),
    name: 'GPT-4o Mini',
    cost: 0.15,
    avgLatency: 300,
    capabilities: ['classification', 'extraction', 'simple-generation'],
  },
  {
    provider: openai('gpt-4o'),
    name: 'GPT-4o',
    cost: 2.5,
    avgLatency: 1200,
    capabilities: ['generation', 'reasoning', 'creative', 'code'],
  },
  {
    provider: anthropic('claude-3-5-sonnet-20241022'),
    name: 'Claude 3.5 Sonnet',
    cost: 3.0,
    avgLatency: 2500,
    capabilities: ['creative', 'reasoning', 'long-context', 'nuanced'],
  },
  {
    provider: openai('o1-mini'),
    name: 'OpenAI o1-mini',
    cost: 3.0,
    avgLatency: 5000,
    capabilities: ['reasoning', 'math', 'logic'],
  },
];

function selectModel(context: RoutingContext): ModelConfig {
  // Filter by capability
  let candidates = MODEL_REGISTRY.filter((m) =>
    m.capabilities.includes(context.taskType)
  );

  // Filter by latency budget
  candidates = candidates.filter((m) => m.avgLatency <= context.latencyBudget);

  // Filter by cost budget (estimate: 500 tokens @ $X/1M)
  const estimatedTokens = 500;
  candidates = candidates.filter((m) => {
    const estimatedCost = (m.cost / 1_000_000) * estimatedTokens;
    return estimatedCost <= context.costBudget;
  });

  // Apply user tier constraints
  if (context.userTier === 'free') {
    // Free tier: only cheapest models
    candidates = candidates.filter((m) => m.cost < 1.0);
  }

  // Sort by cost (cheapest first) and return the first match
  candidates.sort((a, b) => a.cost - b.cost);

  if (candidates.length === 0) {
    throw new Error(
      `No model available for constraints: ${JSON.stringify(context)}`
    );
  }

  return candidates[0];
}

async function generateWithRouting(
  prompt: string,
  context: RoutingContext
) {
  const model = selectModel(context);

  console.log(`
🎯 Model Selection:
   Task: ${context.taskType}
   Model: ${model.name}
   Cost: $${model.cost}/1M tokens
   Latency: ~${model.avgLatency}ms
   User Tier: ${context.userTier}
  `);

  const result = await generateText({
    model: model.provider,
    prompt,
    maxTokens: 500,
  });

  return {
    text: result.text,
    model: model.name,
    estimatedCost: (model.cost / 1_000_000) * 500,
  };
}

// Usage example
const result = await generateWithRouting(
  'Summarize this article: ...',
  {
    taskType: 'generation',
    latencyBudget: 2000, // 2s max
    costBudget: 0.01, // 1 cent max
    userTier: 'free',
  }
);

Token Economics

Model routing is fundamentally about economics. Here's what the numbers actually look like:

Model	Input ($/1M)	Output ($/1M)	Cost per 1K Request*	Relative Cost
GPT-4o Mini	$0.15	$0.60	$0.38	1x (baseline)
GPT-4o	$2.50	$10.00	$6.25	16x
Claude 3.5 Sonnet	$3.00	$15.00	$9.00	24x
OpenAI o1-mini	$3.00	$12.00	$7.50	20x
GPT-4 Turbo	$10.00	$30.00	$20.00	53x

* Estimated cost per 1,000 requests assuming 500 input tokens, 500 output tokens per request

The 24x Cost Trap

Using Claude 3.5 Sonnet for every request when GPT-4o Mini would suffice costs 24x more. At 1M requests/month, that's $9,000 vs $380. Smart routing isn't optional—it's survival.

Real-World Routing Impact Example:

Scenario: SaaS app with 100K requests/day (3M/month)

• No routing (all Claude 3.5): 3M × $0.009 = $27,000/month

• Simple routing (70% GPT-4o Mini, 30% Claude 3.5):

2.1M × $0.00038 + 900K × $0.009 = $8,898/month

💰 Savings: $18,102/month ($217,224/year)

Production Considerations

1. Fallback Strategy

Models fail. APIs have outages. Always have a fallback plan.

// Fallback with retry logic
async function generateWithFallback(prompt: string) {
  const models = [
    openai('gpt-4o-mini'),
    openai('gpt-4o'),
    anthropic('claude-3-5-sonnet-20241022'),
  ];

  for (const model of models) {
    try {
      return await generateText({ model, prompt, maxTokens: 500 });
    } catch (error) {
      console.error(`Model failed, trying next: ${error}`);
      continue;
    }
  }

  throw new Error('All models failed');
}

2. Rate Limiting Awareness

Different models have different rate limits. Route around congestion.

Typical Rate Limits (varies by tier):

GPT-4o Mini: 30,000 RPM (requests per minute)
GPT-4o: 10,000 RPM
Claude 3.5 Sonnet: 4,000 RPM
OpenAI o1-mini: 500 RPM

If you hit rate limits on your primary model, automatically route to secondary model instead of failing.

3. Monitoring and Observability

Track which models are actually being used and their performance.

// Add telemetry to routing decisions
async function generateWithTelemetry(prompt: string, context: RoutingContext) {
  const model = selectModel(context);
  const startTime = Date.now();

  try {
    const result = await generateText({
      model: model.provider,
      prompt,
      maxTokens: 500,
    });

    const latency = Date.now() - startTime;

    // Log to your analytics (PostHog, Segment, etc.)
    analytics.track('model_routing', {
      model: model.name,
      taskType: context.taskType,
      latency,
      success: true,
      userTier: context.userTier,
      estimatedCost: (model.cost / 1_000_000) * 500,
    });

    return result;
  } catch (error) {
    analytics.track('model_routing', {
      model: model.name,
      taskType: context.taskType,
      success: false,
      error: error.message,
    });
    throw error;
  }
}

4. A/B Testing Routing Strategies

Don't guess which routing strategy works best. Measure it.

What to test:

Does aggressive cost optimization hurt user satisfaction?
Is latency or quality more important to users?
Do pro users notice quality difference vs free users?
What's the minimum viable model for each task type?

Run A/B tests with different routing strategies and measure user engagement, satisfaction scores, and retention.

Common Mistakes

❌ Always using the "best" model

Claude 3.5 Sonnet is amazing, but using it for simple classification tasks is burning money. Most tasks don't need the best model.

❌ No fallback strategy

When your primary model has an outage (and it will), your entire app goes down. Always have a backup model ready.

❌ Ignoring latency in routing decisions

Saving $0.002 per request doesn't matter if users bounce because your app feels slow. Factor latency into routing.

❌ Not monitoring routing decisions

You can't optimize what you don't measure. Track which models are used, their latency, cost, and success rates.

❌ Over-engineering the router

Start simple: cost-based waterfall routing. Add complexity only when you have data showing you need it.

Best Practices

✅ Start with cost-based waterfall routing

Try cheap models first, fallback to expensive ones only when necessary. This single pattern handles 80% of routing needs.

✅ Add quality validation

Check if cheaper model output meets quality threshold before accepting it. Simple checks: length, format, keyword presence.

✅ Separate user-facing from background tasks

User-facing: optimize for latency. Background: optimize for cost. Don't use the same routing strategy for both.

✅ Build in observability from day one

Track model usage, latency, cost, and success rates. You'll need this data to optimize routing decisions.

✅ Test routing strategies with real data

A/B test different routing approaches. Measure user satisfaction, not just cost savings. Sometimes expensive models are worth it.

Key Takeaways

•Route to the cheapest capable model, not the best. This single principle can save 10-50x in costs.
•Start simple: Cost-based waterfall routing handles most use cases. Add complexity only when needed.
•Factor in latency: User-facing features need fast models. Background jobs can use slower, cheaper models.
•Always have fallbacks: Models fail. APIs have outages. Never depend on a single model.
•Monitor everything: Track model usage, costs, latency, and success rates. Optimize based on data.
•Test routing strategies: A/B test different approaches. Measure user satisfaction, not just cost savings.

← Agent Coordination Patterns State Management for Multi-Agent Systems →