Hybrid AI Stack: Reduce AI API Costs by 60-80% with Intelligent Request Routing
Hybrid AI Stack: Reduce AI API Costs by 60-80% with Intelligent Request Routing
TL;DR: Run lightweight LLMs locally on CPU for simple tasks, use cloud APIs only for complex requests. Save 60-80% on AI costs while maintaining quality. Production-ready Docker stack with full monitoring.
The Problem
AI API costs add up fast:
- Simple question: $0.0009 per request
- Medium explanation: $0.0036 per request
- Complex code generation: $0.0159 per request
At 200,000 requests/month, you’re paying $774/month for cloud-only AI.
The insight: 70% of AI requests are simple enough for local models to handle—for free.
The Solution: Hybrid AI Stack
Intelligently route requests between:
- Local models (CPU-based): TinyLlama, Phi-2, Mistral - $0.00 per request
- Cloud APIs: Claude Sonnet - $0.003-0.015 per request
User Request
|
v
API Gateway (:8080)
Smart Router
/ | \
/ | \
Complexity < 0.3 0.3-0.6 > 0.6
| | |
v v v
TinyLlama Phi-2 Claude Sonnet
(Local CPU) (Local CPU) (Cloud API)
$0.00/req $0.00/req $0.003-0.015
Architecture
Smart Routing Algorithm
The router analyzes every request in real-time:
def estimate_complexity(prompt: str) -> float:
"""
Analyze prompt and return complexity score (0-1)
Factors:
- Length (0-0.5 points)
- Complex keywords (+0.3): "implement", "design", "refactor"
- Simple keywords (-0.1): "what is", "list", "summarize"
- Code presence (+0.3)
- Task type (-0.1 to +0.2)
"""
score = 0.0
# Length scoring
if len(prompt) < 100:
score += 0.1
elif len(prompt) < 500:
score += 0.3
else:
score += 0.5
# Keyword analysis
complex_keywords = ['analyze', 'design', 'implement', 'refactor', 'optimize']
simple_keywords = ['what is', 'list', 'summarize', 'define']
for keyword in complex_keywords:
if keyword in prompt.lower():
score += 0.1
for keyword in simple_keywords:
if keyword in prompt.lower():
score -= 0.1
# Code detection
if '```' in prompt or 'function' in prompt or 'class' in prompt:
score += 0.3
return min(max(score, 0), 1)
def select_model(complexity: float) -> str:
"""Route to optimal model based on complexity"""
if complexity < 0.3:
return 'tinyllama' # Free, fast, 700MB RAM
elif complexity < 0.6:
return 'phi2' # Free, quality, 1.6GB RAM
else:
return 'claude-sonnet' # Paid, best quality
Full Stack Components
Core Services:
- API Gateway (Flask + Gunicorn): HTTP entry point on port 8080
- Smart Router (Python): Complexity estimation and routing logic
- Ollama (Local LLM server): Runs TinyLlama, Phi-2, Mistral on CPU
- Redis: Response caching for duplicate queries
Automation:
- n8n (Workflow engine): Orchestration, monitoring, batch processing
- PostgreSQL: n8n workflow storage
Monitoring:
- Prometheus: Metrics collection and alerting
- Grafana: Real-time dashboards for cost tracking
- Taskwarrior: Task tracking for every routing decision
Real-World Cost Scenarios
Scenario 1: Established SaaS (50,000 requests/month)
Request Distribution:
- Simple (45%): 22,500 requests
- Medium (35%): 17,500 requests
- Complex (20%): 10,000 requests
Cloud-Only Approach
Simple: 22,500 × $0.0009 = $20.25
Medium: 17,500 × $0.0036 = $63.00
Complex: 10,000 × $0.009 = $90.00
─────────────────────────────────────
Total: $173.25/month
Hybrid Approach (Tier 2 VPS: $52/mo)
VPS Cost: $52.00
Simple → TinyLlama (22,500): $0.00 (saved $20.25)
Medium → Phi-2 (17,500): $0.00 (saved $63.00)
Complex → Claude (10,000): $90.00
─────────────────────────────────────
Total: $142.00/month
Savings: $31.25/month (18% reduction) ✅
Scenario 2: High-Volume Enterprise (200,000 requests/month)
Request Distribution:
- Simple (40%): 80,000 requests @ $0.0009
- Medium (35%): 70,000 requests @ $0.0036
- Complex (25%): 50,000 requests @ $0.009
Cloud-Only Approach
Simple: 80,000 × $0.0009 = $72.00
Medium: 70,000 × $0.0036 = $252.00
Complex: 50,000 × $0.009 = $450.00
──────────────────────────────────────
Total: $774.00/month
Hybrid Approach (Tier 3 VPS: $120/mo)
VPS Cost: $120.00
Simple → TinyLlama: $0.00 (saved $72.00)
Medium → Phi-2: $0.00 (saved $252.00)
Complex (60%) → Mistral-7B: $0.00 (saved $270.00)
Complex (40%) → Claude (20K): $180.00
──────────────────────────────────────
Total: $300.00/month
Savings: $474.00/month (61% reduction) ✅✅✅
VPS Tiers & Break-Even Analysis
Tier | Monthly Cost | Models | RAM | Break-Even Volume |
---|---|---|---|---|
Tier 1 | $26 | TinyLlama | 2GB | ~3,000 requests |
Tier 2 | $52 | TinyLlama + Phi-2 | 4GB | ~30,000 requests ⭐ |
Tier 3 | $120 | All + Mistral-7B | 8GB | ~100,000 requests |
Tier 4 | $310 | GPU-accelerated | 16GB + GPU | ~500,000 requests |
Recommendation: Start with Tier 2 ($52/mo) - optimal for most use cases.
5-Minute Quick Start
Prerequisites
- Ubuntu 22.04+ (or compatible Linux)
- 4GB+ RAM (15GB recommended)
- Sudo access
- Docker installed
Installation
# Clone the repository
git clone https://github.com/jeremylongshore/Hybrid-ai-stack-intent-solutions.git
cd hybrid-ai-stack
# Run one-command installation
./install.sh
# Edit .env and add your API key
nano .env
# Set: ANTHROPIC_API_KEY=sk-ant-your-key-here
# Deploy everything
./deploy-all.sh docker
That’s it! System is running.
Test the API
# Simple question (routes to local TinyLlama)
curl -X POST http://localhost:8080/api/v1/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "What is Python?"}'
# Complex task (routes to Claude)
curl -X POST http://localhost:8080/api/v1/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "Write a Python function to implement a binary search tree with insertion, deletion, and balancing."}'
# Check routing statistics
curl http://localhost:8080/api/v1/stats
Cost Optimization Strategies
Strategy 1: Aggressive Local Routing
Increase local model usage to 80%+:
# Edit .env
COMPLEXITY_THRESHOLD_TINY=0.4 # Default: 0.3
COMPLEXITY_THRESHOLD_PHI=0.7 # Default: 0.6
Impact: Savings increase 15-20% with slight quality trade-off.
Strategy 2: Response Caching
Eliminate duplicate API calls:
# Enable Redis caching in .env
REDIS_CACHE_ENABLED=true
REDIS_CACHE_TTL=86400 # 24 hours
Expected cache hit rate: 10-30% Additional savings: 10-30%
Example (50K requests/month):
Without caching: $142/month
With 20% cache hit rate: $113.60/month
Additional savings: $28.40 (20%)
Strategy 3: Smart Model Fallback
Try cheaper models first, escalate only if needed:
def smart_fallback_request(prompt):
"""Try TinyLlama → Phi-2 → Claude until acceptable quality"""
# Try TinyLlama (free)
response = try_tinyllama(prompt)
if quality_acceptable(response):
return response # Cost: $0
# Try Phi-2 (free)
response = try_phi2(prompt)
if quality_acceptable(response):
return response # Cost: $0
# Use Claude (paid)
return try_claude(prompt) # Cost: $0.003-0.015
Savings: Can increase local usage to 90%+
Monitoring & Observability
Prometheus Metrics
# Total API costs (last 30 days)
sum(api_gateway_cost_total)
# Cost by model
sum by (model) (api_gateway_cost_total)
# Average cost per request
sum(api_gateway_cost_total) / sum(api_gateway_requests_total)
# Projected monthly cost
sum(rate(api_gateway_cost_total[1h])) * 730
Grafana Dashboard
Key panels:
- Total Monthly Cost: Real-time cost tracking
- Cost by Model: Pie chart showing routing distribution
- Savings vs Cloud-Only: Calculated panel showing actual savings
- Request Volume: Time series of requests by model
Taskwarrior Cost Tracking
# View total costs this month
task project:vps_ai.router +routing cost.any: list
# Monthly cost trend
tw_cost_report
Performance Characteristics
Latency
Model | Typical Latency | Use Case |
---|---|---|
TinyLlama | 0.5-2s | Simple Q&A, classifications |
Phi-2 | 1-3s | Explanations, summaries |
Claude Sonnet | 2-5s | Code gen, complex analysis |
Throughput
Component | Max RPS | Bottleneck |
---|---|---|
API Gateway | 100+ | CPU-bound |
Ollama (CPU) | 10-20 | Model inference |
Claude API | 50+ | Rate limits |
Resource Usage (Tier 2)
Service | RAM | CPU | Disk |
---|---|---|---|
Ollama | ~2GB | 1-2 cores | 5GB models |
API Gateway | ~200MB | 0.5 cores | Minimal |
n8n | ~300MB | 0.25 cores | 1GB |
Monitoring | ~500MB | 0.25 cores | 2GB |
Total | ~3GB | ~2 cores | ~8GB |
Technology Stack
Backend
- Python 3.11+: Core language
- Flask 3.1: Web framework
- Gunicorn: WSGI server
- Anthropic SDK: Claude API client
Infrastructure
- Docker: Containerization
- Docker Compose: Orchestration
- Ollama: Local LLM server
- Redis: Caching layer
- PostgreSQL: n8n database
Monitoring
- Prometheus: Metrics collection
- Grafana: Visualization
- Taskwarrior: Task tracking
Models
- TinyLlama 1.1B: Ultra-lightweight (700MB)
- Phi-2 2.7B: Quality lightweight (1.6GB)
- Mistral 7B: High-quality (4GB)
- Claude Sonnet 4: Cloud API
Production Deployment
Docker Deployment
# Start all services
./deploy-all.sh docker
# Check status
docker-compose ps
# View logs
docker-compose logs -f api-gateway
# Stop services
docker-compose down
Cloud Deployment (AWS/GCP)
# Deploy to AWS with Terraform
cd terraform/aws
terraform init
terraform plan
terraform apply
# Deploy to GCP
cd terraform/gcp
terraform init
terraform plan
terraform apply
ROI Calculator
Calculate your potential savings:
#!/usr/bin/env python3
"""ROI Calculator for Hybrid AI Stack"""
def calculate_roi(
monthly_requests: int,
simple_percent: float,
medium_percent: float,
complex_percent: float,
tier: int
):
# Costs per request (Claude)
SIMPLE_COST = 0.0009
MEDIUM_COST = 0.0036
COMPLEX_COST = 0.009
# VPS costs per tier
VPS_COSTS = {1: 26, 2: 52, 3: 120, 4: 310}
# Calculate request distribution
simple_reqs = monthly_requests * simple_percent
medium_reqs = monthly_requests * medium_percent
complex_reqs = monthly_requests * complex_percent
# Cloud-only cost
cloud_only = (
simple_reqs * SIMPLE_COST +
medium_reqs * MEDIUM_COST +
complex_reqs * COMPLEX_COST
)
# Hybrid cost (Tier 2: 80% local routing)
vps_cost = VPS_COSTS[tier]
hybrid_api_cost = (
simple_reqs * 0.2 * SIMPLE_COST +
medium_reqs * 0.3 * MEDIUM_COST +
complex_reqs * COMPLEX_COST
)
hybrid_total = vps_cost + hybrid_api_cost
# Calculate savings
savings = cloud_only - hybrid_total
savings_percent = (savings / cloud_only) * 100
print(f"\n{'='*60}")
print(f" ROI Analysis: {monthly_requests:,} requests/month")
print(f"{'='*60}")
print(f"\n💵 Cloud-Only Cost: ${cloud_only:,.2f}/month")
print(f"\n🔄 Hybrid Approach (Tier {tier}):")
print(f" VPS Cost: ${vps_cost:,.2f}")
print(f" API Cost: ${hybrid_api_cost:,.2f}")
print(f" Total: ${hybrid_total:,.2f}")
print(f"\n💰 Savings: ${savings:,.2f}/month ({savings_percent:.1f}%)")
# Example usage
calculate_roi(50000, 0.45, 0.35, 0.2, tier=2)
Key Takeaways
- 60-80% cost reduction possible with intelligent routing
- Break-even at ~30,000 requests/month for Tier 2
- Local models handle 70-80% of requests for free
- Production-ready with Docker, monitoring, automation
- Open source and fully customizable
GitHub Repository
Full source code: Hybrid-ai-stack-intent-solutions
Documentation:
- Architecture guide
- Deployment instructions
- Cost optimization strategies
- Monitoring setup
- n8n workflow examples
What’s Next
- Start with Tier 2 ($52/mo) for optimal ROI
- Monitor routing decisions in Grafana
- Tune complexity thresholds based on your use case
- Enable caching for additional 10-30% savings
- Scale up to Tier 3 when you exceed 100K requests/month
Questions or feedback: jeremy@intentsolutions.io GitHub: @jeremylongshore
Open source project maintained by Intent Solutions as an educational resource for cost-effective AI infrastructure.