LLM Cost Management for Data Pipelines: When to Use Claude, OpenAI, or Ollama
LLM costs in production pipelines scale differently from anything else in your data infrastructure. A poorly architected pipeline that sends every event through GPT-4o can burn through thousands of dollars per day. A well-architected one running the same workload might cost a tenth of that — by routing each task to the model that's just capable enough for the job.
This post covers the cost architecture decisions that keep AI pipelines economically viable at scale.
The Cost Stack
LLM pricing in 2025 spans roughly three orders of magnitude:
| Model tier | Example models | Input cost (per 1M tokens) | Output cost | Best for |
|---|---|---|---|---|
| Heavy | Claude Opus 4.6, GPT-4o | $15–25 | $75–100 | Complex reasoning, long context |
| Mid | Claude Sonnet 4.6, GPT-4o mini | $3–8 | $12–25 | Most production tasks |
| Light | Claude Haiku 4.5, GPT-3.5 | $0.25–1 | $1.25–4 | Classification, routing, simple extraction |
| Self-hosted | Ollama (Llama 3.1, Mistral) | ~$0 (compute only) | ~$0 | High-volume, repeatable, latency-tolerant |
The cost difference between a heavy model and self-hosted Ollama is 100–1000×. That difference is the budget for architectural complexity.
The Decision Framework
For every LLM call in your pipeline, ask three questions:
- How much context does this task require? Long context (>32K tokens) rules out some light models.
- How much reasoning does this task require? Structured extraction is not the same as strategic analysis.
- How often does this task run? A task running 1M times/day at $0.01/call is $10K/day.
The answers map to a tier:
Task volume × complexity → model selection
High volume + simple task → Ollama (self-hosted)
High volume + moderate task → Light tier (Haiku, GPT-3.5)
Low volume + complex task → Heavy tier (Opus, GPT-4o)
Production reliability required → Mid tier as default
On-prem / data sensitivity → Ollama regardless of complexity
Routing Pattern: The LLM Gateway
Rather than hard-coding model choices in each pipeline step, a routing layer selects the model based on task characteristics:
from enum import Enum
from dataclasses import dataclass
import anthropic
import ollama
class ModelTier(Enum):
HEAVY = "heavy"
MID = "mid"
LIGHT = "light"
LOCAL = "local"
@dataclass
class TaskProfile:
task_type: str # "classification", "extraction", "reasoning", "generation"
estimated_tokens: int # rough input+output estimate
requires_reliability: bool
data_sensitivity: bool # if True, cannot leave on-prem
def select_model(profile: TaskProfile) -> tuple[str, str]:
"""Returns (provider, model_name)."""
# Data sensitivity always wins — must stay on-prem
if profile.data_sensitivity:
return ("ollama", "llama3.1:8b")
# High-volume, simple classification → local or light
if profile.task_type == "classification" and profile.estimated_tokens < 2000:
return ("ollama", "mistral:7b") if not profile.requires_reliability \
else ("anthropic", "claude-haiku-4-5-20251001")
# Extraction tasks → light-to-mid depending on complexity
if profile.task_type == "extraction":
if profile.estimated_tokens < 5000:
return ("anthropic", "claude-haiku-4-5-20251001")
return ("anthropic", "claude-sonnet-4-6")
# Complex reasoning → mid as default, heavy only if justified
if profile.task_type == "reasoning":
if profile.estimated_tokens > 50000:
return ("anthropic", "claude-opus-4-6")
return ("anthropic", "claude-sonnet-4-6")
# Default: mid tier
return ("anthropic", "claude-sonnet-4-6")
def llm_call(prompt: str, profile: TaskProfile) -> str:
provider, model = select_model(profile)
if provider == "ollama":
response = ollama.chat(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response["message"]["content"]
client = anthropic.Anthropic()
response = client.messages.create(
model=model,
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Cost Monitoring: Track Every Call
Without observability, LLM costs are a black box until the invoice arrives.
import time
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class LLMCallRecord:
task_type: str
model: str
provider: str
input_tokens: int
output_tokens: int
latency_ms: float
cost_usd: float
timestamp: datetime = field(default_factory=datetime.utcnow)
# Approximate pricing (update as models change)
COST_PER_1M_TOKENS = {
"claude-opus-4-6": {"input": 15.0, "output": 75.0},
"claude-sonnet-4-6": {"input": 3.0, "output": 15.0},
"claude-haiku-4-5-20251001":{"input": 0.25, "output": 1.25},
"llama3.1:8b": {"input": 0.0, "output": 0.0}, # compute only
"mistral:7b": {"input": 0.0, "output": 0.0},
}
def tracked_llm_call(prompt: str, profile: TaskProfile, cost_log: list) -> str:
provider, model = select_model(profile)
start = time.time()
result = llm_call(prompt, profile)
# Estimate tokens (replace with actual token counts from API response)
input_tokens = len(prompt.split()) * 1.3
output_tokens = len(result.split()) * 1.3
pricing = COST_PER_1M_TOKENS.get(model, {"input": 0, "output": 0})
cost = (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000
cost_log.append(LLMCallRecord(
task_type=profile.task_type,
model=model,
provider=provider,
input_tokens=int(input_tokens),
output_tokens=int(output_tokens),
latency_ms=(time.time() - start) * 1000,
cost_usd=cost,
))
return result
Log these records to your warehouse. A simple daily query surfaces where the cost is going:
SELECT
task_type,
model,
COUNT(*) AS call_count,
SUM(input_tokens) AS total_input_tokens,
SUM(output_tokens) AS total_output_tokens,
SUM(cost_usd) AS total_cost_usd,
AVG(latency_ms) AS avg_latency_ms
FROM llm_call_log
WHERE DATE(timestamp) = CURRENT_DATE - 1
GROUP BY 1, 2
ORDER BY total_cost_usd DESC
Caching: The Free Cost Reduction
Many LLM calls in data pipelines ask the same question with the same inputs. Caching the response is cost reduction.
import hashlib
import redis
cache = redis.Redis(host="redis", port=6379)
def cached_llm_call(prompt: str, profile: TaskProfile, ttl_seconds: int = 3600) -> str:
cache_key = hashlib.sha256(f"{profile.task_type}:{prompt}".encode()).hexdigest()
cached = cache.get(cache_key)
if cached:
return cached.decode()
result = llm_call(prompt, profile)
cache.setex(cache_key, ttl_seconds, result)
return result
For classification and routing tasks that run on the same input repeatedly (e.g., categorizing the same product description multiple times), cache hit rates of 40–70% are common, cutting those costs by nearly half.
Prompt Efficiency
Tokens cost money. Verbose prompts and padded context directly inflate the bill.
Reduce prompt length:
- Remove preamble ("You are a helpful AI assistant...") for task-specific models
- Use structured output schemas to avoid lengthy format instructions repeated per call
- Trim retrieved context to only the most relevant chunks (cosine similarity threshold, not top-K blindly)
Batch where possible: Instead of one LLM call per record, batch multiple records into a single call:
# Expensive: 100 calls for 100 records
for record in records:
classify(record)
# Cheaper: 1 call for 100 records (if the task allows)
batch_classify(records, batch_size=50)
Batching trades latency for cost — acceptable for non-real-time pipeline steps.
Monthly Budget by Architecture
Rough estimates for a mid-sized data operation (10M records/month, 5 agent pipelines):
| Architecture | Estimated monthly LLM cost |
|---|---|
| All calls → Claude Opus | $8,000–15,000 |
| All calls → Claude Sonnet | $1,500–3,000 |
| Routed (Haiku for classification, Sonnet for reasoning) | $400–800 |
| Routed + Ollama for high-volume simple tasks | $150–350 |
| Routed + Ollama + caching | $80–200 |
The difference between "all calls to Opus" and "routed with caching" is 40–75×. Both produce equivalent output quality for most tasks.
Book a strategy session to architect your AI cost strategy.