Skip to main content

LLM Cost Management for Data Pipelines: When to Use Claude, OpenAI, or Ollama

· 6 min read
Metadata Morph
AI & Data Engineering Team

LLM costs in production pipelines scale differently from anything else in your data infrastructure. A poorly architected pipeline that sends every event through GPT-4o can burn through thousands of dollars per day. A well-architected one running the same workload might cost a tenth of that — by routing each task to the model that's just capable enough for the job.

This post covers the cost architecture decisions that keep AI pipelines economically viable at scale.

The Cost Stack

LLM pricing in 2025 spans roughly three orders of magnitude:

Model tierExample modelsInput cost (per 1M tokens)Output costBest for
HeavyClaude Opus 4.6, GPT-4o$15–25$75–100Complex reasoning, long context
MidClaude Sonnet 4.6, GPT-4o mini$3–8$12–25Most production tasks
LightClaude Haiku 4.5, GPT-3.5$0.25–1$1.25–4Classification, routing, simple extraction
Self-hostedOllama (Llama 3.1, Mistral)~$0 (compute only)~$0High-volume, repeatable, latency-tolerant

The cost difference between a heavy model and self-hosted Ollama is 100–1000×. That difference is the budget for architectural complexity.

The Decision Framework

For every LLM call in your pipeline, ask three questions:

  1. How much context does this task require? Long context (>32K tokens) rules out some light models.
  2. How much reasoning does this task require? Structured extraction is not the same as strategic analysis.
  3. How often does this task run? A task running 1M times/day at $0.01/call is $10K/day.

The answers map to a tier:

Task volume × complexity → model selection

High volume + simple task → Ollama (self-hosted)
High volume + moderate task → Light tier (Haiku, GPT-3.5)
Low volume + complex task → Heavy tier (Opus, GPT-4o)
Production reliability required → Mid tier as default
On-prem / data sensitivity → Ollama regardless of complexity

Routing Pattern: The LLM Gateway

Rather than hard-coding model choices in each pipeline step, a routing layer selects the model based on task characteristics:

from enum import Enum
from dataclasses import dataclass
import anthropic
import ollama

class ModelTier(Enum):
HEAVY = "heavy"
MID = "mid"
LIGHT = "light"
LOCAL = "local"

@dataclass
class TaskProfile:
task_type: str # "classification", "extraction", "reasoning", "generation"
estimated_tokens: int # rough input+output estimate
requires_reliability: bool
data_sensitivity: bool # if True, cannot leave on-prem

def select_model(profile: TaskProfile) -> tuple[str, str]:
"""Returns (provider, model_name)."""

# Data sensitivity always wins — must stay on-prem
if profile.data_sensitivity:
return ("ollama", "llama3.1:8b")

# High-volume, simple classification → local or light
if profile.task_type == "classification" and profile.estimated_tokens < 2000:
return ("ollama", "mistral:7b") if not profile.requires_reliability \
else ("anthropic", "claude-haiku-4-5-20251001")

# Extraction tasks → light-to-mid depending on complexity
if profile.task_type == "extraction":
if profile.estimated_tokens < 5000:
return ("anthropic", "claude-haiku-4-5-20251001")
return ("anthropic", "claude-sonnet-4-6")

# Complex reasoning → mid as default, heavy only if justified
if profile.task_type == "reasoning":
if profile.estimated_tokens > 50000:
return ("anthropic", "claude-opus-4-6")
return ("anthropic", "claude-sonnet-4-6")

# Default: mid tier
return ("anthropic", "claude-sonnet-4-6")


def llm_call(prompt: str, profile: TaskProfile) -> str:
provider, model = select_model(profile)

if provider == "ollama":
response = ollama.chat(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response["message"]["content"]

client = anthropic.Anthropic()
response = client.messages.create(
model=model,
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text

Cost Monitoring: Track Every Call

Without observability, LLM costs are a black box until the invoice arrives.

import time
from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class LLMCallRecord:
task_type: str
model: str
provider: str
input_tokens: int
output_tokens: int
latency_ms: float
cost_usd: float
timestamp: datetime = field(default_factory=datetime.utcnow)

# Approximate pricing (update as models change)
COST_PER_1M_TOKENS = {
"claude-opus-4-6": {"input": 15.0, "output": 75.0},
"claude-sonnet-4-6": {"input": 3.0, "output": 15.0},
"claude-haiku-4-5-20251001":{"input": 0.25, "output": 1.25},
"llama3.1:8b": {"input": 0.0, "output": 0.0}, # compute only
"mistral:7b": {"input": 0.0, "output": 0.0},
}

def tracked_llm_call(prompt: str, profile: TaskProfile, cost_log: list) -> str:
provider, model = select_model(profile)
start = time.time()

result = llm_call(prompt, profile)

# Estimate tokens (replace with actual token counts from API response)
input_tokens = len(prompt.split()) * 1.3
output_tokens = len(result.split()) * 1.3

pricing = COST_PER_1M_TOKENS.get(model, {"input": 0, "output": 0})
cost = (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000

cost_log.append(LLMCallRecord(
task_type=profile.task_type,
model=model,
provider=provider,
input_tokens=int(input_tokens),
output_tokens=int(output_tokens),
latency_ms=(time.time() - start) * 1000,
cost_usd=cost,
))

return result

Log these records to your warehouse. A simple daily query surfaces where the cost is going:

SELECT
task_type,
model,
COUNT(*) AS call_count,
SUM(input_tokens) AS total_input_tokens,
SUM(output_tokens) AS total_output_tokens,
SUM(cost_usd) AS total_cost_usd,
AVG(latency_ms) AS avg_latency_ms
FROM llm_call_log
WHERE DATE(timestamp) = CURRENT_DATE - 1
GROUP BY 1, 2
ORDER BY total_cost_usd DESC

Caching: The Free Cost Reduction

Many LLM calls in data pipelines ask the same question with the same inputs. Caching the response is cost reduction.

import hashlib
import redis

cache = redis.Redis(host="redis", port=6379)

def cached_llm_call(prompt: str, profile: TaskProfile, ttl_seconds: int = 3600) -> str:
cache_key = hashlib.sha256(f"{profile.task_type}:{prompt}".encode()).hexdigest()
cached = cache.get(cache_key)

if cached:
return cached.decode()

result = llm_call(prompt, profile)
cache.setex(cache_key, ttl_seconds, result)
return result

For classification and routing tasks that run on the same input repeatedly (e.g., categorizing the same product description multiple times), cache hit rates of 40–70% are common, cutting those costs by nearly half.

Prompt Efficiency

Tokens cost money. Verbose prompts and padded context directly inflate the bill.

Reduce prompt length:

  • Remove preamble ("You are a helpful AI assistant...") for task-specific models
  • Use structured output schemas to avoid lengthy format instructions repeated per call
  • Trim retrieved context to only the most relevant chunks (cosine similarity threshold, not top-K blindly)

Batch where possible: Instead of one LLM call per record, batch multiple records into a single call:

# Expensive: 100 calls for 100 records
for record in records:
classify(record)

# Cheaper: 1 call for 100 records (if the task allows)
batch_classify(records, batch_size=50)

Batching trades latency for cost — acceptable for non-real-time pipeline steps.

Monthly Budget by Architecture

Rough estimates for a mid-sized data operation (10M records/month, 5 agent pipelines):

ArchitectureEstimated monthly LLM cost
All calls → Claude Opus$8,000–15,000
All calls → Claude Sonnet$1,500–3,000
Routed (Haiku for classification, Sonnet for reasoning)$400–800
Routed + Ollama for high-volume simple tasks$150–350
Routed + Ollama + caching$80–200

The difference between "all calls to Opus" and "routed with caching" is 40–75×. Both produce equivalent output quality for most tasks.

Book a strategy session to architect your AI cost strategy.