LLM Cost Management for Data Pipelines: When to Use Claude, OpenAI, or Ollama

December 10, 2025 · 6 min read

Metadata Morph

AI & Data Engineering Team

LLM costs in production pipelines scale differently from anything else in your data infrastructure. A poorly architected pipeline that sends every event through GPT-4o can burn through thousands of dollars per day. A well-architected one running the same workload might cost a tenth of that — by routing each task to the model that's just capable enough for the job.

This post covers the cost architecture decisions that keep AI pipelines economically viable at scale.

The Cost Stack

LLM pricing in 2025 spans roughly three orders of magnitude:

Model tier	Example models	Input cost (per 1M tokens)	Output cost	Best for
Heavy	Claude Opus 4.6, GPT-4o	$15–25	$75–100	Complex reasoning, long context
Mid	Claude Sonnet 4.6, GPT-4o mini	$3–8	$12–25	Most production tasks
Light	Claude Haiku 4.5, GPT-3.5	$0.25–1	$1.25–4	Classification, routing, simple extraction
Self-hosted	Ollama (Llama 3.1, Mistral)	~$0 (compute only)	~$0	High-volume, repeatable, latency-tolerant

The cost difference between a heavy model and self-hosted Ollama is 100–1000×. That difference is the budget for architectural complexity.

The Decision Framework

For every LLM call in your pipeline, ask three questions:

How much context does this task require? Long context (>32K tokens) rules out some light models.
How much reasoning does this task require? Structured extraction is not the same as strategic analysis.
How often does this task run? A task running 1M times/day at $0.01/call is $10K/day.

The answers map to a tier:

Task volume × complexity → model selection

High volume + simple task       → Ollama (self-hosted)
High volume + moderate task     → Light tier (Haiku, GPT-3.5)
Low volume + complex task       → Heavy tier (Opus, GPT-4o)
Production reliability required → Mid tier as default
On-prem / data sensitivity      → Ollama regardless of complexity

Routing Pattern: The LLM Gateway

Rather than hard-coding model choices in each pipeline step, a routing layer selects the model based on task characteristics:

from enum import Enum
from dataclasses import dataclass
import anthropic
import ollama

class ModelTier(Enum):
    HEAVY = "heavy"
    MID = "mid"
    LIGHT = "light"
    LOCAL = "local"

@dataclass
class TaskProfile:
    task_type: str          # "classification", "extraction", "reasoning", "generation"
    estimated_tokens: int   # rough input+output estimate
    requires_reliability: bool
    data_sensitivity: bool  # if True, cannot leave on-prem

def select_model(profile: TaskProfile) -> tuple[str, str]:
    """Returns (provider, model_name)."""

    # Data sensitivity always wins — must stay on-prem
    if profile.data_sensitivity:
        return ("ollama", "llama3.1:8b")

    # High-volume, simple classification → local or light
    if profile.task_type == "classification" and profile.estimated_tokens < 2000:
        return ("ollama", "mistral:7b") if not profile.requires_reliability \
               else ("anthropic", "claude-haiku-4-5-20251001")

    # Extraction tasks → light-to-mid depending on complexity
    if profile.task_type == "extraction":
        if profile.estimated_tokens < 5000:
            return ("anthropic", "claude-haiku-4-5-20251001")
        return ("anthropic", "claude-sonnet-4-6")

    # Complex reasoning → mid as default, heavy only if justified
    if profile.task_type == "reasoning":
        if profile.estimated_tokens > 50000:
            return ("anthropic", "claude-opus-4-6")
        return ("anthropic", "claude-sonnet-4-6")

    # Default: mid tier
    return ("anthropic", "claude-sonnet-4-6")


def llm_call(prompt: str, profile: TaskProfile) -> str:
    provider, model = select_model(profile)

    if provider == "ollama":
        response = ollama.chat(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        return response["message"]["content"]

    client = anthropic.Anthropic()
    response = client.messages.create(
        model=model,
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Cost Monitoring: Track Every Call

Without observability, LLM costs are a black box until the invoice arrives.

import time
from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class LLMCallRecord:
    task_type: str
    model: str
    provider: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    cost_usd: float
    timestamp: datetime = field(default_factory=datetime.utcnow)

# Approximate pricing (update as models change)
COST_PER_1M_TOKENS = {
    "claude-opus-4-6":          {"input": 15.0,  "output": 75.0},
    "claude-sonnet-4-6":        {"input": 3.0,   "output": 15.0},
    "claude-haiku-4-5-20251001":{"input": 0.25,  "output": 1.25},
    "llama3.1:8b":              {"input": 0.0,   "output": 0.0},    # compute only
    "mistral:7b":               {"input": 0.0,   "output": 0.0},
}

def tracked_llm_call(prompt: str, profile: TaskProfile, cost_log: list) -> str:
    provider, model = select_model(profile)
    start = time.time()

    result = llm_call(prompt, profile)

    # Estimate tokens (replace with actual token counts from API response)
    input_tokens = len(prompt.split()) * 1.3
    output_tokens = len(result.split()) * 1.3

    pricing = COST_PER_1M_TOKENS.get(model, {"input": 0, "output": 0})
    cost = (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000

    cost_log.append(LLMCallRecord(
        task_type=profile.task_type,
        model=model,
        provider=provider,
        input_tokens=int(input_tokens),
        output_tokens=int(output_tokens),
        latency_ms=(time.time() - start) * 1000,
        cost_usd=cost,
    ))

    return result

Log these records to your warehouse. A simple daily query surfaces where the cost is going:

SELECT
    task_type,
    model,
    COUNT(*)                    AS call_count,
    SUM(input_tokens)           AS total_input_tokens,
    SUM(output_tokens)          AS total_output_tokens,
    SUM(cost_usd)               AS total_cost_usd,
    AVG(latency_ms)             AS avg_latency_ms
FROM llm_call_log
WHERE DATE(timestamp) = CURRENT_DATE - 1
GROUP BY 1, 2
ORDER BY total_cost_usd DESC

Caching: The Free Cost Reduction

Many LLM calls in data pipelines ask the same question with the same inputs. Caching the response is cost reduction.

import hashlib
import redis

cache = redis.Redis(host="redis", port=6379)

def cached_llm_call(prompt: str, profile: TaskProfile, ttl_seconds: int = 3600) -> str:
    cache_key = hashlib.sha256(f"{profile.task_type}:{prompt}".encode()).hexdigest()
    cached = cache.get(cache_key)

    if cached:
        return cached.decode()

    result = llm_call(prompt, profile)
    cache.setex(cache_key, ttl_seconds, result)
    return result

For classification and routing tasks that run on the same input repeatedly (e.g., categorizing the same product description multiple times), cache hit rates of 40–70% are common, cutting those costs by nearly half.

Prompt Efficiency

Tokens cost money. Verbose prompts and padded context directly inflate the bill.

Reduce prompt length:

Remove preamble ("You are a helpful AI assistant...") for task-specific models
Use structured output schemas to avoid lengthy format instructions repeated per call
Trim retrieved context to only the most relevant chunks (cosine similarity threshold, not top-K blindly)

Batch where possible: Instead of one LLM call per record, batch multiple records into a single call:

# Expensive: 100 calls for 100 records
for record in records:
    classify(record)

# Cheaper: 1 call for 100 records (if the task allows)
batch_classify(records, batch_size=50)

Batching trades latency for cost — acceptable for non-real-time pipeline steps.

Monthly Budget by Architecture

Rough estimates for a mid-sized data operation (10M records/month, 5 agent pipelines):

Architecture	Estimated monthly LLM cost
All calls → Claude Opus	$8,000–15,000
All calls → Claude Sonnet	$1,500–3,000
Routed (Haiku for classification, Sonnet for reasoning)	$400–800
Routed + Ollama for high-volume simple tasks	$150–350
Routed + Ollama + caching	$80–200

The difference between "all calls to Opus" and "routed with caching" is 40–75×. Both produce equivalent output quality for most tasks.

Book a strategy session to architect your AI cost strategy.

The Cost Stack​

The Decision Framework​

Routing Pattern: The LLM Gateway​

Cost Monitoring: Track Every Call​

Caching: The Free Cost Reduction​

Prompt Efficiency​

Monthly Budget by Architecture​