Skip to main content

17 posts tagged with "data-engineering"

View All Tags

The $250K Employee You Can Replace with an MCP Agent

· 14 min read
Metadata Morph
AI & Data Engineering Team

Every company has at least one of these roles: a highly skilled, well-compensated professional who spends 60% of their day doing something a well-designed system could do automatically. Reading a log. Routing a ticket. Copying a number from one system into another. Writing the same report they wrote last month.

That is not a talent problem. It is an architecture problem. And MCP agents are how you fix it.

note

Important: this is not about replacing real Data Engineers. The engineers who design systems, solve novel problems, architect pipelines, and make judgment calls under uncertainty are not what we are automating. We are targeting the repetitive, rule-based, high-volume work that consumes a disproportionate share of their week — the work that prevents them from doing what they were actually hired to do.

This post covers where the highest-impact automation opportunities are across the business — and then builds the DBA case in full detail, because database administration is one of the most expensive, most automatable, and most overlooked targets in the enterprise.

Stop Guessing: How to Migrate Presto to BigQuery Without Breaking Your Analytics

· 7 min read
Metadata Morph
AI & Data Engineering Team

Migrating your analytics from Presto to BigQuery is a strategic move — better scalability, serverless pricing, deeper integration with the Google Cloud ecosystem. But the migration itself is where teams lose weeks of engineering time and, worse, end up with reports their stakeholders can no longer trust.

Most Presto-to-BigQuery migrations don't fail on the big stuff. They fail on the small, invisible things: a function that flips its argument order, a type name that changes, an approximation function that's been renamed. The queries still parse without errors. They still return results. The results are just wrong — and nobody notices until a dashboard is questioned in a board meeting.

This post walks through the automated migration pipeline we use at Metadata Morph to move Presto query libraries to BigQuery safely and at scale — using SQLGlot for dialect translation, AST-based testing to validate structure, and DuckDB to prove the converted queries return identical results before anything touches your warehouse or your stakeholders.

Building an AI Data Layer on Top of Your Existing Data Lake and Warehouse

· 6 min read
Metadata Morph
AI & Data Engineering Team

Your data lake and warehouse already hold the answers your business needs. The missing layer isn't more data — it's an intelligent orchestration layer that lets AI agents query, reason, and act on that data reliably.

This post walks through a production-ready architecture that uses dbt as a semantic manifest, Model Context Protocol (MCP) servers as the access layer, and multiple specialized agents to turn your existing Snowflake, Redshift, or BigQuery investment into an active, AI-driven intelligence system.

Building Your First AI Agent with MCP: A Practical Guide

· 5 min read
Metadata Morph
AI & Data Engineering Team

Most "AI" projects are still just API calls wrapped in if/else logic. True agentic AI gives the model real tools — file access, database queries, API calls — and lets it decide how to use them to accomplish a goal.

Model Context Protocol (MCP), developed by Anthropic, is the emerging open standard for connecting AI agents to those tools in a secure, structured way. In this guide you'll configure two MCP servers, write a simple agent, and automate a daily reporting task — using Claude, OpenAI, or a self-hosted Ollama model.

High-Speed Network Security Log Analysis with msgspec and AI Agents

· 6 min read
Metadata Morph
AI & Data Engineering Team

Security logs are among the highest-volume, most time-sensitive data in any organization. A single mid-sized network generates millions of log events per hour — firewall denies, DNS queries, authentication events, lateral movement signals. Traditional SIEM tools drown in the volume. Manual analysis is impossible at scale.

This post shows how to combine msgspec for high-performance log parsing with an AI agent that correlates events, identifies threat patterns, and generates structured incident reports — without the overhead of a full SIEM platform.

Multi-Agent Orchestration: When One Agent Isn't Enough

· 6 min read
Metadata Morph
AI & Data Engineering Team

A single agent with access to all your tools sounds like the simplest architecture. In practice, it's the architecture that breaks first. As tool count grows, context windows fill up, prompts become unwieldy, and the agent starts making worse decisions because it's trying to do too many things at once.

Multi-agent systems solve this by decomposing complex workflows into specialized agents with focused responsibilities, coordinated by an orchestrator. The result is more reliable, more observable, and — counter-intuitively — cheaper to operate.

LLM Cost Management for Data Pipelines: When to Use Claude, OpenAI, or Ollama

· 6 min read
Metadata Morph
AI & Data Engineering Team

LLM costs in production pipelines scale differently from anything else in your data infrastructure. A poorly architected pipeline that sends every event through GPT-4o can burn through thousands of dollars per day. A well-architected one running the same workload might cost a tenth of that — by routing each task to the model that's just capable enough for the job.

This post covers the cost architecture decisions that keep AI pipelines economically viable at scale.

Automated KPI Commentary: Teaching an AI Agent to Write the 'So What'

· 5 min read
Metadata Morph
AI & Data Engineering Team

Every metrics review has the same pattern: someone pulls up the dashboard, sees that revenue is up 8% week-over-week, and then spends 20 minutes writing a sentence explaining why. Then they do it again for conversion rate. Then for churn. Then for CAC.

The numbers are already in your warehouse. The context — seasonality, campaigns, product launches, prior period comparisons — is also already in your warehouse. The gap is the synthesis, and that's exactly what a KPI commentary agent closes.

dbt Testing Strategies Before Feeding Data to LLMs: Preventing Garbage-In, Garbage-Out

· 5 min read
Metadata Morph
AI & Data Engineering Team

An AI agent is only as reliable as the data it reasons from. Feed it nulls, duplicates, or stale data and it will produce confident, coherent, and wrong answers — often without any obvious signal that something is off. The LLM doesn't know what it doesn't know.

dbt's testing framework is the right place to enforce data quality before data reaches your agents. This post covers a layered testing strategy that catches the most common failure modes before they become AI failures.

Real-Time Agent Context with Kafka: Sub-Second Data Freshness for AI Pipelines

· 5 min read
Metadata Morph
AI & Data Engineering Team

Batch pipelines are sufficient for most analytical workloads. They're not sufficient for AI agents making time-sensitive decisions. An anomaly detection agent that works on yesterday's data misses the incident happening right now. A customer churn agent fed weekly snapshots can't act on a user who disengaged three hours ago.

Real-time streaming closes this gap. With Kafka as the event backbone and Flink for stream processing, your agents can operate on data that is seconds old rather than hours or days.