Skip to main content

Building an AI Data Layer on Top of Your Existing Data Lake and Warehouse

· 6 min read
Metadata Morph
AI & Data Engineering Team

Your data lake and warehouse already hold the answers your business needs. The missing layer isn't more data — it's an intelligent orchestration layer that lets AI agents query, reason, and act on that data reliably.

This post walks through a production-ready architecture that uses dbt as a semantic manifest, Model Context Protocol (MCP) servers as the access layer, and multiple specialized agents to turn your existing Snowflake, Redshift, or BigQuery investment into an active, AI-driven intelligence system.

The Problem: Data is Trapped

Most companies have built solid data foundations — a lake for raw ingestion, a warehouse for curated models, dbt for transformations. But that data is still passive. Analysts write queries. Reports are scheduled. Dashboards go stale.

The opportunity is to make that foundation agentic: give AI agents secure, structured access to your existing warehouse and let them answer questions, detect anomalies, and trigger workflows — without rebuilding your stack.

The AI Data Layer Architecture

┌─────────────────────────────────────────────────────────────────┐
│ BUSINESS LAYER │
│ Slack · Email · Dashboard · API Consumers │
└──────────────────────────┬──────────────────────────────────────┘

┌──────────────────────────▼────────────────────────────────────┐
│ AGENT ORCHESTRATOR │
│ LangGraph · Multi-agent routing │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Analytics │ │ Anomaly │ │ Report & Narrative │ │
│ │ Agent │ │ Agent │ │ Agent │ │
│ └──────┬──────┘ └──────┬──────┘ └──────────┬──────────┘ │
└──────────┼────────────────┼────────────────────┼──────────────┘
│ │ │
┌──────────▼────────────────▼──────────────────────▼────────────┐
│ MCP SERVER LAYER │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Warehouse │ │ dbt │ │ Notification │ │
│ │ MCP │ │ Manifest │ │ MCP │ │
│ │ (SQL access) │ │ MCP │ │ (Slack / Email) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────────────────┘ │
└──────────┼─────────────────┼──────────────────────────────────┘
│ │
┌──────────▼─────────────────▼──────────────────────────────────┐
│ DATA FOUNDATION │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Raw Lake │ │ dbt Models │ │ Semantic Layer │ │
│ │ (S3 / GCS) │ │ (Warehouse) │ │ (Metrics Store) │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
└───────────────────────────────────────────────────────────────┘

Why dbt is the Perfect Semantic Manifest

dbt already knows everything about your data: model names, column descriptions, relationships, tests, and freshness checks. The manifest.json that dbt generates at compile time is a machine-readable map of your entire data warehouse.

Instead of hardcoding table names and column logic into agent prompts, we expose the dbt manifest through a dedicated MCP server. Agents query the manifest to understand what data exists and how it's structured — then query the warehouse with that context.

{
"mcpServers": {
"warehouse": {
"command": "uvx",
"args": ["mcp-server-snowflake"],
"env": {
"SNOWFLAKE_ACCOUNT": "${SF_ACCOUNT}",
"SNOWFLAKE_USER": "${SF_USER}",
"SNOWFLAKE_PASSWORD": "${SF_PASSWORD}",
"SNOWFLAKE_DATABASE": "ANALYTICS",
"SNOWFLAKE_WAREHOUSE": "COMPUTE_WH"
}
},
"dbt_manifest": {
"command": "uvx",
"args": ["mcp-server-dbt"],
"env": {
"DBT_MANIFEST_PATH": "/opt/dbt/target/manifest.json",
"DBT_CATALOG_PATH": "/opt/dbt/target/catalog.json"
}
},
"notifications": {
"command": "uvx",
"args": ["mcp-server-slack"],
"env": {
"SLACK_BOT_TOKEN": "${SLACK_TOKEN}"
}
},
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/data/reports"]
}
}
}

The Three Specialized Agents

A single monolithic agent trying to do everything becomes fragile and expensive. Instead, we deploy three focused agents routed by a LangGraph orchestrator.

1. Analytics Agent

Answers natural-language questions about business metrics. It first queries the dbt manifest MCP to find the right model and columns, then runs the SQL against the warehouse.

System prompt excerpt:

You are an analytics agent with access to a Snowflake data warehouse.
Before writing any SQL, always query the dbt_manifest tool to find
the correct model name and verified column definitions.
Never guess table or column names.
Return results as structured JSON.

Example trigger: "What were the top 5 revenue-generating products last quarter, broken down by region?"

2. Anomaly Detection Agent

Runs on a schedule. Queries key metric models, compares against rolling baselines, and fires alerts when thresholds are breached.

Example workflow:

  1. Query fct_daily_orders for yesterday's figures
  2. Query mart_metrics_history for the 30-day rolling average and standard deviation
  3. Flag any metric more than 2 standard deviations from baseline
  4. Post anomaly report to #data-alerts Slack channel with the SQL used to detect it

3. Report & Narrative Agent

Consumes outputs from the Analytics and Anomaly agents and generates human-readable executive summaries. It writes structured JSON reports to the filesystem MCP and posts narrative summaries to Slack or email.

This agent does not query the warehouse directly — it only processes structured data passed by the other agents. This separation keeps the narrative generation cheap (small context window) and reliable (no SQL errors).

LangGraph Orchestration

LangGraph manages the state machine that routes queries between agents and handles retries.

from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal

class DataLayerState(TypedDict):
query: str
agent_type: Literal["analytics", "anomaly", "report"]
warehouse_result: dict
narrative: str
alert_fired: bool

def route_query(state: DataLayerState):
"""Classify the incoming query and route to the right agent."""
query = state["query"].lower()
if any(w in query for w in ["anomaly", "alert", "unusual", "spike", "drop"]):
return "anomaly_agent"
if any(w in query for w in ["report", "summary", "executive", "weekly"]):
return "report_agent"
return "analytics_agent"

graph = StateGraph(DataLayerState)
graph.add_node("analytics_agent", run_analytics_agent)
graph.add_node("anomaly_agent", run_anomaly_agent)
graph.add_node("report_agent", run_report_agent)
graph.add_conditional_edges("__start__", route_query)
graph.add_edge("analytics_agent", END)
graph.add_edge("anomaly_agent", "report_agent")
graph.add_edge("report_agent", END)

app = graph.compile()

The anomaly agent always hands off to the report agent — anomalies need narrative context, not raw numbers.

Keeping the Data Layer Honest

The biggest risk in agentic data systems is agents querying stale or untested models. Two guardrails prevent this:

1. dbt test gating — Before the orchestrator runs, a preflight check queries the dbt manifest MCP for any models that failed their last dbt test run. Those models are excluded from the agent's accessible table list for that run.

2. Freshness enforcement — The warehouse MCP wrapper checks dbt source freshness metadata. If a source hasn't been updated within its declared freshness threshold, the agent is instructed to flag the result as potentially stale rather than presenting it as current.

Deployment Footprint

This architecture requires no new data infrastructure — it runs alongside your existing stack:

ComponentWhere it runs
MCP serversLightweight Python processes on any VM or container
LangGraph orchestratorPython service (Docker, ECS, Cloud Run)
dbt manifestGenerated by your existing dbt CI pipeline, shared via S3/GCS
WarehouseYour existing Snowflake / Redshift / BigQuery — read-only access
Agent outputsSlack, email, S3 report files, or your existing BI tool via API

The agents consume your warehouse with read-only credentials. They never write back to the lake or warehouse — outputs go through the notification and filesystem MCP servers only.

What This Unlocks

Once the AI Data Layer is live, your team stops waiting for analyst availability to answer data questions. The analytics agent handles the routine queries. The anomaly agent replaces scheduled threshold alerts with reasoning-based detection that understands context. The report agent turns raw numbers into board-ready narratives in seconds.

Your data lake and warehouse don't need to change. The intelligence layer sits on top.


Ready to build your AI Data Layer? Book a strategy session with the Metadata Morph team and we'll map this architecture to your existing stack.