Skip to main content

Stop Guessing: How to Migrate Presto to BigQuery Without Breaking Your Analytics

· 7 min read
Metadata Morph
AI & Data Engineering Team

Migrating your analytics from Presto to BigQuery is a strategic move — better scalability, serverless pricing, deeper integration with the Google Cloud ecosystem. But the migration itself is where teams lose weeks of engineering time and, worse, end up with reports their stakeholders can no longer trust.

Most Presto-to-BigQuery migrations don't fail on the big stuff. They fail on the small, invisible things: a function that flips its argument order, a type name that changes, an approximation function that's been renamed. The queries still parse without errors. They still return results. The results are just wrong — and nobody notices until a dashboard is questioned in a board meeting.

This post walks through the automated migration pipeline we use at Metadata Morph to move Presto query libraries to BigQuery safely and at scale — using SQLGlot for dialect translation, AST-based testing to validate structure, and DuckDB to prove the converted queries return identical results before anything touches your warehouse or your stakeholders.

Why SQL Migrations Are a Business Risk, Not Just a Technical One

Most engineering teams budget for the migration work itself — converting queries, updating connection strings, repointing dashboards. They don't budget for the silent data quality failures that follow. When a metric is wrong by 3% because DATE_DIFF argument order flipped, that error doesn't throw an exception. It quietly corrupts every report that metric feeds until someone questions a number at the wrong moment.

SQL dialects share roughly 80% of their surface area. The last 20% is where migrations fail — and it's rarely obvious. Consider this:

-- Presto
DATE_DIFF('day', created_at, NOW())

-- BigQuery
DATE_DIFF(CURRENT_DATE(), created_at, DAY)

Argument order flips. The date part moves from a string to a keyword. Both are valid SQL in their respective dialects. Neither throws a parse error in the wrong context. They silently return wrong results.

At small scale, manual review catches these. At 200+ queries with a migration deadline and a business unit waiting on their dashboards, it doesn't. What you need is automation that doesn't just translate — it verifies.

The Three-Layer Migration Approach

We structure the migration pipeline in three layers, each catching a different class of error:

┌──────────────────────────────────────────┐
│ MIGRATION PIPELINE │
│ │
│ 1. SQLGlot Transpiler │
│ Presto SQL → BigQuery SQL │
│ Handles dialect differences at scale │
│ │
│ 2. AST-Based Test Suite │
│ Structural validation of the output │
│ Catches regressions across 500+ files│
│ │
│ 3. DuckDB Semantic Validator │
│ Runs both versions against real data │
│ Proves row-level equivalence locally │
└──────────────────────────────────────────┘

This matters because passing syntax checks is not the same as returning correct results. Each layer closes a gap the previous one leaves open.

Layer 1: Automated Dialect Translation with SQLGlot

SQLGlot is an open-source SQL parser and transpiler that understands 20+ dialects. The translation itself is a single call — the engineering work is in wrapping it for batch processing and failure recovery:

import sqlglot
from sqlglot import ErrorLevel
from dataclasses import dataclass
from typing import Optional

@dataclass
class TranspileResult:
source_sql: str
bigquery_sql: Optional[str]
error: Optional[str]
success: bool


def transpile_presto_to_bigquery(sql: str) -> TranspileResult:
try:
results = sqlglot.transpile(sql, read="presto", write="bigquery", error_level=ErrorLevel.WARN)
return TranspileResult(source_sql=sql, bigquery_sql=results[0] if results else None, error=None, success=True)
except Exception as e:
return TranspileResult(source_sql=sql, bigquery_sql=None, error=str(e), success=False)

The TranspileResult wrapper is what makes batch conversion viable. When one query out of 500 fails, you log it and keep going — you don't crash the entire run and lose the work completed so far.

SQLGlot handles the common dialect gaps automatically:

PrestoBigQuery
DATE_DIFF('day', a, b)DATE_DIFF(b, a, DAY)
ARRAY_JOIN(arr, ', ')ARRAY_TO_STRING(arr, ', ')
APPROX_DISTINCT(col)APPROX_COUNT_DISTINCT(col)
CAST(id AS VARCHAR)CAST(id AS STRING)
JSON_EXTRACT(col, '$.key')JSON_EXTRACT_SCALAR(col, '$.key')

Layer 2: Structural Validation with AST Testing

String comparison is the wrong tool for SQL testing. Whitespace, quoting style, and alias casing produce false failures on semantically identical queries. Instead, we parse both the converted output and the expected output into abstract syntax trees, then compare the normalized structures:

def assert_converts(test, presto_sql: str, expected_bq: str) -> None:
result = transpile_presto_to_bigquery(presto_sql)
test.assertTrue(result.success, f"Transpilation failed: {result.error}")

actual_ast = sqlglot.parse_one(result.bigquery_sql, dialect="bigquery")
expected_ast = sqlglot.parse_one(expected_bq, dialect="bigquery")

test.assertEqual(
actual_ast.sql(dialect="bigquery"),
expected_ast.sql(dialect="bigquery"),
)

This also catches a subtle failure mode: SQLGlot can occasionally produce output that is plausible-looking but not parseable in the target dialect. A roundtrip assertion — transpile, then parse the output back through the BigQuery parser — catches those cases before they reach your warehouse.

Layer 3: Semantic Equivalence with DuckDB

Structural tests prove the query looks right. They don't prove it returns the right rows. For that you need execution.

The key insight is that DuckDB doesn't speak Presto or BigQuery natively — but SQLGlot can transpile into DuckDB dialect too. So we use a triangular pattern: transpile both the original and converted query into DuckDB dialect, run both against the same in-memory dataset, and assert the results match exactly:

Presto SQL ──► DuckDB dialect ──► Execute ──► rows_a
│ │
▼ ▼
SQLGlot transpile assertEqual
│ ▲
▼ │
BigQuery SQL ──► DuckDB dialect ──► Execute ──► rows_b

This runs entirely locally — no cloud credentials, no BigQuery costs, no waiting for job queues. You get proof of correctness in seconds.

def _run(conn, sql: str, dialect: str):
duckdb_sql = sqlglot.transpile(sql, read=dialect, write="duckdb")[0]
return conn.execute(duckdb_sql).fetchall()

def assert_equivalent(self, presto_sql: str) -> list:
result = transpile_presto_to_bigquery(presto_sql)
self.assertTrue(result.success)

presto_rows = _run(self.conn, presto_sql, "presto")
bq_rows = _run(self.conn, result.bigquery_sql, "bigquery")

self.assertEqual(presto_rows, bq_rows)
return bq_rows

We test against fixtures engineered to surface the edge cases that silently break migrations: NULL values, empty arrays, zero-value aggregations, and JSON strings. Clean fixture data catches clean bugs. Edge case data catches the failures that only show up in production.

What This Gives Your Team

The business value here isn't just speed — it's confidence. Running this pipeline on CI means every converted query is validated before it ever touches BigQuery. Your data team can migrate at scale without putting analyst trust on the line. Your stakeholders keep getting the numbers they rely on. And your engineers aren't stuck doing manual spot-checks for weeks after go-live.

Your migration checklist becomes:

  • Run transpiler with --error-level warn, review flagged queries
  • AST tests pass for all converted files
  • Semantic equivalence confirmed against representative fixtures
  • Spot-check APPROX_DISTINCT results — HyperLogLog implementations differ slightly across dialects by design
  • Final BigQuery dry-run before flipping traffic

A migration that previously required weeks of careful manual review — with the ever-present risk of a wrong number slipping through — becomes a repeatable, auditable CI pipeline that runs in minutes.

What Requires Manual Review

Automated transpilation handles the 80%. A few things genuinely require human judgment:

  • UDFs — User-defined functions are not portable. They need to be rewritten in the target dialect.
  • Query hints and session variablesSET SESSION and execution hints don't translate. Strip them before converting.
  • Deep JSON path expressions — Complex JSON_EXTRACT chains with array indexing need a second look.
  • APPROX_DISTINCT numeric accuracy — The translation is correct, but BigQuery and Presto use different HyperLogLog implementations. Expect small numeric differences on large cardinalities.

Knowing where automation stops and judgment starts is what separates a fast, confident migration from a painful one that drags on for months.


The Bottom Line

Warehouse migrations are not just engineering projects — they're business continuity decisions. Every day your team is mid-migration is a day your analytics are in a state of partial trust. The goal isn't just to get the queries running in BigQuery. It's to get there without breaking the reports your business depends on, and without burning out the engineers responsible for making it work.

SQLGlot handles the dialect translation at scale. DuckDB closes the gap between "it parses" and "it returns the right rows." Together they give you a migration pipeline you can run on CI, audit at any point, and trust when you flip the switch.

If you're planning a warehouse migration or need help building a validation pipeline for your query library, book a strategy session with the Metadata Morph team — we'll assess your query footprint, identify the highest-risk dialect gaps, and build a migration plan your team can execute with confidence.