What Actually Breaks in Zero-Copy Iceberg Migrations (and How We Operated Through It).
Abstract
The promise of zero-copy migration, registering existing files into Iceberg to avoid the O(n) cost of data rewrites, is compelling, but operationally treacherous at scale. In this talk, we break down a production migration of 6,000+ Hive tables (~1 exabyte) to Iceberg, focusing on the failure modes that only surface under heavy live traffic.
We’ll cover what actually broke and how we operated through it: metadata conflicts from ORC files with and without Iceberg field IDs, mismatched partition metadata, HDFS lock storms caused by large rename waves, engine-level gaps between Spark and Trino, and the mechanics of zero-downtime cutovers using dual ingestion, atomic per-partition moves, and quick rollback.
Rather than an idealized path, this talk focuses on the guardrails and controls required to make zero-copy Iceberg migrations survivable in production. Attendees will leave with a practical framework for deciding when zero-copy is viable, and how to operate it safely at scale.