Data Platform Architecture
Medallion Architecture on Databricks: From Raw Data to AI-Ready Datasets
Abstract
The medallion architecture—Bronze, Silver, Gold—is a data organisation pattern that structures data by quality and readiness rather than by domain or source. Popularised by Databricks and Delta Lake, it provides a principled framework for managing the journey from raw ingestion to AI-ready, governed datasets. This note explains how each layer functions, what transformations occur at each stage, and why this pattern has become the de facto standard for enterprise data platforms preparing for analytics and machine learning workloads.
Why Data Layers Matter
Raw data arriving from operational systems is inherently untrustworthy for analytical or AI consumption. It contains duplicates, schema inconsistencies, null values without business meaning, and historical artefacts from upstream system migrations. Consuming this data directly creates fragile pipelines that break on schema evolution, produce incorrect analytical results, and make AI model training unreliable.
The core insight of the medallion architecture is separation of concerns by data quality tier. Rather than transforming raw data into analytical form in a single step—a pattern that creates brittle, hard-to-debug pipelines—the medallion approach introduces intermediate quality layers that isolate transformation logic, make lineage explicit, and allow downstream consumers to trust the quality guarantees of the layer they read from.
Data layering patterns predate the lakehouse—Kimball's staging area, operational data store, and data warehouse layers all express a similar intuition about separating raw capture from analytical serving. What the medallion architecture adds is a design native to cloud object storage at scale, where Delta Lake's ACID transaction guarantees enable reliable concurrent reads and writes across layers without the locking overhead that would make this pattern impractical on a traditional filesystem or early-generation data lake.
The Bronze Layer
The Bronze layer is the landing zone for raw data from all source systems. Its defining characteristic is minimal transformation: data arrives and is stored as close to its source form as possible, with the addition of ingestion metadata (arrival timestamp, source system identifier, batch ID). The Bronze layer is append-only by default—existing records are never modified, which preserves a complete historical record of every source event.
The primary technical concern at Bronze is reliable, scalable ingestion. Azure Data Factory, Databricks Auto Loader, and Apache Kafka are common ingestion mechanisms. Auto Loader is particularly well-suited to the Bronze pattern because it handles schema evolution automatically, incrementally processing new files as they arrive in ADLS without requiring manual schema management.
Partitioning strategy at Bronze should be driven by ingestion cadence rather than query patterns. For event streams, partitioning by arrival date and hour is standard; for batch loads, partitioning by source system and load date makes replay and debugging straightforward. Writing Bronze in Delta format rather than raw Parquet is generally recommended even at this earliest layer: Delta's transaction log enables reliable schema evolution tracking and supports time travel for incident investigation without requiring additional tooling or data copies.
The Silver Layer
The Silver layer is where raw data becomes trustworthy. Transformations at this layer focus on deduplication, null handling, type casting, and the application of business-level validation rules. A Silver table is the first representation of an entity that a data consumer can reasonably rely on—it has a stable schema, documented quality guarantees, and is not subject to the upstream variability of source systems.
The Silver layer is also where cross-source joins typically occur. If a customer entity exists in both a CRM and an ERP system, the Silver layer is the appropriate place to resolve the entity and produce a canonical customer record. This requires careful identity resolution logic and, ideally, a data contract that specifies the merge rules and their business justification.
Delta Live Tables (DLT) expectations are the most natural quality enforcement mechanism for Silver in a Databricks environment. Expectations are declared as constraint annotations on the pipeline definition, and DLT handles the operational question of what to do with failing records: quarantine them in a dead-letter table, drop them silently, or fail the pipeline entirely. The right choice depends on data criticality—for regulatory reporting pipelines, failing loudly is correct; for analytics aggregations, quarantining failures and alerting on volume is often preferable to stopping the flow.
The Gold Layer
The Gold layer contains business-ready datasets: aggregated metrics, denormalised fact tables, feature stores, and curated datasets built for specific analytical or AI consumption patterns. Gold tables are typically shaped to answer specific questions rather than representing raw entities, which means they are more specific in scope but faster to query and easier to understand for non-engineering consumers.
A key design principle for Gold is consumer-driven design. The schema and grain of a Gold table should be driven by how the consuming team—a data analyst, a machine learning engineer, a business intelligence tool—needs to read the data. Building Gold tables in isolation from their consumers leads to datasets that are technically correct but practically unused.
Gold tables should carry explicit operational contracts: freshness guarantees (updated daily by 07:00 UTC, for example), documented owners with on-call responsibilities, and a deprecation process that gives downstream consumers advance notice before a table is removed or restructured. Without this discipline, Gold datasets accumulate technical debt as silently as any other artefact—but the consequences are more severe: incorrect dashboards and misleading model training data rather than broken builds that are caught immediately.
Preparing AI-Ready Datasets
AI Readiness Criteria
An AI-ready dataset is not simply a clean dataset. AI workloads impose additional requirements that go beyond those of traditional analytics: feature completeness, temporal consistency for time-series models, low label noise for supervised learning, and stable distribution for models deployed in production. Meeting these requirements typically requires Gold-layer extensions specifically designed for ML consumption.
Feature stores built on top of the Gold layer are the most mature pattern for AI readiness. They enforce point-in-time correctness (preventing target leakage), provide versioned feature sets that reproduce training conditions exactly, and decouple feature engineering from model training schedules. Databricks Feature Store integrates directly with MLflow, creating a traceable lineage from raw Bronze data to trained model.
Maintaining AI-ready datasets over time requires monitoring beyond data quality checks. Feature distributions shift as upstream systems evolve or business processes change—a feature that was informative during model training may become uninformative or misleading in production months later. Distribution drift detection, implemented as a scheduled comparison of training-time and current feature statistics, should be a first-class operational concern for any team serving ML workloads from the Gold layer, not an afterthought addressed only when model performance visibly degrades.
Common Implementation Risks
The most common failure mode for medallion implementations is layer proliferation: teams create Silver-Prime, Gold-Refined, and Platinum layers as workarounds for upstream quality issues rather than fixing the root cause. This results in a data landscape that is as confusing as the pre-medallion state, with the added complexity of a naming convention that implies quality guarantees it no longer provides.
A second significant risk is ownership ambiguity. The medallion pattern requires clear ownership of each layer and each table within it. Without this, Bronze tables accumulate stale data with no owner to deprecate it, Silver tables develop schema drift that no one catches until a downstream pipeline breaks, and Gold datasets are modified ad-hoc without versioning.
At the pipeline level, two failure patterns are particularly common. Non-idempotent merges: if a Silver transformation is re-run after a partial failure, incorrectly specified MERGE INTO conditions can produce duplicate or inconsistent records. Delta Lake's merge semantics are idempotent when the match predicate is correct, but this requires deliberate design. Small-file accumulation: high-frequency streaming ingest creates many small files that degrade query performance significantly over time. OPTIMIZE with AUTO OPTIMIZE enabled, or scheduled Z-ORDER compaction runs, should be built into the operational runbook from initial deployment rather than applied reactively when query times spike.
My Opinion / Critique
Editorial
The medallion architecture is the right default for most enterprise data platforms. Its value is not in any particular technical innovation—the individual components (Delta Lake, Auto Loader, Delta Live Tables) are independent capabilities—but in the discipline it imposes on data transformation. The pattern is opinionated enough to prevent the worst anti-patterns, and flexible enough to accommodate most enterprise data landscapes.
My main critique is that the pattern is often applied too rigidly. In practice, not every dataset needs all three layers. For small, low-velocity reference datasets with stable schemas, a direct Silver materialisation from source is more pragmatic than a Bronze landing zone that adds latency and storage cost with no meaningful benefit. Pattern adherence should be a default, not a dogma.
References
- [1]Medallion Architecture — Databricks Documentation — Databricks, 2024
- [2]Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores — VLDB 2020, 2020
- [3]Delta Live Tables — Declarative ETL for the Lakehouse — Databricks Engineering Blog, 2023
- [4]Feature Store Best Practices for ML in Production — Databricks Solution Accelerator, 2024
Daniel Conejo Sobrino
Enterprise Data Engineer
Related Notes