Medallion Architecture for Lakehouses Guide

What Is the Medallion Architecture?

The medallion architecture organizes lakehouse data into 3 quality tiers: Bronze (Raw) — data exactly as received from the source. No transformation, no cleansing. The "single source of truth" for what the source system sent. Purpose: auditability, reprocessing, debugging. Silver (Cleaned) — deduplicated, validated, conformed, and enriched. Silver is the "trusted, reusable" layer — multiple gold tables built from the same silver tables. Gold (Business-Ready) — aggregated and modeled for specific use cases: dimensional models for Power BI, feature sets for ML models, and domain-specific data products. Each layer adds value through transformation — but the previous layer is always preserved for auditability and reprocessing.

Bronze stores what happened. Silver stores what it means. Gold stores what it's for. Each layer adds value — and the previous layer is always preserved.

Bronze Layer Design

Bronze design principles: append-only (never modify bronze data — every load appended with timestamp, creating complete history), schema-on-read (store in native format — schema enforcement at silver), partitioning (by ingestion date — efficient queries on recent data and lifecycle management), and metadata (every record: source system, ingestion timestamp, batch ID — full traceability). Technology: Fabric lakehouse or Databricks Delta Lake. Autoloader or ADF for ingestion. Target file sizes: 100MB-1GB for optimal query performance. Bronze is the safety net — when a silver transformation has a bug or business logic changes, you reprocess from bronze instead of re-extracting from the source system.

Silver Layer Transformation

Silver transformations: deduplication (identify and remove duplicates using primary key + source timestamp — keep latest version, log duplicates), validation (apply data quality checks: NOT NULL on required fields, referential integrity, value range, format. Failed records → quarantine table with failure reason — not silently dropped), conforming (standardize: dates to ISO 8601, currencies to consistent precision, country codes to ISO 3166, identifiers unified across sources), type casting (convert strings to proper types: "2024-01-15" → DATE, "1234.56" → DECIMAL(10,2)), and joining (enrich by joining across sources: orders + customers + products → enriched order record). Silver tables are the reusable foundation — build once, consume many times. Every gold table traces back to silver tables, and every silver table traces back to bronze.

Gold Layer Business Models

Gold tables are purpose-built: dimensional models (star schema for BI: fact_orders + dim_customer + dim_product + dim_date → Power BI semantic model serving 10 dashboards. Fast analytical queries, intuitive drill-down, consistent metrics), feature stores (ML feature tables: customer_features with purchase frequency, recency, CLV, churn score → consumed by ML pipeline. Point-in-time correct, low-latency, reusable), data products (customer_360 from CRM + support + billing + product usage → customer success team. Published with SLA and documentation), and aggregations (daily_revenue_by_region, monthly_inventory_levels → executive dashboards. Pre-computed for fast refresh).

Implementation on Fabric and Databricks

Aspect	Fabric	Databricks
Bronze	Fabric pipelines, Dataflows, shortcuts	Autoloader, Delta Live Tables
Silver	Spark notebooks, Dataflows Gen2	Delta Live Tables, Spark notebooks
Gold	SQL endpoint, Direct Lake in Power BI	SQL warehouse, JDBC/ODBC
Quality	Purview DQ, Great Expectations	DLT expectations, Great Expectations
Governance	Purview catalog + lineage	Unity Catalog + lineage

Both platforms implement medallion natively — the architecture is platform-agnostic, the tooling differs. Choice depends on: cloud investment (Azure → Fabric), team skills (Spark → Databricks), and Power BI priority (Direct Lake → Fabric).

Medallion Governance

Per-layer governance: Bronze: access restricted to data engineers. Business users never query bronze. Retention: per regulation or 3 years default. Silver: access for engineers (write), analysts (read), scientists (read). Quality SLAs published. Schema changes require review. Gold: access for all authorized business users (read via Power BI or SQL). Published as data products with documentation, SLA, owner, and version. Changes require: impact analysis, testing, and consumer communication. The governance escalates with the layer: bronze is engineering territory, silver is shared, gold is business-facing — each with appropriate controls.

Common Medallion Architecture Mistakes

Five mistakes that undermine the medallion architecture: 1. Skipping bronze. "We'll just load cleaned data directly into silver." When the transformation logic has a bug: there's no raw data to reprocess. When the business changes a definition: there's no historical raw data to re-derive. Bronze is the insurance policy — it costs storage but saves months of re-extraction. 2. Silver becomes a dump. Silver has 500 tables, half undocumented, quality varies wildly. Solution: treat silver like a product — every table has an owner, quality SLA, and documentation. Tables that don't meet the standard are quarantined, not published. 3. Gold tables are just views. Gold should be materialized tables optimized for consumption — not views that run complex queries every time a dashboard refreshes. Views on silver work for prototyping; production gold tables should be pre-computed for performance. 4. No lifecycle management. Bronze data accumulates forever. Silver tables from abandoned projects persist. Gold tables from deprecated dashboards consume resources. Solution: retention policies per layer (bronze: 3 years, silver: 2 years for inactive tables, gold: active tables only — archived when the consuming dashboard is retired). 5. Everyone queries bronze. Business analysts running ad-hoc queries on bronze data: no quality guarantees, raw formats, performance issues. Solution: enforce access policies — bronze for engineers only, silver for analysts and scientists, gold for everyone.

Medallion Architecture for Real-Time Data

The medallion architecture works for streaming data as well as batch: bronze streaming (raw events appended to a bronze Delta table via Spark Structured Streaming or Kafka-to-lakehouse pipeline — events stored exactly as received, partitioned by ingestion hour), silver streaming (continuous transformation: deduplication (using event ID watermark), validation (schema and business rules), and enrichment (joining with slowly-changing dimension tables) — silver tables updated within minutes of source events), and gold streaming (real-time aggregations: rolling window metrics, live dashboards, and event-driven triggers — gold tables serving Power BI Direct Lake for sub-second dashboard refresh). The streaming medallion architecture enables: real-time dashboards with quality-assured data (not just fast data — correct data, fast), historical reprocessing capability (bronze retains all raw events for replay), and consistent architecture for both batch and streaming (same layer semantics, same governance, same tooling — just different update frequency).

Medallion Layer Design: Practical Guidelines

Layer design decisions that determine long-term maintainability: Bronze naming convention (bronze_[source]_[table] — e.g., bronze_erp_sales_orders, bronze_crm_accounts. The source prefix is critical: when you have 3 sources with "orders" tables, the source prefix prevents confusion), Silver naming convention (silver_[domain]_[entity] — e.g., silver_sales_orders, silver_customers. Domain-oriented, not source-oriented. The Silver layer abstracts the source system — consumers don't need to know the data came from ERP vs CRM), Gold naming convention (gold_[business_use]_[entity] — e.g., gold_revenue_daily, gold_customer_360, gold_inventory_snapshot. Business-use-oriented — the name tells the consumer what the table is for, not where it came from), partition strategy (Bronze: partition by ingestion date (enables efficient retention and reprocessing). Silver: partition by business date (enables efficient analytical queries). Gold: partition by the most common query filter (date for time-series, region for geographic queries)), and file format (Delta Lake for all layers — provides: ACID transactions, time travel, schema enforcement, and MERGE capability. Parquet is acceptable for Bronze if Delta isn't available — but Delta is strongly preferred for Silver and Gold where MERGE operations are essential).

Data Quality Gates Between Layers

Quality gates prevent bad data from propagating through the medallion layers: Bronze → Silver gate (schema validation: all expected columns present with correct types. Row count: within expected range based on historical average. Primary key: not null, unique after deduplication. Failed records: quarantined in a bronze_quarantine_[table] table, not promoted to Silver), Silver → Gold gate (business rule validation: all business calculations produce valid results. Referential integrity: all foreign keys resolve to valid parent records in Silver. Completeness: critical fields have null rates below threshold. Cross-table consistency: revenue in the orders table matches revenue in the revenue summary. Failed validation: Gold table not updated — previous valid Gold data preserved for consumers. Alert sent to DataOps team). The gates ensure: Bronze can contain anything (raw data with potential quality issues). Silver contains only validated, deduplicated, correctly-typed data. Gold contains only business-verified, reconciled data ready for decision-making.

Performance Optimization for Medallion Architecture

Performance optimization by layer: Bronze (append-only writes — no MERGE, no deduplication at ingestion time. Just write fast. Optimize for write throughput. File compaction: scheduled OPTIMIZE command to consolidate small files — preventing the "small files problem" that degrades read performance), Silver (MERGE operations for upserts — Z-ORDER BY the primary key column for efficient MERGE joins. Vacuum old file versions after 7 days — reclaim storage from Delta Lake time travel history. Auto-optimize with auto-compaction and optimized writes enabled), Gold (Z-ORDER BY the most common query filter columns — typically date and category dimensions. Liquid clustering on Databricks for adaptive optimization. Materialized views or pre-aggregated tables for frequently accessed dashboard queries — reducing compute cost for repetitive analytical queries). Monitoring: track query performance per Gold table. Queries taking longer than 30 seconds → investigate: missing Z-ORDER? excessive file count? suboptimal partition strategy? Performance optimization is ongoing — as data volume grows, optimization parameters must be adjusted.

Medallion Architecture Performance Optimization

Performance optimization per layer: bronze (file size optimization: target 100MB-1GB per file. Small files (under 1MB) create: slow reads, excessive metadata overhead, and poor query performance. Use Spark's auto-compaction or schedule compaction jobs to merge small files. Partition by ingestion date — enables efficient time-range queries and lifecycle management), silver (Z-ORDER or OPTIMIZE on frequently filtered columns: customer_id, order_date, product_id. These clustering operations co-locate related data on disk, improving query performance 5-10x for filtered queries. Schedule OPTIMIZE weekly for large tables), and gold (pre-compute aggregations instead of querying silver at read time. Materialized gold tables with scheduled refresh outperform silver views by 10-100x for dashboard queries. For Power BI Direct Lake: gold tables in Delta format provide sub-second dashboard refresh — matching import mode performance with DirectQuery-like freshness). Storage optimization: bronze (cold/archive tier after 90 days — bronze is rarely queried after initial processing), silver (standard tier — queried regularly for exploration and gold table builds), gold (hot tier — queried constantly by dashboards, APIs, and ML inference).

Data Products on the Medallion Architecture

Gold tables evolve into data products — purpose-built datasets published for specific consumers: data product definition (name, description, owner, SLA, schema, refresh frequency, quality metrics, and access policy — documented in the data catalog), data product types (analytics products: star schema for Power BI. Feature products: feature tables for ML models. Integration products: datasets consumed by downstream applications via SQL endpoint or API. Report products: pre-aggregated datasets optimized for specific reports or dashboards), data product lifecycle (draft → published → deprecated → retired. Draft: schema and content being developed. Published: production-quality, consumed by users. Deprecated: still available but no longer enhanced — consumers should migrate to the replacement. Retired: removed). The data product catalog in Purview or Unity Catalog enables: discovery (business users browse available data products), self-service access (request access through the catalog, approved by the data owner), impact analysis (which consumers depend on this product? — critical for change management), and quality monitoring (quality scores published alongside each data product — consumers know the quality before they use the data).

Medallion Architecture: Cost Analysis and Optimization

The medallion architecture creates 3 copies of data — cost implications: storage cost (Bronze: raw data, largest volume. Silver: cleaned, typically 80-90% of Bronze volume (duplicates removed). Gold: aggregated, typically 10-30% of Silver volume. Total: 1.8-2.2x the raw data volume. Mitigation: Bronze archived to cool/archive tier after 90 days — reducing hot storage cost 60-80%), compute cost (each layer transition requires processing: Bronze→Silver is the most compute-intensive (deduplication, type casting, validation). Silver→Gold is typically lighter (aggregation, joins). Mitigation: incremental processing — only new/changed records processed, not the full table every run), and management cost (each layer requires: pipeline development, monitoring, and maintenance. Mitigation: standardized templates — create a medallion pipeline template that handles: ingestion to Bronze, transformation to Silver, and aggregation to Gold — new tables onboarded using the template in hours, not days). Net cost impact: the medallion architecture costs 20-30% more than a single-layer architecture. The value: 10x reduction in data issue resolution time (reprocess from Bronze instead of re-ingesting from source), clear quality guarantees per layer, and governance that satisfies auditors.

Medallion Architecture for Real-Time Data

The medallion architecture extends to streaming data: streaming Bronze (raw events from Kafka/Event Hubs appended to the Bronze Delta table. Each event includes: event data + metadata (event_id, timestamp, source, partition). Retention: 30-90 days in hot storage, then archived), streaming Silver (Spark Structured Streaming or Flink processes Bronze events: deduplication (using event_id), schema validation, type casting, and enrichment (joining with reference data). MERGE into Silver Delta table. Latency: seconds to minutes depending on processing complexity), and streaming Gold (aggregated streaming metrics: events per minute, running totals, windowed averages. Materialized in Gold Delta tables for dashboard consumption. Power BI DirectLake connects to Gold for near-real-time dashboards). The streaming medallion pattern enables: real-time operational dashboards (Gold refreshes every minute), historical analysis (Silver provides detailed event history), and debugging/recovery (Bronze preserves every raw event for replay).

The Xylity Approach

We implement medallion architecture on Fabric and Databricks with the layered-quality methodology — bronze for auditability, silver for clean reusable data, gold for purpose-built consumption. Our data engineers and Fabric architects build lakehouse architectures with data lineage from source to dashboard and quality improvement at each layer.

Continue building your understanding with these related resources from our consulting practice.

Data Engineering

Data Engineering consulting.

Explore →

/services/data-engineering/

Consulting.

Explore →

/staff-augmentation/data-professionals/hire-data-engineers/

Pre-qualified engineers.

Explore →