In This Article
- Why Data Warehouses Still Matter in the Lakehouse Era
- Dimensional Modeling: Star Schema That Serves Analysis
- Data Zones: Raw to Staging to Presentation
- Warehouse Governance: Quality, Access, Lifecycle
- Performance: Partitioning, Indexing, Materialization
- Cloud Platforms: Fabric, Synapse, Snowflake, BigQuery
- Scaling From 1TB to 100TB
- Warehouse Implementation Approach
- Go Deeper
Why Data Warehouses Still Matter in the Lakehouse Era
The lakehouse (Fabric, Databricks) combines lake flexibility with warehouse performance. Does this make warehouses obsolete? No — understanding why clarifies when to use each.
Warehouses excel at: Structured, governed, SQL-queryable data for business users. Dimensional models designed for specific analytical patterns. Performance-optimized for BI dashboards, financial reports, and operational analytics. Governance controls (row-level security, column masking, audit logging) that regulated industries require.
Lakehouses excel at: Semi-structured and unstructured data (JSON, Parquet, images, logs). Schema-on-read flexibility for exploration. Data science workloads that need raw data for feature engineering. Streaming data arriving continuously.
Most enterprises need both: the data lake/lakehouse for raw storage, data science, and unstructured workloads + the warehouse for governed, modeled, business-ready data. The warehouse isn't replaced by the lakehouse — it's complemented by it.
Dimensional Modeling: Star Schema That Serves Analysis
Dimensional modeling organizes data into fact tables (measurable events) and dimension tables (descriptive context). The star schema matches how business users think: "show me sales (fact) by product (dimension) by region (dimension) over time (dimension)."
Fact Table Design
Grain definition — the most important decision. The grain determines detail level. You can aggregate up from fine-grained data; you can never drill below the grain. If any user needs transaction-level analysis, the grain is transaction-level — even if most users work with daily summaries.
Fact types: Transactional (one row per event), periodic snapshot (one row per entity per period — monthly balance, daily inventory), and accumulating snapshot (one row per process instance — order lifecycle from placement through delivery). Most warehouses need all three.
Dimension Design
Slowly changing dimensions (SCD): Type 1 (overwrite — no history), Type 2 (new row with effective dates — full history), Type 3 (previous + current value columns). Most enterprises use Type 2 for critical dimensions — preserving historical context for trend analysis.
Conformed dimensions: Shared across fact tables with identical definitions. The "Customer" dimension used by sales facts is the same one used by support facts — same key, same attributes, same SCD. Enables cross-process analytics: "customers who purchased X AND submitted a support ticket."
The bus matrix maps which facts connect to which dimensions. Rows are facts (Sales, Orders, Support). Columns are dimensions (Customer, Product, Time, Geography). Each "X" indicates a link. The bus matrix ensures: all facts that should share a dimension do (conformed), no facts are missing needed dimensions, and design covers all analytical patterns before development begins.
Data Zones: Raw to Staging to Presentation
Raw zone: Source data as-is, no transformation. Preserved for reprocessing, audit trail, and debugging. Append-only, never modified. Retention: 12+ months.
Staging zone: Cleansed, validated, conformed. Quality rules execute here — rejecting invalid records, flagging for review. Types standardized (dates to ISO, currencies to base). Clean but not yet dimensionally modeled.
Presentation zone: Dimensionally modeled, performance-optimized. Star schemas with facts and dimensions. Partitioned, indexed, materialized views. This is what Power BI, analysts, and self-service users access. Access controls enforce RLS and column security.
Warehouse Governance: Quality, Access, Lifecycle
Ingestion governance: Quality gates prevent bad data from loading. Access governance: RBAC + RLS enforced at warehouse level. Lifecycle governance: Retention policies, archive schedules, schema change management.
Schema change management: The schema is a contract with every report, model, and query. Changes follow: propose → assess impact (what breaks?) → test in non-production → deploy through pipeline → notify consumers. Unmanaged changes break downstream systems.
Performance: Partitioning, Indexing, Materialization
Partitioning: Large fact tables by date. 500M-row table partitioned by month: "last quarter" scans 120M rows instead of 500M. 76% fewer rows = 76% faster queries.
Materialized views: Pre-computed for common queries. "Monthly sales by region by product" computed nightly (3-second query) instead of on-the-fly from 500M rows (45 seconds x 50 users). Trades storage for performance.
Indexing: Column-store for analytical queries (default in cloud warehouses). Clustered indexes on dimension join keys. Non-clustered on frequently filtered columns. Monitor query performance and add indexes for specific slow patterns.
Cloud Platforms: Fabric, Synapse, Snowflake, BigQuery
| Platform | Best For | Compute Model | Differentiator |
|---|---|---|---|
| Fabric Warehouse | Microsoft ecosystem | Shared capacity (CU) | Unified with OneLake, Power BI |
| Snowflake | Multi-cloud, workload isolation | Independent warehouses | Separate storage/compute |
| BigQuery | Google Cloud, serverless | Pay per query TB | Zero infrastructure |
| Databricks SQL | Lakehouse-native SQL | Auto-scaling clusters | Delta Lake integration |
Selection: Microsoft ecosystem → Fabric. Multi-cloud + isolation → Snowflake. Google → BigQuery. Databricks platform → Databricks SQL. Platform follows ecosystem.
Scaling From 1TB to 100TB
1-10TB: Basic optimization suffices. Focus on model quality and governance. Performance issues are usually modeling problems.
10-50TB: Needs: aggregation tables, partition pruning, materialized views. Concurrency becomes a concern — separate compute for ETL and BI.
50-100TB+: Separate workload pools. Summarization layers (detailed data in lakehouse, summaries in warehouse). Incremental processing — only changed data. The lakehouse-warehouse combination becomes essential.
Warehouse Implementation Approach
Phase 1: Foundation (4-6 weeks)
Deploy platform. Implement zone architecture. Migrate 2-3 pilot subject areas (Sales, Inventory). Prove architecture with real data and queries.
Phase 2: Core Build (8-12 weeks)
Build remaining subject areas in waves. Dimensional modeling, ETL/ELT pipelines, quality gates. Validate query results, performance, concurrency.
Phase 3: Optimization (4-6 weeks)
Performance tuning (partitions, materialized views, indexes). Enable incremental/CDC loads. Deploy row-level security. Cut over BI tools.
Slowly Changing Dimensions: Implementation Patterns
SCD Type 2 — the most common enterprise pattern — requires careful implementation. Each dimension row has: surrogate_key (system-generated, used for fact table joins), business_key (natural key from source, e.g., customer ID), effective_date (when this version became active), expiry_date (when this version was superseded — current row has 9999-12-31), and is_current flag (boolean for current version). When a customer moves from New York to London: a new row is inserted with the London address, effective_date = today, is_current = true. The previous row is updated: expiry_date = today, is_current = false. The old fact table records still join to the New York row (correct for historical analysis); new facts join to the London row. This preserves historical truth while reflecting current state — the query "how many customers are in London today?" and "how many customers were in New York when they made this purchase?" both return correct answers.
Data Warehouse Automation (DWA)
Manual warehouse development — hand-coding ETL, DDL, and transformation logic for each table — doesn't scale beyond 50-100 tables. Data warehouse automation tools generate: DDL from metadata (define the source-to-target mapping, the tool generates CREATE TABLE and ALTER statements), ETL code from patterns (define the loading pattern — SCD Type 2, snapshot, incremental — the tool generates the pipeline code), documentation from implementation (data dictionary, lineage, and transformation logic auto-documented from the generated code). DWA tools (WhereScape, TimeXtender, dbt for the transformation layer) reduce development time by 40-60% and eliminate the hand-coding errors that produce inconsistent implementations across tables. The warehouse becomes metadata-driven — change the mapping, regenerate the code, redeploy. No manual code editing for standard patterns.
Testing the Data Warehouse
Warehouse testing validates three levels: unit tests (each transformation produces correct output for known input — dbt tests for column-level validation, referential integrity, uniqueness), integration tests (end-to-end pipeline produces correct results from source to presentation — row count reconciliation, aggregate comparison, sample record verification), and regression tests (changes to existing pipelines don't break existing outputs — automated comparison of pre-change and post-change results for a standard query set). Testing is automated in the CI/CD pipeline — every pipeline change triggers the test suite before deployment. Untested warehouse changes are the #1 cause of incorrect dashboard numbers.
The Modern ELT Pattern: Transforming Inside the Warehouse
Traditional ETL (Extract, Transform, Load) transforms data outside the warehouse — in ETL tools like Informatica or SSIS. Modern ELT (Extract, Load, Transform) loads raw data into the warehouse first, then transforms using the warehouse's compute engine. Why ELT wins in cloud: the cloud warehouse has virtually unlimited compute (scale up for transformation, scale down after), transformation in SQL is auditable and version-controllable (dbt models are SQL files in Git), and eliminating the ETL server removes infrastructure and licensing cost. dbt (data build tool) is the standard for ELT transformation — SQL-based models that define transformations, tests that validate output, documentation that auto-generates from model metadata, and lineage captured from model dependencies. dbt runs inside the warehouse compute — no separate transformation server needed.
Warehouse Development Methodology: Agile vs Waterfall
Traditional warehouse development followed waterfall: 6-month requirements phase, 6-month build, 3-month testing. Modern warehouse development uses agile sprints: 2-week sprints delivering 1-2 subject areas or fact tables per sprint. Each sprint produces: working ETL pipelines, validated dimension and fact tables, and connected BI reports that business users can test. Business feedback from sprint reviews drives the next sprint's priorities — "the Sales fact needs these 3 additional measures" becomes next sprint's work, not a 6-month change request. dbt + Git enable this: each sprint's transformations are version-controlled, tested, and deployable. Roll back any sprint that introduces regression. The warehouse evolves incrementally rather than being "finished" in a big-bang delivery. For most enterprises, the agile approach delivers first business value in 4-6 weeks instead of 12-18 months — building organizational confidence while iteratively expanding the warehouse scope.
The Xylity Approach
We design warehouses with the layered architecture — zones, dimensional modeling, conformed dimensions, and performance optimization. Our data engineers, data architects, and Fabric architects build alongside your team — transferring modeling, governance, and optimization capability.
Go Deeper
Continue building your understanding with these related resources from our consulting practice.
Build a Warehouse That Serves the Business
Dimensional modeling, zone architecture, performance optimization. Enterprise warehouse design that turns raw data into governed analytical assets.
Start Your Data Warehouse Project →