In This Article
The Data Blind Spot
Application systems have observability: uptime monitoring, error rates, latency metrics, and alerting. Data systems traditionally have: none. The data pipeline succeeds (the job completed without errors) but the data is wrong (a source schema changed and the pipeline silently dropped 40% of rows, or a source system sent duplicate records, or the transformation produced NULLs where it shouldn't have). Data observability closes this gap by monitoring the data itself — not just the pipeline. When the pipeline says "success" but the data says "something is wrong" — observability catches the discrepancy before a business user sees a broken dashboard and loses trust in the entire data platform.
5 Pillars of Data Observability
| Pillar | What It Monitors | Failure It Detects |
|---|---|---|
| Freshness | When was data last updated? | Pipeline failure, delayed source, stale reports |
| Volume | How many rows arrived? | Source outage, filter errors, partial loads |
| Schema | Has the structure changed? | Column added/removed, type changes, breaking changes |
| Distribution | Do values look normal? | Data corruption, source issues, transformation bugs |
| Lineage | Where did data come from, where does it go? | Impact analysis, root cause identification |
Pillar 1: Freshness Monitoring
Freshness = time since the table was last updated. Monitoring: for each critical table, define an SLA: "orders table must be updated every 2 hours." Alert when: the table hasn't been updated within the SLA window. Implementation: check the maximum timestamp in the "updated_at" or "loaded_at" column. If MAX(loaded_at) > 2 hours ago → stale data alert. Freshness monitoring catches: pipeline failures that didn't generate error logs (the pipeline silently stopped), source system delays (the source API is responding slowly, delaying downstream data), and schedule misconfigurations (the pipeline is scheduled for 6 AM but the source data isn't available until 7 AM — the pipeline runs on yesterday's data every day).
Pillar 2: Volume Monitoring
Volume = number of rows in the latest load. Monitoring: establish a baseline for each table (the orders table typically receives 5,000-7,000 rows per day). Alert when: volume deviates significantly from the baseline (below 3,000 → possible source outage. Above 12,000 → possible duplicate data). Implementation: track daily row counts, compute rolling average and standard deviation, alert when today's count is outside 2-3 standard deviations from the average. Volume monitoring catches: partial loads (the pipeline loaded 2,000 of 6,000 expected rows because the source API timed out mid-extraction), duplicate data (the pipeline ran twice due to a scheduler error — 12,000 rows instead of 6,000), and source outages (a source system was down during the extraction window — zero rows loaded, pipeline reported "success" because it didn't error).
Pillar 3: Schema Change Detection
Schema monitoring detects: column additions (a source system added a new column — the pipeline ignores it, but downstream consumers may need it), column removals (a source column was dropped — the pipeline fails or produces NULLs in the corresponding lakehouse column), type changes (a column changed from string to integer — the pipeline's transformation logic may break or silently truncate data), and column renames (the source renamed "customer_id" to "cust_id" — the pipeline can't find the expected column). Implementation: snapshot the schema of each source table daily. Compare today's schema to yesterday's. Alert on any difference with: what changed, which tables are affected, and which downstream dashboards/models consume this table. Purview provides automated schema detection for cataloged data sources. Great Expectations and Monte Carlo provide schema monitoring as part of their observability platforms.
Pillar 4: Distribution Monitoring
Distribution monitoring detects: unexpected NULLs (the "email" column is 99% populated historically but today's load is 40% NULL — something changed in the source), value range anomalies (the "order_amount" column has historically ranged $10-$10,000 but today includes a $500,000 entry — data error or legitimate outlier?), categorical distribution shifts (the "status" column has historically been 70% "active" / 30% "inactive" but today is 20% "active" / 80% "inactive" — a source system change or a real business shift?), and uniqueness violations (the "order_id" column should be unique but today has 500 duplicates — the pipeline loaded the same data twice). Implementation: for each critical column, define expected distributions: NULL rate, value range, distinct count, and statistical distribution. Compare each load to the expectations. Alert on significant deviations. Tools like Monte Carlo, Great Expectations, and data quality frameworks automate distribution monitoring across hundreds of tables.
Pillar 5: Lineage-Aware Alerting
When a data quality issue is detected, lineage answers: where did the bad data come from, and what does it affect downstream? Purview lineage shows: source table → pipeline → lakehouse table → semantic model → Power BI dashboard. When the source table has a freshness alert: lineage identifies all downstream consumers — the alerting system notifies: the data engineer (fix the pipeline), the dashboard owner (your dashboard may show stale data), and the business user (the revenue report may not reflect today's transactions). Without lineage-aware alerting: the data engineer fixes the pipeline but the business user doesn't know their dashboard was affected for 3 hours — they made a decision based on stale data.
Tool Comparison
| Tool | Type | Strengths |
|---|---|---|
| Monte Carlo | Commercial SaaS | ML-powered anomaly detection, automated monitoring, incident management |
| Great Expectations | Open-source | Flexible expectation framework, CI/CD integration, free |
| Soda | Open-source + commercial | SQL-based checks, easy to adopt, SodaCL language |
| Purview Data Quality | Azure-native | Integrated with Purview catalog and lineage, no additional tool |
| dbt tests | Built into dbt | Tests defined alongside transformations, part of the pipeline |
Implementation Approach
Week 1-2: Critical Tables
Identify the 10-20 most critical tables (the ones that feed executive dashboards and financial reports). Implement freshness and volume monitoring on these tables. Alert the data engineering team on violations.
Week 3-4: Schema and Distribution
Add schema change detection on all source tables. Add distribution monitoring on critical columns (financial amounts, key identifiers, status fields). Establish baseline distributions from historical data.
Month 2-3: Lineage and Expansion
Connect observability alerts to lineage (Purview) for impact analysis. Expand monitoring to all production tables. Establish SLAs per table with business stakeholders. Build the data quality dashboard that business users can view.
Data Observability ROI
Data observability ROI is measured by: incidents prevented, trust maintained, and engineering time saved. Incidents prevented: without observability, a silent pipeline failure averages 8 hours before detection (discovered when a business user sees a wrong dashboard). With observability: detection in 5-15 minutes. For 12 silent failures/year: 12 × (8 hours - 0.25 hours) × $200/hour impact = $18,600/year in faster detection. But the bigger value: the 3 incidents that would have caused business harm (wrong financial report sent to the board, incorrect inventory leading to overselling, stale data causing wrong pricing decision) — each worth $50-500K in prevented business impact. Trust maintained: when business users discover data issues before the data team, trust erodes. Each trust-eroding incident takes 3-6 months to recover from — during which business users build shadow analytics in Excel, defeating the purpose of the data platform. Data observability prevents trust erosion by catching issues before business users see them. Engineering time: without observability, the data engineer investigates a quality issue by: querying 10 tables to find where the issue started, checking 5 pipeline logs, and running manual reconciliation — average 4 hours per investigation. With observability: the alert includes the affected table, the specific quality dimension that failed, and the lineage showing upstream causes — investigation time drops to 30 minutes. For 20 investigations/month: 20 × 3.5 hours saved × $100/hour = $7,000/month = $84,000/year. Total observability ROI: $100-500K/year for a mid-size data platform — against an implementation cost of $50-100K and $20-40K/year in tooling.
Building a Data Observability Culture
Data observability is a tool — but the culture around it determines whether it's effective: publish quality scores (make data quality visible to business stakeholders — not hidden in engineering dashboards. When the VP of Sales can see that CRM data quality is 89%, they have the context to prioritize quality improvement), set expectations (every critical dataset has an SLA: freshness, quality, availability. SLA violations are treated as incidents — investigated, resolved, and prevented from recurring), reward detection (when the observability system catches a silent failure before business impact: celebrate it. "The observability system detected a schema change in the orders source at 6:15 AM, the data engineer fixed it by 6:45 AM, and the 7 AM dashboard refresh was accurate" — this is a success story, not an incident report), and continuous improvement (monthly review of: false positive rate (alerts that weren't real issues — tune the thresholds), false negative rate (issues that observability missed — add new checks), and time-to-detection trend (is the team getting faster at catching issues?)).
Building a Data Observability Practice
Month 1: Foundation
Deploy observability tooling (Monte Carlo, Great Expectations, or custom). Onboard top 10 critical tables: freshness monitoring, volume anomaly detection, and basic quality checks. Configure alerting to the DataOps team Slack channel.
Month 2-3: Scale
Onboard all Gold and Silver tables. Add schema change detection. Implement lineage-aware alerting (connect observability to Purview lineage). Define and publish data SLAs for Tier 1 datasets. Build the quality scorecard dashboard.
Month 4+: Optimize
Tune alerting thresholds (reduce false positives from initial conservative settings). Add ML-based anomaly detection for volume and quality trends. Integrate observability with incident management (PagerDuty/ServiceNow). Publish SLA compliance reports to business stakeholders monthly.
Data Observability vs Data Quality: The Difference
Data quality focuses on: are the values correct? Data observability focuses on: is the data platform healthy? The distinction matters: data quality catches: invalid email addresses, negative order amounts, duplicate customer records — issues with the data content. Data observability catches: the pipeline didn't run, the table hasn't refreshed in 6 hours, the row count dropped 50% overnight, the schema changed unexpectedly — issues with the data infrastructure. Organizations need both: quality checks embedded in pipelines (catching content issues during processing), and observability monitoring across the platform (catching infrastructure issues that quality checks can't detect because the pipeline didn't run at all). The most dangerous failure mode: the pipeline didn't run, no quality check executed, and the dashboard shows yesterday's data without any indication that it's stale. Observability catches this — quality checks don't, because they only run when the pipeline runs.
The Xylity Approach
We implement data observability with the 5-pillar methodology — freshness, volume, schema, distribution, and lineage-aware alerting. Our data engineers and DataOps engineers deploy monitoring that catches silent data failures before they reach business users — because pipeline success doesn't mean data correctness.
Go Deeper
Continue building your understanding with these related resources from our consulting practice.
Detect Silent Data Failures Before Business Users Do
Freshness, volume, schema, distribution, lineage. Data observability that catches what pipeline monitoring misses.
Start Your Data Observability →