In This Article
Why Data Quality Needs KPIs
Without KPIs, data quality is: invisible (nobody knows whether quality is good, bad, or improving), unaccountable (nobody is responsible for quality because there's no measurement to be accountable for), and unfundable (the business case for quality improvement requires: "quality is X today, we need to reach Y, it will cost Z" — without X, there's no business case). Data quality KPIs make quality: visible (published on the data catalog alongside the data — consumers see quality scores before using the data), accountable (data stewards have quality targets — measured, reported, and included in performance reviews), and fundable (quality improvement projects have: measurable baseline, defined target, and ROI calculation based on the business impact of poor quality).
6 Quality Dimensions with KPIs
| Dimension | Definition | KPI | Target |
|---|---|---|---|
| Completeness | % of required fields that are populated | Non-null rate for required columns | 99%+ |
| Accuracy | % of values that are correct | Validated against source or business rules | 99%+ |
| Freshness | Time since last update | Minutes/hours since last refresh | Per SLA (1-4 hours) |
| Consistency | Same value across systems | Cross-system reconciliation match rate | 99%+ |
| Uniqueness | No unintended duplicates | Duplicate rate in key columns | Under 0.1% |
| Validity | Values conform to business rules | Business rule pass rate | 99%+ |
Each dimension measured per table, per column where applicable. Aggregated into: table quality score (weighted average across dimensions), domain quality score (average across tables in a domain), and enterprise quality score (average across all domains). The enterprise score provides executive visibility; the table score drives operational improvement.
Defining Data Quality SLAs
SLAs formalize quality commitments between the data team and business consumers: SLA structure: for each critical dataset: quality dimension → metric → target → measurement frequency → escalation path. Example: "Customer table completeness: email field non-null rate ≥ 99%. Measured hourly. If below 99% for 2 consecutive hours: P2 alert to data steward." SLA tiers: Tier 1 (critical business data): 99%+ quality, 1-hour freshness, P1 escalation on breach. Tier 2 (important operational data): 97%+ quality, 4-hour freshness, P2 escalation. Tier 3 (supplementary data): 95%+ quality, 24-hour freshness, P3 escalation. Tier assignment: based on business impact — data that drives financial reporting, customer-facing applications, or regulatory compliance = Tier 1. Data that supports internal operations = Tier 2. Historical or reference data = Tier 3.
Quality Scorecards: Design and Implementation
The quality scorecard is a Power BI dashboard that shows: enterprise quality score (single number: weighted average across all domains — the data equivalent of a credit score. Displayed on the executive dashboard alongside business KPIs), domain scores (quality score per domain: Finance 98.5%, Sales 96.2%, Operations 94.8% — immediately identifies which domains need attention), table detail (drill-down to individual tables: completeness, accuracy, freshness, consistency scores. Trend over time: improving or degrading?), SLA compliance (% of SLAs met this period. Breached SLAs listed with: duration, impact, and resolution status), and quality incidents (data quality issues logged, categorized, and tracked — similar to application incident management). The scorecard is reviewed: weekly by the data team (operational improvement), monthly by data stewards (governance accountability), and quarterly by executives (strategic investment decisions).
Quality Governance: Ownership and Accountability
Data steward responsibilities: Each critical dataset has a named business data steward who: defines quality rules (what constitutes "correct" for each field), sets quality targets (the SLA for their domain), reviews quality scorecard weekly, investigates and resolves quality incidents, and approves changes to data definitions. Data engineer responsibilities: Implement quality checks in pipelines, build and maintain the quality monitoring framework, remediate technical quality issues (pipeline bugs, transformation errors), and report quality metrics to stewards. Escalation path: Quality below SLA → data engineer investigates (is it a pipeline issue?). If yes → engineer fixes. If no (source data issue) → escalate to data steward → steward coordinates with source system owner.
Automating Quality Measurement
Quality measurement automation: pipeline-embedded checks (Great Expectations or dbt tests run after every pipeline load — checks embedded in the pipeline, not separate from it. Failed checks: block the pipeline from writing to Gold tables, preserving the last-known-good state for business users), scheduled quality jobs (thorough quality scoring runs nightly — computing all dimensions across all tables. Results stored in a quality metrics table, consumed by the scorecard dashboard), anomaly detection (ML-based anomaly detection on quality metrics — detecting gradual degradation that individual threshold checks miss. "Completeness has dropped from 99.5% to 98.8% over 3 weeks" — the rate of change matters, not just the absolute value), and quality alerting (integrated with: Teams/Slack (real-time alerts), PagerDuty (P1 escalation for critical quality breaches), and email (weekly quality summary to stewards)). Automation investment: 2-4 weeks for initial framework setup. Ongoing: 2-4 hours/week for: new table onboarding, threshold tuning, and false positive investigation.
Continuous Quality Improvement
Quality improvement is iterative: identify (the scorecard shows: Finance domain quality dropped from 98.5% to 96.2%. Drill-down: the revenue table's accuracy score dropped because a new product category isn't mapping correctly), root cause (the mapping table doesn't include the new product category added last month — new products default to "Unknown" instead of their actual category), fix (update the mapping table. Add a quality check: "assert no products with category 'Unknown' that are more than 7 days old" — preventing the same issue from recurring), verify (quality score recovers to 98.5%+. The new check catches future mapping gaps within 24 hours instead of weeks), and prevent (establish a process: when new product categories are created in the source system, the data governance team is notified to update downstream mappings within 48 hours). This cycle runs continuously — each issue fixed makes the quality framework stronger, and the quality score trends upward over time.
Data Quality Automation: Shifting Quality Left
Quality checks at the end of the pipeline are too late — bad data has already been processed and potentially served to users. Shifting quality left means: quality at ingestion (validate data as it arrives from the source — reject or quarantine records that fail basic checks: required fields present, data types valid, referential integrity maintained. Bad data never enters the lakehouse), quality at transformation (validate after each transformation step — the join didn't produce duplicates, the aggregation sums match detail, the filter didn't drop valid records. Each transformation is a potential error injection point), quality at serving (validate before data is exposed to consumers — the Power BI semantic model's measures produce expected results, the API response matches the schema, the ML feature values are within expected ranges). Implementation: Great Expectations expectations embedded in the data pipeline at each stage. Soda checks as pipeline gates. dbt tests in every model. The goal: every piece of data passes quality checks before it moves to the next layer — bronze → silver → gold with quality gates at each transition.
Data Quality for AI/ML: The Often-Overlooked Foundation
"Garbage in, garbage out" applies more to ML models than to any other use case. ML-specific data quality requirements: label quality (for supervised learning: are the training labels accurate? A 5% label error rate in training data directly translates to a 5%+ accuracy ceiling for the model — no algorithm can overcome incorrect training labels), feature completeness (features with high NULL rates produce: unreliable predictions for records with missing features, or biased predictions if NULL correlates with the target variable), temporal consistency (training data must use features as they existed at prediction time — using today's feature values to predict yesterday's outcomes creates data leakage), distribution stability (if the training data distribution doesn't match production data distribution: model accuracy degrades. Quality monitoring must detect distribution drift in both features and predictions), and bias detection (protected attributes: gender, race, age — quality checks verify that model performance is equitable across protected groups). Data quality for ML isn't just about clean data — it's about data that's correct, complete, temporally consistent, and fair.
Data Quality ROI: Quantifying the Cost of Poor Quality
Poor data quality costs: operational waste (employees manually verifying and correcting data — 15-25% of knowledge worker time spent on data-related rework), bad decisions (the demand forecast based on dirty data over-orders $500K of inventory that doesn't sell — the data quality issue becomes a business loss), customer impact (incorrect customer addresses → failed deliveries → customer complaints → churn. Each failed delivery: $15-50 in direct cost + customer satisfaction damage), compliance risk (inaccurate financial reporting → audit findings → regulatory penalties. SOX material weakness from data quality issues → stock price impact), and analytics failure (ML models trained on dirty data produce unreliable predictions — the churn model predicts 30% churn because duplicate customers inflate the calculation, not because customers are actually leaving). Quantification: identify the top 3 data quality issues by business impact. Calculate: cost per occurrence × frequency × duration. Typical finding: the top 3 issues cost $500K-2M annually. The quality improvement investment: $100-300K. ROI: 3-10x in year one.
Quality Metrics by Data Domain
| Domain | Critical Metrics | Quality Target |
|---|---|---|
| Customer | Duplicate rate, email validity, address completeness | Duplicates under 1%, email valid 98%+ |
| Financial | Balance accuracy, GL completeness, period integrity | 100% balance reconciliation, 99.9%+ completeness |
| Product | Description completeness, price accuracy, category validity | 99%+ completeness, price matches source |
| Order | Order-line integrity, amount positivity, status validity | 100% referential integrity, 99.9%+ valid amounts |
| Employee | Active status accuracy, compensation completeness, manager hierarchy | 100% active status correct, 99%+ compensation |
Each domain has different quality priorities: financial data requires 100% accuracy (a $0.01 error is a defect). Customer data requires deduplication focus (duplicates cause: double-counting in analytics, multiple communications to the same person, and inconsistent customer experience). Product data requires completeness (missing descriptions and images reduce conversion rates). The quality program prioritizes by domain based on business impact — not by treating all data equally.
The Xylity Approach
We implement data quality management with the 6-dimension KPI framework — completeness, accuracy, freshness, consistency, uniqueness, and validity — measured automatically, published on scorecards, and governed by SLAs. Our data engineers and DataOps engineers build quality measurement that transforms "the data quality is fine" into "the data quality is 98.2% and improving."
Go Deeper
Continue building your understanding with these related resources from our consulting practice.
Data Quality You Can Measure, Not Just Hope For
6-dimension KPIs, SLAs, automated scorecards. Data quality framework that transforms opinions into metrics.
Start Your Data Quality Program →