The Master Data Crisis: 3 Versions of Every Customer

Marketing reports 2.3 million customers. CRM shows 1.8 million. Billing has 1.4 million active accounts. Finance reports 950,000 "revenue-generating customers." Nobody is wrong — they count different records from different systems with different definitions. Marketing includes prospects. CRM has duplicates (same customer entered 3 times by different reps). Billing counts accounts, not customers (one customer may have 3). Finance counts only those with 12-month revenue.

The consequence: customer lifetime value is wrong (calculated on duplicates), marketing wastes spend (3 copies of every campaign), support can't see the complete picture (partial data per system), and AI models trained on duplicated data produce inflated predictions. Every downstream process inherits the chaos — and it compounds as systems multiply.

Master data chaos isn't a data problem — it's a business problem that produces wrong counts, duplicated spend, incomplete context, and unreliable AI. MDM doesn't just clean data — it creates a single version of truth every system references. — Xylity Data Engineering Practice

What MDM Solves

ProblemWithout MDMWith MDM
Duplicate recordsSame customer entered 3x across systemsOne golden record linked to all system records
Inconsistent definitions"Customer" means different things per systemOne definition, one count, one authority
Incomplete viewEach system has partial infoGolden record aggregates all attributes from all sources
Downstream errorsDuplicates inflate analytics, marketing, AIClean master data feeds all consumers
Regulatory riskCan't identify all records for GDPR DSRMaster ID links all records — complete response

Implementation Styles: Registry, Consolidation, Coexistence

StyleHow It WorksSources Changed?Best ForComplexity
RegistryCross-references maintained; data stays in sourcesNoInitial MDM, analytics, GDPRLow
ConsolidationGolden record consolidated from sourcesNoAnalytics, BI, warehouse, AI trainingMedium
CoexistenceGolden record synchronized back to sourcesYesOperational MDM — all systems see same dataHigh

Start with Registry or Consolidation. Both provide the golden record for analytics and compliance without modifying sources — lower risk, lower resistance. Once the golden record is proven (stewards validate, consumers adopt), consider Coexistence to sync clean data back. Jumping straight to Coexistence requires source changes, real-time sync, and conflict resolution — too complex for a first initiative.

Entity Resolution: Matching, Merging, Surviving

1. Matching (Is It the Same Entity?)

Deterministic matching: Exact match on unique identifier — same email, tax ID, or phone. High precision but low recall (misses variants: [email protected] vs [email protected]).

Probabilistic matching: Fuzzy comparison across multiple attributes — name similarity (Levenshtein, Jaro-Winkler), address normalization, phone normalization. Each attribute contributes probability. Composite score determines: high confidence (auto-merge), low confidence (different entities), or uncertain (human review queue). Catches "ACME Corp" = "Acme Corporation" at the same address.

ML-powered matching: Train a model on verified match/non-match pairs. The model learns which attribute combinations predict matches in YOUR data. Outperforms rule-based by 10-15% on recall — more duplicates found, fewer false positives. Requires 5,000-10,000 labeled pairs for training.

2. Merging (Combine Into One Record)

When records match, conflicts must resolve: "ACME Corp" vs "Acme Corporation" — which survives? Address from CRM (updated 2 months ago) vs ERP (updated yesterday) — which wins? Survival rules define the winner per attribute.

3. Survivorship (Which Values Win)

AttributeSurvivorship RuleRationale
Legal nameERP (most authoritative)Matches contracts and invoices
EmailMost recently updatedEmail changes frequently; latest is most current
PhoneCRM (validated by sales)Sales reps verify during interactions
AddressMost recent + USPS validationValidate against postal database after selection
IndustryEnrichment service (D&B, ZoomInfo)Third-party data more current than manual entry

The Golden Record: Design and Governance

Master ID: Each golden record gets a unique synthetic ID. Customer #12345 (CRM), Account ACME-001 (ERP), Contact 78901 (support) all link to Master ID XYL-C-00001. This key connects all views — enabling customer 360, complete GDPR responses, and deduplicated analytics.

Attribute governance: Each golden record attribute has: authoritative source, survivorship rule, quality threshold, and update frequency. Governance ensures the golden record is consistently correct — not random selection from conflicting sources.

Stewardship workflow: Records that can't be auto-resolved (uncertain matches, conflicting values) route to stewards. They see: candidate pair, match probability, conflicting values, and recommended resolution. Decisions feed back into the matching algorithm, improving auto-resolution over time.

MDM by Domain: Customer, Product, Vendor, Location

Customer MDM (highest priority): Resolve duplicates across CRM, ERP, billing, support, marketing. Creates: deduplicated marketing, complete support context, accurate analytics, and GDPR compliance.

Product MDM: Unify across PLM, ERP, e-commerce, marketing. Ensures: consistent descriptions across channels, accurate dimensions for shipping, correct pricing across sales channels. Critical for retail (10,000+ SKUs), manufacturing (complex BOMs), and healthcare (device tracking).

Vendor MDM: Deduplicate across procurement, accounts payable, compliance. Enables: consolidated spend analytics, consistent compliance screening, contract terms tracking. Prevents: paying the same vendor through 3 different records with 3 different terms.

Location MDM: Standardize and validate addresses across all systems. Address validation (USPS, Google Maps) corrects entry errors. Geocoding enables spatial analytics. Critical for: logistics (delivery routing), retail (store network), insurance (geographic risk), compliance (data residency).

Technology: Informatica, Azure, and Custom

PlatformBest ForMatchingCost
Informatica MDMEnterprise-scale, complex, multi-domainAdvanced probabilistic + ML$200K-500K/year
Azure + Purview + CustomAzure-native, Fabric-integratedCustom in Spark/Python$30K-100K/year
ReltioCloud-native, real-time, API-firstML-powered$100K-300K/year

For Azure-native: Build on Fabric — Spark notebooks for entity resolution, Purview for catalog and governance, lakehouse for golden record storage. Avoids platform licensing ($200-500K/year) and integrates natively with the analytical platform.

MDM Implementation Roadmap

1

Month 1: Assessment and Design

Identify primary domain (usually Customer). Inventory all source systems. Profile each for: record count, key attributes, quality scores, duplication rates. Design golden record schema, survivorship rules, matching strategy.

2

Month 2-3: Build and Match

Extract from top 3-5 sources. Run entity resolution: deterministic on email/phone, probabilistic on name+address, ML for complex cases. Steward review of uncertain matches. Create initial golden record. Validate count and attribute correctness.

3

Month 4-5: Consume and Validate

Feed golden record into: data warehouse (deduplicated dimension), Power BI (accurate counts), ML models (clean training data). Validate with stakeholders. Iterate matching rules based on false positive/negative feedback.

4

Month 6: Operationalize

Automate: new records matched on ingestion. Establish stewardship cadence. Deploy quality monitoring. Plan expansion to additional domains in Year 2.

MDM and Data Mesh: Centralized Masters in a Decentralized World

Data mesh advocates decentralized data ownership — each domain owns its data products. MDM seems to contradict this: a centralized golden record that overrides domain-specific records. The reconciliation: MDM operates at the entity level (Customer, Product, Vendor) while data mesh operates at the domain level (Sales, Marketing, Operations). Domain teams own their domain-specific data products. The MDM function provides the shared entity reference that domains consume — the "Customer" golden record is a cross-domain data product that every domain references for entity identification, deduplication, and attribute resolution. In data mesh terms, MDM is a platform capability (like infrastructure) that domain teams consume — not a centralized control that overrides domain ownership. The domain still owns "Sales Customer Data" — but it references the MDM golden record for customer identity, preventing the duplication that makes cross-domain analytics impossible.

Data Quality and MDM: The Circular Dependency

MDM depends on data quality — entity resolution accuracy depends on the quality of input data (clean names and addresses match more accurately than dirty ones). Simultaneously, data quality depends on MDM — deduplication is a quality dimension that MDM addresses. The practical resolution: run quality profiling and basic cleansing before entity resolution (clean the inputs), then run MDM (resolve entities), then measure quality on the golden record (validate the output). The quality-MDM-quality loop operates continuously: quality improvements → better matching → cleaner golden records → better downstream quality. Each iteration improves both quality and master data accuracy.

Measuring MDM Success: 5 KPIs

Duplicate rate reduction: Pre-MDM duplicate rate vs. post-MDM (target: 95%+ reduction). Match accuracy: Precision (false positive rate below 2%) and recall (false negative rate below 5%) on sampled validation. Golden record coverage: Percentage of source records linked to a golden record (target: 98%+). Steward queue efficiency: Average resolution time for uncertain matches (target: under 48 hours). Consumer adoption: Percentage of downstream systems consuming the golden record instead of source-direct data (target: 80%+ within 12 months). Track these monthly — they demonstrate MDM value to leadership and identify areas needing improvement.

Real-Time vs Batch MDM: Choosing the Processing Model

Batch MDM (most common start): Entity resolution runs nightly or weekly on accumulated records. New records are matched and merged in the next batch run. Latency: hours to days. Best for: analytical MDM (golden record feeds the warehouse), organizations starting their MDM journey (simpler operations), and domains with low record creation velocity (vendors, locations). Real-time MDM (operational use cases): Entity resolution runs as records are created or updated — matching occurs within seconds. Required for: operational MDM where the golden record drives business processes (e.g., CRM deduplication at data entry), compliance scenarios requiring immediate entity identification (sanctions screening, fraud detection), and high-velocity domains (customer creation at 1,000+ records/day). Real-time MDM requires: streaming infrastructure (Event Hubs, Kafka), pre-computed match indexes (blocking keys for fast candidate generation), and low-latency matching algorithms (deterministic matching for real-time, with probabilistic matching as a follow-up batch process). Most enterprises start with batch MDM and add real-time for specific high-priority use cases after the batch process proves the matching accuracy.

The Xylity Approach

We implement MDM with the golden record architecture — entity resolution, golden record in Fabric, stewardship workflows, and analytical integration. Our data engineers and data architects build MDM alongside your team — starting with Customer, proving value through deduplicated analytics, and expanding to additional domains.

Continue building your understanding with these related resources from our consulting practice.

One Customer, One Record, One Truth

Entity resolution, golden records, survivorship rules. MDM that creates the single version of truth every system references.

Start Your MDM Implementation →