In This Article
- The Cost of Fragmented Data: Why Integration Is the First Priority
- Five Integration Architecture Patterns
- Pattern 1: Batch ETL/ELT — The Foundation
- Pattern 2: API-Led Integration
- Pattern 3: Event-Driven Architecture
- Pattern 4: Change Data Capture (CDC)
- Pattern 5: Hybrid — Combining Patterns for Enterprise Reality
- Platform Selection: Azure Data Factory, Fabric, MuleSoft, or Informatica
- Pipeline Design Principles: Idempotency, Observability, Recovery
- Integration Roadmap: Decision-Driven, Not Source-Driven
- Go Deeper
The Cost of Fragmented Data: Why Integration Is the First Priority
A retail enterprise operates 14 source systems: ERP (SAP), CRM (Salesforce), e-commerce platform (Shopify Plus), POS system (Square), marketing automation (HubSpot), inventory management (NetSuite), warehouse management (Manhattan Associates), loyalty program (custom), HR system (Workday), finance (Oracle EPM), web analytics (GA4), support (Zendesk), supply chain (Blue Yonder), and a product information management system (Salsify). Each system has its own definition of "customer," "product," and "order." The marketing team's customer count is 2.3 million. The CRM shows 1.8 million. The loyalty program has 950,000. The finance team reports 1.4 million "active customers." Nobody is wrong — they're each counting different things from different systems with different definitions.
This is the fragmented data estate. Every analytical question that crosses system boundaries — "what's the lifetime value of customers acquired through digital channels?" — requires manual data reconciliation before analysis can begin. Analysts spend 60-70% of their time finding, joining, and reconciling data. The remaining 30-40% is the actual analysis. The data analytics investment can't deliver value because the integration layer that feeds it doesn't exist.
Data integration solves this by creating a unified data layer where "customer" means one thing, "order" has one definition, and analytical questions that cross system boundaries can be answered without 3 weeks of data reconciliation. The architecture patterns described here determine how data flows from 14 source systems into that unified layer — reliably, efficiently, and at the freshness each use case requires.
Five Integration Architecture Patterns
Enterprise data integration isn't one pattern — it's a portfolio of patterns, each suited to different data velocity, volume, and use case requirements. Most enterprises need 3-4 patterns operating simultaneously.
| Pattern | Data Velocity | Best For | Latency | Complexity |
|---|---|---|---|---|
| Batch ETL/ELT | Scheduled (hourly/daily) | Warehousing, reporting, historical analytics | Hours | Low-Medium |
| API-Led | On-demand (request/response) | Application integration, microservices, portals | Milliseconds | Medium |
| Event-Driven | Continuous (as events occur) | Real-time analytics, notifications, workflow triggers | Seconds | Medium-High |
| CDC | Near-real-time (change propagation) | Database replication, data lake ingestion, cache sync | Seconds-Minutes | Medium |
| Hybrid | Mixed | Enterprise reality — different patterns for different use cases | Varies | High |
Pattern 1: Batch ETL/ELT — The Foundation
Batch integration — extracting data from source systems on a schedule, transforming it, and loading it into the analytical platform — is the foundation that 80% of enterprise analytics runs on. It's the right pattern when the analytical use case tolerates latency (daily financial reporting, weekly business reviews, monthly trend analysis) and data volumes are large enough to benefit from bulk processing.
ETL vs. ELT: The Architecture Decision
ETL (Extract-Transform-Load) transforms data before loading into the target. The transformation engine (Informatica, SSIS, Talend) processes data in a staging area, applies business rules, and loads clean data into the warehouse. Best when: the target warehouse has limited compute, transformations are complex, and data governance requires quality gates before data reaches the warehouse.
ELT (Extract-Load-Transform) loads raw data first, then transforms using the target platform's compute. Microsoft Fabric, Snowflake, and Databricks make ELT the default for cloud analytics — the platform's distributed compute handles transformations faster and cheaper than a separate ETL engine. Best when: the target platform has elastic compute, you want raw data preserved for flexibility, and transformations benefit from SQL-based processing.
The industry has largely shifted to ELT for new implementations. ETL remains relevant for legacy warehouse environments and scenarios where data must be transformed before it touches the analytical platform (regulatory requirements, PII masking before loading).
If your data platform is cloud-native (Fabric, Snowflake, Databricks, BigQuery), default to ELT. Load raw data into the landing zone, transform in the platform using SQL or Spark. If your warehouse is on-premises or has limited compute, ETL with a dedicated engine offloads transformation work. The decision is platform-driven, not philosophical.
Pattern 2: API-Led Integration
API-led integration connects systems through well-defined service interfaces — request/response communication where one system calls another system's API to read or write data. This is the pattern for operational integration: the e-commerce platform checks inventory in real time, the CRM retrieves order history from the ERP, the customer portal shows support tickets from Zendesk.
Three-Layer API Architecture
System APIs expose source system data through standardized interfaces. Each source system gets a thin API layer that abstracts the underlying database or proprietary interface. The CRM system API returns customer data in a standard format regardless of whether the CRM is Salesforce, Dynamics, or HubSpot.
Process APIs orchestrate business logic across multiple system APIs. The "customer 360" process API calls the CRM system API for demographics, the ERP system API for transactions, and the support system API for tickets — composing a unified customer view that no single source system contains.
Experience APIs serve specific front-end needs — mobile app, web portal, partner portal. Each experience API consumes process APIs and shapes the response for its specific consumer. The mobile app gets a lightweight JSON payload. The internal dashboard gets a richer dataset. Same underlying data, different presentations.
MuleSoft, Azure API Management, AWS API Gateway, and Kong are the primary platforms. For enterprises already on the Microsoft stack, Azure API Management integrates natively with Azure services and enterprise integration patterns.
Pattern 3: Event-Driven Architecture
Event-driven integration communicates through events — "order placed," "customer updated," "inventory below threshold" — published to a message broker and consumed by any system that needs to react. Unlike API integration (point-to-point, synchronous), event-driven integration is publish-subscribe (one-to-many, asynchronous). A single "order placed" event can trigger inventory update, shipping notification, loyalty point accrual, and real-time analytics — without the order system knowing or caring about any of these downstream consumers.
Event Streaming Platforms
Apache Kafka is the industry standard for high-throughput event streaming. Billions of events per day, durable storage, exactly-once semantics, and a rich ecosystem of connectors. The trade-off: operational complexity. Kafka clusters require sizing, partition management, and monitoring.
Azure Event Hubs is the managed alternative for Azure-native organizations. Compatible with the Kafka protocol (existing Kafka producers/consumers work without code changes), fully managed (no cluster operations), and integrated with Azure Stream Analytics and Fabric Real-Time Intelligence for stream processing.
When to use event-driven: When multiple consumers need to react to the same business event. When systems must be decoupled (the producer shouldn't know or depend on consumers). When data pipelines need near-real-time freshness for streaming analytics or real-time dashboards.
Pattern 4: Change Data Capture (CDC)
CDC captures row-level changes (inserts, updates, deletes) from source databases and propagates them to target systems in near-real-time. Unlike batch extraction (which re-reads entire tables periodically), CDC captures only what changed since the last sync — reducing source system load, network bandwidth, and target processing time.
CDC Implementation Approaches
Log-based CDC reads the database's transaction log (binlog in MySQL, WAL in PostgreSQL, transaction log in SQL Server) to capture changes. This is the preferred approach because it imposes zero load on the source database — the log is already being written for database recovery purposes. CDC tools read it as a secondary consumer. Debezium (open-source), Azure Data Factory CDC, Fivetran, and Qlik Replicate implement log-based CDC.
Query-based CDC periodically queries the source table for rows where a "last modified" timestamp exceeds the previous sync point. Simpler to implement but has limitations: requires a reliable timestamp column on every table, can't detect deletes (no row to query), and imposes query load on the source database during each sync.
Trigger-based CDC uses database triggers to capture changes into a staging table. Captures all change types (insert, update, delete) but adds overhead to every write operation on the source table — which can degrade transactional performance for high-volume tables.
| CDC Approach | Source Impact | Captures Deletes? | Latency | Best For |
|---|---|---|---|---|
| Log-based | Zero — reads existing log | Yes | Seconds | Production databases with zero-impact requirement |
| Query-based | Moderate — periodic queries | No (soft deletes only) | Minutes | Simple implementations, tolerant sources |
| Trigger-based | High — every write triggers | Yes | Immediate | Legacy systems without log access |
For enterprise data pipeline architecture, log-based CDC with Debezium (publishing to Kafka) or Azure Data Factory's native CDC connector is the standard recommendation. It provides near-real-time data freshness with zero source system impact — the combination every CDO wants.
Pattern 5: Hybrid — Combining Patterns for Enterprise Reality
No single integration pattern serves all enterprise needs. The realistic architecture combines patterns based on use case requirements:
CDC for Near-Real-Time Analytics
CDC on the 3-5 highest-velocity source tables (transactions, inventory, customer interactions) propagates changes to the Fabric lakehouse in near-real-time. Feeds operational dashboards that need current-hour data without waiting for the nightly batch.
API for Operational Systems
APIs connect customer-facing systems — the portal checks real-time inventory, the mobile app retrieves order status, the partner system pushes transactions. Synchronous, low-latency, request/response.
Events for Cross-System Reactions
Business events (order placed, customer churned, threshold breached) published to the event bus trigger downstream actions — alerts, workflow updates, ML model re-scoring, and real-time dashboard updates. Asynchronous, decoupled, one-to-many.
The hybrid architecture matches integration pattern to use case requirement. Analytical workloads get batch (cost-effective, high-volume). Operational systems get APIs (low-latency, synchronous). Near-real-time analytics get CDC (low-impact, continuous). Cross-system coordination gets events (decoupled, reactive).
Platform Selection: Azure Data Factory, Fabric, MuleSoft, or Informatica
| Platform | Strongest Pattern | Best For | Limitation |
|---|---|---|---|
| Azure Data Factory | Batch ELT, CDC | Azure-native orgs, 100+ connectors, cost-effective batch | Limited real-time (no native streaming) |
| Microsoft Fabric Data Factory | Batch ELT, Dataflows, CDC | Unified with Fabric lakehouse, simplified pipeline management | Newer platform, evolving feature set |
| MuleSoft | API-led integration | API-first architectures, complex B2B integration, Salesforce orgs | Expensive licensing, steep learning curve |
| Informatica IDMC | Enterprise ETL, MDM, data quality | Complex transformation, data governance, legacy migration | Cost, complexity for simpler use cases |
| Fivetran / Airbyte | SaaS-to-warehouse replication | Quick setup for 500+ SaaS connectors, no-code | Limited transformation, basic CDC |
| Debezium + Kafka | Log-based CDC, event streaming | High-volume CDC, event-driven architecture, multi-cloud | Operational complexity, requires Kafka expertise |
For Microsoft-stack enterprises, the recommended approach: Fabric Data Factory for batch ELT (unified with the analytical platform), Azure Event Hubs for event-driven patterns (managed Kafka-compatible), and Azure Data Factory CDC for near-real-time replication. This combination handles all four patterns from a single cloud ecosystem.
Pipeline Design Principles: Idempotency, Observability, Recovery
Integration pipelines run at scale — thousands of executions per day, millions of rows per run. At that scale, failures are not exceptions; they're expected operations that the pipeline must handle gracefully.
Idempotency: Running a pipeline twice on the same data produces the same result as running it once. No duplicates, no missing records, no corrupted state. Idempotent pipelines use merge/upsert patterns (not append-only), deduplicate on business keys, and maintain watermarks that track exactly which data has been processed.
Observability: Every pipeline execution produces telemetry — rows extracted, rows loaded, execution duration, data quality scores, and any anomalies detected. Dashboards show pipeline health. Alerts fire when pipelines fail, slow down, or process unexpected volumes. Without observability, pipeline failures are discovered when the Power BI dashboard shows stale data — which is the worst time to discover it.
Recovery: When a pipeline fails (and it will — source system outages, network issues, schema changes, data quality exceptions), what happens? The pipeline should log the failure with diagnostic detail, alert the operations team, and support replay from the point of failure — not require a full re-run. Data engineering best practices include checkpoint-based recovery, dead-letter queues for rejected records, and automated retry with exponential backoff.
Enterprise data pipelines should achieve 99%+ success rate. For a pipeline running 100 times daily, that's fewer than 1 failure per day. For critical pipelines (financial reporting, regulatory), target 99.9%. Achieve this through idempotent design, thorough error handling, and the monitoring that detects issues before they cascade.
Integration Roadmap: Decision-Driven, Not Source-Driven
The traditional integration approach starts with source systems: "let's integrate the ERP, then the CRM, then the marketing platform." This produces a data lake containing integrated data that nobody asked for — 200 tables from 14 systems, technically correct, analytically useless because nobody designed the integration for specific analytical questions.
The decision-driven approach starts with the analytical or operational use cases: which decisions need integrated data, which sources do those decisions require, and at what freshness? This produces integration that serves specific outcomes from day one.
Month 1: Use Case Discovery
Identify the top 5-8 analytical and operational use cases that require cross-system data. For each: which source systems, which entities (customer, product, order), which granularity, what freshness? This produces the integration requirements — not a source system catalog, but a use-case-driven data requirements document.
Months 2-3: Core Entity Integration
Build the integration for the 2-3 core entities (Customer, Product, Order/Transaction) from the sources the top use cases require. Design master data management patterns — how to reconcile "customer" across CRM, ERP, and e-commerce into a unified customer entity. Deploy to the analytical platform with quality validation.
Months 4-5: Analytical Layer
Build the dimensional model that serves the analytical use cases — star schema facts and dimensions designed for the specific questions. Connect Power BI semantic models. Validate that the integrated data produces correct analytical results. The data warehouse design follows the analytical requirements, not the source system structure.
Month 6: Operational Integration + Expansion
Add API and event-driven patterns for operational use cases (real-time inventory, customer portal, partner integration). Begin integrating additional sources for Wave 2 use cases. The architecture scales horizontally — each new source follows the established patterns.
The Xylity Approach
We implement data integration with use-case-driven scoping — starting from the decisions that need integrated data, not from the source systems that contain it. Our ETL developers and data engineers build the integration layer alongside your team, transferring pipeline design, monitoring, and operational discipline. The output is a functioning integration platform that serves analytical and operational use cases — not a data lake nobody uses.
Go Deeper
Continue building your understanding with these related resources from our consulting practice.
Integrate Data for Decisions, Not Storage
Five patterns — batch, API, event-driven, CDC, hybrid — matched to the use cases that drive your business. Integration architecture that serves outcomes from day one.
Start Your Data Integration Engagement →