Enterprise Data Integration Architecture Guide

The Cost of Fragmented Data: Why Integration Is the First Priority
Five Integration Architecture Patterns
Pattern 1: Batch ETL/ELT — The Foundation
Pattern 2: API-Led Integration
Pattern 3: Event-Driven Architecture
Pattern 4: Change Data Capture (CDC)
Pattern 5: Hybrid — Combining Patterns for Enterprise Reality
Platform Selection: Azure Data Factory, Fabric, MuleSoft, or Informatica
Pipeline Design Principles: Idempotency, Observability, Recovery
Integration Roadmap: Decision-Driven, Not Source-Driven
Go Deeper

The Cost of Fragmented Data: Why Integration Is the First Priority

A retail enterprise operates 14 source systems: ERP (SAP), CRM (Salesforce), e-commerce platform (Shopify Plus), POS system (Square), marketing automation (HubSpot), inventory management (NetSuite), warehouse management (Manhattan Associates), loyalty program (custom), HR system (Workday), finance (Oracle EPM), web analytics (GA4), support (Zendesk), supply chain (Blue Yonder), and a product information management system (Salsify). Each system has its own definition of "customer," "product," and "order." The marketing team's customer count is 2.3 million. The CRM shows 1.8 million. The loyalty program has 950,000. The finance team reports 1.4 million "active customers." Nobody is wrong — they're each counting different things from different systems with different definitions.

This is the fragmented data estate. Every analytical question that crosses system boundaries — "what's the lifetime value of customers acquired through digital channels?" — requires manual data reconciliation before analysis can begin. Analysts spend 60-70% of their time finding, joining, and reconciling data. The remaining 30-40% is the actual analysis. The data analytics investment can't deliver value because the integration layer that feeds it doesn't exist.

Data integration isn't a technical project — it's the prerequisite for every analytical and AI initiative in the organization. Without it, every downstream investment is built on a fragmented foundation. — Xylity Data Engineering Practice

Data integration solves this by creating a unified data layer where "customer" means one thing, "order" has one definition, and analytical questions that cross system boundaries can be answered without 3 weeks of data reconciliation. The architecture patterns described here determine how data flows from 14 source systems into that unified layer — reliably, efficiently, and at the freshness each use case requires.

Five Integration Architecture Patterns

Enterprise data integration isn't one pattern — it's a portfolio of patterns, each suited to different data velocity, volume, and use case requirements. Most enterprises need 3-4 patterns operating simultaneously.

Pattern	Data Velocity	Best For	Latency	Complexity
Batch ETL/ELT	Scheduled (hourly/daily)	Warehousing, reporting, historical analytics	Hours	Low-Medium
API-Led	On-demand (request/response)	Application integration, microservices, portals	Milliseconds	Medium
Event-Driven	Continuous (as events occur)	Real-time analytics, notifications, workflow triggers	Seconds	Medium-High
CDC	Near-real-time (change propagation)	Database replication, data lake ingestion, cache sync	Seconds-Minutes	Medium
Hybrid	Mixed	Enterprise reality — different patterns for different use cases	Varies	High

Pattern 1: Batch ETL/ELT — The Foundation

Batch integration — extracting data from source systems on a schedule, transforming it, and loading it into the analytical platform — is the foundation that 80% of enterprise analytics runs on. It's the right pattern when the analytical use case tolerates latency (daily financial reporting, weekly business reviews, monthly trend analysis) and data volumes are large enough to benefit from bulk processing.

ETL vs. ELT: The Architecture Decision

ETL (Extract-Transform-Load) transforms data before loading into the target. The transformation engine (Informatica, SSIS, Talend) processes data in a staging area, applies business rules, and loads clean data into the warehouse. Best when: the target warehouse has limited compute, transformations are complex, and data governance requires quality gates before data reaches the warehouse.

ELT (Extract-Load-Transform) loads raw data first, then transforms using the target platform's compute. Microsoft Fabric, Snowflake, and Databricks make ELT the default for cloud analytics — the platform's distributed compute handles transformations faster and cheaper than a separate ETL engine. Best when: the target platform has elastic compute, you want raw data preserved for flexibility, and transformations benefit from SQL-based processing.

The industry has largely shifted to ELT for new implementations. ETL remains relevant for legacy warehouse environments and scenarios where data must be transformed before it touches the analytical platform (regulatory requirements, PII masking before loading).

The ETL Consulting Decision

If your data platform is cloud-native (Fabric, Snowflake, Databricks, BigQuery), default to ELT. Load raw data into the landing zone, transform in the platform using SQL or Spark. If your warehouse is on-premises or has limited compute, ETL with a dedicated engine offloads transformation work. The decision is platform-driven, not philosophical.

Pattern 2: API-Led Integration

API-led integration connects systems through well-defined service interfaces — request/response communication where one system calls another system's API to read or write data. This is the pattern for operational integration: the e-commerce platform checks inventory in real time, the CRM retrieves order history from the ERP, the customer portal shows support tickets from Zendesk.

Three-Layer API Architecture

System APIs expose source system data through standardized interfaces. Each source system gets a thin API layer that abstracts the underlying database or proprietary interface. The CRM system API returns customer data in a standard format regardless of whether the CRM is Salesforce, Dynamics, or HubSpot.

Process APIs orchestrate business logic across multiple system APIs. The "customer 360" process API calls the CRM system API for demographics, the ERP system API for transactions, and the support system API for tickets — composing a unified customer view that no single source system contains.

Experience APIs serve specific front-end needs — mobile app, web portal, partner portal. Each experience API consumes process APIs and shapes the response for its specific consumer. The mobile app gets a lightweight JSON payload. The internal dashboard gets a richer dataset. Same underlying data, different presentations.

MuleSoft, Azure API Management, AWS API Gateway, and Kong are the primary platforms. For enterprises already on the Microsoft stack, Azure API Management integrates natively with Azure services and enterprise integration patterns.

Pattern 3: Event-Driven Architecture

Event-driven integration communicates through events — "order placed," "customer updated," "inventory below threshold" — published to a message broker and consumed by any system that needs to react. Unlike API integration (point-to-point, synchronous), event-driven integration is publish-subscribe (one-to-many, asynchronous). A single "order placed" event can trigger inventory update, shipping notification, loyalty point accrual, and real-time analytics — without the order system knowing or caring about any of these downstream consumers.

Event Streaming Platforms

Apache Kafka is the industry standard for high-throughput event streaming. Billions of events per day, durable storage, exactly-once semantics, and a rich ecosystem of connectors. The trade-off: operational complexity. Kafka clusters require sizing, partition management, and monitoring.

Azure Event Hubs is the managed alternative for Azure-native organizations. Compatible with the Kafka protocol (existing Kafka producers/consumers work without code changes), fully managed (no cluster operations), and integrated with Azure Stream Analytics and Fabric Real-Time Intelligence for stream processing.

When to use event-driven: When multiple consumers need to react to the same business event. When systems must be decoupled (the producer shouldn't know or depend on consumers). When data pipelines need near-real-time freshness for streaming analytics or real-time dashboards.

Pattern 4: Change Data Capture (CDC)

CDC captures row-level changes (inserts, updates, deletes) from source databases and propagates them to target systems in near-real-time. Unlike batch extraction (which re-reads entire tables periodically), CDC captures only what changed since the last sync — reducing source system load, network bandwidth, and target processing time.

CDC Implementation Approaches

Log-based CDC reads the database's transaction log (binlog in MySQL, WAL in PostgreSQL, transaction log in SQL Server) to capture changes. This is the preferred approach because it imposes zero load on the source database — the log is already being written for database recovery purposes. CDC tools read it as a secondary consumer. Debezium (open-source), Azure Data Factory CDC, Fivetran, and Qlik Replicate implement log-based CDC.

Query-based CDC periodically queries the source table for rows where a "last modified" timestamp exceeds the previous sync point. Simpler to implement but has limitations: requires a reliable timestamp column on every table, can't detect deletes (no row to query), and imposes query load on the source database during each sync.

Trigger-based CDC uses database triggers to capture changes into a staging table. Captures all change types (insert, update, delete) but adds overhead to every write operation on the source table — which can degrade transactional performance for high-volume tables.

CDC Approach	Source Impact	Captures Deletes?	Latency	Best For
Log-based	Zero — reads existing log	Yes	Seconds	Production databases with zero-impact requirement
Query-based	Moderate — periodic queries	No (soft deletes only)	Minutes	Simple implementations, tolerant sources
Trigger-based	High — every write triggers	Yes	Immediate	Legacy systems without log access

For enterprise data pipeline architecture, log-based CDC with Debezium (publishing to Kafka) or Azure Data Factory's native CDC connector is the standard recommendation. It provides near-real-time data freshness with zero source system impact — the combination every CDO wants.

Pattern 5: Hybrid — Combining Patterns for Enterprise Reality

No single integration pattern serves all enterprise needs. The realistic architecture combines patterns based on use case requirements:

Batch ELT for Analytical Workloads

Daily/hourly batch loads from ERP, CRM, and operational systems into the data lake and warehouse. Feeds Power BI dashboards, financial reporting, and historical analytics. 80% of analytical data flows through this pattern.

CDC for Near-Real-Time Analytics

CDC on the 3-5 highest-velocity source tables (transactions, inventory, customer interactions) propagates changes to the Fabric lakehouse in near-real-time. Feeds operational dashboards that need current-hour data without waiting for the nightly batch.

API for Operational Systems

APIs connect customer-facing systems — the portal checks real-time inventory, the mobile app retrieves order status, the partner system pushes transactions. Synchronous, low-latency, request/response.

Events for Cross-System Reactions

Business events (order placed, customer churned, threshold breached) published to the event bus trigger downstream actions — alerts, workflow updates, ML model re-scoring, and real-time dashboard updates. Asynchronous, decoupled, one-to-many.

The hybrid architecture matches integration pattern to use case requirement. Analytical workloads get batch (cost-effective, high-volume). Operational systems get APIs (low-latency, synchronous). Near-real-time analytics get CDC (low-impact, continuous). Cross-system coordination gets events (decoupled, reactive).

Platform Selection: Azure Data Factory, Fabric, MuleSoft, or Informatica

Platform	Strongest Pattern	Best For	Limitation
Azure Data Factory	Batch ELT, CDC	Azure-native orgs, 100+ connectors, cost-effective batch	Limited real-time (no native streaming)
Microsoft Fabric Data Factory	Batch ELT, Dataflows, CDC	Unified with Fabric lakehouse, simplified pipeline management	Newer platform, evolving feature set
MuleSoft	API-led integration	API-first architectures, complex B2B integration, Salesforce orgs	Expensive licensing, steep learning curve
Informatica IDMC	Enterprise ETL, MDM, data quality	Complex transformation, data governance, legacy migration	Cost, complexity for simpler use cases
Fivetran / Airbyte	SaaS-to-warehouse replication	Quick setup for 500+ SaaS connectors, no-code	Limited transformation, basic CDC
Debezium + Kafka	Log-based CDC, event streaming	High-volume CDC, event-driven architecture, multi-cloud	Operational complexity, requires Kafka expertise

For Microsoft-stack enterprises, the recommended approach: Fabric Data Factory for batch ELT (unified with the analytical platform), Azure Event Hubs for event-driven patterns (managed Kafka-compatible), and Azure Data Factory CDC for near-real-time replication. This combination handles all four patterns from a single cloud ecosystem.

Pipeline Design Principles: Idempotency, Observability, Recovery

Integration pipelines run at scale — thousands of executions per day, millions of rows per run. At that scale, failures are not exceptions; they're expected operations that the pipeline must handle gracefully.

Idempotency: Running a pipeline twice on the same data produces the same result as running it once. No duplicates, no missing records, no corrupted state. Idempotent pipelines use merge/upsert patterns (not append-only), deduplicate on business keys, and maintain watermarks that track exactly which data has been processed.

Observability: Every pipeline execution produces telemetry — rows extracted, rows loaded, execution duration, data quality scores, and any anomalies detected. Dashboards show pipeline health. Alerts fire when pipelines fail, slow down, or process unexpected volumes. Without observability, pipeline failures are discovered when the Power BI dashboard shows stale data — which is the worst time to discover it.

Recovery: When a pipeline fails (and it will — source system outages, network issues, schema changes, data quality exceptions), what happens? The pipeline should log the failure with diagnostic detail, alert the operations team, and support replay from the point of failure — not require a full re-run. Data engineering best practices include checkpoint-based recovery, dead-letter queues for rejected records, and automated retry with exponential backoff.

The Pipeline Reliability Target

Enterprise data pipelines should achieve 99%+ success rate. For a pipeline running 100 times daily, that's fewer than 1 failure per day. For critical pipelines (financial reporting, regulatory), target 99.9%. Achieve this through idempotent design, thorough error handling, and the monitoring that detects issues before they cascade.

Integration Roadmap: Decision-Driven, Not Source-Driven

The traditional integration approach starts with source systems: "let's integrate the ERP, then the CRM, then the marketing platform." This produces a data lake containing integrated data that nobody asked for — 200 tables from 14 systems, technically correct, analytically useless because nobody designed the integration for specific analytical questions.

The decision-driven approach starts with the analytical or operational use cases: which decisions need integrated data, which sources do those decisions require, and at what freshness? This produces integration that serves specific outcomes from day one.

Month 1: Use Case Discovery

Identify the top 5-8 analytical and operational use cases that require cross-system data. For each: which source systems, which entities (customer, product, order), which granularity, what freshness? This produces the integration requirements — not a source system catalog, but a use-case-driven data requirements document.

Months 2-3: Core Entity Integration

Build the integration for the 2-3 core entities (Customer, Product, Order/Transaction) from the sources the top use cases require. Design master data management patterns — how to reconcile "customer" across CRM, ERP, and e-commerce into a unified customer entity. Deploy to the analytical platform with quality validation.

Months 4-5: Analytical Layer

Build the dimensional model that serves the analytical use cases — star schema facts and dimensions designed for the specific questions. Connect Power BI semantic models. Validate that the integrated data produces correct analytical results. The data warehouse design follows the analytical requirements, not the source system structure.

Month 6: Operational Integration + Expansion

Add API and event-driven patterns for operational use cases (real-time inventory, customer portal, partner integration). Begin integrating additional sources for Wave 2 use cases. The architecture scales horizontally — each new source follows the established patterns.

The Xylity Approach

We implement data integration with use-case-driven scoping — starting from the decisions that need integrated data, not from the source systems that contain it. Our ETL developers and data engineers build the integration layer alongside your team, transferring pipeline design, monitoring, and operational discipline. The output is a functioning integration platform that serves analytical and operational use cases — not a data lake nobody uses.

Continue building your understanding with these related resources from our consulting practice.

Data Integration Consulting

Enterprise data integration consulting.

Explore →

ETL Consulting

ETL and ELT pipeline design and implementation.

Explore →

Hire ETL Developers

Pre-qualified ETL developers for your team.

Explore →

Integrate Data for Decisions, Not Storage

Five patterns — batch, API, event-driven, CDC, hybrid — matched to the use cases that drive your business. Integration architecture that serves outcomes from day one.

Start Your Data Integration Engagement →

Enterprise Data Integration Architecture: Patterns, Platforms and Pipeline Design

In This Article

The Cost of Fragmented Data: Why Integration Is the First Priority

Five Integration Architecture Patterns

Pattern 1: Batch ETL/ELT — The Foundation

ETL vs. ELT: The Architecture Decision

Pattern 2: API-Led Integration

Three-Layer API Architecture

Pattern 3: Event-Driven Architecture

Event Streaming Platforms

Pattern 4: Change Data Capture (CDC)

CDC Implementation Approaches

Pattern 5: Hybrid — Combining Patterns for Enterprise Reality

Batch ELT for Analytical Workloads

CDC for Near-Real-Time Analytics

API for Operational Systems

Events for Cross-System Reactions

Platform Selection: Azure Data Factory, Fabric, MuleSoft, or Informatica

Pipeline Design Principles: Idempotency, Observability, Recovery

Integration Roadmap: Decision-Driven, Not Source-Driven

Month 1: Use Case Discovery

Months 2-3: Core Entity Integration

Months 4-5: Analytical Layer

Month 6: Operational Integration + Expansion

The Xylity Approach

Data Integration Consulting

ETL Consulting

Hire ETL Developers

Integrate Data for Decisions, Not Storage

Enterprise Data Integration Architecture: Patterns, Platforms and Pipeline Design

In This Article

The Cost of Fragmented Data: Why Integration Is the First Priority

Five Integration Architecture Patterns

Pattern 1: Batch ETL/ELT — The Foundation

ETL vs. ELT: The Architecture Decision

Pattern 2: API-Led Integration

Three-Layer API Architecture

Pattern 3: Event-Driven Architecture

Event Streaming Platforms

Pattern 4: Change Data Capture (CDC)

CDC Implementation Approaches

Pattern 5: Hybrid — Combining Patterns for Enterprise Reality

Batch ELT for Analytical Workloads

CDC for Near-Real-Time Analytics

API for Operational Systems

Events for Cross-System Reactions

Platform Selection: Azure Data Factory, Fabric, MuleSoft, or Informatica

Pipeline Design Principles: Idempotency, Observability, Recovery

Integration Roadmap: Decision-Driven, Not Source-Driven

Month 1: Use Case Discovery

Months 2-3: Core Entity Integration

Months 4-5: Analytical Layer

Month 6: Operational Integration + Expansion

The Xylity Approach

Go Deeper

Data Integration Consulting

ETL Consulting

Hire ETL Developers

Integrate Data for Decisions, Not Storage