Real-Time Data Integration & Streaming Guide

When Real-Time Integration Creates Value
CDC Deep Dive: Log-Based Capture at Scale
Event Streaming Architecture: Kafka and Event Hubs
Stream Processing: Flink, Spark Streaming, and Stream Analytics
Fabric Real-Time Intelligence: The Microsoft Approach
The Exactly-Once Challenge: Deduplication and Idempotency
Operational Reality: Monitoring, Failure, and Recovery
Cost Management: Streaming Doesn't Have to Break the Budget
Go Deeper

When Real-Time Integration Creates Value

A fintech company processes 2 million transactions daily. Their fraud detection system runs on batch analytics — transactions processed overnight, flagged the next morning. By the time a fraudulent pattern is detected, the fraudster has completed 15-20 transactions over 6-8 hours. At $200 average per fraudulent transaction, the overnight delay costs $3,000-4,000 per fraud incident. With 50+ fraud incidents monthly, the batch delay costs $150,000-200,000 per month in preventable losses.

Real-time integration — CDC capturing transaction inserts as they happen, streaming to the fraud detection model for sub-second scoring — would detect the pattern after the 2nd or 3rd transaction instead of the 15th. The investment in streaming infrastructure ($8,000-12,000/month) pays back 10-15x in prevented fraud losses. This is a clear case where real-time integration creates measurable value.

But not every use case justifies real-time. The analytics dashboard that leadership reviews weekly doesn't need real-time data — hourly or daily refresh is sufficient. The financial reports that close monthly don't need streaming — daily batch integration with reconciliation is appropriate. The question isn't "can we do real-time?" but "does the decision latency require real-time, and does the financial impact justify the infrastructure cost?"

Real-time integration is an investment decision, not a technology decision. The question: does the cost of data latency exceed the cost of eliminating it? — Xylity Data Engineering Practice

CDC Deep Dive: Log-Based Capture at Scale

Log-based CDC reads the database's transaction log — the record of every insert, update, and delete the database performs for its own recovery purposes. CDC tools read this log as a secondary consumer, converting log entries into structured change events that downstream systems can process.

Debezium: The Open-Source Standard

Debezium is the most widely adopted log-based CDC platform. It connects to MySQL (binlog), PostgreSQL (WAL), SQL Server (transaction log), Oracle (LogMiner), and MongoDB (oplog). Each source connector reads the database log, converts changes to a standardized JSON format, and publishes to Kafka topics — one topic per table. Downstream consumers (data lake ingestion, real-time analytics, search index sync) subscribe to the topics they need.

CDC at Enterprise Scale: Challenges

Schema evolution. Source tables change — columns added, renamed, data types modified. The CDC pipeline must handle schema evolution without data loss or consumer breakdowns. Debezium supports schema registries (Confluent Schema Registry, Apicurio) that version schemas and enforce compatibility rules (backward, forward, full). Consumers can read new schema versions without code changes if the schema evolution is compatible.

Initial snapshot. CDC captures changes from the point it starts — not historical data. The initial snapshot loads the current state of each table before CDC begins capturing changes. For a 500GB table, the initial snapshot can take hours. Production CDC deployments must handle snapshot processing without impacting source database performance — typically by throttling snapshot reads and scheduling during off-peak hours.

Handling deletes. Log-based CDC captures delete events, but downstream systems need to decide how to handle them. The analytical data lake typically implements soft deletes (marking records as deleted with a timestamp) rather than physical deletes — preserving the historical record for audit and analysis while reflecting the current state.

CDC Platform	Source Support	Target Delivery	Best For
Debezium + Kafka	MySQL, PostgreSQL, SQL Server, Oracle, MongoDB, Cassandra	Kafka topics → any consumer	High-volume, multi-consumer, event-driven architectures
Azure Data Factory CDC	SQL Server, Azure SQL, PostgreSQL, MySQL, Cosmos DB	ADLS, Fabric lakehouse, Synapse	Azure-native, simpler operational model, batch-oriented CDC
Fivetran / Airbyte CDC	Major relational databases + SaaS APIs	Snowflake, BigQuery, Databricks, Fabric	Quick setup, managed service, 500+ connectors
Oracle GoldenGate	Oracle, SQL Server, MySQL, PostgreSQL	Kafka, databases, cloud storage	Oracle-heavy environments, bidirectional replication

Event Streaming Architecture: Kafka and Event Hubs

Event streaming captures business events as they happen and delivers them to any system that needs to react. Unlike CDC (which captures database changes), event streaming captures application-level events — "order placed," "payment processed," "user clicked checkout," "inventory threshold breached." These events carry business meaning, not just data changes.

Apache Kafka Architecture

Kafka organizes events into topics (logical channels). Producers publish events to topics. Consumers subscribe to topics and process events. Topics are partitioned for horizontal scalability — a topic with 12 partitions can be consumed by up to 12 parallel consumers, each processing a subset of events. Consumer groups enable multiple consumer applications to independently read the same topic at their own pace.

Kafka's durability model retains events for a configurable period (default 7 days, configurable to weeks, months, or indefinitely). This means a new consumer can join and process events from the past — useful for reprocessing after a bug fix, backfilling a new analytical system, or replaying events for testing. This retention model is what distinguishes Kafka from traditional message queues (RabbitMQ, ActiveMQ) where messages are deleted after consumption.

Azure Event Hubs: Managed Kafka

Azure Event Hubs provides a managed event streaming service compatible with the Kafka protocol. Existing Kafka producers and consumers work with Event Hubs without code changes — you change the connection string, not the application. Event Hubs eliminates cluster management (no brokers to size, no ZooKeeper to maintain, no partition rebalancing to monitor). For Azure-native organizations using Fabric and Azure services, Event Hubs is the recommended streaming platform — native integration with Stream Analytics, Fabric Real-Time Intelligence, and Azure Functions.

Stream Processing: Flink, Spark Streaming, and Stream Analytics

Raw events need processing before they're analytically useful. Stream processing applies transformations, aggregations, and pattern detection to event streams in real time.

Azure Stream Analytics: SQL-based stream processing. Write queries that aggregate, filter, and join event streams using familiar SQL syntax. Best for: windowed aggregations (5-minute averages, hourly counts), simple pattern detection, Azure-native environments. Limitation: limited stateful processing for complex event patterns.

Apache Flink: The most capable stream processing engine for complex event processing. Supports event-time processing (handling out-of-order events correctly), complex pattern detection (CEP library), and large-scale stateful processing. Best for: fraud detection, complex alerting rules, applications where event ordering and exactly-once processing matter. Limitation: operational complexity — Flink clusters require expertise.

Spark Structured Streaming: Spark's stream processing engine — treats streams as continuously appending DataFrames. Integrates with the broader Spark ecosystem (ML, SQL, DataFrames). Best for: organizations already using Spark/Databricks, applications that need both batch and streaming on the same platform. Limitation: micro-batch architecture means latency is seconds-to-minutes, not sub-second.

Fabric Real-Time Intelligence: The Microsoft Approach

Microsoft Fabric Real-Time Intelligence provides an integrated streaming analytics experience within the Fabric platform. The components:

Eventstream: Captures events from Event Hubs, Kafka, custom apps, IoT Hub, and Change Data Capture sources. No-code event routing to multiple destinations — KQL databases, lakehouses, and custom processing.

Eventhouse (KQL Database): Stores events for real-time analytical queries using Kusto Query Language (KQL). Optimized for time-series analytics — query millions of events in seconds. Think of it as a specialized database for event data, separate from the lakehouse/warehouse used for batch analytics.

Real-Time Dashboard: Dashboards connected to Eventhouse that refresh automatically as new events arrive. Unlike Power BI Import dashboards (refresh on schedule), Real-Time Dashboards query live data on every interaction.

Activator: Data-driven alerts and actions. Define conditions on streaming data (inventory below threshold, transaction amount exceeds limit, error rate spikes) and trigger automated responses (email, Teams notification, Power Automate flow). This closes the loop from detection to action without human dashboard monitoring.

When to Use Fabric Real-Time Intelligence vs. Custom Kafka/Flink

Use Fabric Real-Time Intelligence when: your organization is on the Microsoft/Fabric stack, event volumes are moderate (thousands to millions per day), and you want integrated real-time analytics without managing streaming infrastructure. Use Kafka + Flink when: event volumes are extreme (billions per day), you need complex event processing, or you require multi-cloud portability.

The Exactly-Once Challenge: Deduplication and Idempotency

In any distributed system, messages can be delivered zero times (lost), at least once (duplicated), or exactly once. Exactly-once is the hardest to achieve — and the one enterprise analytics requires. A duplicate transaction in the data warehouse inflates revenue. A missing transaction deflates it. Both produce wrong analytics.

Achieving Exactly-Once in Practice

Idempotent producers: Kafka producers with idempotency enabled assign sequence numbers to each message. If a producer retries (network timeout), the broker deduplicates based on the sequence number. This prevents producer-side duplicates.

Consumer deduplication: Even with idempotent producers, consumer-side processing can introduce duplicates (consumer crashes after processing but before committing the offset). The defense: idempotent processing — design the consumer so processing the same message twice produces the same result. For database writes, use UPSERT (insert or update based on business key) instead of INSERT. For aggregations, use event timestamps and deduplication windows.

Transactional outbox pattern: For applications that need to both update a database and publish an event atomically, the outbox pattern writes both to the database in a single transaction. A CDC connector (Debezium) reads the outbox table and publishes events — guaranteeing that if the database write succeeds, the event is published, and if it fails, neither happens.

Operational Reality: Monitoring, Failure, and Recovery

Streaming infrastructure runs 24/7 — unlike batch pipelines that run on schedule and are "off" between runs. This changes the operational model fundamentally.

Monitoring: Consumer lag (how far behind is the consumer from the latest event?), throughput (events per second processed), error rate (events that failed processing), and end-to-end latency (time from event occurrence to analytical availability). Consumer lag is the primary health indicator — increasing lag means the consumer can't keep up with the producer, and the gap will grow until the consumer fails or the producer is throttled.

Failure handling: What happens when the stream processing job crashes? Dead-letter queues capture events that failed processing. Checkpointing saves processing state periodically so recovery restarts from the last checkpoint, not from the beginning. Automated restart with health monitoring — the job restarts automatically after a crash, and if it crashes repeatedly, alerts escalate to the operations team.

Backpressure: When a consumer can't process events as fast as the producer generates them, backpressure mechanisms prevent data loss. Kafka handles this natively through consumer lag — events accumulate in the topic until the consumer catches up. Stream processing engines (Flink, Spark Streaming) implement backpressure by slowing input consumption to match processing capacity.

Cost Management: Streaming Doesn't Have to Break the Budget

Streaming infrastructure costs are continuous (24/7 compute) unlike batch (pay for the run, idle between runs). Cost management practices:

Right-size the infrastructure. A Kafka cluster sized for peak volume runs at 20% utilization 80% of the time. Use autoscaling where available (Event Hubs, managed Kafka services) or schedule capacity changes for predictable volume patterns (high during business hours, low overnight).

Tier the retention. Hot retention (Kafka topics, Eventhouse) for 7-30 days. Cold retention (data lake) for long-term archive. Most consumers need recent events — only reprocessing scenarios need months of history, and the data lake provides that at 10-50x lower cost than streaming storage.

Evaluate whether batch suffices. Before building a streaming pipeline, validate that the use case requires real-time. If near-real-time (15-minute CDC refresh) serves the decision latency, the infrastructure cost is 70-80% lower than true streaming. Reserve streaming for the use cases where sub-minute latency directly affects business outcomes.

The Xylity Approach

We implement real-time data integration with latency-matched architecture — streaming where decisions require sub-second data, CDC for near-real-time, and batch where daily freshness suffices. Our streaming data engineers and data engineers build the infrastructure alongside your team — Kafka, Event Hubs, Debezium CDC, Fabric Real-Time Intelligence — transferring the operational knowledge that makes streaming reliable in production.

Continue building your understanding with these related resources from our consulting practice.

Data Integration Consulting

Enterprise data integration architecture.

Explore →

Real-Time Streaming

Real-time streaming data engineering.

Explore →

Hire Streaming Engineers

Pre-qualified streaming data engineers.

Explore →

Real-Time Data Where It Matters

CDC, event streaming, stream processing — matched to the decision latency that justifies the investment. Real-time integration that's operationally sustainable.

Start Your Real-Time Integration Engagement →

Real-Time Data Integration: CDC, Event Streaming and Sub-Second Pipeline Design

In This Article

When Real-Time Integration Creates Value

CDC Deep Dive: Log-Based Capture at Scale

Debezium: The Open-Source Standard

CDC at Enterprise Scale: Challenges

Event Streaming Architecture: Kafka and Event Hubs

Apache Kafka Architecture

Azure Event Hubs: Managed Kafka

Stream Processing: Flink, Spark Streaming, and Stream Analytics

Fabric Real-Time Intelligence: The Microsoft Approach

The Exactly-Once Challenge: Deduplication and Idempotency

Achieving Exactly-Once in Practice

Operational Reality: Monitoring, Failure, and Recovery

Cost Management: Streaming Doesn't Have to Break the Budget

The Xylity Approach

Data Integration Consulting

Real-Time Streaming

Hire Streaming Engineers

Real-Time Data Where It Matters

Real-Time Data Integration: CDC, Event Streaming and Sub-Second Pipeline Design

In This Article

When Real-Time Integration Creates Value

CDC Deep Dive: Log-Based Capture at Scale

Debezium: The Open-Source Standard

CDC at Enterprise Scale: Challenges

Event Streaming Architecture: Kafka and Event Hubs

Apache Kafka Architecture

Azure Event Hubs: Managed Kafka

Stream Processing: Flink, Spark Streaming, and Stream Analytics

Fabric Real-Time Intelligence: The Microsoft Approach

The Exactly-Once Challenge: Deduplication and Idempotency

Achieving Exactly-Once in Practice

Operational Reality: Monitoring, Failure, and Recovery

Cost Management: Streaming Doesn't Have to Break the Budget

The Xylity Approach

Go Deeper

Data Integration Consulting

Real-Time Streaming

Hire Streaming Engineers

Real-Time Data Where It Matters