Why CDC Replaces Nightly Batch

DimensionNightly Batch ETLCDC
Data freshness24 hours (stale by 6 AM)Near-real-time (minutes to seconds)
Source impactHeavy (full table scan during batch window)Minimal (reads transaction log, not tables)
Data volume transferredFull table every night (50M rows)Only changes (50K rows)
Processing time2-6 hours for full extractionContinuous (events processed in seconds)
ComplexityLow (well-understood pattern)Medium (requires log access, schema evolution)
Hard deletesDetectable via comparisonCaptured from transaction log

CDC is the right choice when: data freshness matters (reports need same-day or near-real-time data), source system performance is critical (the batch window conflicts with production hours), or data volume makes full extraction expensive (tables with 100M+ rows where only 0.1% changes daily). Batch remains the right choice when: daily freshness is sufficient, source systems have a clear batch window, and the team lacks CDC expertise.

CDC doesn't replace your entire ETL pipeline — it replaces the extraction step. The data still needs transformation, quality checks, and loading. CDC makes extraction incremental and near-real-time instead of full and nightly.

3 Types of CDC

Log-based CDC (reads the database transaction log — WAL in PostgreSQL, binlog in MySQL, transaction log in SQL Server. Captures every change without querying the database. Minimal source impact. The gold standard. Tools: Debezium, AWS DMS, Azure Data Factory CDC connector). Query-based CDC (periodically queries the source table using a timestamp or version column: "SELECT * WHERE updated_at > last_run." Simple to implement but: requires a reliable timestamp column, misses hard deletes, and adds query load. Use when: log-based CDC isn't available). Trigger-based CDC (database triggers fire on INSERT/UPDATE/DELETE, writing change records to a shadow table. Adds overhead to every write operation. Use when: log-based isn't available and the source has low write volume).

Debezium: Open-Source Log-Based CDC

Debezium is the most widely used open-source CDC platform: how it works (Debezium connects to the database's transaction log → reads change events → publishes them to Kafka topics. Each table gets its own topic. Each event contains: before state, after state, operation type (insert/update/delete), source metadata (database, table, timestamp), and the transaction context), supported databases (PostgreSQL, MySQL/MariaDB, SQL Server, Oracle, MongoDB, Cassandra, Db2 — covering 95%+ of enterprise databases), deployment (runs as a Kafka Connect connector — deployed on the Kafka Connect cluster. No changes to the source database beyond enabling log access), and schema evolution (Debezium publishes schema changes to a schema history topic. When a column is added or renamed in the source, the change event includes the updated schema — downstream consumers can handle the evolution using Avro schema registry compatibility rules).

Debezium architecture: Source database (transaction log enabled) → Debezium connector (reads log, converts to events) → Kafka (stores events durably) → Stream processor or Spark (transforms events) → Fabric lakehouse or Databricks (stores final state in Delta Lake). For SQL Server: enable CDC at database and table level. For PostgreSQL: configure logical replication. For MySQL: enable binary logging. Each database requires specific configuration for log access — Debezium documentation covers each database type.

Fabric Mirroring: Managed CDC

Microsoft Fabric Mirroring provides managed CDC without Kafka or Debezium: how it works (Fabric connects to the source database → reads the transaction log → replicates changes to the Fabric lakehouse in near-real-time. The mirrored data appears as Delta tables — queryable via Spark, SQL, or Power BI. No Kafka, no Debezium, no connector management), supported sources (Azure SQL Database, SQL Server, Cosmos DB, Snowflake — with more added regularly), zero infrastructure (no Kafka cluster, no connector deployment, no schema registry — Fabric manages the entire CDC pipeline), and automatic schema sync (schema changes in the source automatically reflected in the mirrored Delta tables). When to use Fabric Mirroring vs Debezium: Fabric Mirroring when: the target is Fabric lakehouse, the source is supported, and you want zero operational overhead. Debezium when: you need Kafka as the event backbone (multiple consumers), the source isn't supported by Mirroring, or you're multi-cloud.

CDC to Lakehouse: Delta Lake Merge

CDC events arrive as: insert, update, or delete operations. The Delta Lake MERGE operation applies these changes: MERGE INTO target USING changes ON target.id = changes.id — WHEN MATCHED AND delete THEN DELETE, WHEN MATCHED AND update THEN UPDATE, WHEN NOT MATCHED AND insert THEN INSERT. Processing patterns: micro-batch merge (process CDC events every 5-15 minutes — accumulate changes in staging, then MERGE. Simple, reliable, 5-15 minute latency), streaming merge (Spark Structured Streaming reads from Kafka/Event Hubs and applies MERGE continuously — sub-minute latency but higher complexity), and append-only with snapshot (append all CDC events raw. Periodically create a snapshot using the latest event per key — simpler than MERGE but requires snapshot computation). The micro-batch merge is the recommended starting pattern — it balances freshness (5-15 minutes) with simplicity.

CDC Operations and Recovery

Monitoring: replication lag (time between change in source and availability in lakehouse — alert if lag exceeds SLA), event throughput (events per second — a sudden drop indicates connector failure), error rate (failed events — usually schema mismatch), and source log retention (if CDC falls behind and log is truncated, connector can't catch up — must full snapshot). Recovery: connector restart (resumes from last committed offset — no data loss), full re-snapshot (if log is lost — resource-intensive but recovers from any state), and schema change handling (source schema change mid-stream — connector captures the change, downstream must handle the new schema).

Selection Framework

ScenarioRecommended ToolWhy
Azure SQL → Fabric lakehouseFabric MirroringZero-ops, native integration
PostgreSQL → DatabricksDebezium + KafkaLog-based, PostgreSQL support, multi-consumer
SQL Server → multiple targetsDebezium + KafkaKafka enables multiple consumers from one CDC stream
Oracle → any targetDebezium or GoldenGateDebezium for Kafka integration; GoldenGate for Oracle-native
Simple, few tablesQuery-based (ADF CDC)No log access needed, simple

Advanced CDC Patterns

Multi-database CDC consolidation: Multiple source databases (ERP, CRM, HRIS) each with Debezium connectors → all publishing to Kafka → a single Spark job that MERGES changes from all sources into the unified lakehouse. The Kafka topic naming convention (database.schema.table) ensures each source's changes are identifiable and routable. CDC for data warehousing: CDC events feed the data warehouse incrementally instead of nightly full refresh. The SCD (Slowly Changing Dimension) Type 2 pattern: CDC update events create new rows in dimension tables with effective dates — preserving historical state while keeping the current state available. CDC with data quality: Quality checks applied to CDC events before MERGE — invalid events (null required fields, referential integrity violations, business rule failures) routed to dead letter queue instead of corrupting the target table. This prevents the common problem of CDC propagating source data quality issues into the lakehouse at real-time speed.

CDC Performance Tuning

CDC performance optimization: source database (log retention: ensure the transaction log retains enough history for the CDC connector to recover from a restart — 24-72 hours minimum. For SQL Server: ensure the CDC capture job runs frequently enough to prevent log growth issues — every 5-15 seconds for high-volume tables), Debezium connector (snapshot mode: "initial" for first-time sync, "schema_only" after initial sync is complete — prevents unnecessary full snapshots on connector restart. Batch size: adjust max.batch.size to balance throughput vs latency — larger batches = higher throughput, higher latency), Kafka (topic partitions: match the source table's key distribution — a single partition for a high-volume table becomes a bottleneck. Replication factor: 3 for production — ensures durability without excessive overhead. Compression: enable LZ4 compression on topics — 50% storage reduction with minimal CPU overhead), and consumer (lakehouse merge) (merge frequency: every 5-15 minutes for most use cases — more frequent merges increase compute cost without proportional freshness improvement. Checkpoint interval: 30-60 seconds for streaming merges — balances recovery time vs overhead). These tuning parameters are specific to your data volume and latency requirements — start with defaults, measure, and optimize based on observed performance.

CDC for Multi-Database Environments

Enterprise environments have 10-50+ databases that need CDC: centralized Kafka backbone (one Kafka cluster serves as the CDC event backbone for all databases — each database has its own Debezium connector publishing to database-specific topics. Consumers subscribe to the topics they need), standardized event format (regardless of source database type — PostgreSQL, MySQL, SQL Server, Oracle — the CDC events are normalized to a common format in the consumer layer. This means: the lakehouse merge logic is the same regardless of source database type), connector management (10+ Debezium connectors require: centralized monitoring (Kafka Connect dashboard showing: connector status, lag per connector, error rate), automated alerting (connector failure → alert within 5 minutes), and automated restart (self-healing connectors that restart after transient failures without manual intervention)), and schema registry organization (namespace schemas by database and table: "erp.sales.orders" — preventing naming conflicts and enabling consumers to discover available CDC streams by browsing the registry).

CDC Implementation Checklist

Before deploying CDC to production, verify: source readiness (transaction log enabled and sized for 24-72 hours retention. CDC-specific permissions granted to the connector user. Schema documentation up to date — the connector needs to know which tables and columns to capture). Kafka/Event Hubs readiness (topics created with appropriate partition counts — match the source table's key cardinality. Schema registry configured. Dead letter topic created. Monitoring configured for consumer lag and error rate). Target readiness (Delta Lake tables created with correct schema. MERGE logic tested with: inserts, updates, deletes, and schema evolution scenarios. Scheduled or streaming job configured and tested). Operational readiness (alerting configured for: connector failure, replication lag exceeding SLA, dead letter queue growth. Runbook documented for: connector restart, full re-snapshot, and schema change handling. On-call rotation includes CDC monitoring). Testing (functional: insert, update, delete captured correctly. Volume: sustained throughput at expected peak. Recovery: connector restart with no data loss. Schema evolution: column added to source table — downstream handles new column gracefully). This checklist prevents the most common CDC deployment failures — each item has caused production incidents when skipped.

The Xylity Approach

We implement CDC with the right-tool methodology — Fabric Mirroring for Azure-native zero-ops CDC, Debezium for multi-source multi-consumer architectures, and query-based CDC when log access isn't available. Our streaming data engineers and data engineers deploy CDC that replaces nightly batch with near-real-time change capture — 99.9% less data transferred with sub-15-minute freshness.

Continue building your understanding with these related resources from our consulting practice.

Near-Real-Time Data Without Nightly Batch

Debezium, Fabric Mirroring, Delta Lake merge. CDC that captures only changes — 99.9% less data, sub-15-minute freshness.

Start Your CDC Implementation →