In This Article
Why CDC Replaces Nightly Batch
| Dimension | Nightly Batch ETL | CDC |
|---|---|---|
| Data freshness | 24 hours (stale by 6 AM) | Near-real-time (minutes to seconds) |
| Source impact | Heavy (full table scan during batch window) | Minimal (reads transaction log, not tables) |
| Data volume transferred | Full table every night (50M rows) | Only changes (50K rows) |
| Processing time | 2-6 hours for full extraction | Continuous (events processed in seconds) |
| Complexity | Low (well-understood pattern) | Medium (requires log access, schema evolution) |
| Hard deletes | Detectable via comparison | Captured from transaction log |
CDC is the right choice when: data freshness matters (reports need same-day or near-real-time data), source system performance is critical (the batch window conflicts with production hours), or data volume makes full extraction expensive (tables with 100M+ rows where only 0.1% changes daily). Batch remains the right choice when: daily freshness is sufficient, source systems have a clear batch window, and the team lacks CDC expertise.
3 Types of CDC
Log-based CDC (reads the database transaction log — WAL in PostgreSQL, binlog in MySQL, transaction log in SQL Server. Captures every change without querying the database. Minimal source impact. The gold standard. Tools: Debezium, AWS DMS, Azure Data Factory CDC connector). Query-based CDC (periodically queries the source table using a timestamp or version column: "SELECT * WHERE updated_at > last_run." Simple to implement but: requires a reliable timestamp column, misses hard deletes, and adds query load. Use when: log-based CDC isn't available). Trigger-based CDC (database triggers fire on INSERT/UPDATE/DELETE, writing change records to a shadow table. Adds overhead to every write operation. Use when: log-based isn't available and the source has low write volume).
Debezium: Open-Source Log-Based CDC
Debezium is the most widely used open-source CDC platform: how it works (Debezium connects to the database's transaction log → reads change events → publishes them to Kafka topics. Each table gets its own topic. Each event contains: before state, after state, operation type (insert/update/delete), source metadata (database, table, timestamp), and the transaction context), supported databases (PostgreSQL, MySQL/MariaDB, SQL Server, Oracle, MongoDB, Cassandra, Db2 — covering 95%+ of enterprise databases), deployment (runs as a Kafka Connect connector — deployed on the Kafka Connect cluster. No changes to the source database beyond enabling log access), and schema evolution (Debezium publishes schema changes to a schema history topic. When a column is added or renamed in the source, the change event includes the updated schema — downstream consumers can handle the evolution using Avro schema registry compatibility rules).
Debezium architecture: Source database (transaction log enabled) → Debezium connector (reads log, converts to events) → Kafka (stores events durably) → Stream processor or Spark (transforms events) → Fabric lakehouse or Databricks (stores final state in Delta Lake). For SQL Server: enable CDC at database and table level. For PostgreSQL: configure logical replication. For MySQL: enable binary logging. Each database requires specific configuration for log access — Debezium documentation covers each database type.
Fabric Mirroring: Managed CDC
Microsoft Fabric Mirroring provides managed CDC without Kafka or Debezium: how it works (Fabric connects to the source database → reads the transaction log → replicates changes to the Fabric lakehouse in near-real-time. The mirrored data appears as Delta tables — queryable via Spark, SQL, or Power BI. No Kafka, no Debezium, no connector management), supported sources (Azure SQL Database, SQL Server, Cosmos DB, Snowflake — with more added regularly), zero infrastructure (no Kafka cluster, no connector deployment, no schema registry — Fabric manages the entire CDC pipeline), and automatic schema sync (schema changes in the source automatically reflected in the mirrored Delta tables). When to use Fabric Mirroring vs Debezium: Fabric Mirroring when: the target is Fabric lakehouse, the source is supported, and you want zero operational overhead. Debezium when: you need Kafka as the event backbone (multiple consumers), the source isn't supported by Mirroring, or you're multi-cloud.
CDC to Lakehouse: Delta Lake Merge
CDC events arrive as: insert, update, or delete operations. The Delta Lake MERGE operation applies these changes: MERGE INTO target USING changes ON target.id = changes.id — WHEN MATCHED AND delete THEN DELETE, WHEN MATCHED AND update THEN UPDATE, WHEN NOT MATCHED AND insert THEN INSERT. Processing patterns: micro-batch merge (process CDC events every 5-15 minutes — accumulate changes in staging, then MERGE. Simple, reliable, 5-15 minute latency), streaming merge (Spark Structured Streaming reads from Kafka/Event Hubs and applies MERGE continuously — sub-minute latency but higher complexity), and append-only with snapshot (append all CDC events raw. Periodically create a snapshot using the latest event per key — simpler than MERGE but requires snapshot computation). The micro-batch merge is the recommended starting pattern — it balances freshness (5-15 minutes) with simplicity.
CDC Operations and Recovery
Monitoring: replication lag (time between change in source and availability in lakehouse — alert if lag exceeds SLA), event throughput (events per second — a sudden drop indicates connector failure), error rate (failed events — usually schema mismatch), and source log retention (if CDC falls behind and log is truncated, connector can't catch up — must full snapshot). Recovery: connector restart (resumes from last committed offset — no data loss), full re-snapshot (if log is lost — resource-intensive but recovers from any state), and schema change handling (source schema change mid-stream — connector captures the change, downstream must handle the new schema).
Selection Framework
| Scenario | Recommended Tool | Why |
|---|---|---|
| Azure SQL → Fabric lakehouse | Fabric Mirroring | Zero-ops, native integration |
| PostgreSQL → Databricks | Debezium + Kafka | Log-based, PostgreSQL support, multi-consumer |
| SQL Server → multiple targets | Debezium + Kafka | Kafka enables multiple consumers from one CDC stream |
| Oracle → any target | Debezium or GoldenGate | Debezium for Kafka integration; GoldenGate for Oracle-native |
| Simple, few tables | Query-based (ADF CDC) | No log access needed, simple |
Advanced CDC Patterns
Multi-database CDC consolidation: Multiple source databases (ERP, CRM, HRIS) each with Debezium connectors → all publishing to Kafka → a single Spark job that MERGES changes from all sources into the unified lakehouse. The Kafka topic naming convention (database.schema.table) ensures each source's changes are identifiable and routable. CDC for data warehousing: CDC events feed the data warehouse incrementally instead of nightly full refresh. The SCD (Slowly Changing Dimension) Type 2 pattern: CDC update events create new rows in dimension tables with effective dates — preserving historical state while keeping the current state available. CDC with data quality: Quality checks applied to CDC events before MERGE — invalid events (null required fields, referential integrity violations, business rule failures) routed to dead letter queue instead of corrupting the target table. This prevents the common problem of CDC propagating source data quality issues into the lakehouse at real-time speed.
CDC Performance Tuning
CDC performance optimization: source database (log retention: ensure the transaction log retains enough history for the CDC connector to recover from a restart — 24-72 hours minimum. For SQL Server: ensure the CDC capture job runs frequently enough to prevent log growth issues — every 5-15 seconds for high-volume tables), Debezium connector (snapshot mode: "initial" for first-time sync, "schema_only" after initial sync is complete — prevents unnecessary full snapshots on connector restart. Batch size: adjust max.batch.size to balance throughput vs latency — larger batches = higher throughput, higher latency), Kafka (topic partitions: match the source table's key distribution — a single partition for a high-volume table becomes a bottleneck. Replication factor: 3 for production — ensures durability without excessive overhead. Compression: enable LZ4 compression on topics — 50% storage reduction with minimal CPU overhead), and consumer (lakehouse merge) (merge frequency: every 5-15 minutes for most use cases — more frequent merges increase compute cost without proportional freshness improvement. Checkpoint interval: 30-60 seconds for streaming merges — balances recovery time vs overhead). These tuning parameters are specific to your data volume and latency requirements — start with defaults, measure, and optimize based on observed performance.
CDC for Multi-Database Environments
Enterprise environments have 10-50+ databases that need CDC: centralized Kafka backbone (one Kafka cluster serves as the CDC event backbone for all databases — each database has its own Debezium connector publishing to database-specific topics. Consumers subscribe to the topics they need), standardized event format (regardless of source database type — PostgreSQL, MySQL, SQL Server, Oracle — the CDC events are normalized to a common format in the consumer layer. This means: the lakehouse merge logic is the same regardless of source database type), connector management (10+ Debezium connectors require: centralized monitoring (Kafka Connect dashboard showing: connector status, lag per connector, error rate), automated alerting (connector failure → alert within 5 minutes), and automated restart (self-healing connectors that restart after transient failures without manual intervention)), and schema registry organization (namespace schemas by database and table: "erp.sales.orders" — preventing naming conflicts and enabling consumers to discover available CDC streams by browsing the registry).
CDC Implementation Checklist
Before deploying CDC to production, verify: source readiness (transaction log enabled and sized for 24-72 hours retention. CDC-specific permissions granted to the connector user. Schema documentation up to date — the connector needs to know which tables and columns to capture). Kafka/Event Hubs readiness (topics created with appropriate partition counts — match the source table's key cardinality. Schema registry configured. Dead letter topic created. Monitoring configured for consumer lag and error rate). Target readiness (Delta Lake tables created with correct schema. MERGE logic tested with: inserts, updates, deletes, and schema evolution scenarios. Scheduled or streaming job configured and tested). Operational readiness (alerting configured for: connector failure, replication lag exceeding SLA, dead letter queue growth. Runbook documented for: connector restart, full re-snapshot, and schema change handling. On-call rotation includes CDC monitoring). Testing (functional: insert, update, delete captured correctly. Volume: sustained throughput at expected peak. Recovery: connector restart with no data loss. Schema evolution: column added to source table — downstream handles new column gracefully). This checklist prevents the most common CDC deployment failures — each item has caused production incidents when skipped.
The Xylity Approach
We implement CDC with the right-tool methodology — Fabric Mirroring for Azure-native zero-ops CDC, Debezium for multi-source multi-consumer architectures, and query-based CDC when log access isn't available. Our streaming data engineers and data engineers deploy CDC that replaces nightly batch with near-real-time change capture — 99.9% less data transferred with sub-15-minute freshness.
Go Deeper
Continue building your understanding with these related resources from our consulting practice.
Near-Real-Time Data Without Nightly Batch
Debezium, Fabric Mirroring, Delta Lake merge. CDC that captures only changes — 99.9% less data, sub-15-minute freshness.
Start Your CDC Implementation →