In This Article
When Real-Time Matters
Real-time adds complexity and cost. It's justified when: time-sensitive decisions (fraud detection — a 5-second delay means the fraudulent transaction completes and the money is gone), operational monitoring (IoT sensor data — the temperature spike that indicates equipment failure needs sub-minute detection, not next-day batch analysis), customer experience (real-time recommendations during a browsing session, dynamic pricing based on current demand, personalized content based on current behavior), and regulatory requirements (real-time trade reporting in financial services, real-time transaction monitoring for AML compliance). Real-time is NOT justified when: the business decision doesn't change within the batch window (daily sales reports, monthly financial close, quarterly business reviews — batch is simpler, cheaper, and equally effective). The first question before building real-time: "does the business decision change if the data is 5 minutes old vs 24 hours old?" If both produce the same decision, batch is 3-5x cheaper.
Event Streaming Platforms: Kafka vs Event Hubs
| Dimension | Apache Kafka | Azure Event Hubs |
|---|---|---|
| Deployment | Self-managed or Confluent Cloud | Fully managed Azure service |
| Throughput | Millions of events/second (configurable) | Millions of events/second (tier-based) |
| Retention | Configurable (hours to infinite with tiered storage) | 1-90 days (7 default), or Capture to storage |
| Ecosystem | Kafka Connect, Schema Registry, KSQL, huge community | Fabric integration, Schema Registry, Azure-native |
| Operations | Complex (ZooKeeper/KRaft, partition management) | Zero operational overhead (managed) |
| Cost | Compute + storage (variable, can optimize at scale) | Throughput units (predictable, can be higher at scale) |
| Best for | Multi-cloud, complex topologies, Kafka expertise | Azure-centric, rapid deployment, no Kafka ops team |
Selection: Azure-centric organizations → Event Hubs (zero ops overhead, native Fabric integration). Multi-cloud or Kafka-experienced teams → Confluent Cloud or self-managed Kafka. The protocol compatibility feature of Event Hubs allows Kafka client libraries to connect to Event Hubs — enabling migration between the two without application code changes.
Stream Processing Engines
| Engine | Programming Model | Best For |
|---|---|---|
| Apache Flink | Java/Scala/Python, stateful processing | Complex event processing, windowed aggregations, exactly-once |
| Spark Structured Streaming | Spark DataFrame API | Teams already using Spark, micro-batch acceptable (100ms+) |
| Kafka Streams | Java library (no separate cluster) | Simple transformations, Kafka-native, no infrastructure |
| Fabric Real-Time Intelligence | KQL, low-code | Azure-native, dashboards, alerting, no Spark needed |
| Azure Stream Analytics | SQL-like (SAQL) | Simple windowed queries, IoT patterns, no coding |
Selection principle: For complex stateful processing (fraud detection, session analysis) → Flink. For teams already using Databricks/Spark → Structured Streaming. For simple enrichment and routing → Kafka Streams. For Azure-native with low-code → Fabric Real-Time Intelligence or Stream Analytics. The processing engine should match the team's skills and the complexity of the processing logic — not the most powerful engine available.
4 Real-Time Architecture Patterns
Pattern 1 — Event Streaming: Events published to Kafka/Event Hubs → consumed by multiple subscribers. Each subscriber processes the event independently. Use for: event-driven microservices, audit logging, real-time notifications. Pattern 2 — Stream Processing: Events → stream processor (windowed aggregation, enrichment, filtering) → output topic or database. Use for: real-time dashboards, alerting, derived metrics (events per minute, moving average). Pattern 3 — Event Sourcing: All state changes stored as immutable events → current state derived by replaying events. Use for: audit trails, temporal queries ("what was the state at time T?"), regulatory compliance. Pattern 4 — CQRS (Command Query Responsibility Segregation): Write model (events) separated from read model (materialized views). Commands produce events → events materialized into read-optimized stores → queries served from materialized views. Use for: high-read/low-write workloads, complex domain models, event-driven systems.
Exactly-Once Processing
Message processing guarantees: at-most-once (messages may be lost, never duplicated — acceptable for: metrics that tolerate small gaps), at-least-once (messages never lost, may be duplicated — acceptable for: idempotent operations where processing the same message twice produces the same result), and exactly-once (messages never lost, never duplicated — required for: financial transactions, inventory updates, and any operation where duplication causes business harm). Achieving exactly-once: Kafka supports exactly-once via idempotent producers + transactional consumers (available since Kafka 0.11). Flink supports exactly-once via checkpointing + two-phase commit. The practical approach: design for at-least-once processing with idempotent operations (use unique event IDs and check-before-process logic). This achieves effective exactly-once behavior with simpler infrastructure than true exactly-once semantics — and works across all streaming platforms.
Operational Considerations
Partitioning strategy: Partitions determine parallelism — more partitions = more consumers processing in parallel. Partition by: the key that determines processing order (customer ID for customer events, account ID for financial transactions). Over-partitioning wastes resources; under-partitioning limits throughput. Rule of thumb: start with partitions = 2× expected peak consumer count. Schema evolution: Event schemas change over time — new fields added, field types changed. Use Avro or Protobuf with a schema registry (Confluent Schema Registry or Azure Schema Registry) — the registry enforces compatibility rules (backward, forward, or full compatibility) preventing breaking changes. Dead letter queues: Events that fail processing (bad format, business rule violation, downstream service unavailable) → routed to a dead letter topic for investigation and reprocessing. Never drop failed events — they represent data loss or undetected errors. Monitoring: consumer lag (how far behind the consumer is — high lag means the consumer can't keep up with the event rate), throughput (events per second processed), error rate (% of events that fail processing), and end-to-end latency (time from event production to downstream action).
Implementation Roadmap
Week 1-4: Foundation
Deploy Kafka or Event Hubs. Set up schema registry. Build first producer (one source system emitting events). Build first consumer (simple processing + write to lakehouse). Validate: events flow end-to-end with monitoring.
Week 5-8: Processing
Deploy stream processing engine. Implement first business use case (e.g., real-time dashboard or alerting). Add dead letter queue for failed events. Implement schema evolution strategy.
Week 9-12: Production
Add remaining producers (3-5 source systems). Implement monitoring and alerting for consumer lag and error rate. Performance testing at expected peak volume. Documentation and runbook for on-call team.
Real-Time Cost Optimization
Streaming infrastructure runs 24/7 — cost optimization is critical: right-size partitions (more partitions = more brokers/TUs — only add partitions when throughput demands it), tiered storage (Kafka tiered storage or Event Hubs Capture — move old events to cheap storage instead of keeping them on expensive broker disks), auto-scaling consumers (scale consumers based on lag — spin down during low-volume periods, scale up during peaks), compression (enable producer compression: Snappy or LZ4 — reduces network and storage cost 50-70% with minimal CPU overhead), and data lifecycle (set retention based on business need — 7 days for operational events, 30 days for analytics events, permanent for compliance events. Over-retention wastes storage). Typical streaming cost: $2,000-10,000/month for moderate throughput (100K-1M events/day). High-throughput (10M+ events/day): $10,000-50,000/month depending on processing complexity and retention.
Real-Time Data Use Cases by Industry
| Industry | Use Case | Latency Requirement | Pattern |
|---|---|---|---|
| Financial Services | Fraud detection, trade compliance | Sub-second | Event streaming + Flink |
| Manufacturing | Equipment monitoring, quality alerts | 1-5 seconds | IoT Hub + Stream Analytics |
| Retail | Inventory updates, dynamic pricing | 1-15 seconds | Kafka + micro-batch |
| Healthcare | Patient monitoring, clinical alerts | Sub-second | Event Hubs + Fabric RT |
| Logistics | Fleet tracking, delivery ETAs | 5-30 seconds | Kafka + Spark Streaming |
| Telecom | Network monitoring, CDR processing | Sub-second | Kafka + Flink |
The latency requirement determines the architecture: sub-second → dedicated stream processing (Flink or Kafka Streams). 1-15 seconds → micro-batch (Spark Structured Streaming). 15-60 seconds → near-real-time batch (frequent scheduled jobs). Each step down in latency requirement reduces infrastructure cost and operational complexity. Don't build sub-second infrastructure for a 15-second use case.
Event Schema Design: Getting the Foundation Right
Event schemas are the contract between producers and consumers. Design principles: include metadata (every event includes: event_id (UUID), event_type, timestamp, source_system, and correlation_id — metadata that enables: deduplication, ordering, tracing, and debugging), use explicit versioning (schema_version field in every event — consumers can handle different versions gracefully during evolution), prefer flat over nested (deeply nested events are harder to process in stream processors — flatten where possible, nest only when the nesting has semantic meaning), include the full entity state (for CDC events: include both before and after states — consumers can determine what changed without maintaining their own state), and use Avro or Protobuf (not JSON — Avro and Protobuf provide: schema enforcement, compact serialization (50-70% smaller than JSON), and schema evolution with compatibility checks). Schema registry: every schema registered and versioned. Compatibility mode: backward (new consumers can read old events) — the safest default for production systems.
Monitoring and Alerting for Streaming Systems
Streaming system monitoring requires different practices than batch monitoring: consumer lag monitoring (the most critical metric — lag = events waiting to be processed. Alert thresholds: warning at 5 minutes lag, critical at 15 minutes. Lag increasing steadily = consumer throughput is below producer throughput — scale consumers or optimize processing). End-to-end latency tracking (measure time from event production to downstream action — not just consumer lag but the total pipeline latency including: Kafka transit, consumer processing, database write, and downstream notification. Track P50, P95, and P99 latencies), error rate alerting (events that fail processing — alert at >0.1% error rate. Investigate immediately: schema mismatch, downstream service failure, or data quality issue? Dead letter queue growing = silent data loss in progress), producer health (producer throughput dropping unexpectedly = source system issue. Producer errors = Kafka connectivity or authentication problem), and infrastructure health (broker disk usage, partition balance, consumer group membership, ZooKeeper/KRaft health — these infrastructure metrics predict failures before they impact data flow). Dashboard: Grafana dashboard with: consumer lag per consumer group, end-to-end latency (P95), error rate, throughput (events/second), and dead letter queue depth — reviewed by the streaming operations team at the start of every shift.
The Xylity Approach
We build real-time data architectures with the pattern-first methodology — selecting the right streaming platform (Kafka or Event Hubs), the right processing engine (Flink, Spark, or Fabric), and the right architecture pattern (streaming, processing, event sourcing, or CQRS) for your specific use case. Our streaming data engineers and data architects deploy real-time systems that process millions of events with sub-second latency and exactly-once guarantees.
Go Deeper
Continue building your understanding with these related resources from our consulting practice.
Real-Time Events to Real-Time Decisions
Kafka, Event Hubs, Flink, Spark Streaming. Real-time architecture for time-sensitive decisions.
Start Your Real-Time Architecture →