Real-Time Architecture & Stream Processing

When Real-Time Matters

Real-time adds complexity and cost. It's justified when: time-sensitive decisions (fraud detection — a 5-second delay means the fraudulent transaction completes and the money is gone), operational monitoring (IoT sensor data — the temperature spike that indicates equipment failure needs sub-minute detection, not next-day batch analysis), customer experience (real-time recommendations during a browsing session, dynamic pricing based on current demand, personalized content based on current behavior), and regulatory requirements (real-time trade reporting in financial services, real-time transaction monitoring for AML compliance). Real-time is NOT justified when: the business decision doesn't change within the batch window (daily sales reports, monthly financial close, quarterly business reviews — batch is simpler, cheaper, and equally effective). The first question before building real-time: "does the business decision change if the data is 5 minutes old vs 24 hours old?" If both produce the same decision, batch is 3-5x cheaper.

Don't build real-time because it's technically impressive. Build real-time because a 5-minute delay in detection costs $50K in fraud, $200K in equipment damage, or $1M in regulatory fines. If a 24-hour delay is acceptable, batch is the right answer.

Event Streaming Platforms: Kafka vs Event Hubs

Dimension	Apache Kafka	Azure Event Hubs
Deployment	Self-managed or Confluent Cloud	Fully managed Azure service
Throughput	Millions of events/second (configurable)	Millions of events/second (tier-based)
Retention	Configurable (hours to infinite with tiered storage)	1-90 days (7 default), or Capture to storage
Ecosystem	Kafka Connect, Schema Registry, KSQL, huge community	Fabric integration, Schema Registry, Azure-native
Operations	Complex (ZooKeeper/KRaft, partition management)	Zero operational overhead (managed)
Cost	Compute + storage (variable, can optimize at scale)	Throughput units (predictable, can be higher at scale)
Best for	Multi-cloud, complex topologies, Kafka expertise	Azure-centric, rapid deployment, no Kafka ops team

Selection: Azure-centric organizations → Event Hubs (zero ops overhead, native Fabric integration). Multi-cloud or Kafka-experienced teams → Confluent Cloud or self-managed Kafka. The protocol compatibility feature of Event Hubs allows Kafka client libraries to connect to Event Hubs — enabling migration between the two without application code changes.

Stream Processing Engines

Engine	Programming Model	Best For
Apache Flink	Java/Scala/Python, stateful processing	Complex event processing, windowed aggregations, exactly-once
Spark Structured Streaming	Spark DataFrame API	Teams already using Spark, micro-batch acceptable (100ms+)
Kafka Streams	Java library (no separate cluster)	Simple transformations, Kafka-native, no infrastructure
Fabric Real-Time Intelligence	KQL, low-code	Azure-native, dashboards, alerting, no Spark needed
Azure Stream Analytics	SQL-like (SAQL)	Simple windowed queries, IoT patterns, no coding

Selection principle: For complex stateful processing (fraud detection, session analysis) → Flink. For teams already using Databricks/Spark → Structured Streaming. For simple enrichment and routing → Kafka Streams. For Azure-native with low-code → Fabric Real-Time Intelligence or Stream Analytics. The processing engine should match the team's skills and the complexity of the processing logic — not the most powerful engine available.

4 Real-Time Architecture Patterns

Pattern 1 — Event Streaming: Events published to Kafka/Event Hubs → consumed by multiple subscribers. Each subscriber processes the event independently. Use for: event-driven microservices, audit logging, real-time notifications. Pattern 2 — Stream Processing: Events → stream processor (windowed aggregation, enrichment, filtering) → output topic or database. Use for: real-time dashboards, alerting, derived metrics (events per minute, moving average). Pattern 3 — Event Sourcing: All state changes stored as immutable events → current state derived by replaying events. Use for: audit trails, temporal queries ("what was the state at time T?"), regulatory compliance. Pattern 4 — CQRS (Command Query Responsibility Segregation): Write model (events) separated from read model (materialized views). Commands produce events → events materialized into read-optimized stores → queries served from materialized views. Use for: high-read/low-write workloads, complex domain models, event-driven systems.

Exactly-Once Processing

Message processing guarantees: at-most-once (messages may be lost, never duplicated — acceptable for: metrics that tolerate small gaps), at-least-once (messages never lost, may be duplicated — acceptable for: idempotent operations where processing the same message twice produces the same result), and exactly-once (messages never lost, never duplicated — required for: financial transactions, inventory updates, and any operation where duplication causes business harm). Achieving exactly-once: Kafka supports exactly-once via idempotent producers + transactional consumers (available since Kafka 0.11). Flink supports exactly-once via checkpointing + two-phase commit. The practical approach: design for at-least-once processing with idempotent operations (use unique event IDs and check-before-process logic). This achieves effective exactly-once behavior with simpler infrastructure than true exactly-once semantics — and works across all streaming platforms.

Operational Considerations

Partitioning strategy: Partitions determine parallelism — more partitions = more consumers processing in parallel. Partition by: the key that determines processing order (customer ID for customer events, account ID for financial transactions). Over-partitioning wastes resources; under-partitioning limits throughput. Rule of thumb: start with partitions = 2× expected peak consumer count. Schema evolution: Event schemas change over time — new fields added, field types changed. Use Avro or Protobuf with a schema registry (Confluent Schema Registry or Azure Schema Registry) — the registry enforces compatibility rules (backward, forward, or full compatibility) preventing breaking changes. Dead letter queues: Events that fail processing (bad format, business rule violation, downstream service unavailable) → routed to a dead letter topic for investigation and reprocessing. Never drop failed events — they represent data loss or undetected errors. Monitoring: consumer lag (how far behind the consumer is — high lag means the consumer can't keep up with the event rate), throughput (events per second processed), error rate (% of events that fail processing), and end-to-end latency (time from event production to downstream action).

Implementation Roadmap

Week 1-4: Foundation

Deploy Kafka or Event Hubs. Set up schema registry. Build first producer (one source system emitting events). Build first consumer (simple processing + write to lakehouse). Validate: events flow end-to-end with monitoring.

Week 5-8: Processing

Deploy stream processing engine. Implement first business use case (e.g., real-time dashboard or alerting). Add dead letter queue for failed events. Implement schema evolution strategy.

Week 9-12: Production

Add remaining producers (3-5 source systems). Implement monitoring and alerting for consumer lag and error rate. Performance testing at expected peak volume. Documentation and runbook for on-call team.

Real-Time Cost Optimization

Streaming infrastructure runs 24/7 — cost optimization is critical: right-size partitions (more partitions = more brokers/TUs — only add partitions when throughput demands it), tiered storage (Kafka tiered storage or Event Hubs Capture — move old events to cheap storage instead of keeping them on expensive broker disks), auto-scaling consumers (scale consumers based on lag — spin down during low-volume periods, scale up during peaks), compression (enable producer compression: Snappy or LZ4 — reduces network and storage cost 50-70% with minimal CPU overhead), and data lifecycle (set retention based on business need — 7 days for operational events, 30 days for analytics events, permanent for compliance events. Over-retention wastes storage). Typical streaming cost: $2,000-10,000/month for moderate throughput (100K-1M events/day). High-throughput (10M+ events/day): $10,000-50,000/month depending on processing complexity and retention.

Real-Time Data Use Cases by Industry

Industry	Use Case	Latency Requirement	Pattern
Financial Services	Fraud detection, trade compliance	Sub-second	Event streaming + Flink
Manufacturing	Equipment monitoring, quality alerts	1-5 seconds	IoT Hub + Stream Analytics
Retail	Inventory updates, dynamic pricing	1-15 seconds	Kafka + micro-batch
Healthcare	Patient monitoring, clinical alerts	Sub-second	Event Hubs + Fabric RT
Logistics	Fleet tracking, delivery ETAs	5-30 seconds	Kafka + Spark Streaming
Telecom	Network monitoring, CDR processing	Sub-second	Kafka + Flink

The latency requirement determines the architecture: sub-second → dedicated stream processing (Flink or Kafka Streams). 1-15 seconds → micro-batch (Spark Structured Streaming). 15-60 seconds → near-real-time batch (frequent scheduled jobs). Each step down in latency requirement reduces infrastructure cost and operational complexity. Don't build sub-second infrastructure for a 15-second use case.

Event Schema Design: Getting the Foundation Right

Event schemas are the contract between producers and consumers. Design principles: include metadata (every event includes: event_id (UUID), event_type, timestamp, source_system, and correlation_id — metadata that enables: deduplication, ordering, tracing, and debugging), use explicit versioning (schema_version field in every event — consumers can handle different versions gracefully during evolution), prefer flat over nested (deeply nested events are harder to process in stream processors — flatten where possible, nest only when the nesting has semantic meaning), include the full entity state (for CDC events: include both before and after states — consumers can determine what changed without maintaining their own state), and use Avro or Protobuf (not JSON — Avro and Protobuf provide: schema enforcement, compact serialization (50-70% smaller than JSON), and schema evolution with compatibility checks). Schema registry: every schema registered and versioned. Compatibility mode: backward (new consumers can read old events) — the safest default for production systems.

Monitoring and Alerting for Streaming Systems

Streaming system monitoring requires different practices than batch monitoring: consumer lag monitoring (the most critical metric — lag = events waiting to be processed. Alert thresholds: warning at 5 minutes lag, critical at 15 minutes. Lag increasing steadily = consumer throughput is below producer throughput — scale consumers or optimize processing). End-to-end latency tracking (measure time from event production to downstream action — not just consumer lag but the total pipeline latency including: Kafka transit, consumer processing, database write, and downstream notification. Track P50, P95, and P99 latencies), error rate alerting (events that fail processing — alert at >0.1% error rate. Investigate immediately: schema mismatch, downstream service failure, or data quality issue? Dead letter queue growing = silent data loss in progress), producer health (producer throughput dropping unexpectedly = source system issue. Producer errors = Kafka connectivity or authentication problem), and infrastructure health (broker disk usage, partition balance, consumer group membership, ZooKeeper/KRaft health — these infrastructure metrics predict failures before they impact data flow). Dashboard: Grafana dashboard with: consumer lag per consumer group, end-to-end latency (P95), error rate, throughput (events/second), and dead letter queue depth — reviewed by the streaming operations team at the start of every shift.

The Xylity Approach

We build real-time data architectures with the pattern-first methodology — selecting the right streaming platform (Kafka or Event Hubs), the right processing engine (Flink, Spark, or Fabric), and the right architecture pattern (streaming, processing, event sourcing, or CQRS) for your specific use case. Our streaming data engineers and data architects deploy real-time systems that process millions of events with sub-second latency and exactly-once guarantees.

Continue building your understanding with these related resources from our consulting practice.

Real-Time Streaming

Real-time streaming consulting.

Explore →

Data Engineering

Data engineering consulting.

Explore →

Hire Streaming Engineers

Pre-qualified streaming engineers.

Explore →

Real-Time Events to Real-Time Decisions

Kafka, Event Hubs, Flink, Spark Streaming. Real-time architecture for time-sensitive decisions.

Start Your Real-Time Architecture →

Real-Time Data Architecture: Kafka, Event Hubs and Stream Processing Patterns

In This Article

When Real-Time Matters

Event Streaming Platforms: Kafka vs Event Hubs

Stream Processing Engines

4 Real-Time Architecture Patterns

Exactly-Once Processing

Operational Considerations

Implementation Roadmap

Week 1-4: Foundation

Week 5-8: Processing

Week 9-12: Production

Real-Time Cost Optimization

Real-Time Data Use Cases by Industry

Event Schema Design: Getting the Foundation Right

Monitoring and Alerting for Streaming Systems

The Xylity Approach

Real-Time Streaming

Data Engineering

Hire Streaming Engineers

Real-Time Events to Real-Time Decisions

Real-Time Data Architecture: Kafka, Event Hubs and Stream Processing Patterns

In This Article

When Real-Time Matters

Event Streaming Platforms: Kafka vs Event Hubs

Stream Processing Engines

4 Real-Time Architecture Patterns

Exactly-Once Processing

Operational Considerations

Implementation Roadmap

Week 1-4: Foundation

Week 5-8: Processing

Week 9-12: Production

Real-Time Cost Optimization

Real-Time Data Use Cases by Industry

Event Schema Design: Getting the Foundation Right

Monitoring and Alerting for Streaming Systems

The Xylity Approach

Go Deeper

Real-Time Streaming

Data Engineering

Hire Streaming Engineers

Real-Time Events to Real-Time Decisions