Flink Forward Asia

The Delta Join in Apache Flink: Architectural Decoupling for Hyper-Scale Stream

Written by Joanna He | Nov 18, 2025 9:36:43 AM

The Paradigm Shift in Flink Stream Joins

What Delta Join Solves

Apache Flink has always been great at stateful stream processing, but here's the thing: traditional stream joins hit a wall when dealing with massive amounts of data and high-cardinality keys. The problem? You need to keep all the historical data in Flink's state forever to ensure correctness, and that's just not sustainable.

Delta Join (FLIP-486) changes everything. Instead of buffering everything internally, it turns joins into a stateless lookup mechanism that directly queries data in tables like Apache Fluss or Apache Paimon.

The Real Impact of Delta Join

Here's what Delta Join actually does: it separates compute from history. Instead of keeping everything in Flink's state, the operator just looks things up to external storage when needed. No more explosive state growth.

The numbers tell the story. Production deployments of Taobao team (Alibaba's ecommerce business) have seen:

  • 50TB of join state eliminated - imagine that
  • 10x cost reduction (2300 CU down to 200 CU) while maintaining the same throughput
  • 80%+ CPU and memory savings
  • 87% faster job recovery times
  • Second-level checkpointing instead of waiting forever

This isn't just incremental improvement - it's a complete game changer for large-scale stream processing.

The Unbounded State Crisis: Why Traditional Joins Fail at Scale

Why Regular Joins Break at Scale

Regular Joins in Flink are incredibly flexible - they handle Insert, Update, Delete operations perfectly. But here's the catch: you have to keep ALL the history from both streams in Flink's state forever.

Since streaming jobs run indefinitely, that state just keeps growing. Forever. And in high-cardinality scenarios? That's a disaster waiting to happen.

The problems pile up fast:

  1. Resource Death: Your Task Managers get crushed under the weight of all that state
  2. Checkpoint Nightmares: Checkpointing takes forever, jobs become unstable, timeouts everywhere
  3. Recovery Hell: Restoring 100TB+ of state from storage? Plan for coffee breaks

What We Had Before Delta Join

Before Delta Join, Flink had some limited options:

  1. Interval Joins: Only work with time windows and append-only tables. Great if your use case fits those constraints, but most real-world scenarios don't.
  2. Temporal/Lookup Joins: Good for joining a stream with a dimension table, but they're one-way only. You can't use them for stream-to-stream joins where both sides need historical access.

The core issue? Regular Joins forced Flink to duplicate data that was already stored elsewhere. It was like keeping copies of your database in memory just in case - inefficient and unsustainable.

Architectural Deep Dive into Delta Join

The Delta Join Philosophy: Compute vs History

Delta Join (FLIP-486) is all about separation of concerns. The principle is simple: "probe on demand, minimal job state, eventual consistency."

When an event hits either side of the join, the operator doesn't dig through internal history. Instead, it queries an external index in real-time. No more hoarding data - just look it up when you need it.

How StreamingDeltaJoinOperator Actually Works

The StreamingDeltaJoinOperator is the engine that makes this all happen. Here's the key components:

  • LRU Caches (Both Sides): Check these first before hitting external storage. Hot data stays in memory, cold data gets evicted. Simple but effective.
  • Asynchronous Probing: No waiting around - when you miss the cache, fire off the lookup and keep processing.
  • AsyncDeltaJoinRunner: Two instances (one per side) manage the cache and external I/O operations.

Important: Delta Join isn't truly stateless - it's a hybrid model. The operator still keeps LRU caches and coordination state for consistency. The performance now depends on cache hit rates and external lookup latency.

Keeping Things Correct: Async Ordering

Here's the tricky part: when you have parallel async lookups, you can get updates for the same key arriving out of order. That's bad news for correctness.

Delta Join uses FLIP-519's KeyedAsyncWaitOperator to solve this. It ensures that all operations for a single key happen in sequence, while different keys can still run in parallel. Smart approach - you get the throughput benefits without sacrificing correctness.

External State Storage: Integration with the Real-Time Lakehouse

Why Fluss is Perfect for Delta Join

Apache Fluss (incubating) was built specifically for this use case. It's a disaggregated table storage engine designed from the ground up for Apache Flink.

Here's what makes it work:

  • Distributed Architecture: Coordinator + tablet servers with RocksDB backing for mutable tables
  • Dual Structure: KV store + log tablet = point-in-time lookups + CDC streams
  • Prefix Lookups: This is the killer feature. You can query using partial composite keys (e.g., just customer_id instead of the full customer_id, order_id, item_id). Most systems require exact matches - Fluss is much more flexible.

The Future: Apache Paimon Integration

While Apache Fluss was the initial, purpose-built enabler for Delta Join, the Flink community recognizes the need to expand this concept to leverage open-source data lake formats. Continuous optimization of join performance remains a strategic focus. The roadmap for Flink SQL explicitly includes plans to add support for other storage systems, notably Apache Paimon, to facilitate near-real-time Delta Join capabilities.

Apache Paimon offers:

  • Primary-key tables with real-time streaming updates
  • Query access within minutes
  • Flexible merge engines (deduplication, partial updates, aggregation)
  • Integration with Spark, Hive, Trino

The goal: make Delta Join work across the entire Lakehouse ecosystem, not just Fluss.

Quantitative Impact and Operational Stability

The numbers don't lie - Delta Join delivers serious operational improvements.

The Big Wins

  • State Elimination: No more 100TB+ state files. No more checkpoint timeouts. No more job failures.
  • Resource Savings: 80%+ CPU and memory reduction. One case went from 2300 CU to 200 CU - that's a 10x cost cut while maintaining the same throughput.

Operational Stability Improvements

  • Checkpointing: No more waiting forever for checkpoints. Delta Join enables second-level checkpointing - that's how fast it is.
  • Recovery: 87% faster bootstrap times. When things go wrong, you get back up and running quickly.

Bonus: Since your join history lives in external storage, you can use it for other things too. One team reduced data reprocessing from 4 hours to 30 minutes by using Sort-Merge Joins on the externalized tables.

Implementation, Configuration, and Use Cases

Using Delta Join in Your SQL

The best part? Delta Join works with standard SQL. No special syntax needed - just write your regular JOIN queries like SELECT * FROM orders INNER JOIN Product ON orders.productId = Product.id.

The optimizer is smart enough to automatically choose Delta Join when:

  • The SQL pattern supports transforming the Regular Join into a Delta Join
  • You have the right external storage configured

In recent Flink versions, this often happens automatically.

Configuration Knobs

You can tune these parameters to optimize performance:

  • table.optimizer.delta-join.strategy='AUTO' - Let Flink decide whether a regular join can be converted to a delta join when applicable. Default is AUTO.
  • table.exec.delta-join.cache-enabled='true' - Whether to enable the cache of delta join. Default is true.
  • table.exec.delta-join.left.cache-size=10000 and table.exec.delta-join.right.cache-size=10000- The cache size used to cache the lookup results of the left and right table in delta join.
  • table.exec.async-lookup.buffer-capacity=100 - The max number of async i/o operation that the delta join can trigger to lookup

When to Use Delta Join

Delta Join shines in these scenarios:

  1. High-Cardinality Stream Enrichment: Joining massive event streams (clicks, transactions) with large, frequently-updated dimension tables (customer profiles, catalogs). No state explosion.
  2. Real-Time Traceability: Since all join history lives in external storage, you can debug and audit exactly what data was used for any calculation.
  3. Complex Change Tracking: Handle Insert/Update/Delete operations on dynamic tables while keeping Flink's internal state minimal.

Conclusions and Outlook

Delta Join (FLIP-486) isn't just a new feature - it's a fundamental shift in how we think about stream processing at scale.

The trade-off is clear: instead of managing massive state within Flink's checkpoint domain, you pay the latency cost of external lookups. For enterprise-scale real-time applications, this is a no-brainer. You get:

  • Massive operational stability improvements
  • 10x cost reduction with >80% resource efficiency
  • Lightning-fast job recovery

Right now, it works great with specialized systems like Fluss (with those awesome prefix lookups). The future roadmap includes Apache Paimon and other Lakehouse formats, which means Delta Join will become a standard capability across the entire streaming ecosystem.

This is how Flink stays ahead as the go-to engine for large-scale, stateful stream processing.

See Also

An in-depth case study from Alibaba showcasing how Taobao leverages Apache Fluss for scalable streaming storage and real-time analytics.