Apache Fluss vs. Apache Paimon: Two Engines for the Real-Time Lakehouse

Written by Joanna He | Nov 20, 2025 5:24:06 AM

If you’ve been tracking the evolution of real-time data architectures, you’ve probably heard the buzz around Apache Fluss (Incubating) and Apache Paimon. Both are Apache/ Apache Incubator projects born from the Apache Flink community, and both aim to solve real pain points in building modern data platforms. But they’re not competitors—they’re more like teammates with very different jobs.

So, what’s the real difference? When should you use one over the other? And how do they fit into your lakehouse?

Let’s cut through the noise and break it down for engineers who actually build and run these systems.

TL;DR – The Big Picture

Apache Paimon is a stream-native lake format. Think of it as a next-gen Delta Lake or Iceberg, but built from the ground up for streaming workloads, with strong ACID guarantees, time travel, schema evolution, and deep Flink integration. It stores data in Parquet/ORC on object storage (S3, OSS, etc.), with minute-level freshness.
Apache Fluss is an ultra-low-latency, columnar streaming storage engine. It’s designed for sub-second query latencies—ideal as a real-time “hot” layer. It uses Apache Arrow for in-memory columnar storage and integrates tightly with Flink for operations like Delta Joins, bypassing Flink’s state bloat.

learn more about Delta Join in articles below

Flink 2.1 SQL: Unlocking Real-time Data & AI Integration for Scalable Stream Processing

Together, they form a tiered Streamhouse: Fluss handles the “right now,” Paimon handles the “recent past + history.” You write once (into Fluss), and tier seamlessly to Paimon. Your apps query a unified view without knowing the difference.

Why Do We Need Either?

Before diving in, let’s talk about the problem space.

Flink is amazing at processing streams, but it has two major pain points:

State explosion: Maintaining streaming joins in Flink state leads to massive checkpoints, slow recovery, and high memory usage.

Architectural fragmentation: You often end up with Kafka (for raw events), a database or cache (for low-latency lookups), and a data lake (for analytics)—three systems to manage, with data duplicated and inconsistent.

Both Fluss and Paimon aim to simplify this—but from opposite ends of the latency spectrum.

Apache Paimon: The “Lakehouse Layer” for Streaming

What is it?

Paimon started life as Flink Table Store—a native storage engine for Flink dynamic tables. It’s now an Apache top-level project and functions as a streaming data lake format that natively supports streaming and batch.

How does it work?

Data is stored as Parquet/ORC files in object storage (S3, HDFS, etc.).
It uses an LSM-tree (Log-Structured Merge Tree) under the hood to handle high-throughput writes and upserts efficiently.
Every write creates an immutable snapshot, enabling time travel, schema evolution, and ACID transactions.
It supports primary key tables with full CRUD operations (insert, update, delete, merge).

Why use it?

Unified batch + streaming: One table serves both Flink streaming jobs and Spark/Trino batch queries.
CDC made easy: Flink CDC can stream MySQL/Postgres changes directly into Paimon with schema sync.
No more dual pipelines: Replace complex Flink jobs that de-duplicate or aggregate data just to produce a clean changelog—Paimon can generate that changelog natively with changelog-producer = 'lookup'.
Cheap & durable: Uses low-cost object storage with compression—great for long-term retention.
Ecosystem friendly: Works with Spark, Hive, Trino, StarRocks, etc.

Apache Fluss: The “Real-Time Accelerator”

What is it?

Fluss (German for “river”) is a distributed, columnar streaming storage engine built for sub-second analytics. It’s not a message queue like Kafka—it’s an analytical store that looks like a queue but performs like a database.

How does it work?

Stores data in Apache Arrow IPC format—a zero-copy, in-memory columnar format. This is the key innovation.
Supports two table types: Log Tables (append-only) and Primary Key Tables (upserts/deletes).
Data is organized into tables and partitions (not Kafka partitions), aligned with downstream lake formats like Paimon to enable efficient tiering.
Backed by RocksDB for primary key lookups, enabling high-QPS serving.
Designed to run as a disaggregated cluster (tablet servers + coordinators), separate from your Flink cluster.

Why use it?

Column pruning at the source: Because it’s columnar, Fluss can push down projections. If your Flink job only reads 3 out of 20 columns, Fluss sends only those 3—reducing I/O, network, and CPU by up to 10x (Alibaba’s numbers).
Solves Flink state bloat: Replace dual-stream joins with Delta Joins. Instead of caching a huge dimension table in Flink state, query Fluss directly. At Taobao, this reduced a 100TB Flink state job and cut checkpoint time from 90s to 1s.
Sub-second freshness: Data is queryable within milliseconds of ingestion.
Tiering built-in: Fluss can automatically tier data to Paimon (or Iceberg) after a configured interval (e.g., 3 minutes), creating a warm/cold layer.

How They Work Together: The Tiered Streamhouse

Here’s where it gets powerful.

Imagine this flow:

Ingest: Your app writes events to a Fluss table (user_events).
Process: A Flink job reads from Fluss and does enrichment, aggregation, etc.
Tier: A background Flink job continuously moves data from Fluss → Paimon after 3 minutes.
Query: Your dashboard queries the same logical table name—but gets a union of Fluss (last 2 min) + Paimon (everything else).

This is called Union Read. To your query engine, it’s one table with second-level freshness and unlimited history.

No code changes. No data duplication. No consistency issues.

Fluss handles the “hot” path; Paimon handles the “warm + cold” path. And because both use aligned bucketing, the tiering is efficient and partition-aware.

Key insight: Fluss isn’t replacing Paimon—it’s accelerating it for the critical last few minutes of data.

Apache Fluss vs Apache Paimon: When to Choose What?

Use Paimon alone if:

You’re happy with minute-level freshness.
You need ACID, time travel, or schema evolution.
You want broad ecosystem support (Spark, Trino, etc.).
Your Flink jobs don’t suffer from state explosion.

Paimon is a drop-in upgrade for Delta/Iceberg if you’re all-in on Flink streaming. It’s mature, stable, and solves the “streaming lakehouse” problem elegantly.

Use Fluss + Paimon if:

You need sub-second data freshness.
You have massive streaming joins causing Flink state bloat.
You’re building real-time user-facing features (e.g., live recommendations).
You’re okay with adding one more system (Fluss cluster) for huge performance gains.

This combo is for teams pushing the limits of real-time. If your SLA is “data must be queryable in <5 seconds,” Fluss is your answer.

Don’t use Fluss if:

You just need a Kafka replacement for event streaming.
You don’t have ultra-low-latency requirements.
You can’t justify the operational cost of another distributed system.

Remember: Fluss is not a message queue. It’s a columnar streaming store that supports analytical workloads.

Capability	Apache Fluss	Apache Paimon
Latency	Sub-second (milliseconds to seconds)	Minute-level (up to ~1 minute)
Data Model	Append-only (Log Tables), Primary Key (Update/Delete)	Primary Key, Append, Bucketed Append, Partial Update
Storage Format	Native: Apache Arrow (IPC)	Native: Parquet, ORC; Pluggable for future formats
ACID Transactions	read-your-writes consistency	Yes, via two-phase commit and snapshot isolation
Schema Evolution	Not yet supported	Yes, fully supported (add, remove columns)
Time Travel	Yes, via a tiered data lake table	Yes, via immutable snapshots and tags
Join Support	Optimized for Lookup Joins via primary-key lookups and Delta Joins via index-key lookups	Supports various merge engines for enrichment/aggregation
Query Engines	Primarily Flink; limited support beyond	Broad ecosystem support: Spark, Trino, StarRocks, Hive, etc.
Integration Focus	Tiering to Paimon/Iceberg; Delta Join for Flink	Native Flink/Spark connectors; Flink CDC integration

Real-World Impact (From Alibaba/Taobao)

The numbers speak for themselves:

100% reduction in Flink state size by replacing dual-stream joins with Delta Joins against Fluss.
Checkpoint time dropped from 90s → 1s.
80%+ reduction in CPU and memory for enrichment pipelines.
10x query performance on analytical workloads due to column pruning.

These aren’t theoretical gains—they’re from production systems handling petabytes of data.

How Taobao uses Apache Fluss (Incubating) for Real-Time Processing in Search and RecSys

The Future Roadmap

Both projects are moving fast:

Fluss is working on:

Native Iceberg tiering (recently released in Fluss 0.8).
Zero-disk architecture (write directly to object storage).
Python client and broader query engine support.

Paimon is adding:

Iceberg-compatible snapshots (recently supprted in Apache Paimon 1.2.0).
Multimodal AI support (via Lance integration).
Variant data type for semi-structured data (like JSON).
Better concurrency control for multi-writer scenarios.

The vision is clear: Fluss as the real-time front door, Paimon as the universal storage layer.

Bottom Line for Engineers

Paimon = your lakehouse table for streaming data. It’s durable, ACID, and queryable by everyone. Start here if you’re building a real-time data platform.
Fluss = your real-time accelerator. Use it when Paimon’s minute-level latency isn’t enough, or when Flink state is killing your cluster.

They’re not rivals—they’re designed to work together. Think of Fluss as the “cache” and Paimon as the “source of truth.” And thanks to Union Read, your apps don’t need to know the difference.

If you’re serious about real-time analytics at scale, this tiered Streamhouse architecture (Flink + Fluss + Paimon) is one of the most promising patterns we’ve seen in years. It reduces complexity, cuts costs, and delivers the latency that modern apps demand.

So evaluate both—but don’t see it as an either/or. See it as a stack.

Want to try it?

Paimon: Get started with the open-source version at https://paimon.apache.org
Fluss: Check out the code and docs on GitHub: https://github.com/apache/fluss

Prefer a no-setup, cloud-native experience? You can spin up a fully managed environment in minutes on Alibaba Cloud:

Use Realtime Compute for Apache Flink to run your streaming jobs with native Paimon and Fluss included.
Pair it with Data Lake Formation (DLF) to auto-provision Paimon tables, manage catalogs, and enable seamless tiering from Fluss.

Both services are available in the Alibaba Cloud console—ideal for a quick test without managing infrastructure. Give it a shot and see the tiered Streamhouse in action!

View full post