Xiaomi's Real-Time Lakehouse Implementation: Best Practices with Apache Paimon

Written by Yujiang Zhong | Nov 20, 2025 5:22:48 AM

This article is compiled from the presentation by Mr. Zhong Yujiang, a software development engineer, Xiaomi, at the Flink Forward Asia 2024 Streaming Lakehouse (Part I) session. The content mainly consists of the following three parts:

1. Background Introduction
Building a Near-Real-Time Data Lakehouse Based on Paimon
Future Outlook

Challenges in Xiaomi's Real-Time Lakehouse Evolution

The first part is the background introduction, which briefly describes the architecture of typical real-time data warehouses and the reasons for introducing Apache Paimon.

Legacy Architecture: Flink + Talos + Iceberg Limitations

The current real-time lakehouse computing framework primarily consists of Flink + Talos + Iceberg.

The above diagram illustrates our previous real-time lakehouse computing architecture. The upper half represents the real-time pipeline, primarily consisting of Flink, Talos, and Iceberg. Talos is an internally developed message queue. The data in this pipeline typically originates from transaction data generated by online systems or logs reported from terminals. After the data is collected and reported to Talos via the collection platform, users can perform custom data transformation operations to generate a real-time data warehouse for downstream system consumption. The downstream systems are usually OLAP BI application platforms or real-time query systems.

The lower half of the diagram depicts the offline pipeline. Due to stability or computational resource constraints in the real-time pipeline, some attributes may be discarded, and complete cycle data may not be retained, resulting in incomplete outcomes from real-time calculations. Therefore, we utilize the offline pipeline to compute the final correct data results.

Key Architectural Challenges: Cost, Complexity, and Data Redundancy

This architecture primarily faces the following pain points:

High Cost of Real-Time Computing: The lakehouse mainly utilizes Iceberg, but Iceberg has limited support for stream processing semantics. For instance, some frequently updated operations commonly used in stream processing, such as stream join, real-time deduplication, and partial column updates, are unsupported by Iceberg. Consequently, many computational operations need to be executed at the computation layer, specifically within Flink jobs. This leads to higher resource consumption and poorer stability of the jobs.
Complex Architecture with Poor Job Stability: The overall job architecture is relatively complex during real-time processing, resulting in lower stability. As seen in the previous diagram, in addition to Iceberg and the message queue in the real-time data warehouse, an external key-value (KV) system may be introduced based on demand. This additional KV system increases operational costs for users and is inconvenient to use, as it cannot be queried directly through the OLAP engine.
High Storage Costs: The existence of both real-time and offline pipelines, along with data redundancy in the KV system, results in high storage costs. The data prices in the KV system are typically much higher than those in the lakehouse.

Strategic Goals: Unified Architecture and Cost Optimization

Regarding the challenges of real-time data warehousing, there are several key expectations:

Reduce Computing Costs

There is a desire to better integrate stream processing with the data lakehouse, aiming to minimize the additional resource consumption brought about by real-time processing.

Simplify Architecture and Improve Stability

The goal is to keep the real-time pipeline as straightforward as possible to reduce the development burden on users while enhancing job stability.

Unify Data Pipelines

Efforts should be made to avoid the development and operational costs associated with having separate real-time and offline pipelines, while also reducing data redundancy.

Streamlining Real-Time Lakehouse Architecture with Apache Paimon

The second part introduces building a near-real-time data lakehouse based on Paimon, discusses the capabilities of Paimon, and presents classic application scenarios along with the benefits brought by its implementation.

Apache Paimon Core Capabilities: LSM Storage and Stream-Batch Hybrid Processing

The understanding of Paimon can be briefly summarized as follows:

(1) Lakehouse Table Format: Paimon, similar to Iceberg and other Data Lake projects, belongs to the table format of Lakehouse. This means its underlying storage can be HDFS or S3, offering strong scalability and low cost. It supports AICD and Schema Evolution. Like traditional data warehouses, Paimon supports AICD (Add, Insert, Change, Delete) operations and schema evolution, ensuring the reliability and security of data updates while allowing schema adjustments according to business needs.

(2) LSM Storage Structure: On the basis of the Lakehouse architecture, Paimon innovatively introduces the LSM (Log-Structured Merge-Tree) storage structure, bringing three main capabilities:

Streaming Upsert Functionality: Enables real-time data deduplication and efficient CDC (Change Data Capture) updates.
Partial Update: Supports updates to individual columns.
Aggregation: Near-real-time data aggregation.

In summary, Paimon effectively supports stream computing semantics on top of the Lakehouse structure by introducing the LSM data structure while also providing good support for batch processing.

Use Case 1: Implementing Data Flatting Using Paimon Partial-Update

The application scenario can be described as using Paimon's partial column update functionality to achieve data flattening. Before the introduction of Paimon, jobs typically used the classic two-stream join solution. The job would consume two event streams, perform filtering and transformation operations, and then execute the two-stream join. Due to the large volume of data, the state could easily reach the TB level. To reduce the state in Flink jobs, longer-period data would typically be stored in external KV systems like HBase or Pegasus, while only the state data for the most recent few hours would be kept in the Flink job.

This approach brings two problems. First, the large state leads to job instability. Second, the use of additional KV systems increases development and operational costs, and managing KV systems is not convenient. Analyzing the root cause of these problems reveals that the efficiency of the two-stream join is very low. Due to the large volume of two-stream join data, most of the Flink state data is cached on local disks. If the in-memory data cache is exhausted during the Join, random reads from the disk are required. Given the overall high data traffic, the frequency of random disk reads is very high. Additionally, when longer-period data needs to be queried, external KV systems must be accessed, which not only brings network overhead but also leads to data redundancy.

To address the aforementioned issue, Paimon can utilize its partial column update feature. Paimon supports a Merge engine known as Partial-Update. This capability allows for merging multiple records with the same primary key and selecting the last non-null value for each column. However, this Merge operation is not performed during Flink's computational tasks; it is executed during the Compaction tasks of Paimon tables. Due to Paimon's use of the Log-Structured Merge-Tree (LSM)'s tiered storage structure, it can effortlessly merge identical records from different layers during Compaction.

By utilizing Paimon's Partial-Update feature, the random disk reads involved in the previous two-stream join operations can be transformed into sequential disk reads, thereby eliminating the disk random reads. The benefits brought by using Paimon Partial-Update are as follows:

Firstly, it eliminates the disk random read issue associated with Streaming Join. Secondly, data storage can be unified and converged into a data lakehouse, meaning there is no longer a need for external HBase or other KV systems, thus saving related costs. In practical cases, simply eliminating a portion of HBase storage can save approximately ¥50,000/month (approx. $7,000).

Additionally, system stability is enhanced because Flink's state no longer needs to store terabytes of data. Finally, job logic is simplified. By using Paimon, tasks can be accomplished using SQL alone, without the need to write custom timer logic for Flink optimizations as before.

Use Case 2: CDC Implementation with Streaming Upsert

The second application scenario is Paimon's Streaming Upsert feature, which is mainly applied in two aspects: firstly, the Changelog data generated during stream processing is written into Paimon tables; secondly, incremental update data from online databases is integrated into Paimon tables through CDC (Change Data Capture) technology.

Before introducing Paimon, traditional solutions mainly included two approaches:

Storing raw Changelog data: This method directly stores raw Changelog data and aggregates it based on primary key and timestamp through a View during queries, allowing users to query the aggregated table. The advantage is ease of maintenance, as data is directly written in an append-only manner, and maintenance only involves cleaning up older Changelog data. However, this method lacks scalability, especially for frequently updated scenarios, as it requires frequent cleaning of old log data. Additionally, the aggregation operation during each query introduces some latency
Offline batch data import: This is an offline approach, and its drawback is low data freshness since each import requires full data. This leads to untimely data updates, failing to meet real-time requirements.

Beyond the aforementioned traditional methods, there has been an exploration of using Iceberg Upsert solutions. Flink supports writing to Iceberg in an Upsert manner. During the writing process, Iceberg decouples update operations from data writing using Position Delete files for Equality Delete, transferring the data merging and load to the read and Compaction stages. The advantage of this approach is efficient data ingestion.

However, this method also presents some challenges. Firstly, Equality Delete currently lacks sorting and layering mechanisms, which can lead to a high load during reading and necessitate frequent Compaction to eliminate Delta Delete files. Secondly, each write operation may require writing three files: Data File, Equality Delete, and Position Delete files, which triples the number of files and introduces the small file problem. Lastly, Iceberg's support for incremental reads is not easy, as data updates are performed in a lazy mode, requiring significant effort to calculate incremental data. Currently, it does not support incremental reads.

Paimon effectively addresses the aforementioned issues through its data storage structure. Paimon adopts the Log-Structured Merge-Tree (LSM) storage architecture, where merge operations are achieved through sequential merging between two layers and are conducted in a layered manner. This means that each merge does not require rewriting all historical files, which is different from Iceberg's Equality Delete. Iceberg's Equality Delete covers all historical files, thus potentially necessitating rewriting almost all historical files during each operation.

A comparative test was conducted by writing the same batch of data into two tables for Upsert testing and using Compaction tasks to keep the Delta Delete files of both tables below a certain threshold, to test the space amplification issue. The test results showed that the space amplification caused by Paimon's Compaction was significantly lower than that of Iceberg, with specific values approximately being 16.1 compared to 6.7 (lower indicates better efficiency).

This advantage gives Paimon a significant superiority in data update and storage efficiency, especially in scenarios requiring frequent updates and merges. Paimon can more effectively manage storage resources, reducing unnecessary file rewriting and space wastage.

Paimon supports multiple Changelog Producers, catering to different business scenarios and requirements. For instance, in Streaming Upsert, although many computational scenarios do not require users to consume precise Changelog, Paimon can provide various Changelog modes if needed:

INPUT : Suitable for writing CDC data.
LOOKUP : Suitable for scenarios with smaller data volumes but higher Changelog latency requirements.
FULL-COMPACTION : Suitable for scenarios with larger data volumes but high latency requirements, typically with delays ranging from 10 to 20 minutes.

The benefits of replacing Iceberg with Paimon mainly include:

Reduced Compaction resource consumption: Paimon's LSM structure is more efficient in merging data, thereby reducing resource usage.
Decreases the Number of Small Files Written: Paimon eliminates the need to write two additional Delete files, thus mitigating the small file issue.
Improves Streaming Read Experience: Paimon's LSM structure is well-suited for batch processing scenarios involving incremental reads, enabling more efficient handling of incremental data.

Use Case 3: Optimizing Joins with Dimension Tables

In the Lookup Join scenarios for dimension tables, traditional pipelines commonly utilize three technical components: HBase, Pegasus, and Iceberg. We have extended the Lookup interface in the Flink Iceberg connector, enabling incremental loading functionality when updating the Lookup tables. However, the previous implementation stored all data in memory, which, although cost-effective, was poorly scalable. When data volumes become slightly larger, memory resources can become overwhelmed.

Therefore, when data volumes increase, it is typically necessary to import data into KV storage systems like HBase or Pegasus. As dedicated KV storage systems, HBase and Pegasus offer strong scalability. By expanding the nodes of the storage service, the performance of Join operations can be improved. However, the drawback is that this approach incurs high costs, which may be difficult to bear for some businesses.

Paimon supports using Lookup as a source, enabling data caching on local disks, loading data from table storage on demand, and caching data in RocksDB. However, it also has some drawbacks.

Firstly, local disk performance can still become a bottleneck for Lookup operations. When data is stored on local disks, each Lookup operation triggers a random read from the disk. This can significantly limit the performance of Lookup Join on computing nodes using HDD disks, as the local disk performance may become a bottleneck.

Additionally, the current implementation requires each node to load the entire dataset. This means that when data is updated, particularly after a full Compaction, nodes need to load all data upon startup, leading to prolonged startup times, which is evidently inefficient.

In analyzing this issue, we discovered that the need to load full data arises from not partitioning the stream table appropriately. By adopting the same Bucket strategy for the data stream as used for the Lookup table, the problem of loading the entire table data in each concurrent operation can be effectively avoided. The optimized system can automatically partition the data, ensuring that each node only needs to load data from specific Buckets, significantly reducing the data loading volume for individual Lookup nodes.

We also identified that local disk performance can easily become a bottleneck for Lookup operations, as most reads in Lookup operations are essentially high-frequency random disk reads. Disk I/O can quickly become a performance bottleneck and is easily affected by the load of other tasks. For instance, in mixed-deployment clusters, if it is mixed with HDFS, data tasks on HDFS Data Nodes may occupy a large proportion of I/O, which can easily affect Flink Lookup performance. We attempted to optimize this by storing disk data on dedicated remote SSDs. In our scenario, network I/O is usually not the bottleneck; rather, disk I/O is. Additionally, SSD performance is significantly better than HDD. By transferring disk load to remote SSDs, local tasks are less likely to be affected by other tasks on the node.

Using Paimon for Lookup operations can replace some tasks originally performed through Join operations with HBase and Pegasus KV systems, bringing multiple benefits:

Unified Data Management: Data can be consolidated into a data lakehouse. In Paimon tables, SQL can be used for querying and writing, greatly facilitating data management and operations.
Simplified Data Pipeline: Since there is no longer a need to maintain parts of the KV system, the data pipeline is simplified, reducing system maintenance complexity and lowering development and operational costs.

However, Paimon also has some limitations:

Performance and Flexibility Constraints: Paimon's performance and flexibility still have significant gaps compared to dedicated KV systems, especially in scenarios with high update frequencies where it underperforms compared to KV systems. Currently, Paimon is better suited for dimension tables with very low update frequencies.
Scalability Dependency: Paimon's scalability relies on the horizontal expansion of Lookup nodes. Unlike KV systems, which only need to expand their storage resources, using Paimon Lookup requires expanding Flink's compute nodes, and may even necessitate re-partitioning Paimon's Buckets and increasing their number.

Future Outlook: Next-Generation Lakehouse Roadmap

Looking ahead, the plan is to continue exploring Paimon's applications in stream processing scenarios, with the hope of promoting it to more users and use cases. There are already several typical application scenarios, but there is still a desire to further expand its usage scope.

Additionally, there is a plan to build automated maintenance services. Currently, Paimon supports the automatic handling of snapshot expiration and TTL tasks within jobs, making its use very convenient. However, these tasks are coupled with data update tasks. To address this issue, the goal is to implement automated scheduling on the platform side, thereby reducing task coupling and improving efficiency. Prior to this, an automatic monitoring and maintenance governance service was constructed for Iceberg, and the hope is to support Paimon within this governance service in the future.

Moreover, there are plans to enhance Paimon's Catalog capabilities. As a table, Paimon not only supports OLAP or SQL data query operations but also needs to support platform access to its metadata operations. Drawing from Iceberg's approach, the introduction of a REST Catalog to provide REST API capabilities is envisaged, with the hope that Paimon can establish similar capabilities to support a broader range of application scenarios.

View full post