In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the relevant knowledge of "the method of Apache Hudi data layout". The editor shows you the operation process through the actual case. The operation method is simple, fast and practical. I hope this article "the method of Apache Hudi data layout" can help you solve the problem.
1. Background
Apache Hudi brings stream processing to big data, which is an order of magnitude more efficient than traditional batch processing and provides more fresh data. In the data lake / warehouse, there needs to be a tradeoff between intake speed and query performance. Data intake usually prefers small files to improve parallelism and make data available for query as soon as possible, but many small files will lead to query performance degradation. In the process of ingesting, data is usually placed in the same place according to time, but if frequently queried data are put together, the performance of the query engine will be better, and most systems tend to support independent optimization to improve performance. to address the limitations of unoptimized data layout. This blog introduces a service called clustering [RFC-19] that reorganizes data to improve query performance without affecting ingestion speed.
2. Clustering architecture
Hudi provides different operations, such as insert/upsert/bulk_insert, to write data to the Hudi table through its write client API. To make a tradeoff between file size and ingestion speed, Hudi provides a hoodie.parquet.small.file.limit configuration to set the minimum file size. The user can set the configuration to 0 to force new data to be written to a new filegroup, or to a higher value to ensure that new data is "populated" into existing small filegroups until the specified size is reached, but it increases the uptake delay.
In order to support fast ingestion without affecting query performance, we introduce Clustering service to rewrite data to optimize the layout of Hudi data lake files.
Clustering services can be run asynchronously or synchronously, and Clustering adds a new REPLACE operation type that marks Clustering operations in the Hudi metadata timeline.
Overall, Clustering is divided into two parts:
Scheduling Clustering: create a Clustering plan using a pluggable Clustering policy. Execute Clustering: use the execution policy processing plan to create new files and replace old ones.
2.1 scheduling Clustering
Scheduling Clustering will take the following steps
Identify files that meet the Clustering criteria: according to the selected Clustering policy, the scheduling logic will identify files that meet the Clustering criteria. Files that meet the Clustering criteria are grouped according to specific conditions. The data size for each group should be a multiple of targetFileSize. Grouping is part of the Policy defined in the plan. There is also an option to limit the group size to improve parallelism and avoid mixing large amounts of data. Finally, save the Clustering plan to the timeline in avro metadata format.
2.2 run Clustering
Read the Clustering plan and get the clusteringGroups, which marks the filegroup that needs to be Clustering. For each group, use strategyParams to instantiate the appropriate policy class (for example: sortColumns), and then apply that policy to rewrite the data. Create an REPLACE submission and update the metadata in HoodieReplaceCommitMetadata.
The Clustering service is based on Hudi's MVCC design and allows new data to continue to be inserted, while Clustering operations run in the background to reformat the data layout, ensuring snapshot isolation between concurrent readers and writers.
Note: updates are not supported when Clustering tables, but concurrent updates will be supported in the future.
2.3 Clustering configuration
Inline Clustering can be easily set using Spark, as shown in the following example
Import org.apache.hudi.QuickstartUtils._import scala.collection.JavaConversions._import org.apache.spark.sql.SaveMode._import org.apache.hudi.DataSourceReadOptions._import org.apache.hudi.DataSourceWriteOptions._import org.apache.hudi.config.HoodieWriteConfig._val df = / / generate data framedf.write.format ("org.apache.hudi"). Options (getQuickstartWriteConfigs). Option (PRECOMBINE_FIELD_OPT_KEY, "ts"). Option (RECORDKEY_FIELD_OPT_KEY, "uuid"). Option (PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). Option (TABLE_NAME, "tableName"). Option ("hoodie.parquet.small.file.limit", "0"). Option ("hoodie.clustering.inline", "true"). Option ("hoodie.clustering.inline.max.commits", "4"). Option ("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824") Option ("hoodie.clustering.plan.strategy.small.file.limit", "629145600") Option ("hoodie.clustering.plan.strategy.sort.columns", "column1,column2"). / / optional, if sorting is needed as part of rewriting data mode (Append). Save ("dfs://location")
For setting up more advanced asynchronous Clustering pipes, refer to the example here.
3. Table query performance
We used a partition of the production environment table to create a dataset with about 20 million records, about 200GB, and the dataset with multiple session_id rows. Users always use session predicates to query data, and the data for a single session is distributed across multiple data files, because data intake groups the data according to the time of arrival. The following experiments show that Clustering sessions can improve data locality and reduce query execution time by more than 50%.
The query SQL is as follows
Spark.sql ("select * from table where session_id=123") 3.1 before Clustering
The query took 2.2 minutes. Note that the number of output rows in the scan parquet section of the query plan includes all 2000W rows in the table.
3.2After Clustering
The query plan is similar to the above, but Spark can prune many rows due to improved data locality and predicate pushdown. After Clustering, the same query outputs only 110000 lines (20 million lines) when scanning the parquet file, which reduces the query time from 2.2 minutes to less than one minute.
The following table summarizes the query performance improvements of the experiments run with Spark3
Table StateQuery runtimeNum Records ProcessedNum files on diskSize of each fileUnclustered130673 ms~20M13642~150 MBClustered55963 ms~110K294~600 MB
The query run time has been reduced by 60% after Clustering, and similar results have been observed on other sample datasets. See the sample query plan and RFC-19 performance evaluation for more details.
We hope that large tables will be much faster, and unlike the example above, the query run time is determined almost entirely by the actual Imap O rather than by the query plan.
This is the end of the introduction to "the method of Apache Hudi data layout". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.