Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Hive appearance and data Lake explained in detail by ByConity Technology

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

With the increasing demand for big data processing, lower-cost storage and a more unified analytical perspective are becoming more and more important. Data warehouse as the core decision support system of enterprises, how to access external data storage has become a problem that must be considered in technology selection. For the same consideration, ByConity 0.2.0 released a series of capabilities to interface with external storage, initially realizing the access to Hive appearance and data lake format.

Support for Hive appearance

With the increasing requirements of enterprise data decision-making, Hive data warehouse has become one of the preferred tools for many organizations. By combining Hive,ByConity in the query scenario, we can provide more comprehensive enterprise decision support and create a more complete data management model. So starting with version 0.2.0, ByConity can access Hive data in the form of a facade.

Principle and use

The main table engine of ByConity is CnchMergeTree. When connecting to external storage, it needs to be based on different appearance engines. For example, when creating a Hive appearance, you need to read Parquet and Hive data in ORC format through the CnchHive engine.

By specifying HiveMetastore uri,Hive database and Hive table. ByConity fetches and parses Hive table metadata and automatically deduces the structure of the table (column name, type, partition). When querying, server obtains the files that need to be read through the List remote file system, and then server sends the files to workers,worker to read data from the remote file system. The overall execution process is basically the same as CnchMergeTree.

By configuring the disk_cache,worker side, you can store the remote files in the local disk cache to speed up the next read.

Performance optimization

In addition, CnchHive implements a number of important performance tuning tools to achieve the same level of appearance performance as Presto / Trino:

Support partition pruning and fragment level pruning

Partition pruning and fragment level pruning are the performance optimization techniques of Hive. Partition pruning allows Hive to scan only partitions related to query conditions instead of full table scans when querying, thus greatly reducing query execution time. For some file formats, such as Parquet, you can further reduce the amount of data read by reading the minmax value of each row group in the file and cropping the row groups.

Hive Statistical Information Integration Optimizer

CnchHive introduces the statistical information integration optimizer, which can automatically select the best execution plan based on the statistical information of the data. This makes the execution of the query more intelligent and efficient and reduces the workload of manually adjusting the query plan. The statistics integration optimizer can significantly improve query performance in benchmark.

Benchmark (ByConity vs Trino)

TPC-DS (Transaction Processing Performance Council Decision Support) is a standardized decision support benchmark used to evaluate the performance of data warehouse systems. The CnchHive engine released by ByConity 0.2.0 not only fully passes the TPC-DS benchmark, but also performs well in terms of performance by optimizing the query execution plan.

Test information:

Deployment model: Kubernetes deployment, based on AWS EC2 r5.12large model

Physical resource size: 4 Worker (48cpu, 256Gb mem)

The parameters used in the test:

Enable_optimizer: turn on the optimizer

Dialect_type ANSI: using standard Ansi SQL

S3_use_read_ahead: turn off the ReadAhead function of S3

Remote_read_min_bytes_for_seek: if the interval between two reads is less than 1MB, it will not seek

Disk_cache_mode=SKIP_DISK_CACHE turns off worker's local disk cache to simulate a pure cold read scenario.

Parquet_parallel_read=1 uses parquet's parallel read

The enable_optimizer_fallback=0 optimizer fails to execute and returns an error directly, which is used in the test scenario.

Parameter Optimization of exchange_enable_multipath_reciever=0 execution layer

Legend add: ordinate unit millisecond, Abscissa unit TPC-DS query statement label

Support for Hudi appearance

Main concepts of Hudi

Starting from the actual business scenario, the requirements for data lake data can be divided into two categories: read preference and write preference; therefore, Apache Hudi provides two types of tables:

Copy On Write table: COW for short. This kind of Hudi table uses column file format (such as Parquet) to store data. If the data is updated, the entire Parquet file will be rewritten, which is suitable for reading the preferred operation.

Merge On Read tables: referred to as MOR, these Hudi tables store data together in column file formats (such as Parquet) and row file formats (such as Avro). Generally, MOR tables use column storage to store historical data, row storage to store increments and updated data. When the data is updated, it is first written to the row memory file, then compressed, and the column storage file is generated synchronously or asynchronously according to the configurable policy, which is suitable for writing preference operations.

Hudi provides different query methods for these two different types of tables and scenarios:

Additional note: Read Optimized Queries is the optimization of snapshot query of MOR table type, which reduces the query delay caused by online merging log data by sacrificing the timeliness of query data.

Principle and use

Overview of the principle

ByConity implements snapshot query on COW table. The reading of MoR table can be supported after JNI Reader is enabled. Hudi supports synchronous HiveMetastore, so ByConity can perceive Hudi tables through HiveMetastore.

Ordinary CoW tables can be queried directly using the CnchHive engine.

When JNI Reader is enabled, ByConity can read Hudi CoW and MoR tables through the CnchHudi table engine.

For the Hudi MoR table, ByConity introduces the JNI module to call Hudi Java Client to read the data. The data read by Java will be written to arrow table in memory, and the in-memory data will be exchanged between Java and C++ through Arrow C Data Interface. C++ converts arrow table into Block data for subsequent data processing.

Get started quickly through Hudi Docker

After https://hudi.apache.org/docs/docker_demo/ configures the docker environment of Hudi, make sure that after the ByConity cluster is connected to hivemetastore, you can create Hudi appearance and query operations in ByConity.

Multi-Catalog

Transparent Catalog design

Multi-Catalog is designed to make it easier to connect to multiple external data directories to enhance ByConity's data lake analysis and appearance query capabilities. In data architecture design, the core data objects are still only databases and tables. The Catalog information is embedded into the database name during processing, and the corresponding processing is realized according to the naming pattern of different databases. Such designs can be transparently compatible with previously created library table metadata and update only the new external data catalogs.

For example, after creating the catalog of Hive, if the catalog name of hive is included in the table name of query, it will follow the logic related to external catalog and get the information about the database table from Hive Metastore. The query is as follows.

Multi-Catalog convenience

The multi-Catalog design allows users to connect multiple different storage and metadata services simultaneously in the same Hive instance without having to create a separate Hive instance for each storage. This simplifies the complexity of data management and query, and enables organizations to better manage and utilize their diverse data resources. Currently, external Catalog has been supported: Hive,Apache Hudi,AWS Glue.

Users can create a Catalog based on Hive and S3 storage using the

Then use three-segment naming to directly access the Hive appearance

You can also use query to view external catalog-related information

Future planning

As more and more data needs to integrate different data stores, ByConity will continue to enrich its ability to interface with data lakes and external storage, and enhance integration with upstream and downstream tools. The 2023 roadmap can be viewed in the discussion on Github:

Https://github.com/ByConity/ByConity/issues/26 . At the same time, we will also focus on the following: docking more data lake formats such as Iceberg,DeltaLake; introducing Native reader to improve the efficiency of reading Parquet files; optimizing file allocation strategy to make the workload of each worker more uniform and so on.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report