Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of RCFile data Storage format in Hive

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the example analysis of RCFile data storage format in Hive, which has a certain reference value, and interested friends can refer to it. I hope you will gain a lot after reading this article.

Facebook introduced the data warehouse Hive at the 2010 ICDE (IEEE International Conference on Data Engineering) conference. Hive stores massive data in Hadoop system, it provides a set of database-like data storage and processing mechanism. It uses SQL-like language to manage and process data automatically. after sentence parsing and transformation, it finally generates Hadoop-based MapReduce tasks, and completes data processing by executing these tasks. Figure 1 shows the system structure of the Hive data warehouse.

Fig. 1 system structure of Hive data warehouse

MapReduce-based data warehouse plays an important role in very large-scale data analysis. For typical Web service providers, these analyses help them quickly understand dynamic user behavior and changing user needs. Data storage structure is one of the key factors that affect the performance of data warehouse. The commonly used file storage formats in Hadoop systems are TextFile, which supports text, and SequenceFile, which supports binary, which belong to line storage. The article RCFile: A Fast and Spaceefficient Data Placement Structure in MapReducebased Warehouse Systems published by Facebook engineers introduces an efficient data storage structure-RCFile (Record Columnar File) and applies it to the data warehouse Hive of Facebook. Compared with the data storage structure of traditional database, RCFile can more effectively meet the four key requirements of MapReduce-based data warehouse, namely Fast data loading, Fast query processing, Highly efficient storage space utilization and Strong adaptivity to highly dynamic workload patterns.

Requirements for data warehouses

Based on the analysis of Facebook system characteristics and user data, data warehouse has four key requirements for data storage structure in MapReduce computing environment.

Fast data loading

For Facebook's product data warehouse, it is critical to load data (write data) quickly. Every day, more than 20TB data is uploaded to Facebook's data warehouse. Because the network and disk traffic will interfere with the normal query execution during data loading, it is necessary to shorten the data loading time.

Fast query processing

In order to meet real-time website requests and support a large number of read loads submitted by highly concurrent users, query response time is very critical, which requires the underlying storage structure to maintain high-speed query processing with the increase of the number of queries.

Highly efficient storage space utilization

The rapid growth of user activity always requires scalable storage capacity and computing power, and the limited disk space needs to manage the storage of massive data reasonably. In fact, the solution to this problem is to maximize disk space utilization.

Strong adaptivity to highly dynamic workload patterns

The same data set is provided to users of different applications and analyzed in a variety of ways. Some data analysis is a routine process that is performed periodically according to a fixed pattern, while others are queries initiated from intermediate platforms. Most loads do not follow any regular patterns, which requires the underlying system to have a high degree of adaptability to unpredictable dynamic data in data processing under the premise of limited storage space, rather than focusing on a special load pattern.

MapReduce storage policy

To design and implement an efficient data storage structure based on MapReduce data warehouse, the key challenge is to meet the above four requirements in the MapReduce computing environment. In traditional database systems, three kinds of data storage structures are widely studied, which are row storage structure, column storage structure and PAX hybrid storage structure. The above three structures have their own characteristics, but simply transplanting these database-oriented storage structures to MapReduce-based data warehouse systems can not meet all the requirements.

Row storage

As shown in figure 2, the advantage of the Hadoop-based system row storage structure is fast data loading and high adaptability to dynamic load, because row storage ensures that all domains of the same record are on the same cluster node, that is, the same HDFS block. However, the disadvantages of row storage are also obvious, for example, it cannot support fast query processing, because when the query is only for a few columns in multiple lists, it cannot skip unnecessary column reads; in addition, because of columns mixed with different data values, row storage is not easy to get a very high compression ratio, that is, space utilization is not easily greatly improved. Although a better compression ratio can be obtained by entropy coding and making use of column correlation, the implementation of complex data storage will lead to an increase in decompression overhead.

Figure 3 example of column storage in a HDFS block

PAX hybrid storage

The PAX storage model (for Data Morphing storage technology) uses hybrid storage to improve CPU Cache performance. For multiple fields from different columns in the record, PAX puts them on a single disk page. In each disk page, PAX uses a mini page to store all fields belonging to each column and a header to store pointers to the mini page. Similar to row storage, PAX has strong adaptability to a variety of dynamic queries. However, it can not meet the needs of large-scale distributed systems for high storage utilization and fast query processing. The reason is: first of all, PAX has no work related to data compression, which has little to do with Cache optimization, but it is very critical for large-scale data processing systems. It provides the possibility of column dimension data compression. Second, PAX can not improve the performance of PAX O, because it can not change the actual page content, this limitation makes it difficult to achieve fast query processing when large-scale data scanning; thirdly, PAX uses fixed pages as the basic unit of data organization, according to this size, PAX will not effectively store data fields of different sizes in massive data processing systems. This paper introduces the implementation of RCF i l e data storage structure on Hadoop system. The structure emphasizes that: first, the tables stored by RCFile are divided horizontally into multiple row groups, and each row group is divided vertically so that each column can be stored separately; second, RCFile uses a column dimension of data compression in each row group, and provides a Lazy decompression (decompression) technology to avoid unnecessary column decompression during query execution. Third, RCFile supports flexible row group size, which requires a tradeoff between data compression performance and query performance.

Design and implementation of RCFile

The RCFile (Record Columnar File) storage structure follows the design concept of "first horizontal partition, then vertical partition", which comes from PAX. It combines the advantages of row storage and column storage: first, RCFile ensures that the data of the same row is on the same node, so the overhead of tuple refactoring is very low; second, like column storage, RCFile can take advantage of column dimension data compression and skip unnecessary column reads. Figure 4 is an example of RCFile storage within a HDFS block.

Figure 4 example of RCFile storage in a HDFS block

Data format

RCFile is designed and implemented on top of the HDFS distributed file system. As shown in figure 4, RCFile stores a table in the following data format.

RCFile is based on HDFS architecture, and tables occupy multiple HDFS blocks.

In each HDFS block, RCFile organizes records on the basis of row groups. That is, all records stored in a HDFS block are divided into multiple row groups. For a table, all row groups are the same size. A HDFS block can have one or more row groups.

A row group consists of three parts. The first part is the synchronization identification of the row group header, which is mainly used to separate the two consecutive row groups in the HDFS block; the second part is the row group metadata header, which is used to store the information of the row group unit, including the number of records in the row group, the number of bytes in each column, and the number of bytes in each field in the column; the third part is the table data segment, that is, the actual column storage data. In this section, all domains of the same column are stored sequentially. As you can see from figure 4, you first store all the fields of column A, then all the fields of column B, and so on.

Compression mode

In each row group of RCFile, the metadata header and the table data segment are compressed separately.

For all metadata headers, RCFile uses the RLE (Run Length Encoding) algorithm to compress the data. Since the length values of all fields in the same column are sequentially stored in this part, the RLE algorithm can find long sequences of repeated values, especially for fixed field lengths.

Table segments are not compressed as entire units; instead, each column is compressed independently, using the Gzip compression algorithm. RCFile uses the heavyweight Gzip compression algorithm in order to obtain a better compression ratio, but does not use the RLE algorithm because the column data is not sorted at this time. In addition, because of the Lazy compression strategy, RCFile does not need to extract all columns when processing a row group. Therefore, the relatively high Gzip decompression overhead can be reduced.

Although RCFile uses the same compression algorithm for all columns of table data, it might be better to use different algorithms to compress different columns. One of the future work of RCFile may be to adaptively select the best compression algorithm according to the data type and data distribution of each column.

Data addition

RCFile does not support arbitrary data writing operations and only provides an append interface, because the underlying HDFS currently only supports data appending to the end of the file. The data append method is described below.

RCFile creates and maintains a memory column holder for each column, and when records are appended, all fields are distributed, and each domain is appended to its corresponding column holder. In addition, RCFile records the metadata corresponding to each domain in the metadata header.

RCFile provides two parameters to control how many records are cached in memory before being written to disk. One parameter is the limit on the number of records, and the other is the size limit of the memory cache.

RCFile first compresses the metadata header and writes it to disk, then compresses each column holder separately, and writes the compressed column holder to a line group in the underlying file system.

Data reading and Lazy decompression

In the MapReduce framework, mapper processes each row group in the HDFS block sequentially. When working with a row group, RCFile does not need to read the entire contents of the row group into memory.

Instead, it only reads the metadata header and the columns required for a given query. Therefore, it can skip unnecessary columns to gain the Icano advantage of column storage. For example, the table TBL (C1, c2, c3, c4) has four columns. Do a query "SELECT C1 FROM tbl WHERE c4 = 1". For each row group, RCFile only reads the contents of the C1 and c4 columns. After the metadata header and required column data are loaded into memory, they need to be unzipped. The metadata header is always unzipped and maintained in memory until the RCFile processes the next row group. However, RCFile does not extract all loaded columns; instead, it uses a Lazy decompression technique.

Lazy decompression means that columns will not be unzipped in memory until RCFile decides that the data in the column is really useful for query execution. Because queries use a variety of WHERE conditions, Lazy decompression is very useful. If a WHERE condition cannot be satisfied by all the records in the row group, RCFile will not extract the columns that are not satisfied in the WHERE condition. For example, in the above query, column c4 in all row groups is decompressed. However, for a row group, if there is no field with a value of 1 in column c4, there is no need to extract column C1.

Row group size

RCFile O performance is the focus of RCFile, so the row group needs to be large and variable. The row group size is related to the following factors.

If the row group is large, the data compression efficiency will be more efficient than the row group hour. According to the observation of the daily application of Facebook, when the row group size reaches a threshold, increasing the row group size can not further increase the compression ratio under the Gzip algorithm.

Larger row groups can improve data compression efficiency and reduce storage. Therefore, if there is a strong need to reduce storage space, it is not recommended to choose to use small row groups. It is important to note that when the size of the row group exceeds 4MB, the compression ratio of the data tends to be consistent.

Although larger row groups help reduce the storage size of the table, it may damage the read performance of the data because it reduces the performance improvement brought about by Lazy decompression. And row grouping takes up more memory, which affects other MapReduce jobs executed concurrently. Considering the storage space and query efficiency, Facebook chooses 4MB as the default row group size, and of course allows users to choose their own parameters to configure.

Thank you for reading this article carefully. I hope the article "sample Analysis of RCFile data Storage format in Hive" shared by the editor will be helpful to you. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report