What is the RCFile of Hive-based file format and its application 07/01 Update SLTechnology News&Howtos

What is the RCFile of Hive-based file format and its application

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces you based on Hive file format RCFile and its application is how, the content is very detailed, interested friends can refer to, hope to be helpful to you.

As an open source implementation of MR, Hadoop has always had the advantage of running the parsed file format dynamically and loading several times faster than the MPP database. However, the MPP database community has also criticized Hadoop for being too expensive to serialize and deserialize because the file format is not built for a specific purpose.

1. Brief introduction of hadoop file format

At present, there are several popular file formats in hadoop:

(1) SequenceFile

SequenceFile is a binary file provided by Hadoop API that serializes data into a file in the form of. The binaries are serialized and deserialized internally using Hadoop's standard Writable interface. It is compatible with MapFile in Hadoop API. SequenceFile in Hive inherits from Hadoop API's SequenceFile, but its key is empty and uses value to store the actual values in order to avoid the sorting process of MR during the map phase. If you write SequenceFile in Java API and let Hive read it, be sure to use the value field to store the data, otherwise you need to customize the InputFormat class and OutputFormat class to read this SequenceFile.

(2) RCFile

RCFile is a column-oriented data format developed by Hive. It follows the design concept of "dividing first by column, then vertically". When querying for columns it doesn't care about, it skips those columns on the IO. It should be noted that the RCFile copy from the remote end of the map phase still copies the entire data block, and after copying to the local directory, the RCFile does not really skip the unwanted columns directly and skip to the columns that need to be read, but by scanning the header definition of each row group, but the header at the entire HDFS Block level does not define which row group each column starts to which row group ends. So in the case of reading all columns, the performance of RCFile is not as high as SequenceFile.

An example of HDFS block inline storage

An example of column storage in a HDFS block

An example of RCFile storage in HDFS block

(3) Avro

Avro is a binary file format used to support data-intensive data. Its file format is more compact, and Avro provides better serialization and deserialization performance when reading large amounts of data. And Avro data files are inherently defined with Schema, so it does not require developers to implement their own Writable objects at the API level. Several recent Hadoop subprojects support Avro data formats, such as Pig, Hive, Flume, Sqoop, and Hcatalog.

(4) text format

In addition to the three binary formats mentioned above, data in text format is also frequently encountered in Hadoop. Such as TextFile, XML and JSON. In addition to consuming more disk resources, the parsing overhead of text formats is generally dozens of times higher than that of binary formats, especially XML and JSON, which are even more expensive than Textfile, so it is strongly not recommended to use these formats for storage in production systems. If you need to output these formats, please do the appropriate conversion on the client side. Text format is often used for log collection, database import, Hive default configuration also uses text format, and it is easy to forget to compress, so make sure you use the correct format. Another disadvantage of text format is that it does not have types and patterns, such as numerical data such as sales amount, profit or date-time data. If saved in text format, because their own string types are different in length, or contain negative numbers, MR has no way to sort them, so they often need to be preprocessed to binary format containing patterns. This leads to the overhead of unnecessary preprocessing steps and the waste of storage resources.

(5) external format

Hadoop supports virtually any file format, as long as you can implement the corresponding RecordWriter and RecordReader. The database format is also often stored in Hadoop, such as Hbase,Mysql,Cassandra,MongoDB. These formats are generally used to avoid the need for large amounts of data movement and fast loading. Their serialization and deserialization are done by clients in these database formats, and the location and data layout (Data Layout) of the files are not controlled by Hadoop, and their file sharding is not cut according to the block size of HDFS (blocksize).

2. Why do you need RCFile

Facebook introduced the data warehouse Hive at the 2010 ICDE (IEEE International Conference on Data Engineering) conference. Hive stores massive data in Hadoop system, it provides a set of database-like data storage and processing mechanism. It uses SQL-like language to manage and process data automatically. after sentence parsing and transformation, it finally generates Hadoop-based MapReduce tasks, and completes data processing by executing these tasks. The following figure shows the system structure of the Hive data warehouse.

The storage scalability challenges that Facebook faces in data warehouses are unique. They store more than 300PB data in Hive-based data warehouses and are growing at the rate of adding 600TB every day. The amount of data stored in this data warehouse tripled last year. Given this growth trend, the issue of storage efficiency is the most important concern for the facebook data warehouse infrastructure now and for some time to come. The article RCFile: A Fast and Spaceefficient Data Placement Structure in MapReducebased Warehouse Systems published by facebook engineers introduces an efficient data storage structure-RCFile (Record Columnar File) and applies it to the data warehouse Hive of Facebook. Compared with the data storage structure of traditional database, RCFile can more effectively meet the four key requirements of MapReduce-based data warehouse, namely Fast data loading, Fast query processing, Highly efficient storage space utilization and Strong adaptivity to highly dynamic workload patterns. RCFile is widely used in the data analysis system Hive of Facebook company. First of all, RCFile has the data loading speed and load adaptability equivalent to row storage; second, the read optimization of RCFile can avoid unnecessary column reading when scanning tables, and tests show that it has better performance than other structures in most cases; third, RCFile uses column dimension compression, so it can effectively improve storage space utilization.

In order to improve the utilization of storage space, the data generated by Facebook product line applications have been stored in RCFile structure since 2010, and the data sets saved by row storage (SequenceFile/TextFile) structure are also transferred to RCFile format. In addition, Yahoo has integrated RCFile,RCFile into its Pig data analysis system, which is being used in another Hadoop-based data management system, Howl (http://wiki.apache.org/pig/Howl). Moreover, according to the communication of the Hive development community, RCFile has been successfully integrated into other MapReduce-based data analysis platforms. It is reasonable to believe that RCFile, as a data storage standard, will continue to play an important role in large-scale data analysis in the MapReduce environment.

3. Introduction to RCFile

The first storage format used when data in facebook's data warehouse is loaded into a table is Record-Columnar File Format (RCFile) developed by Facebook itself. RCFile is a hybrid column storage format that allows queries by row and provides the compression efficiency of column storage. Its core idea is to first split the Hive table horizontally into multiple row groups (row groups), and then split the group vertically according to the columns, so that the data of the columns and columns are continuous storage blocks on disk.

When all columns in a row group are written to disk, RCFile compresses the data in column units using an algorithm similar to zlib/lzo. The lazy decompression strategy (lazy decompression) is used when reading column data, which means that if a user's query involves only some columns in a table, RCFile will skip the process of decompressing and deserializing unwanted columns. By selecting a representative example in the data warehouse of facebook, RCFile can provide 5 times the compression ratio.

4. Beyond RCFile, what is the next step?

As the amount of data stored in the data warehouse continued to grow, engineers in the FB group began to study techniques and methods to improve compression efficiency. The research focuses on column-level coding methods, such as run length coding (run-length encoding), dictionary coding (dictionary encoding), reference frame coding (frame of reference encoding), and numerical coding methods that can better reduce logic redundancy at the column level before the general compression process. FB has also tried new column types (for example, JSON is a widely used format within Facebook, storing data in JSON format in a structured manner not only meets the needs of efficient queries, but also reduces the redundancy of JSON metadata storage). FB's experiments show that column-level coding can significantly improve the compression ratio of RCFile if used properly.

At the same time, Hortonworks is trying similar ideas to improve the storage format of Hive. Hortonworks's engineering team designed and implemented ORCFile (including storage formats and read-write interfaces), which helped Facebook's data warehouse design and implementation of new storage formats provide a good start.

For an introduction to ORCFile, please see here: http://yanbohappy.sinaapp.com/?p=478

With regard to performance evaluation, I have no conditions here for the time being. Post a screenshot of the speaker at a hive technology summit:

5. How to generate RCFile files

With all that has been said above, you must have known that RCFile is mainly used to improve the query efficiency of hive, so how to generate files in this format?

(1) insert conversion directly through textfile table in hive

For example:

Insert overwrite table http_RCTable partition (dt='2013-09-30') select palleidJournal tm journal idatejournal phone from tmp_testp where dt='2013-09-30minutes; (2) generated by mapreduce

So far, mapreduce does not provide built-in API to support RCFile, but other projects in the hadoop ecosystem such as pig, hive, hcatalog and so on. The reason is that RCFile has no significant advantages over other file formats such as textfile for mapreduce application scenarios.

To avoid repeating the wheel, the following mapreduce code that generates RCFile calls the relevant classes of hive and hcatalog. Note that when you test the following code, your hadoop, hive, and hcatalog versions should be the same, otherwise. You know...

For example, if I use hive-0.10.0+198-1.cdh5.4.0, then I should download the corresponding version: http://archive.cloudera.com/cdh5/cdh/4/

PS: the following code has passed the test and there is no problem.

Import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable;import org.apache.hadoop.hive.serde2.columnar.BytesRefWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat Import org.apache.hadoop.util.GenericOptionsParser;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;import org.apache.hcatalog.rcfile.RCFileMapReduceInputFormat;import org.apache.hcatalog.rcfile.RCFileMapReduceOutputFormat;public class TextToRCFile extends Configured implements Tool {public static class Map extends Mapper {private byte [] fieldData; private int numCols Private BytesRefArrayWritable bytes; @ Override protected void setup (Context context) throws IOException, InterruptedException {numCols = context.getConfiguration () .getInt ("hive.io.rcfile.column.number.conf", 0); bytes = new BytesRefArrayWritable (numCols) } public void map (Object key, Text line, Context context) throws IOException, InterruptedException {bytes.clear (); String [] cols = line.toString (). Split ("\ |"); System.out.println ("SIZE:" + cols.length) For (int item0; I

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.