What are the Hadoop file formats? 07/03 Update SLTechnology News&Howtos

What are the Hadoop file formats?

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces what the Hadoop file format has, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian take you to understand it.

File formats in Hadoop

1 SequenceFile

SequenceFile is a binary file provided by Hadoop API that serializes data into a file in the form of. The binaries are serialized and deserialized internally using Hadoop's standard Writable interface. It is compatible with MapFile in Hadoop API. SequenceFile in Hive inherits from Hadoop API's SequenceFile, but its key is empty and uses value to store the actual values in order to avoid the sorting process of MR during the map phase. If you write SequenceFile in Java API and let Hive read it, be sure to use the value field to store the data, otherwise you need to customize the InputFormat class and OutputFormat class to read this SequenceFile.

Figure 1:Sequencefile file structure

2 RCFile

RCFile is a column-oriented data format developed by Hive. It follows the design concept of "dividing first by column, then vertically". When querying for columns it doesn't care about, it skips those columns on the IO. It should be noted that the RCFile copy from the remote end of the map phase still copies the entire data block, and after copying to the local directory, the RCFile does not really skip the unwanted columns directly and skip to the columns that need to be read, but by scanning the header definition of each row group, but the header at the entire HDFS Block level does not define which row group each column starts to which row group ends. So in the case of reading all columns, the performance of RCFile is not as high as SequenceFile.

Figure 2:RCFile file structure

3 Avro

Avro is a binary file format used to support data-intensive data. Its file format is more compact, and Avro provides better serialization and deserialization performance when reading large amounts of data. And Avro data files are inherently defined with Schema, so it does not require developers to implement their own Writable objects at the API level. Several recent Hadoop subprojects support Avro data formats, such as Pig, Hive, Flume, Sqoop, and Hcatalog.

Figure 3:Avro MR file format

4. Text format

In addition to the three binary formats mentioned above, data in text format is also frequently encountered in Hadoop. Such as TextFile, XML and JSON. In addition to consuming more disk resources, the parsing overhead of text formats is generally dozens of times higher than that of binary formats, especially XML and JSON, which are even more expensive than Textfile, so it is strongly not recommended to use these formats for storage in production systems. If you need to output these formats, please do the appropriate conversion on the client side. Text format is often used for log collection, database import, Hive default configuration also uses text format, and it is easy to forget to compress, so make sure you use the correct format. Another disadvantage of text format is that it does not have types and patterns, such as numerical data such as sales amount, profit or date-time data. If saved in text format, because their own string types are different in length, or contain negative numbers, MR has no way to sort them, so they often need to be preprocessed to binary format containing patterns. This leads to the overhead of unnecessary preprocessing steps and the waste of storage resources.

5. External format

Hadoop supports virtually any file format, as long as you can implement the corresponding RecordWriter and RecordReader. The database format is also often stored in Hadoop, such as Hbase,Mysql,Cassandra,MongoDB. These formats are generally used to avoid the need for large amounts of data movement and fast loading. Their serialization and deserialization are done by clients in these database formats, and the location and data layout (Data Layout) of the files are not controlled by Hadoop, and their file sharding is not cut according to the block size of HDFS (blocksize).

Comparison and Analysis of File Storage size

We chose a TPC-H standard test to illustrate the storage overhead of different file formats. Because this data is public, readers who are interested in the results can also do it themselves against the following experiments. The original size of the Orders text format is 1.62G. We loaded it into Hadoop and converted it to the above formats using Hive, and tested the size of the resulting file in the same LZO compression mode.

Orders_text1

1732690045

1.61G

Non-compression

TextFile

Orders_tex2

772681211

736M

LZO compression

TextFile

Orders_seq1

1935513587

1.80G

Non-compression

SequenceFile

Orders_seq2

822048201

783M

LZO compression

SequenceFile

Orders_rcfile1

1648746355

1.53G

Non-compression

RCFile

Orders_rcfile2

686927221

655M

LZO compression

RCFile

Orders_avro_table1

1568359334

1.46G

Non-compression

Avro

Orders_avro_table2

652962989

622M

LZO compression

Avro

Table 1: comparison of file sizes in different formats

As can be seen from the above experimental results, SequenceFile is larger than the original plain text TextFile in both compressed and uncompressed cases, including 11% in uncompressed mode and 6.4% in compressed mode. This has something to do with the definition of SequenceFile's file format: SequenceFile defines its metadata in the header, and the size of the metadata varies slightly depending on the compression mode. In general, compression is performed at the block level, where each block contains the length of the key and the length of the value, and there is a sync-marker tag for every 4K byte. For the TextFile file format, only one line spacer is needed between the different columns, so the TextFile file format is smaller than the SequenceFile file format. But the TextFile file format does not define the length of the column, so it must determine whether each character is a delimiter and line Terminator on a character-by-character basis. As a result, the deserialization overhead of TextFile can be dozens of times higher than other binary file formats.

The RCFile file format also saves the length of each field for each column. But it is continuously stored in the header metadata block, and it stores the actual data values continuously. In addition, RCFile will rewrite the header meta-data blocks every certain block size (called row group, controlled by hive.io.rcfile.record.buffer.size, with a default size of 4m), which is necessary for new columns, but not for duplicate columns. The RCFile is supposed to be larger than the SequenceFile file, but RCFile uses Run Length Encoding to compress the field length when defining the header, so RCFile is a little smaller than SequenceFile. Run length Encoding has very high compression efficiency for fixed-length data formats, such as Integer, Double, and Long. Here is a special case-the TimeStamp time type introduced by Hive 0.8. if its format does not include milliseconds and can be expressed as "YYYY-MM-DD HH:MM:SS", then the fixed length is 8 bytes. If the millisecond is taken, it is expressed as "YYYY-MM-DD HH:MM:SS.fffffffff", and the part that follows the millisecond is variable.

The Avro file format is also divided by group. But it defines the Schema of the entire data in the header, unlike RCFile, which defines the type of the column every other row group and repeats it multiple times. In addition, Avro uses smaller data types, such as Short or Byte, when using some types, so Avro's data blocks are smaller than RCFile's file format blocks.

Analysis of serialization and deserialization overhead

We can use Java's profile tool to view the CPU and memory overhead of Hadoop runtime tasks. The following are the settings on the Hive command line:

Hive > set mapred.task.profile=true;hive > set mapred.task.profile.params =-agentlib:hprof=cpu=samples,heap=sites, depth=6,force=n,thread=y,verbose=n,file=%s

When map task finishes running, the logs it generates are written in the $logs/userlogs/job- folder. Of course, you can also find the log directly on the logs or jobtracker.jsp page of JobTracker's Web interface.

Let's run a simple SQL statement to observe the serialization and deserialization overhead of the RCFile format:

Hive > select Olympian customer instruction from orders_rc2 where Olympiad Orderstus

Where O_CUSTKEY is of type integer and O_ORDERSTATUS is of type String. Memory and CPU consumption is included at the end of the log output.

The following table shows the cost of an CPU:

Rank

Self

Accum

Count

Trace

Method

twenty

0.48%

79.64%

sixty-five

315554

Org.apache.hadoop.hive.ql.io.RCFile$Reader.getCurrentRow

twenty-eight

0.24%

82.07%

thirty-two

315292

Org.apache.hadoop.hive.serde2.columnar.ColumnarStruct.init

fifty-five

0.10%

85.98%

fourteen

315788

Org.apache.hadoop.hive.ql.io.RCFileRecordReader.getPos

fifty-six

0.10%

86.08%

fourteen

315797

Org.apache.hadoop.hive.ql.io.RCFileRecordReader.next

Table 2: cost of a CPU

The fifth column can check which functions are called against the Track information above. For example, the function ranked 20 in CPU consumption corresponds to Track:

TRACE 315554: (thread=200001) org.apache.hadoop.hive.ql.io.RCFile$Reader.getCurrentRow (RCFile.java:1434) org.apache.hadoop.hive.ql.io.RCFileRecordReader.next (RCFileRecordReader.java:88) org.apache.hadoop.hive.ql.io.RCFileRecordReader.next (RCFileRecordReader.java:39) org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext (CombineHiveRecordReader.java:98) org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext (CombineHiveRecordReader.java 42) org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next (HiveContextAwareRecordReader.java:67)

Among them, the more obvious is RCFile, which consumes unnecessary array movement overhead in order to construct rows. This is mainly because RCFile needs to construct RowContainer in order to restore rows, read one row sequentially to construct RowContainer, and then assign values to the corresponding columns. Because RCFile is compatible with SequenceFile, it can merge two block, and because RCFile does not know which row group the column ends in, it must maintain the current position of the array, similar to the following format definition:

Array

This data format can be changed to column-oriented serialization and deserialization. Such as:

Map

Deserialization in this way avoids unnecessary array movement, of course, as long as we know which row group the column starts and which row group ends. This approach improves the efficiency of the overall deserialization process.

Thoughts on Hadoop File format

1 efficient compression

At present, Hadoop does not have an efficient encoding (Encoding) and decoding (Decoding) data format for data characteristics. Especially the data formats that support extremely efficient algorithms such as Run Length Encoding and Bitmap. HIVE-2065 has discussed the use of more efficient forms of compression, but there is no conclusion on how to choose the order of the columns. For column order selection, you can see Daniel Lemire's paper "Reordering Columns for Smaller Indexes" [1]. The author is also the author of the basic library of bitmap compression algorithm introduced in Hive 0.8. the author is also the author of the basic library of bitmap compression algorithm introduced in bitmap 0.8. The conclusion of this paper is that when a table needs to select multiple columns for compression, it needs to be arranged in ascending order according to the column selectivity (selectivity), that is, the columns with fewer unique values are ranked higher. In fact, this conclusion is also the data format used by Vertica for many years. Other compression-related issues include HIVE-2604 and HIVE-2600.

2 serialization and deserialization based on columns and blocks

Whether the sorted results are really needed or not, the current overall framework of Hadoop needs to be constantly sorted according to the data key. In addition to the column-based sorting, serialization, and deserialization mentioned above, the Hadoop file format should support some kind of Block-based sorting and serialization and deserialization, only if the data meets the needs. It has been mentioned as an optimization means of MR in Google Tenzing papers.

"Block Shuffle: normally, MR uses line-based encoding and decoding when Shuffle. In order to process each row one by one, the data must be sorted first. However, this approach is not efficient when sorting is not necessary. We have implemented a block-based shuffle approach based on row-based shuffle, which handles about 1m of compressed block at a time. By treating the entire block as one line, we can avoid row-based serialization and deserialization on the MR framework, which is more than three times faster than row-based shuffle. "

3 data filtering (Skip List)

In addition to common partitions and indexes, using sorted Block intervals is also a common method of filtering data in column databases. Google Tenzing also describes a data format called ColumnIO. ColumnIO defines the maximum and minimum values of the Block in the header. If the range described in the header information of the current Block does not contain the content that needs to be processed, the block will be skipped directly. The Hive community has discussed how to skip unwanted blocks, but there has been no better implementation because there is no sorting. Including RCFile format, Hive's index mechanism currently does not have an efficient way to skip blocks based on header metadata.

4 delayed materialization

A really good column database should be able to manipulate blocks directly on top of compressed data without decompressing and sorting. In this way, the overhead of decompressing, deserializing, and then sorting in the MR framework or row database can be greatly reduced. Block Shuffle described in Google Tenzing is also a kind of delayed materialization. Better delayed materialization can be operated directly on compressed data and can do internal loops, which is described in Chapter 5.2 of the paper "Integrating Compression and Execution in Column-Oriented Database System" [5]. However, considering that it also has something to do with UDF integration, it is also controversial whether it will make the file interface too complex.

5 integrate with Hadoop framework

Whether in text or binary format, it is only the final storage format. The intermediate data generated by the Hadoop runtime cannot be controlled. Including data generated by a MR Job between map and reduce or between DAG Job upstream reduce and downstream map, especially if the intermediate format is not a column format, which will incur unnecessary IO and CPU overhead. For example, the spill,reduce phase generated by the map phase requires copy and then sort-merge. If this intermediate format is also column-oriented, and then cut a large chunk into several small chunks, and add the maximum and minimum index of each small block to the header, you can avoid the overhead of decompression-deserialization-sort-merge (Merge) in a large number of sort-mege operations, thereby reducing the running time of the task.

Other file formats

The Hadoop community has also studied other file formats. For example, IBM has studied column-oriented data formats and published a paper called "Column-Oriented Storage Techniques for MapReduce" [4], which specifically mentions that IBM's CIF (Column InputFormat) file format consumes 20 times less IO than RCFile in serialization and deserialization. The implementation of spreading columns over different HDFS Block blocks mentioned in RCFile has also been considered, but in the end this implementation may be abandoned because the consumption of reorganizing rows may be due to delays caused by dispersion on remote machines. In addition, Avro has recently been implementing a column-oriented data format, but the integration of Hive and Avro is not yet complete.

Thank you for reading this article carefully. I hope the article "what are the Hadoop file formats" shared by the editor will be helpful to you. At the same time, I also hope that you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.