What are the file formats in Hadoop 07/15 Update SLTechnology News&Howtos

What are the file formats in Hadoop

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what are the file formats in Hadoop". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Beginner's Guide to Hadoop File format

A few weeks ago, I wrote an article about Hadoop and talked about different parts of it. And how it plays an important role in data engineering. In this article, I'll summarize the different file formats in Hadoop. This topic will be a short and quick topic. If you want to know how Hadoop works and its important role in data engineers, please visit my article on Hadoop here, or be happy to skip it.

The file formats in Hadoop are roughly divided into two categories: row-oriented and column-oriented:

Row-oriented: the same row of data stored together is continuous storage: SequenceFile,MapFile,Avro Datafile. In this way, if only a small amount of data in the row needs to be accessed, the entire row needs to be read into memory. Delayed serialization can alleviate the problem to a certain extent, but it cannot eliminate the overhead of reading an entire row of data from disk. Row-oriented storage is suitable for situations where the entire row of data needs to be processed at the same time.

Column oriented: the entire file is divided into several columns of data, each of which is stored together: Parquet,RCFile,ORCFile. The column-oriented format can skip unwanted columns when reading data and is suitable for situations where there are only a small number of branches in the field. But this read and write format requires more storage space because the cache row needs to be in memory (to get a column in multiple rows). At the same time, it is not suitable for streaming, because if a write fails, the current file cannot be recovered, and when a write fails, line-oriented data can be resynchronized to the last synchronization point, so Flume uses a line-oriented storage format.

> Picture 1. (Left Side) Show the Logical Table and Picture 2. (Right Side) Row-Oriented Layout (Sequ

> Picture 3. Column-oriented Layout (RC File)

If you still don't know the direction of the rows and columns, don't worry, you can visit this link to see the difference between them.

Here are some related file formats that are widely used on Hadoop systems:

Sequence file

The storage format varies depending on whether it is compressed and whether record compression or block compression is used:

> The Internal structure of a sequence file with no compression and with record compression.

No compression: stored sequentially according to record length, key length, value degree, key value and value value. The range is in bytes. Performs serialization using the specified serialization.

Record compression: only the value is compressed and the compressed codec is stored in the header.

Block compression: compress multiple records together to take advantage of the similarity between records and save space. Synchronization tags are added before and after the block. The minimum value of this property is io.seqfile.compress.blocksizeset.

> The internal structure of a sequence file with block compression

Map file

MapFile is a variant of SequenceFile. After the index is added to the SequenceFile and sorted, it is called MapFile. Indexes are stored as separate files, usually one index for every 128 records. Indexes can be loaded into memory for quick lookup-files that store data in the order defined by Key. MapFile records must be written sequentially. Otherwise, an IOException will be thrown.

Derived type of MapFile:

SetFile: a special MapFile for storing writable type key sequences. The keys are written sequentially.

ArrayFile: the key is an integer that represents the position in the array, and the value is writable.

BloomMapFile: optimized for the MapFile get () method using a dynamic Bloom filter. The filter is stored in memory and the regular get () method is called to perform the read operation only if the key value exists.

The files listed under the Hadoop system include RCFile,ORCFile and Parquet. The column-oriented version of Avro is Trevni.

RC file

Hive's Record Columnar File (record column file), this type of file first divides the data into row groups by row, and then stores the data in columns within the row group. Its structure is as follows:

> Data Layout of RC File in an HDFS block

Compared to pure row-oriented and column-oriented

> Row-Store in an HDFS Block

> Column Group in HDFS Block

ORC file

ORCFile (optimized record column file) provides a more efficient file format than RCFile. It internally divides the data into Stripe with a default size of 250m. Each stripe contains indexes, data, and footers. The index stores the maximum and minimum values for each column and the location of each row in the column.

> ORC File Layout

In Hive, the following command is used to use ORCFile:

CREATE TABLE... STORED AAS ORC ALTER TABLE... SET FILEFORMAT ORC SET hive.default.fileformat=ORC

Parquet

A general column-based storage format based on Google's Dremel. Especially good at dealing with deeply nested data.

> The internal Structure of Parquet File

For nested structures, Parquet converts it to flat column storage, which is represented by repetition levels and definition levels (R and D), and uses metadata to rebuild records when reading data to rebuild the entire file. The structure. The following are examples of R and D:

AddressBook {contacts: {phoneNumber: "555 987 6543"} contacts: {}} AddressBook {}

This is the end of the content of "what are the file formats in Hadoop". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.