How many storage formats does Hive have? 07/08 Update SLTechnology News&Howtos

How many storage formats does Hive have?

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is about how many storage formats Hive has. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Hive file storage format

1.textfile

Textfile is the default format

Storage method: row storage

Disk overhead big data is expensive to parse

The compressed text file hive cannot be merged and split

2.sequencefile

Binary files, serialized into files in the form of

Storage method: row storage

Divisible compression

Generally choose block compression

The advantage is that the file and the mapfile in hadoop api are compatible.

3.rcfile

Storage method: data is stored in rows, blocks, and columns.

Compressed fast column access

Reading records involves as few block as possible.

You only need to read the header definition of each row group to read the required columns.

The operational performance of reading full data may not have obvious advantages over sequencefile.

4.orc

Storage method: data is stored in rows, blocks, and columns.

Compressed fast column access

It is more efficient than rcfile and is an improved version of rcfile

5. Custom format

Users can customize the input and output format by implementing inputformat and outputformat.

Summary:

Textfile consumes a lot of storage space, and compressed text can not split and merge queries with the lowest efficiency, so it can be stored directly, and the speed of loading data is the highest.

Sequencefile consumes the most storage space, and compressed files can be split and merged with high query efficiency, so it needs to be loaded through text file conversion.

Rcfile has the smallest storage space and the highest query efficiency. It needs to be loaded through the conversion of text files, and the loading speed is the lowest.

Personal advice: if you can not use text,seqfile, try not to use it. It is best to choose orc.

Creating RCFile, SequenceFile, or Text Files

If you do not have an existing file to use, begin by creating one.

To create an RCFile, SequenceFile, or Text File table:

In the impala-shell interpreter, issue a command similar to:

Create table rcfile_table (column_specs) stored as rcfile;create table sequencefile_table (column_specs) stored as sequencefile;create table textfile_table (column_specs) stored as textfile;-- If the STORED AS clause is omitted, the default is a comma-delimited TEXTFILE.create table default_table (column_specs)

Because Impala can query some kinds of tables that it cannot currently write to, after creating tables of certain file formats, you might use the Hive shell to load the data. See Understanding File Formats for details.

Enabling Compression for RCFile and SequenceFile Tables

You may want to enable compression on existing tables. Enabling compression provides performance gains in most cases and is supported for RCFile and SequenceFile tables. For example, to enable Snappy compression, you would specify the following additional settings when loading data through the Hive shell:

Hive > SET hive.exec.compress.output=true;hive > SET mapred.max.split.size=256000000;hive > SET mapred.output.compression.type=BLOCK;hive > SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;hive > insert overwrite table NEW_TABLE select * from OLD_TABLE

If you are converting partitioned tables, you must complete additional steps. In such a case, specify additional settings similar to the following:

Hive > create table NEW_TABLE (YOUR COLS) partitioned by (PARTITION COLS) stored as NEW_FORMAT;hive > SET hive.exec.dynamic.partition.mode=nonstrict;hive > SET hive.exec.dynamic.partition=true;hive > insert overwrite table NEW_TABLE partition (COMMA SEPARATED PARTITION COLS) select * from OLD_TABLE

Remember that Hive does not require that you specify a source format for it. Consider the case of converting a table with two partition columns called "year" and "month" to a Snappy compressed SequenceFile. Combining the components outlined previously to complete this table conversion, you would specify settings similar to the following:

Hive > create table TBL_SEQ (int_col int, string_col string) stored as SEQUENCEFILE;hive > SET hive.exec.compress.output=true;hive > SET mapred.max.split.size=256000000;hive > SET mapred.output.compression.type=BLOCK;hive > SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;hive > SET hive.exec.dynamic.partition.mode=nonstrict;hive > SET hive.exec.dynamic.partition=true;hive > insert overwrite table TBL_SEQ select * from TBL

To complete a similar process for a table that includes partitions, you would specify settings similar to the following:

Hive > create table TBL_SEQ (int_col int, string_col string) partitioned by (year int) stored as SEQUENCEFILE;hive > SET hive.exec.compress.output=true;hive > SET mapred.max.split.size=256000000;hive > SET mapred.output.compression.type=BLOCK;hive > SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;hive > SET hive.exec.dynamic.partition.mode=nonstrict;hive > SET hive.exec.dynamic.partition=true;hive > insert overwrite table TBL_SEQ partition (year) select * from TBL

Note:

The compression type is specified in the following phrase:

SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec

You could elect to specify alternative codecs such as GzipCodec here.

In the three file formats of Hive: TEXTFILE, SEQUENCEFILE and RCFILE, the storage formats of TEXTFILE and SEQUENCEFILE are all based on row storage. RCFILE is based on the idea of mixing rows and rows. First, the data is divided into N row group according to rows, and each column is stored separately in row group. Also: Hive can support custom format. For more information, please see: Hive file storage format.

HDFS-based row storage has high adaptability to fast data loading and dynamic load, because row storage ensures that all domains of the same record are on the same cluster node. However, it does not meet the requirements of fast query response time, because when the query is only for a few columns of all columns, it can not skip unwanted columns and locate directly to the desired columns; at the same time, it also has some bottlenecks in the utilization of storage space, because the data table contains columns of different types and different data values, row storage is not easy to get a high compression ratio. RCFILE is a column storage format based on SEQUENCEFILE implementation. In addition to meeting the requirements of fast data loading and high adaptation of dynamic load, it also solves some bottlenecks of SEQUENCEFILE.

Here is a brief introduction to these kinds:

TextFile:

Hive default format, no data compression, high disk overhead, high data parsing overhead.

It can be used in combination with Gzip, Bzip2, Snappy, etc. (the system automatically checks and decompresses the query automatically), but in this way, hive does not split the data, so it cannot operate on the data in parallel.

SequenceFile:

SequenceFile is a binary file provided by Hadoop API that serializes data into a file in the form of. The binaries are serialized and deserialized internally using Hadoop's standard Writable interface. It is compatible with MapFile in Hadoop API. SequenceFile in Hive inherits from Hadoop API's SequenceFile, but its key is empty and uses value to store the actual values in order to avoid the sorting process of MR during the map phase.

File structure diagram of SequenceFile:

Header common header file format:

SEQ3BYTENun1byte numeric keyClassNameValueClassNamecompression (boolean) indicates whether compression blockCompression (boolean, indicates whether it is block compression) compressioncodecMetadata file metadata Sync header file end flag is enabled in the file

Block-Compressed SequenceFile format

The advantage of row storage structure based on Hadoop system lies in fast data loading and high adaptability to dynamic load, because row storage ensures that all domains of the same record are on the same cluster node, that is, the same HDFS block. However, the disadvantages of row storage are also obvious. for example, it cannot support fast query processing, because when the query is only for a few columns in multiple lists, it cannot skip unnecessary column reads; in addition, because of columns mixed with different data values, row storage is not easy to get a very high compression ratio, that is, space utilization is not easy to be greatly improved.

Column storage

An example of column storage in a HDFS block

An example of storing tables by column group on HDFS. In this example, column An and column B are stored in the same column group, while columns C and D are stored in separate column groups. Column storage can avoid reading unnecessary columns when querying, and compressing similar data in a column can achieve a higher compression ratio. However, because of the high overhead of tuple refactoring, it can not provide fast query processing based on Hadoop system. Column storage does not guarantee that all domains of the same record are stored on the same cluster node. In the case of row storage, the four fields of the record are stored in three HDFS blocks on different nodes. Therefore, the reconstruction of records will lead to a large amount of data transmission through the network of cluster nodes. Although multiple columns can reduce overhead after pre-grouping, it does not have good adaptability to highly dynamic load patterns.

RCFile combines the speed of row storage queries with the space-saving features of column storage: first, RCFile ensures that the data of the same row is on the same node, so the overhead of tuple refactoring is very low; second, like column storage, RCFile can take advantage of column dimension data compression and skip unnecessary column reads.

Example of RCFile storage in HDFS block:

Data testing

Number of data records in source table: 67236221

Step 1: create tables of three file types, and refer to the Hive file storage format for table syntax.

Sql code

-- TextFile

Set hive.exec.compress.output=true

Set mapred.output.compress=true

Set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

Set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec

INSERT OVERWRITE table hzr_test_text_table PARTITION (product='xxx',dt='2013-04-22')

SELECT xxx,xxx.... FROM xxxtable WHERE product='xxx' AND dt='2013-04-22'

-- SquenceFile

Set hive.exec.compress.output=true

Set mapred.output.compress=true

Set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

Set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec

Set io.seqfile.compression.type=BLOCK

INSERT OVERWRITE table hzr_test_sequence_table PARTITION (product='xxx',dt='2013-04-22')

SELECT xxx,xxx.... FROM xxxtable WHERE product='xxx' AND dt='2013-04-22'

-- RCFile

Set hive.exec.compress.output=true

Set mapred.output.compress=true

Set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

Set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec

INSERT OVERWRITE table hzr_test_rcfile_table PARTITION (product='xxx',dt='2013-04-22')

SELECT xxx,xxx.... FROM xxxtable WHERE product='xxx' AND dt='2013-04-22'

Step 2: test insert overwrite table tablename select.... Time-consuming, storage space

Type insert time consuming (S) storage space (G)

Sequence

97.291

7.13G

RCFile

120.901

5.73G

TextFile

290.517

6.80G

Insert time-consuming, count (1) time-consuming comparison:

Step 3: query response time

Test one

Sql code

First, test the query efficiency of the entire row of records:

Select * from hzr_test_sequence_table where game='XXX'

Select * from hzr_test_rcfile_table where game='XXX'

Select * from hzr_test_text_table where game='XXX'

Scenario 2, test the query efficiency of specific columns:

Select game,game_server from hzr_test_sequence_table where game = 'XXX'

Select game,game_server from hzr_test_rcfile_table where game = 'XXX'

Select game,game_server from hzr_test_text_table where game = 'XXX'

file format

It takes time to query entire row records (S)

Time to query specific column records (S)

Sequence

42.241

39.918

Rcfile

37.395

36.248

Text

43.164

41.632

Time-consuming comparison of solutions:

Test 2:

The purpose of this test is to verify whether the data reading mode of RCFILE and the decompression mode of Lazy have performance advantages. The data reading method only reads metadata and related columns, saving IO;Lazy decompression method only decompresses the relevant column data, and does not decompress the query data that do not meet the where conditions, IO and efficiency have advantages.

Option 1:

Number of records: 698020

Sql code

Insert overwrite local directory 'XXX/XXXX' select game,game_server from hzr_test_xxx_table where game =' XXX'

Option 2:

Number of records: 67236221

Sql code

Insert overwrite local directory 'xxx/xxxx' select game,game_server from hzr_test_xxx_table

Option 3:

Number of records:

Sql code

Insert overwrite local directory 'xxx/xxx'

Select game from hzr_xxx_rcfile_table

File type scheme 1, scheme 2, scheme 3 TextFile54.89569.428167.667SequenceFile137.09677.03123.667RCFile44.2857.03789.9

The above figure shows that the query efficiency of RCFILE is higher than that of SEQUENCEFILE in large and small data sets, and the query efficiency of RCFILE is still better than that of SEQUENCEFILE when reading specific field data.

Thank you for reading! This is the end of this article on "how many storage formats are there in Hive?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.