Comparison of storage formats of Hive 07/16 Update SLTechnology News&Howtos

Comparison of storage formats of Hive

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

In production, we can determine which file format to use by comparing the file format of the Hive table and the query speed, so as to save space and improve the query speed.

Official reference document: https://cwiki.apache.org/confluence/display/HIVE

Conclusion:

Compression effect:

Best of all: bzip2;bzip2 has a high compression ratio, but it takes a long time.

Secondly, the compression of orc and parquet is almost the same, and orc or parquet is recommended for production.

Query performance: because the amount of data is too small, the result is not accurate; the boss's production experience is that the query performance of parquet is better than that of orc.

The storage formats supported by Hive are:

Text File

SequenceFile

RCFile

Avro Files

ORC Files

Parquet

Note: the default format of Hive is Text File, which can be viewed through set hive.default.fileformat.

> set hive.default.fileformat;hive.default.fileformat=TextFile

The following compares each file format:

Create a TextFile table

# original data format is TextFile, size is 65m [hadoop@hadoop001 ~] $hadoop fs-du-s-h / input/*64.9 M 194.7 M / input/part-r-00000# create table and load TextFile data CREATE EXTERNAL TABLE textfile (cdn string, region string, level string, time string, ip string, domain string, url string, traffic bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t'; load data local inpath'/ home/hadoop/part-r-00000' overwrite into table textfile

You can see the 64.9m of this data size.

Create an bzip2 compressed table from the above table:

Hive supports compressing data when creating tables. The configuration is as follows:

Set to enable compression: set hive.exec.compress.output=true

View the compressed format: set mapreduce.output.fileoutputformat.compress.codec

Configure the format of compression: set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec

# create a bzip2 compressed table create table textfile_bzip2 as select * from textfile

You can see that after the compression is started, the size is only 13.84m and the format is bz2

Create a SequenceFile table

# create SequenceFile table CREATE EXTERNAL TABLE seqfile (cdn string, region string, level string, time string, ip string, domain string, url string, traffic bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t'stored as sequencefile;# load data insert into table seqfile select * from textfile; Note: when loading data with load by default, the source file is textfile and the table format is sequencefile, so it cannot be loaded directly. You need to borrow a temporary table to load using insert into.

You can see that the table data in sequencefile format is larger than the original file, because the sequencefile table adds a lot of additional information when it is created, and this type of file format is not used in production.

Create a RCFile table

# create RCFile table CREATE EXTERNAL TABLE rcfile (cdn string, region string, level string, time string, ip string, domain string, url string, traffic bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t'stored as rcfile;# load data insert into table rcfile select * from textfile

The function of rcfile is only to save about 10% of the storage space, and this file format is not used in production.

The creation of ORC Files:orc is based on rc and is optimized for column storage.

Official introduction to orc: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

# create a table in orc format; the default orcfile is to use zlib format to compress CREATE EXTERNAL TABLE orcfile (cdn string, region string, level string, time string, ip string, domain string, url string, traffic bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t'stored as orcfile;# load data insert into table orcfile select * from textfile;# to create a table in orc format, and specify that zlib compression is not used Specify create table orcfile_nonestored as orc tblproperties ("orc.compress" = "NONE") as select * from textfile with "orc.compress" = "NONE"

Orcfile uses zlib compression:

Orcfile does not use zlib compression:

Summary: by comparison, zlib compression saves a little more space.

Create a table in Parquet format

# create Parquet format without compression create table parquetfilestored as parquetas select * from textfile;# create Parquet format use gzip compression set parquet.compression=gzip;create table parquetfile_gzipstored as parquetas select * from textfile; Note: by comparison, when Parquet format uses gzip compression, a lot of space can be saved

Parquet is not compressed by gzip: you can see how little space is compressed

Parquet is compressed by gzip: you can see that the compressed data is very modern.

Query performance comparison:

Query statement: select count (*) from textfile | rcfile | orcfile | parquetfile where ip='210.35.230.31'

Number of queries:

Textfile: query all the data of the whole table and query 68085397 pieces of data

Rcfile: query 1973371 pieces of data

Orcfile: query 2883851 pieces of data

Parquetfile: 8622602 pieces of data have been queried

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.