In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article is about how many storage formats Hive has. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
Hive file storage format
1.textfile
Textfile is the default format
Storage method: row storage
Disk overhead big data is expensive to parse
The compressed text file hive cannot be merged and split
2.sequencefile
Binary files, serialized into files in the form of
Storage method: row storage
Divisible compression
Generally choose block compression
The advantage is that the file and the mapfile in hadoop api are compatible.
3.rcfile
Storage method: data is stored in rows, blocks, and columns.
Compressed fast column access
Reading records involves as few block as possible.
You only need to read the header definition of each row group to read the required columns.
The operational performance of reading full data may not have obvious advantages over sequencefile.
4.orc
Storage method: data is stored in rows, blocks, and columns.
Compressed fast column access
It is more efficient than rcfile and is an improved version of rcfile
5. Custom format
Users can customize the input and output format by implementing inputformat and outputformat.
Summary:
Textfile consumes a lot of storage space, and compressed text can not split and merge queries with the lowest efficiency, so it can be stored directly, and the speed of loading data is the highest.
Sequencefile consumes the most storage space, and compressed files can be split and merged with high query efficiency, so it needs to be loaded through text file conversion.
Rcfile has the smallest storage space and the highest query efficiency. It needs to be loaded through the conversion of text files, and the loading speed is the lowest.
Personal advice: if you can not use text,seqfile, try not to use it. It is best to choose orc.
Creating RCFile, SequenceFile, or Text Files
If you do not have an existing file to use, begin by creating one.
To create an RCFile, SequenceFile, or Text File table:
In the impala-shell interpreter, issue a command similar to:
Create table rcfile_table (column_specs) stored as rcfile;create table sequencefile_table (column_specs) stored as sequencefile;create table textfile_table (column_specs) stored as textfile;-- If the STORED AS clause is omitted, the default is a comma-delimited TEXTFILE.create table default_table (column_specs)
Because Impala can query some kinds of tables that it cannot currently write to, after creating tables of certain file formats, you might use the Hive shell to load the data. See Understanding File Formats for details.
Enabling Compression for RCFile and SequenceFile Tables
You may want to enable compression on existing tables. Enabling compression provides performance gains in most cases and is supported for RCFile and SequenceFile tables. For example, to enable Snappy compression, you would specify the following additional settings when loading data through the Hive shell:
Hive > SET hive.exec.compress.output=true;hive > SET mapred.max.split.size=256000000;hive > SET mapred.output.compression.type=BLOCK;hive > SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;hive > insert overwrite table NEW_TABLE select * from OLD_TABLE
If you are converting partitioned tables, you must complete additional steps. In such a case, specify additional settings similar to the following:
Hive > create table NEW_TABLE (YOUR COLS) partitioned by (PARTITION COLS) stored as NEW_FORMAT;hive > SET hive.exec.dynamic.partition.mode=nonstrict;hive > SET hive.exec.dynamic.partition=true;hive > insert overwrite table NEW_TABLE partition (COMMA SEPARATED PARTITION COLS) select * from OLD_TABLE
Remember that Hive does not require that you specify a source format for it. Consider the case of converting a table with two partition columns called "year" and "month" to a Snappy compressed SequenceFile. Combining the components outlined previously to complete this table conversion, you would specify settings similar to the following:
Hive > create table TBL_SEQ (int_col int, string_col string) stored as SEQUENCEFILE;hive > SET hive.exec.compress.output=true;hive > SET mapred.max.split.size=256000000;hive > SET mapred.output.compression.type=BLOCK;hive > SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;hive > SET hive.exec.dynamic.partition.mode=nonstrict;hive > SET hive.exec.dynamic.partition=true;hive > insert overwrite table TBL_SEQ select * from TBL
To complete a similar process for a table that includes partitions, you would specify settings similar to the following:
Hive > create table TBL_SEQ (int_col int, string_col string) partitioned by (year int) stored as SEQUENCEFILE;hive > SET hive.exec.compress.output=true;hive > SET mapred.max.split.size=256000000;hive > SET mapred.output.compression.type=BLOCK;hive > SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;hive > SET hive.exec.dynamic.partition.mode=nonstrict;hive > SET hive.exec.dynamic.partition=true;hive > insert overwrite table TBL_SEQ partition (year) select * from TBL
Note:
The compression type is specified in the following phrase:
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
You could elect to specify alternative codecs such as GzipCodec here.
In the three file formats of Hive: TEXTFILE, SEQUENCEFILE and RCFILE, the storage formats of TEXTFILE and SEQUENCEFILE are all based on row storage. RCFILE is based on the idea of mixing rows and rows. First, the data is divided into N row group according to rows, and each column is stored separately in row group. Also: Hive can support custom format. For more information, please see: Hive file storage format.
HDFS-based row storage has high adaptability to fast data loading and dynamic load, because row storage ensures that all domains of the same record are on the same cluster node. However, it does not meet the requirements of fast query response time, because when the query is only for a few columns of all columns, it can not skip unwanted columns and locate directly to the desired columns; at the same time, it also has some bottlenecks in the utilization of storage space, because the data table contains columns of different types and different data values, row storage is not easy to get a high compression ratio. RCFILE is a column storage format based on SEQUENCEFILE implementation. In addition to meeting the requirements of fast data loading and high adaptation of dynamic load, it also solves some bottlenecks of SEQUENCEFILE.
Here is a brief introduction to these kinds:
TextFile:
Hive default format, no data compression, high disk overhead, high data parsing overhead.
It can be used in combination with Gzip, Bzip2, Snappy, etc. (the system automatically checks and decompresses the query automatically), but in this way, hive does not split the data, so it cannot operate on the data in parallel.
SequenceFile:
SequenceFile is a binary file provided by Hadoop API that serializes data into a file in the form of. The binaries are serialized and deserialized internally using Hadoop's standard Writable interface. It is compatible with MapFile in Hadoop API. SequenceFile in Hive inherits from Hadoop API's SequenceFile, but its key is empty and uses value to store the actual values in order to avoid the sorting process of MR during the map phase.
File structure diagram of SequenceFile:
Header common header file format:
SEQ3BYTENun1byte numeric keyClassNameValueClassNamecompression (boolean) indicates whether compression blockCompression (boolean, indicates whether it is block compression) compressioncodecMetadata file metadata Sync header file end flag is enabled in the file
Block-Compressed SequenceFile format
The advantage of row storage structure based on Hadoop system lies in fast data loading and high adaptability to dynamic load, because row storage ensures that all domains of the same record are on the same cluster node, that is, the same HDFS block. However, the disadvantages of row storage are also obvious. for example, it cannot support fast query processing, because when the query is only for a few columns in multiple lists, it cannot skip unnecessary column reads; in addition, because of columns mixed with different data values, row storage is not easy to get a very high compression ratio, that is, space utilization is not easy to be greatly improved.
Column storage
An example of column storage in a HDFS block
An example of storing tables by column group on HDFS. In this example, column An and column B are stored in the same column group, while columns C and D are stored in separate column groups. Column storage can avoid reading unnecessary columns when querying, and compressing similar data in a column can achieve a higher compression ratio. However, because of the high overhead of tuple refactoring, it can not provide fast query processing based on Hadoop system. Column storage does not guarantee that all domains of the same record are stored on the same cluster node. In the case of row storage, the four fields of the record are stored in three HDFS blocks on different nodes. Therefore, the reconstruction of records will lead to a large amount of data transmission through the network of cluster nodes. Although multiple columns can reduce overhead after pre-grouping, it does not have good adaptability to highly dynamic load patterns.
RCFile combines the speed of row storage queries with the space-saving features of column storage: first, RCFile ensures that the data of the same row is on the same node, so the overhead of tuple refactoring is very low; second, like column storage, RCFile can take advantage of column dimension data compression and skip unnecessary column reads.
Example of RCFile storage in HDFS block:
Data testing
Number of data records in source table: 67236221
Step 1: create tables of three file types, and refer to the Hive file storage format for table syntax.
Sql code
-- TextFile
Set hive.exec.compress.output=true
Set mapred.output.compress=true
Set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
Set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec
INSERT OVERWRITE table hzr_test_text_table PARTITION (product='xxx',dt='2013-04-22')
SELECT xxx,xxx.... FROM xxxtable WHERE product='xxx' AND dt='2013-04-22'
-- SquenceFile
Set hive.exec.compress.output=true
Set mapred.output.compress=true
Set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
Set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec
Set io.seqfile.compression.type=BLOCK
INSERT OVERWRITE table hzr_test_sequence_table PARTITION (product='xxx',dt='2013-04-22')
SELECT xxx,xxx.... FROM xxxtable WHERE product='xxx' AND dt='2013-04-22'
-- RCFile
Set hive.exec.compress.output=true
Set mapred.output.compress=true
Set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
Set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec
INSERT OVERWRITE table hzr_test_rcfile_table PARTITION (product='xxx',dt='2013-04-22')
SELECT xxx,xxx.... FROM xxxtable WHERE product='xxx' AND dt='2013-04-22'
Step 2: test insert overwrite table tablename select.... Time-consuming, storage space
Type insert time consuming (S) storage space (G)
Sequence
97.291
7.13G
RCFile
120.901
5.73G
TextFile
290.517
6.80G
Insert time-consuming, count (1) time-consuming comparison:
Step 3: query response time
Test one
Sql code
First, test the query efficiency of the entire row of records:
Select * from hzr_test_sequence_table where game='XXX'
Select * from hzr_test_rcfile_table where game='XXX'
Select * from hzr_test_text_table where game='XXX'
Scenario 2, test the query efficiency of specific columns:
Select game,game_server from hzr_test_sequence_table where game = 'XXX'
Select game,game_server from hzr_test_rcfile_table where game = 'XXX'
Select game,game_server from hzr_test_text_table where game = 'XXX'
file format
It takes time to query entire row records (S)
Time to query specific column records (S)
Sequence
42.241
39.918
Rcfile
37.395
36.248
Text
43.164
41.632
Time-consuming comparison of solutions:
Test 2:
The purpose of this test is to verify whether the data reading mode of RCFILE and the decompression mode of Lazy have performance advantages. The data reading method only reads metadata and related columns, saving IO;Lazy decompression method only decompresses the relevant column data, and does not decompress the query data that do not meet the where conditions, IO and efficiency have advantages.
Option 1:
Number of records: 698020
Sql code
Insert overwrite local directory 'XXX/XXXX' select game,game_server from hzr_test_xxx_table where game =' XXX'
Option 2:
Number of records: 67236221
Sql code
Insert overwrite local directory 'xxx/xxxx' select game,game_server from hzr_test_xxx_table
Option 3:
Number of records:
Sql code
Insert overwrite local directory 'xxx/xxx'
Select game from hzr_xxx_rcfile_table
File type scheme 1, scheme 2, scheme 3 TextFile54.89569.428167.667SequenceFile137.09677.03123.667RCFile44.2857.03789.9
The above figure shows that the query efficiency of RCFILE is higher than that of SEQUENCEFILE in large and small data sets, and the query efficiency of RCFILE is still better than that of SEQUENCEFILE when reading specific field data.
Thank you for reading! This is the end of this article on "how many storage formats are there in Hive?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.