Data Compression method of hive 04/10 Update SLTechnology News&Howtos

Data Compression method of hive

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "the data compression method of hive". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "the data compression method of hive".

1. Compression coding supported by MR

Compressed format

Tools

Arithmetic

File extension

Whether it can be sliced

DEFAULT

None

DEFAULT

.deflate

Gzip

DEFAULT

.gz

Bzip2

.bz2

Yes

LZO

Lzop

LZO

.lzo

LZ4

None

LZ4

.lz4

Snappy

None

Snappy

.snappy

To support multiple compression / decompression algorithms, Hadoop introduces encoders / decoders, as shown in the following table

Compressed format

Corresponding encoder / decoder

DEFLATE

Org.apache.hadoop.io.compress.DefaultCodec

Gzip

Org.apache.hadoop.io.compress.GzipCodec

Bzip2

Org.apache.hadoop.io.compress.BZip2Codec

LZO

Com.hadoop.compression.lzo.LzopCodec

LZ4

Org.apache.hadoop.io.compress.Lz4Codec

Snappy

Org.apache.hadoop.io.compress.SnappyCodec

Comparison of compression performance

Compression algorithm

Original file size

Compress file size

Compression speed

Decompression speed

Gzip

8.3GB

1.8GB

17.5MB/s

58MB/s

Bzip2

8.3GB

1.1GB

2.4MB/s

9.5MB/s

LZO

8.3GB

2.9GB

49.3MB/s

74.6MB/s

Http://google.github.io/snappy/

On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

.2. Compress configuration parameters

To enable compression in Hadoop, configure the following parameters (in the mapred-site.xml file):

Parameters.

Default value

Stage

Suggestion

Io.compression.codecs

(configured in core-site.xml)

Org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec

Org.apache.hadoop.io.compress.Lz4Codec

Input compression

Hadoop uses a file extension to determine whether a codec is supported.

Mapreduce.map.output.compress

False

Mapper output

This parameter is set to true to enable compression

Mapreduce.map.output.compress.codec

Org.apache.hadoop.io.compress.DefaultCodec

Mapper output

Compress data at this stage using LZO, LZ4, or snappy codecs

Mapreduce.output.fileoutputformat.compress

False

Reducer output

This parameter is set to true to enable compression

Mapreduce.output.fileoutputformat.compress.codec

Org.apache.hadoop.io.compress. DefaultCodec

Reducer output

Use standard tools or codecs such as gzip and bzip2

Mapreduce.output.fileoutputformat.compress.type

RECORD

Reducer output

Types of compression used for SequenceFile output: NONE and BLOCK

3. Enable Map output phase compression

Enabling map output phase compression can reduce the amount of data transferred between map and Reduce task in job. The specific configuration is as follows:

Case practice:

1) enable hive intermediate transmission data compression function

Hive (default) > set hive.exec.compress.intermediate=true

2) enable map output compression in mapreduce

Hive (default) > set mapreduce.map.output.compress=true

3) set the compression mode of map output data in mapreduce

Hive (default) > set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec

4) execute the query statement

Select count (1) from score;4 turns on Reduce output phase compression

When Hive writes the output to a table, the output can also be compressed. The property hive.exec.compress.output controls this function. Users may need to keep the default value of false in the default settings file so that the default output is an uncompressed plain text file. Users can turn on the output compression function by setting this value to true in the query statement or execution script.

Case practice:

1) enable hive final output data compression function

Hive (default) > set hive.exec.compress.output=true

2) enable mapreduce final output data compression

Hive (default) > set mapreduce.output.fileoutputformat.compress=true

3) set the compression mode of mapreduce final data output

Hive (default) > set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec

4) set mapreduce final data output compression to block compression

Hive (default) > set mapreduce.output.fileoutputformat.compress.type=BLOCK

5) Test whether the output is a compressed file.

Insert overwrite local directory'/ export/servers/snappy' select * from score distribute by s_id sort by s_id desc; so far, I believe you have a deeper understanding of "hive's data compression method". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.