What are the compression and decompression methods in Hadoop 07/04 Update SLTechnology News&Howtos

What are the compression and decompression methods in Hadoop

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "what are the compression and decompression methods in Hadoop". In the daily operation, I believe that many people have doubts about the compression and decompression methods in Hadoop. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "what are the compression and decompression methods in Hadoop?" Next, please follow the editor to study!

One: compression technology can reduce the number of read and write bytes of the underlying HDFS, reduce disk IO, and improve the efficiency of network transmission, because disk IO and network bandwidth are valuable resources of Hadoop; especially when running MR programs, it takes a lot of time to run MR, network data transfer, shuffle and Merge, so compression is very important. Compression is an optimization strategy to improve the running efficiency of Hadoop. Proper use can improve efficiency, but improper use may also reduce efficiency 1.1.The principle of compression: 1. Computing-intensive tasks: use CPU to do mathematical operations a lot, and use compression less at this time

2. IO-intensive tasks: multi-use compression requires compression coding supported by 1.2:MR that consumes CPU resources.

DEFLATE does not support sharding

Gzip does not support sharding

Bzip2 supports sharding

LZO non-hadoop native installation supports sharding

Snappy does not support sharding

In order to support multiple compression / decompression algorithms, Hadoop introduces encoders / decoders

Org.apache.hadoop.io.compress.DefaultCodec

Org.apache.hadoop.io.compress.GzipCodec

Org.apache.hadoop.io.compress.BZip2Codec

Com.hadoop.compression.lzo.LzopCodec

Org.apache.hadoop.io.compress.SnappyCodec

1.3: comparison of compression performance

1.4: choice of compression mode

1.4.1Gzip compression

Advantages: compression / decompression is faster, and dealing with Gzip format files is the same as dealing with text directly.

Cons: split is not supported

Application scenarios:

When each file is compressed within 130m (within 1 block size), consider.

1.4.2:Bzip2 compression

Advantages: higher compression ratio than Gzip, support for split

Disadvantages: slow compression / decompression

Application scenarios: suitable for low speed, but high compression ratio

Or the output data is relatively large, and the processed data needs to be compressed and archived. At the same time, it reduces the storage space for a single large text file like compression, and needs to support split at the same time.

1.4.3LZO compression

Advantages: fast compression / decompression, reasonable compression ratio, support for split, is the most popular compression format for Hadoop, and needs to be installed under Linux system.

Disadvantages: the compression ratio is lower than Gzip, and Hadoop itself does not support it. In order to support split, you need to build an index and specify InputFormat as Lzo format.

Application scenario: for a large text file, it can be considered that it is larger than 200m after compression, and the larger the single file, the more obvious the advantages of LZO

1.4.4Snappy compression

Advantages: compression speed and reasonable compression ratio

Disadvantages: do not support split, compression ratio is lower than gzip, Hadoop itself does not support the need to install

Application scenario: when the Map output of a MapReduce job is relatively large, it is used as a compression format for intermediate data from Map to Reduce, or as the output of one MapReduce job and the input of another MapReduce job.

Compression can be enabled at any stage of MapReduce action. Second: MapReduce data compression before Map input compression: (Hadoop automatically checks the file extension if the extension can match, it will use the appropriate codec to compress and decompress the file.)

Mapper output is compressed: (it can effectively improve the shuffle process, which consumes the most resources)

Note: (LZO is a general codec for Hadoop, and its design goal is to achieve the compression speed equivalent to the hard disk read speed, so speed is the priority factor, followed by compression ratio. The compression speed of LZO is 5 times that of Gzip, and the decompression speed is 2 times that of Gzip.)

Reducer output compression: compression technology can reduce the amount of data to be stored, I disk space.

Three: parameter configuration of compression

Io.compression.codecs (configured in core-site.xml) (before map input)

Mapreduce.map.output.compress (configured in mapred-site.xml) (map to reduce)

Mapreduce.map.output.compress.codec (configured in mapred-site.xml)

Mapreduce.output.fileoutputformat.compress (configured in mapred-site.xml) (reduce output)

Mapreduce.output.fileoutputformat.compress.codec (configured in mapred-site.xml)

Mapreduce.output.fileoutputformat.compress.type (configured in mapred-site.xml)

If the compression is written to the configuration file, everything will be compressed, and if it is only written to the current program, it will only work for the current program.

3.1Setting the reduce output compression format / / setting the Reduced output compression

FileOutputFormat.setCompressOutput (job,true)

/ / the result of compression is BZip2Codec

FileOutputFormat.setOutputCompressorClass (job,BZip2Codec.class)

FileOutputFormat.setOutputCompressorClass (job, SnappyCodec.class); 3.2: set the compression mode of map input / / enable map output compression

Conf.setBoolean ("mapreduce.map.output.compress", true)

Conf.setClass ("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class); IV: file compression and decompression case public class FileCompress {public static void main (String [] args) throws IOException {/ / two parameters path and compression format / / compress ("E:\\ a.txt", "org.apache.hadoop.io.compress.BZip2Codec"); decompress ("E:\\ a.txt.bz2") } private static void decompress (String path) throws IOException {/ / 1: verify whether CompressionCodecFactory A factory that will find the correct codec for a given filename can be decompressed. CompressionCodecFactory factory = new CompressionCodecFactory (new Configuration ()); / / This class encapsulates a streaming compression/decompression pair. CompressionCodec codec = factory.getCodec (new Path (path)); if (codec = = null) {System.out.println ("cannot find codec for file" + path); return;} / / 2 get the normal input stream, and then get the decompressed input stream FileInputStream fis = new FileInputStream (new File (path)) / / allow the client to redefine the input stream CompressionInputStream cis = codec.createInputStream (fis); / 3: get the output stream FileOutputStream fos = new FileOutputStream (new File (path + ".decodec")); / / 4 write the compressed input stream to the output stream IOUtils.copyBytes (cis, fos, new Configuration ()); / / 5: close the resource IOUtils.closeStream (fos) IOUtils.closeStream (cis); IOUtils.closeStream (fis);} private static void compress (String path, String method) throws IOException {/ / 1: get the input stream FileInputStream fis = new FileInputStream (path); / / 2: get the compression encoder codec CompressionCodecFactory factory = new CompressionCodecFactory (new Configuration ()); CompressionCodec codec = factory.getCodecByName (method) / / 3: get the normal output stream, get the compressed output stream to get the encoder extension FileOutputStream fos = new FileOutputStream (new File (path + codec.getDefaultExtension (); CompressionOutputStream cos = codec.createOutputStream (fos); / / 4: assign the input stream to the stream output stream IOUtils.copyBytes (fis,cos,new Configuration ()); / / 5 turn off the resource IOUtils.closeStream (cos) IOUtils.closeStream (fos); IOUtils.closeStream (fis);}} at this point, the study of "what are the compression and decompression methods in Hadoop" is over, hoping to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.