Seventeen. Related to hadoop compression 04/27 Update SLTechnology News&Howtos

Seventeen. Related to hadoop compression

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The significance of data compression in hadoop 1. Basic overview

compression technology can reduce the number of read and write bytes of the underlying hdfs. And can reduce the network bandwidth resources occupied in the process of data transmission, as well as reduce the occupied disk space. In MapReduce, shuffle and merge process are faced with huge IO pressure. However, it should be noted that the increase in compression, on the other hand, will increase the load on cpu. Therefore, it is necessary to weigh whether to use compression or not, and the characteristics of the compression algorithm used.

2. Basic principles of compression application

Compute-intensive job, use less compression. Because compression takes up cpu.

IO-intensive job, which can be compressed to reduce the amount of data.

When choosing a compression algorithm, we should pay attention to the compression ratio. The greater the compression ratio, the longer the compression and decompression time.

2. Compression format supported by MR 1. Whether the compression format hadoop has its own algorithm, whether the file extension can be split into a compressed format, whether the original program needs to modify whether DEFAULT is DEFAULT.deflate, whether it is the same as ordinary text processing, no need to modify whether gzip is DEFAULT.gz, whether it is the same as ordinary text processing, there is no need to modify whether bzip2 is bzip2.bz2 is the same as ordinary text processing Do not need to modify LZO need to install lzo.lzo need to establish index file, also need to specify output format snappy need to install snappy.snappy is the same as ordinary text processing There is no need to modify 2. The codec compression format corresponding to different compression algorithms corresponds to the codec DEFAULTorg.apache.hadoop.io.compress.DefaultCodecgziporg.apache.hadoop.io.compress.GzipCodecbzip2org.apache.hadoop.io.compress.BZip2Codeclzocom.hadoop.compression.lzo.LzopCodecsnappyorg.apache.hadoop.io.compress.SnappyCodec3, the characteristics of different compression algorithms and applicable scenarios (1) gzip

Advantages:

High compression ratio, decompression, compression speed must be relatively fast. Hadoop comes with itself, and dealing with files in gzip format in an application is the same as dealing with text directly. Most Linux comes with the gzip command, which is easy to use.

Cons: split is not supported

Applicable scenarios:

When each file is compressed to a size of about a block (because it cannot be shredded), you can consider using gzip to compress the original data. For example, you can compress a day's or an hour's log into a gzip file, and when you run MapReduce, you can process multiple gzip in parallel. When dealing with compressed files, the hive,streaming,MapReduce program does not need to modify the program, just like text files.

(2) bzip2

Advantages:

Support split; high compression ratio, higher than gzip. Hadoop comes with it, and Linux comes with bzip2 command

Disadvantages: slow compression and decompression speed, does not support native (java and C interactive api interface)

Applicable scenarios:

It is suitable to be used as the output format of mapreduce job when the speed is not high, but the compression ratio is high, or the data after output is relatively large, and the processed data needs to be compressed, archived, reduced disk space and used less data in the future. Or if you want to compress a single large text file to reduce storage space, you need to support split and be compatible with previous applications (that is, the application does not need to be modified).

(3) lzo

Advantages:

The compression / decompression speed is faster and the compression ratio is reasonable (smaller than gzip and bzip2). Support for split is the most popular compression format in hadoop. It can be used under Linux by installing the lzop command.

Disadvantages:

The compression ratio is lower than gzip; hadoop itself does not support and needs to be installed; in the application, files in lzo format need to do some special processing (in order to support split, you need to build an index, you also need to specify inputformat as lzo format).

Applicable scenarios:

A large text file can be considered if it is more than 200m after compression, and the larger the single file is, the more obvious the advantages of lzo are.

(4) snappy

Advantages: fast compression and decompression speed, reasonable compression ratio

Disadvantages: unsupported split, compression ratio is lower than gzip; hadoop itself does not support, need to install

Applicable scenarios:

When the Map output of a Mapreduce job is large, it is used as a compressed format for intermediate data from Map to Reduce, or as the output of one Mapreduce job and the input of another Mapreduce job.

Third, the configuration of compression 1. Scope of application

It can be used in any stage of MapReduce output, raw data compression, reduce output, etc.

2. Default values of hadoop compression configuration parameters are recommended for io.compression.codecs (configured in core-site.xml) org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec Org.apache.hadoop.io.compress.BZip2Codec input compression Hadoop uses a file extension to determine whether a codec is supported mapreduce.map.output.compress (configured in mapred-site.xml) falsemap output this parameter is set to true enable compression mapreduce.map.output.compress.codec (configured in mapred-site.xml) org.apache.hadoop.io.compress.DefaultCodecmapper output uses LZO or snappy codecs to compress data mapreduce.output at this stage .fileoutputformat.compress (configured in mapred-site.xml) falsereduce output this parameter is set to true to enable compression mapreduce.output.fileoutputformat.compress.codec (configured in mapred-site.xml) org.apache.hadoop.io.compress. DefaultCodecreduce output uses gzip or bzip2 to compress mapreduce.output.fileoutputformat.compress.type (configured in mapred-site.xml) RECORDreduce output SequenceFile output compression types: NONE and block IV, compression application practical example 1, compression and decompression of data streams package JavaCompress;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IOUtils;import org.apache.hadoop.io.compress.CompressionCodec;import org.apache.hadoop.io.compress.CompressionCodecFactory Import org.apache.hadoop.io.compress.CompressionInputStream;import org.apache.hadoop.io.compress.CompressionOutputStream;import org.apache.hadoop.util.ReflectionUtils;import java.io.*;public class TestCompress {public static void main (String [] args) throws IOException, ClassNotFoundException {/ / compress ("G:\\ Fly Away- Leong Jingru .mp3", "org.apache.hadoop.io.compress.GzipCodec") DeCompress ("G:\\ Fly Away- Leong Jingru .mp3.gz", "mp3");} / / compress public static void compress (String filename, String method) throws IOException, ClassNotFoundException {/ / create the input stream FileInputStream fis = new FileInputStream (new File (filename)); / / get the Class object Class codecClass = Class.forName (method) of the compressed class through reflection / / get compressed object CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance (codecClass, new Configuration ()) through reflection; / / create a normal output stream object FileOutputStream fos = new FileOutputStream (new File (filename + codec.getDefaultExtension (); / / create a compressed output stream by compressing the object, similar to encapsulating the ordinary output stream into a compressed output stream CompressionOutputStream cos = codec.createOutputStream (fos) / / streaming copy IOUtils.copyBytes (fis, cos, 1024 * 1024 * 5, false); fis.close (); cos.close (); fos.close ();} / / decompressing public static void deCompress (String filename, String decode) throws IOException {CompressionCodecFactory factory = new CompressionCodecFactory (new Configuration ()) / / gets the compression algorithm type object of the file. The returned value can be used to check whether the file can be decompressed CompressionCodec codec = factory.getCodec (new Path (filename)); if (codec = = null) {System.out.println ("unzipped:" + filename); return } / / depending on the compression type of the compressed file, the returned object is used to create the compressed input stream CompressionInputStream cis = codec.createInputStream (new FileInputStream (new File (filename); / / to create the output stream FileOutputStream fos = new FileOutputStream (new File (filename + decode)); IOUtils.copyBytes (cis, fos, 1024 * 1024 * 5, false); cis.close (); fos.close () }} 2. Map output is compressed

The usage is simple, just configure the following parameters for job in driver

Configuration configuration = new Configuration (); / / enable map output compression configuration.setBoolean ("mapreduce.map.output.compress", true); / / set map output compression mode configuration.setClass ("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class); 3. Reduce output compression is used.

Still set the following in driver

/ / set reduce output compression to enable FileOutputFormat.setCompressOutput (job, true); / / set compression method FileOutputFormat.setOutputCompressorClass (job, BZip2Codec.class); / / FileOutputFormat.setOutputCompressorClass (job, GzipCodec.class); / / FileOutputFormat.setOutputCompressorClass (job, DefaultCodec.class)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.