Analysis of Hadoop SequnceFile.Writer Compression Mode and Compression Library 07/06 Update SLTechnology News&Howtos

Analysis of Hadoop SequnceFile.Writer Compression Mode and Compression Library

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains the "Hadoop SequnceFile.Writer compression mode and compression library example analysis", the article explains the content is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "Hadoop SequnceFile.Writer compression mode and compression library example analysis" bar!

First, the compression type (Compression Type) of SequnceFile is divided into three types of NONE,RECORD,BLOCK, which are specified by the configuration item io.seqfile.compression.type:

NONE, Do not compress records does not compress

RECORD, Compress values only, each separately. Value is compressed once for each record

BLOCK, Compress sequences of records together in blocks. Block compression, which occurs when the key and value byte sizes of the cache reach a specified threshold, which is specified by the configuration item io.seqfile.compress.blocksize. The default value is 1000000 bytes.

The compression algorithm used by RECORD,BLOCK is determined by the CompressionOption specified when creating the SequnceFile.Writer. The CompressionCodec codec attribute in CompressionOption is the compression encoder. If it is not specified, the underlying compression library corresponding to org.apache.hadoop.io.compress.DefaultCodec is zlib by default. There are several other CompressionCodec:GzipCodec Lz4Codec SnappyCodec BZip2Codec except DefaultCodec. No comparison is made here.

When implementing zlib compression, DefaultCodec can specify the use of libhadoop.so (the native library provided by the hadoop framework) or the java.util.zip library. Here's how to open the hadoop native library or the java zip library:

SequnceFile uses org.apache.hadoop.io.compress.DefaultCodec compression by default and uses Deflate compression algorithm.

When DefaultCodec creates the compressor, it executes the class ZlibFactory.getZlibCompressor (conf) method to implement the code snippet:

Return (isNativeZlibLoaded (conf))? New ZlibCompressor (conf): new BuiltInZlibDeflater (ZlibFactory.getCompressionLevel (conf). CompressionLevel ()

When loading the local Zlib library, use the ZlibCompressor compressor class, otherwise use the BuiltInZlibDeflater class, which is implemented by the java.util.zip.Inflater class that calls java

IsNativeZlibLoaded is judged based on whether the NativeCodeLoader class has loaded the hadoop native library. The code is as follows:

/ / Try to load native hadoop library and set fallback flag appropriately if (LOG.isDebugEnabled ()) {LOG.debug ("Trying to load the custom-built native-hadoop library...");} try {System.loadLibrary ("hadoop"); LOG.debug ("Loaded the native-hadoop library"); nativeCodeLoaded = true } catch (Throwable t) {/ / Ignore failure to load if (LOG.isDebugEnabled ()) {LOG.debug ("Failed to load native-hadoop with error:" + t); LOG.debug ("java.library.path=" + System.getProperty ("java.library.path"));} if (! nativeCodeLoaded) {LOG.warn ("Unable to load native-hadoop library for your platform...") "+" using builtin-java classes where applicable ");}

Where System.loadLibrary ("hadoop"); is looking for libhadoop.so on linux.

Summary: when the local hadoop library cannot be loaded, hadoop uses the java.util.zip.Inflater class to compress the SequnceFile; when it can be loaded into the local hadoop library, it uses the local library.

Let's compare the performance differences between a useful native hadoop library and a non-native hadoop library.

Do not use native hadoop, that is, the path to the native library is not included in the jvm run parameter java.library.path:

Java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

If you use it, add the native library path of hadoop:

Java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib:$HADOOP_HOME/lib/native

Virtual machine cluster:

50w data, sequnceFile compression mode is RECORD, key is random 10 bytes, value is random 200bytes:

Native lib disabled: 32689ms after compression 114.07 MB

Native lib enabled: 30625ms after compression 114.07 MB

50w data, sequnceFile compression mode is BLOCK, key is random 10 bytes, value is random 200bytes:

Native lib disabled: 11354ms after compression 101.17 MB

Native lib enabled: 10699ms after compression 101.17 MB

Physical machine cluster:

50w data, sequnceFile compression mode is RECORD, key is random 10 bytes, value is random 200bytes:

Native lib disabled: 21953ms after compression 114.07 MB

Native lib enabled: 24742ms after compression 114.07 MB

100w data, sequnceFile compression mode is RECORD, key is random 10 bytes, value is random 200bytes:

Native lib disabled: 48555ms after compression 228.14 MB

Native lib enabled: 45770ms after compression 228.14 MB

100w data, sequnceFile compression mode is RECORD, key is random 10 bytes, value is random 200bytes, set zlib compression level to BEST_SPEED:

Native lib disabled: 44872ms after compression 228.14 MB

Native lib enabled: 51582ms after compression 228.14 MB

100w data, sequnceFile compression mode is BLOCK, key is random 10 bytes, value is random 200bytes, set zlib compression level to BEST_SPEED:

Native lib disabled: 14374ms after compression 203.54 MB

Native lib enabled: 14639ms after compression 203.54 MB

100w data, sequnceFile compression mode is BLOCK, key is random 10 bytes, value is random 200bytes, set zlib compression level to DEFAULT_COMPRESSION:

Native lib disabled: 15397ms after compression 203.54 MB

Native lib enabled: 13669ms after compression 203.54 MB

The test results are analyzed and summarized as follows:

When in different compression modes, or different data volumes, and different zlib compression levels, there is not much difference between using the hadoop native library and using the java zip library.

Then try other native compression coding methods: GzipCodec Lz4Codec SnappyCodec BZip2Codec

Thank you for your reading, the above is the content of "Hadoop SequnceFile.Writer compression mode and compression library example analysis", after the study of this article, I believe you have a deeper understanding of the Hadoop SequnceFile.Writer compression mode and compression library example analysis of this problem, the specific use of the situation also needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.