Compression knowledge that should not be missed in Hadoop 04/21 Update SLTechnology News&Howtos

Compression knowledge that should not be missed in Hadoop

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

With the advent of the era of big data, the volume of data is getting larger and larger, and the processing of these data will be more and more limited by the network IO. In order to deal with more data as much as possible, we must use compression. So is compression suitable for all formats in Hadoop? What properties does it have?

Compression can be done in sqoop as well as in hive and impala. So under what circumstances do we use compression? Usually when the amount of data is very large, we use compression to reduce the amount of data, so as to reduce the use of data transmission IO in the future. Compression also plays a role in improving performance and storage efficiency.

I. data compression

Each file format supports compression, which reduces the footprint of disk space. But compression itself brings some overhead to CPU, so compression requires a tradeoff between CPU time and bandwidth / storage space. For example:

(1) some algorithms take a long time, but save more space.

(2) some algorithms are faster, but the space saved is limited.

How to understand this? For example, if 1T of data is compressed into 100G, it may take 10 minutes. It may take 1 minute to compress it to 500g. Which way do you choose? So we need to make a trade-off between CPU time and bandwidth, of course, there is no good or bad way, but we choose according to our own needs.

In addition, compression is good for performance: many Hadoop jobs are limited by IO, compression can process more data per IO operation, and compression can also improve the performance of network transmission.

Second, compress Codecs

The implementation of the compression algorithm is called codec, which is short for Compressor/Decompressor. Many codecs are commonly used in Hadoop, each with different performance characteristics. However, not all Hadoop tools are compatible with all codecs. The commonly used compression algorithms in Hadoop are bzip2, gzip, lzo and snappy, in which lzo and snappy need the operating system to install native library to support them.

Here we take a look at the performance of different compression tools:

Bzip2 and GZIP consume more CPU, the compression ratio is the highest, and GZIP cannot be processed in parallel; Snappy is similar to LZO, slightly better, and CPU consumes less than GZIP. In general, it is more common to use Snappy and LZO if you want to strike a balance between CPU and IO. Here I mainly recommend using Snappy because it can provide good compression performance, and the compressed data can be sliced, which is very useful for later running processing.

Also note: for hot data, speed is more important. It is better to compress 40% of the data in 1 second than 80% in 10 seconds.

Third, Sqoop uses compression

Sqoop uses the-- compression-codec flag

Example:

-- compression-codec

Org.apache.hadoop.io.compress.SnappyCodec

IV. Impala and Hive use compression

Impala and Hive use compression, which we need to specify in the syntax for creating tables. We may specify different attributes and syntax for different compressions.

Note: Impala queries data in memory-compression and decompression are in memory.

Impala example:

I suggest you pay more attention to big data's related knowledge, and constantly improve and improve your knowledge structure. I usually like to watch the official Wechat account "big data cn". The content is very good for me, and I recommend you to have a look.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.