How to compress data in HDFS 07/09 Update SLTechnology News&Howtos

How to compress data in HDFS

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to compress data in HDFS". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to compress data in HDFS".

Efficient storage through data compression

Data compression is an important aspect of file processing, which becomes even more important when dealing with data sizes supported by Hadoop. When most enterprises use Hadoop, the goal is to process data as efficiently as possible, and choosing an appropriate compression codec will make jobs run faster and allow more data to be stored in the cluster.

Choose the correct compression codec for the data

Using compression on HDFS is not as transparent as it is on file systems such as ZFS, especially when working with detachable compressed files (more on that later in this chapter). The advantage of using file formats such as Avro and SequenceFile is built-in compression support, which makes compression almost completely transparent to users. But this support is lost when using formats such as text.

problem

Evaluate and determine the best codec for data compression.

Solution

Google's compression codec Snappy provides the best combination of compression size and read / write execution time. However, LZOP is the best codec when using large compressed files that must support detachability.

Discuss

First, take a quick look at the compression codecs available for Hadoop, as shown in Table 4.1.

Table 4.1 Compression Codec

To correctly evaluate the codec, we first need to determine the evaluation criteria, which should be based on functional and performance characteristics. For compression, your standards may include the following:

Space / time tradeoff-in general, the higher the computational cost of the compression codec, the better the compression ratio, resulting in smaller compression output.

Splittability-you can split compressed files for use by multiple mapper. If the compressed file cannot be split, only one mapper can be used. If the file spans multiple blocks, data locality will be lost because the map may have to read the block from the remote DataNode, resulting in network Imax O overhead.

Native compression support-is there a native library that performs compression and decompression? This is usually better than compressed codecs written in Java and is not supported by the underlying native library.

Table 4.2 Compression Codec comparison

Native vs Java bzip2

Hadoop added native support for bzip2 (starting with versions 2.0 and 1.1.0). Native bzip2 support is the default, but does not support splittability. If you need detachability, you need to enable Java bzip2, which can be specified by setting io.compression. Codec. bzip2. Library to java-builtin.

Next, let's take a look at how the codec is balanced in space and time. A XML file of 100 MB (10 ^ 8) (enwik8.zip from http://mattmahoney.net/dc/textdata.html) is used here to compare the running time of the codec and its compression size, as shown in Table 4.3.

Table 4.3 performance comparison of compressed codecs on 100 MB text files

Run the test

When evaluating, I recommend testing with your own data, preferably on a host similar to the production node, so that you can have a good understanding of the expected compression and uptime of the codec.

To ensure that the cluster has native codecs enabled, you can check by running the following command:

$hadoop checknative-a

What do the results of space and time tell us? If pressing as much data as possible into the cluster is a top priority and allows a long compression time, then bzip2 may be a suitable codec. If you want to compress data but require minimal CPU overhead when reading and writing compressed files, you should consider LZ4. Any enterprise looking for a balance between compression and execution time will not consider the Java version of bzip2.

Splitting compressed files is important, but you must choose between bzip2 and LZOP. Native bzip2 codecs do not support splitting, and Java bzip2 time may make most people give up. The only advantage of bzip2 over LZOP is that its Hadoop integration is easier to use than LZOP.

Figure 4.4 compressed size of a single 100 MB text file (smaller values are better)

Figure 4.5 Compression and decompression time for a single 100 MB text file (smaller values are better)

Although LZOP seems to be the best choice, some improvements are needed, as described below.

Summary

The most suitable codec depends on your needs and standards. If you don't care about splitting files, LZ4 is the most promising codec, and if you want to split files, LZOP is the most important thing to focus on.

In addition, we need to consider whether the data needs to be stored for a long time. If you save the data for a long time, you may want to compress the file as much as possible, and I recommend using a zlib-based codec (such as gzip). However, because gzip is inseparable, it is wise to use it with a block-based file format such as Avro or Parquet, so that the data can still be split, or the output can be resized to occupy a block in HDFS, so you don't have to think about whether it can be split or not.

Keep in mind that the compression size will vary depending on whether the file is text or binary, depending on its contents. To get accurate numbers, you need to run similar tests against your own data.

There are many benefits to compressing data in HDFS, including reduced file size and faster MapReduce job runtime. Many compression codecs are available for Hadoop, and I evaluated them based on functionality and performance. Next, let's look at how to compress files and use them through tools such as MapReduce,Pig and Hive.

Use HDFS,MapReduce,Pig and Hive for compression

Since HDFS does not provide built-in compression support, using compression in Hadoop can be a challenge. In addition, splittable compression is not suitable for low-level beginners because it is not an out-of-the-box feature of Hadoop. If you are working on a medium-sized file that is compressed to a block size close to HDFS, the following method will be the most obvious and easiest way to compress in Hadoop.

problem

You want to read and write compressed files in HDFS and use them with MapReduce,Pig and Hive.

Solution

Using compressed files in MapReduce involves updating the MapReduce configuration file mapred-site.xml and registering the compressed codecs that are in use. After doing this, no additional steps are required to use compressed input files in MapReduce, and generating compressed MapReduce output is a matter of setting the mapred.output.compress and mapred.output.compression.codec MapReduce properties.

Discuss

The first step is to figure out how to read and write files using the codecs evaluated earlier in this chapter. All codecs detailed in this chapter are bundled with Hadoop, with the exception of LZO / LZOP and Snappy, which you need to download and build if you want to use them.

To use compression codecs, you first need to know their class names, as shown in Table 4.4.

Table 4.4 Codec classes

Using compression in HDFS

How can I use any of the codecs mentioned in the above table to compress existing files in HDFS? The following code supports this:

One of the overhead of using compressed codecs for codec caching is the high cost of creation. When using the Hadoop ReflectionUtils class, some of the overhead associated with creating an instance is cached in ReflectionUtils, which speeds up subsequent codecs creation. A better option is to use CompressionCodecFactory, which itself provides codec caching.

Reading this compressed file is as simple as writing:

It's super simple. Now that you can create a compressed file, let's see how to use it in MapReduce.

Using compression in MapReduce

To use compressed files in MapReduce, you need to set some configuration options for the job. For brevity, let's assume that identity mapper and reducer are used in this example:

The only difference between a MapReduce job that uses uncompressed iCandle O and a compressed iCandra O is the three annotated lines in the previous example.

Not only the input and output of the job can be compressed, but also the intermediate map output can be compressed because it is first output to disk and finally to reducer over the network. The compression effectiveness of the map output ultimately depends on the type of data sent out, but in general, we can speed up some job processes by making this change.

Why don't you specify a compression codec for the input file in the previous code? By default, the FileInputFormat class uses CompressionCodecFactory to determine whether the input file extension matches the registered codec. If a codec associated with the file extension is found, the codec is automatically used to extract the input file.

How does MapReduce know which codecs to use? The codec needs to be specified in the mapred-site.xml. The following code shows how to register all of the codecs mentioned above. Keep in mind that all compression codecs except gzip,Deflate and bzip2 need to be built and available on the cluster before you can register:

Now that you've mastered compression using MapReduce, it's time to learn about the Hadoop stack. Because compression can also be used with Pig and Hive, let's take a look at how to use Pig and Hive images to complete MapReduce compression.

Using compression in Pig

If you are using Pig, there is no extra work to use compressed input files. All you need to do is make sure that the file extension map is in the appropriate compression codec (see Table 4.4). The following example is the process of loading an gzips local encrypted file into Pig and dumping the user name:

Writing gzip compressed files is the same, be sure to specify the extension of the compressed codec. The following example stores the results of Pig relationship B in a HDFS file, and then copies them to the local file system to check the contents:

Using compression in Hive

Like Pig, all we need to do is specify the codec extension when defining the file name:

The previous example loads a compressed gzip file into Hive. In this case, Hive moves the file being loaded to the data warehouse directory and continues to use the original file as the table storage.

What if you want to create another table and specify that it needs to be compressed? The following example does this with some Hive configuration to enable MapReduce compression (because the MapReduce job is executed to load the new table in the last statement):

We can verify that Hive does compress the storage of the new apachelog_backup table by looking at it in HDFS:

It should be noted that Hive recommends using SequenceFile as the output format of the table because SequenceFile blocks can be compressed separately.

Summary

This technique provides a quick and easy way to run compression in Hadoop, which is suitable for small files because it provides a relatively transparent way of compression. If the compressed file is much larger than the HDFS block size, consider the following methods.

Splittable LZOP with MapReduce,Hive and Pig

If you are using a large text file, even when compressing, it will be many times the size of the HDFS block. To avoid having a map task handle the entire large compressed file, you need to select a compressed codec that supports splitting the file.

LZOP meets the requirements, but using it is more complex than the example above, because the LZOP itself is inseparable. Because LZOP is block-based, it is impossible to randomly search the LZOP file and determine the starting point of the next block, which is the challenge of this approach.

problem

You want to use a compressed codec to allow MapReduce to work in parallel on a single compressed file.

Solution

In MapReduce, splitting large LZOP compressed input files requires the use of LZOP-specific input format classes, such as LzoInputFormat. The same principle applies when using LZOP-compressed input files in Pig and Hive.

Discuss

The LZOP compression codec is one of only two codecs that allow you to split compressed files, so multiple Reducer can be processed in parallel. Another codec, bzip2, runs slowly due to compression time, which may cause the codec to become unusable, and LZOP provides a good trade-off between compression and speed.

What's the difference between LZO and LZOP? both LZO and LZOP codecs can be used for Hadoop. LZO is a stream-based compressed storage with no concept of blocks or headers. LZOP has the concept of a block (checksum), so it is the codec to use, especially if you want the compressed output to be split. Confusingly, the Hadoop codec processes files ending with the .lzo extension as LZOP encoding by default, and files ending with the .lzo _ deflate extension as LZO encoding. In addition, many documents seem to use LZO and LZOP interchangeably.

Unfortunately, Hadoop is not unbundled with LZOP for licensing reasons. Compiling and installing LZOP on a cluster is very laborious, and to compile this article's code, install and configure LZOP first.

Reading and writing LZOP files in HDFS

If we want to use LZOP to read and write compressed files, we need to specify the LZOP codec in our code:

Code 4.3 the method of reading and writing LZOP files in HDFS

Let's write and read the LZOP file to ensure that the generated file can be used by the LZOP utility (replace $HADOOP_CONF_HOME with the location of the Hadoop configuration directory):

The above code generates the core-site.xml.lzo file in HDFS.

Now make sure that you can use this LZOP file with the lzop binaries. Install the lzop binary on the host to copy the LZOP file from HDFS to the local disk, extract it using the native lzop binary, and compare it with the original file:

Diff verifies that files compressed with LZOP codecs can be unzipped using lzop binaries.

Now that we have the LZOP file, we need to index it so that it can be split.

Create an index for the LZOP file

The LZOP file itself is inseparable, and although it has the concept of blocks, the lack of block-delimited synchronization tokens means that the LZOP file cannot be randomly searched and started reading. But because blocks are used internally, you only need to do some preprocessing, which can generate an index file that contains the block offset.

Read the LZOP file in its entirety and write the block offset to the index file when the read occurs. The index file format (shown in figure 4.6) is a binary file that contains a series of consecutive 64-bit digits that represent the byte offset of each block in the LZOP file.

You can create an index file in the following two ways. If you want to create an index file for a single LZOP file, you only need to make a simple library call, as follows:

Shell$ hadoop com.hadoop.compression.lzo.LzoIndexer core-site.xml.lzo

If you have a large number of LZOP files and need a more efficient way to generate index files, the indexer runs MapReduce jobs to create index files, supporting files and directories (scanning LZOP files recursively):

Both of the methods described in figure 4.6 will generate index files in the same directory as the LZOP file. The index file name is the original LZOP file name with the suffix .index. Running the previous command generates the file name core-site.xml.lzo.index.

Next, let's take a look at how to use LzoIndexer in Java code. The following code (from the main method of LzoIndexer) causes the index file to be created synchronously:

Using the DistributedLzoIndexer,MapReduce job starts and runs N mapper, one for each .lzo file. Reducer is not running, so (identity) mapper writes directly to the index file through custom LzoSplitInputFormat and LzoIndexOutputFormat.

If you want to run MapReduce jobs from your own Java code, you can use DistributedLzoIndexer code.

The LZOP index file is required so that the LZOP file can be split in MapReduce,Pig and Hive jobs. Now that you have the above LZOP index files, let's see how to use them with MapReduce.

MapReduce and LZOP

After you create an index file for the LZOP file, you can start using the LZOP file with MapReduce. Unfortunately, this presents us with the next challenge: none of the existing built-in input formats based on Hadoop files are suitable for splittable LZOP, because they require specialized logic to handle input splits using LZOP index files. We need specific input format classes to use the splittable LZOP.

The LZOP library provides an LzoTextInputFormat implementation for line-oriented LZOP compressed text files, along with index files.

The following code shows the steps required to configure a MapReduce job to use LZOP. We will perform the following steps for a MapReduce job with text LZOP input and output:

Compressing the intermediate map output also reduces the overall execution time of the MapReduce job:

You can easily configure the cluster to always compress the map output by editing the hdfs-site.xml:

The number of splits per LZOP file is a function of the number of LZOP blocks occupied by the file, not the number of HDFS blocks occupied by the file.

Pig and Hive

Elephant Bird, a Twitter project that contains utilities that work with LZOP, provides many useful MapReduce and Pig classes. Elephant Bird has a LzoPigStorage class that allows you to compress data using text-based LZOP in Pig.

Hive can use LZOP compressed text files by using the com.hadoop.mapred. DeprecatedLzoTextInputFormat input format class in the LZO library.

Thank you for reading, the above is the content of "how to compress data in HDFS". After the study of this article, I believe you have a deeper understanding of how to compress data in HDFS, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.