Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

A gz Compression method suitable for MapReduce processing

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

Recently, in the preparation of hadoop, the test cluster has only six ordinary virtual machines, each with 1G of memory and 100g of hard disk. Therefore, it is more tangled when yarn carries on resource scheduling, and the hard disk space is also limited. When you execute the job, you want to compress the input data as much as possible.

Hadoop can directly process compressed files in gz format, but it will not generate split, but most of them will be handed over to a Mapper directly, no matter how much, because gz does not support split algorithmically. Although bzip2 supports split, but the compression speed is relatively slow, gz can be said to be the most commonly used compression method.

At first, I took it for granted that I tried to compress the volume, but of course it failed, because no matter how many volumes were divided, the gz had to be decompressed as a whole.

Because I only deal with text data, and are based on text lines, each line does not have any nested relationship like xml, so I started to write a compression program, in the compression of large files, if the compressed file generated is greater than a set value, then create a new file to continue compression.

Package util;import java.io.BufferedReader;import java.io.File;import java.io.FileInputStream;import java.io.FileOutputStream;import java.io.InputStreamReader;import java.io.PrintWriter;import java.util.zip.GZIPOutputStream Public class CompressUtils {/ * compress the file into GZIP shard * @ param inputFile input file * @ param outputDir output directory * @ param outputFileName output filename * @ param splitSize shard size * / public static void compressToSplitsUseGZIP (File inputFile, File outputDir, String outputFileName Int splitSize) throws Exception {String separator = System.getProperty ("line.separator") Int split = 0; long limit = splitSize * 1024 * 1024L; File outputSplit = new File (outputDir, outputFileName + split + ".gz"); outputSplit.createNewFile (); BufferedReader br = new BufferedReader (new InputStreamReader (new FileInputStream (inputFile), "UTF-8")); PrintWriter out = new PrintWriter (new GZIPOutputStream (new FileOutputStream (outputSplit)), false) String line = null; long fileLength = outputSplit.length (); long maxInc = 0L; while (true) {line = br.readLine () If (line = = null) {break } if (fileLength + maxInc > limit) {if (out! = null) {out.close () Out = null; outputSplit = new File (outputDir, outputFileName + (+ + split) + ".gz"); outputSplit.createNewFile (); fileLength = outputSplit.length () Out = new PrintWriter (new GZIPOutputStream (new FileOutputStream (outputSplit)), false) }} for (byte b: line.getBytes ()) {out.write (b) } for (byte b: separator.getBytes ()) {out.write (b);} out.flush () Long currentLength = outputSplit.length (); long inc = currentLength-fileLength; if (inc > = maxInc) {maxInc = inc;} fileLength = currentLength } br.close (); try {out.close () } catch (Exception e) {}} public static void main (String [] args) throws Exception {File inputFile = new File (args [0]); File outputDir = new File (args [1]) String outputFileName = args [2]; int splitSize = Integer.parseInt (args [3]); compressToSplitsUseGZIP (inputFile, outputDir, outputFileName, splitSize);}}

Command line argument: d:\ temp\ test.txt D:\ temp test 64

Each of the resulting compressed files will be less than 64MB, with a maximum difference of less than 100k. There are few factors to consider, here is only a rough algorithm to write, but to meet the demand.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report