How to achieve global sorting of text files by Hadoop 02/18 Update SLTechnology News&Howtos

How to achieve global sorting of text files by Hadoop

2026-02-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "how Hadoop achieves global sorting of text files". The content is simple and easy to understand, and the organization is clear. I hope it can help you solve your doubts. Let Xiaobian lead you to study and learn this article "how Hadoop achieves global sorting of text files".

I. Background

The InputSampler class and TotalOrderPartitioner class for global sorting are implemented in Hadoop, and an example call is org.apache.hadoop.examples.Sort.

But when we take a Text file as input, the result is not sorted by string columns in Text, and the output is SequenceFile.

Reason:

1) hadoop When processing Text files, key is the line number LongWritable type, InputSampler samples key, TotalOrderPartitioner also uses key to find partitions. In this way, the partition file sampled is a sampling of line numbers, and the result is naturally sorted according to line numbers.

2) For large data volumes, the InputSampler sampling speed will be very slow. For example, Random Sampler needs to traverse all data, and IntervalSampler needs to traverse the same number of files as splits. SplitSampler is more efficient, but it only extracts records at the beginning of each file, which is not suitable for orderly situations within files.

II. Function

1. Partial Sampler method is implemented, which is suitable for the case that the input data files are independent and identically distributed.

2. Enable Random Sampler, IntervalSampler, SplitSampler to sample text

3. Implement TotalOrderPartitioner for Text file string column

III. Realization

1. PartialSampler

PartialSampler randomly extracts the first column of text data from the first input data. PartialSampler has two properties: freq (sampling frequency) and numSamples (total number of samples).

public K[] getSample(InputFormat inf, JobConf job) throws IOException { InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks()); ArrayList samples = new ArrayList(numSamples); Random r = new Random(); long seed = r.nextLong(); r.setSeed(seed); LOG.debug("seed: " + seed); //Sample splits [0] for (int i = 0; i < 1; i++) { System.out.println("PartialSampler will getSample splits["+i+"]"); RecordReader reader = inf.getRecordReader(splits[i], job, Reporter.NULL); K key = reader.createKey(); V value = reader.createValue(); while (reader.next(key, value)) { if (r.nextDouble()

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.