How to customize Partition by hadoop 04/27 Update SLTechnology News&Howtos

How to customize Partition by hadoop

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Today, I would like to share with you the relevant knowledge about how to customize the partition in hadoop. The content is detailed and the logic is clear. I believe most people still know too much about this, so share this article for your reference. I hope you can get something after reading this article. Let's take a look.

Zoning concept

The word partition is not new to many students, for example, in many middleware of Java, such as kafka partition, mysql partition table and so on, the meaning of partition is to divide the data reasonably according to business rules, so as to facilitate the subsequent efficient processing of each partition data.

Hadoop partition

The partition in hadoop is to export different data to different reduceTask and finally to different files.

Hadoop default Partition rules

Hash partition

By hashCode% reduceTask quantity of key = partition number

The default number of reduceTask is 1, which can also be set on the driver side.

The following is the source code extracted from the Partition class, which is easy to understand

Hash Partition Code Demo

Here is the code for the driver part of the wordcount case. By default, we don't make any settings and finally output a txt file that counts the number of words. If we add a line like this to this code,

What happens when you run the following program again?

Public class DemoJobDriver {public static void main (String [] args) throws Exception {/ / 1, get job Configuration configuration = new Configuration (); Job job = Job.getInstance (configuration); / / 2, set jar path job.setJarByClass (DemoJobDriver.class); / / 3, associate mapper and Reducer job.setMapperClass (DemoMapper.class); job.setReducerClass (DemoReducer.class) / / 4. Set the key/val type of map output job.setMapOutputKeyClass (Text.class); job.setMapOutputValueClass (IntWritable.class); / / 5. Set the key/val type of final output job.setOutputKeyClass (Text.class); job.setOutputValueClass (IntWritable.class) / / 6. Set the final output path String inputPath = "F:\\ network disk\\ csv\\ hello.txt"; String outPath = "F:\\ network disk\\ csv\\ wordcount\\ hello_result.txt"; / / set the output file to 2 job.setNumReduceTasks (2); FileInputFormat.setInputPaths (job,new Path (inputPath)); FileOutputFormat.setOutputPath (job,new Path (outPath)) / / 7 submit job boolean result = job.waitForCompletion (true); System.exit (result? 0: 1);}}

You can see that two statistical result files are finally output, with different contents in each file. That is, by default, when the number of reducer is set to multiple, the results will be calculated according to the hash partition algorithm and output to the corresponding files of different partitions.

Custom Partition step

Custom classes inherit from Partitioner

Override the getPartition method, and in this method control different data into different partitions according to business rules

In the driver class of Job, set the custom Partitioner class

After customizing the Partition, set the appropriate number of ReduceTask according to the customized Partition logic

Business requirements

Put the names of the characters in the following file into the first partition according to their surnames, those with the surname "Ma", those with the surname "Li" into the second partition, and the others into the third partition.

Custom Partition

Import org.apache.commons.lang3.StringUtils;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.mapreduce.Partitioner;import org.apache.hadoop.io.Text;public class MyPartioner extends Partitioner {@ Override public int getPartition (Text text, IntWritable intWritable, int partion) {String key = text.toString (); if (StringUtils.isNotEmpty (key.trim () {if (key.startsWith ("horse")) {partion = 0 } else if (key.startsWith ("Li")) {partion = 1;} else {partion = 2;} return partion;}}

Associate custom partitions to the Driver class, and note that the number of ReduceTasks here is the same as the number of custom partitions

Job.setNumReduceTasks (3); job.setPartitionerClass (MyPartioner.class)

Next, run the Driver class to observe the final output, and output different surname data to different files as expected

Summary of Custom Partition

If the number of ReduceTask > the number of partitions in the custom partion, several more empty output files will be generated

If 1 < ReduceTask < customize the number of partitions in the partion, an exception will be thrown if the corresponding partition file storage cannot be found during data processing.

If ReduceTask = 1, no matter how many partitions are in the custom partion, the final result will only be handed over to this ReduceTask for processing, and only one result file will be produced.

The partition number must start at 0 and accumulate one by one

These are all the contents of the article "how to customize partitions in hadoop". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.