How to apply hadoop slicing mechanism 07/06 Update SLTechnology News&Howtos

How to apply hadoop slicing mechanism

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to apply the hadoop slicing mechanism". In the daily operation, I believe many people have doubts about how to apply the hadoop slicing mechanism. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about how to apply the hadoop slicing mechanism. Next, please follow the editor to study!

Preface

Above is a logical sequence processing diagram in which MapReduce reads a text data. We know that no matter whether it is running locally or in cluster mode, it finally runs in the form of job task scheduling, which is mainly divided into two phases.

Map phase, enable MapTask to process data reading

Reduce phase, enable ReduceTask to aggregate data

For example, in the case of wordcount, a piece of text data is first parsed in the map phase and split into words. In fact, for hadoop, the completion of this work is handled by a MapTask opened behind. When the job processing is completed, you can see that the corresponding word statistics results are generated under the target folder.

What if there are multiple word statistics text files to process? We might as well modify the job code of wordcount and put multiple processing files in a directory to see what the result is.

Public static void main (String [] args) throws Exception {/ / 1, get job Configuration configuration = new Configuration (); Job job = Job.getInstance (configuration); / / 2, set jar path job.setJarByClass (DemoJobDriver.class); / / 3, associate mapper and Reducer job.setMapperClass (DemoMapper.class); job.setReducerClass (DemoReducer.class) / / 4. Set the key/val type of map output job.setMapOutputKeyClass (Text.class); job.setMapOutputValueClass (IntWritable.class); / / 5. Set the key/val type of final output job.setOutputKeyClass (Text.class); job.setOutputValueClass (IntWritable.class) / / 6. Set the final output path String inputPath = "F:\\ network disk\\ csv\\ combines\"; String outPath = "F:\\ network disk\\ csv\\ result"; FileInputFormat.setInputPaths (job,new Path (inputPath)); FileOutputFormat.setOutputPath (job,new Path (outPath)); / / 7 submit job boolean result = job.waitForCompletion (true) System.exit (result? 0: 1);}

After the operation, we randomly intercept several running logs, by reading the key information, I believe that interested students can see something, summarize it.

If a job is opened and multiple pending files are found in a directory, multiple MapTask processing will be enabled.

By default, there are as many MapTask tasks as there are files.

After ReduceTask processing, merge the results

Parallelization of Hadoop tasks

A very important reason for using hadoop or other big data frameworks is that their underlying design can well support the parallelization of tasks, that is, they will make full use of the configuration of the server to split a complex task, or a single Task task, into multiple parallel subtasks as needed, to make full use of server performance to achieve optimal task processing and the shortest time. The machine performance utilization is the best.

The same is true of hadoop, which provides a number of configuration parameters for client selection to improve the processing performance of tasks.

We know that the task processing of hadoop is mainly divided into two stages, Map and Reduce. By default, combined with the above case, we can know that the same amount of MapTask will be opened by default according to the number of files to deal with, but we imagine such a problem. The current file is relatively small and has not exceeded the default blocksize, that is, 128m. What if it exceeds? What if the file reaches 1 G?

Therefore, the following conclusions are drawn:

The parallelism of MapTask determines the concurrency of task processing in Map, which in turn affects the processing speed of the whole Job.

In this way, when a file to be processed is very large, the processing speed of the whole Map phase can be increased by setting the parallelism of MapTask.

Thinking: 1 gigabyte of data, starting 8 MapTask, can improve the concurrent processing ability of the cluster. So will 1K of data, which also starts 8 MapTask, improve cluster performance? Is it better to have as many MapTask parallel tasks as possible? What factors affect the parallelism of MapTask?

Determination Mechanism of MapTask parallelism

Data block: Block is the physical unit of HDFS, that is, data is divided into pieces, and data blocks are HDFS storage data unit data slices: data slices only logically slice the input, not really split it into multiple slices on disk for storage.

Data slicing is the unit in which the MapReduce program calculates input data, and a slice launches a MapTask accordingly.

Imagine a 300MB-sized file that can be divided into three slices according to 100MB, so that three MapTask will be opened in the Map phase to handle this task, but by default, a file block processed by Hadoop, that is, the size size of block, is 128Mb, then the problem arises. If hadoop is really running in a distributed environment in a production environment, the tasks are often run on different machines, as shown in the following figure.

Node1 ~ node3 can be thought of as three nodes in the cluster, which are used to process MapTask data

By default, the default file database size is 128Mb for one task at a time

The pending file of 300Mb, which is divided into three slices according to 100Mb, will be divided into three slices, and three MapTask will be opened for processing.

According to the above intuitive understanding, we can sum up the above points, but after careful analysis, we will find that another problem is that the slicing rule is artificially specified by the client, which can be understood as an account book, which records the working hours of the workers. in order to make considerations for the settlement of wages.

But for the three nodes, they don't think so, because they really work and perform tasks. What slicing rules do people care about you? You can't change your default 128MB database size to 100MB according to your 100MB size slicing rules, can you? This is obviously impossible, so what should we do?

Since the rule on your slice is 100MB, so the node1 node will follow your rule. I will deal with 100MB-sized data on my node. What about the remaining 28MB-sized data? Now that it is divided into three slices, it is certain to open three MapTask, and the node2 node also has to deal with a task, but you can't deal with the data at will. You have to copy the unprocessed file of the 28MB above the node1 node, and then piece together the 72MB-sized data blocks to make up the 100MB before you do anything.

Therefore, in a real distributed environment, there is a problem of cross-node copying of data files, which obviously brings some network overhead, and if the data files are large, this performance loss is worth considering.

According to the above understanding, we can sum up the following experience:

The parallelism of the Map phase of an Job is determined by the number of slices the client has when submitting the Job.

Each slice will be assigned a MapTask for processing.

By default, if not specified, slice size = block size of BlockSize, which is also the optimal processing

Slicing does not take into account the data as a whole, but slices each file separately

Hadoop default slicing mechanism

By default, hadoop uses the FileInputFormat slicing mechanism if no settings are made. To put it simply, the principle is as follows:

Simply slice according to the length of the file content

Slice size, which is equal to 128MB by default, that is, the size of blocksize

Slicing does not take into account the data as a whole, but slices each file separately

Relatively speaking, this is relatively simple, so I won't repeat it too much. You can find the following writeNewSplits method through source code debugging, and then go in and take a look at the source code.

Optimized slicing Mechanism of Hadoop TextInputFormat

FileInputFormat implementation class

When writing the main program for job, remember to set up the last two lines of code to read the file and output the file

When running the MapReduce program, there are many file formats for input, such as line-based log files, binary format files, database tables, and so on. So how does MapReduce read the data for different data types?

FileInputFormat common interface implementation classes include: TextInputFormat, KeyValueTextInputFormat, NLineInputFormat, CombineTextInputFormat and custom InputFormat and so on.

The most common one is TextInputFormat.

TextInputFormat is the default FileInputFormat implementation class

Read each record by row

The key is the LongWritable type that stores the starting byte offset of the line throughout the file.

The value is the content of the line, excluding any line Terminator (newline and carriage return), Text type.

The following is an example. For example, a fragment contains the following four text records

Rich learning form

Intelligent learning engine

Learning more convenient

From the real demand for more close to the enterprise

Each record can be represented as the following key / value pairs:

(0recom Rich learning form)

(20Bing Intelligent learning engine)

(49th Learning more convenient)

(74Jing from the real demand for more close to the enterprise)

CombineTextInputFormat slicing mechanism

From the above analysis, the default TextInputFormat slicing mechanism of the framework is to slice tasks by file. No matter how small the file is, it will be given to a MapTask as a separate slice. If there are a large number of small files, a large amount of MapTask will be generated, so the processing efficiency is not high.

So we can consider another slicing mechanism, namely CombineTextInputFormat.

CombineTextInputFormat application scenario

CombineTextInputFormat is used in scenarios with too many small files. It can logically plan multiple small files into a single slice, so that multiple small files can be handed over to one MapTask for processing.

For example, in the above files, the largest is less than 7MB and the lowest is less than 2MB. Then the slicing mechanism based on CombineTextInputFormat can consider using this slicing. The specific settings are set as follows in the code of the job task.

CombineTextInputFormat.setMaxInputSplitSize (job, 4194304); / / 4m or other data

Note: the maximum value of virtual storage slice is best set according to the actual small file size. It is better to get a median according to the size of the overall file.

CombineTextInputFormat case code demo

Using the above four files as the input data source, it is expected that only one slice will be used to process four files (by default, four files will start four slices and four MapTask, which can be viewed from the console log)

Use the slicing of CombineTextInputFormat, like the following implementation process

Without any processing, run the above WordCount case program and observe that the number of slices is 4 (console log)

Add the following code to the Job code, run the program, and observe that the number of slices running is 1

/ / if InputFormat is not set, it defaults to TextInputFormat.classjob.setInputFormatClass (CombineTextInputFormat.class); / / the maximum value of virtual storage slice is set to 20mCombineTextInputFormat.setMaxInputSplitSize (job, 20971520)

Set it to 4MB first, and then try to set it to 20MB to see if the running result is 1 slice, number of splits:1

The above target conjecture is verified by the output of the console.

At this point, the study on "how to apply the hadoop slicing mechanism" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.