Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Hadoop

2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to use Hadoop for you. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

Clause 6: serial execution is recommended for Job with multiple large inputs, and parallel execution is recommended for Job with multiple small inputs.

The task processing of Hadoop is divided into map phase and reduce phase. When the taskslots of the cluster is sufficient to support the simultaneous execution of multiple tasks, it is recommended to use multi-task parallel execution. On the contrary, it is recommended to use serial execution, and when a Job starts to execute reducetask.

You can start executing the maptask of the next Job.

The following are the test results of running two 100GMagne200Gjor300G tasks in parallel and serially on 50 decommissioned machines:

The number of clause 7:reducer should be slightly less than the number of all reduceslot in the cluster.

The number of maptask is determined by the input file size, so choosing the appropriate number of reducer is of great significance to take full advantage of the performance of the Hadoop cluster.

Each task in Hadoop corresponds to a slot in tasktracker. The formula for calculating the total number of mapperslots and total reducerslots in the system is as follows:

Total mapperslots = number of cluster nodes × mapred.tasktracker.map.tasks.maximum

Total reducerslots = number of cluster nodes × mapred.tasktracker.reduce.tasks.maximum

Setting the number of reducer slightly less than all the reducerslot in the cluster allows all reducetask to occur at the same time, and can tolerate some reducetask failures.

Clause 8: multiple simple serial Job is better than a complex Job. Dividing complex tasks into multiple simple tasks is a typical idea of divide and conquer. This not only makes the program simpler and has a single responsibility, but also makes multiple serial tasks possible.

While the previous task is executing the reduce task, use the free map resources to execute the next task.

4.Key-Value tradeoff

The core process of Map-Reduce algorithm is as follows:

Map (K1Jing v1)-- > list (K2Jing v2)

Reduce (K2Jing list (v2))-- > list (v2)

That is, the input is converted into a set of pairs through the user-defined map function, and then the * * result is calculated through the user-defined reduce function.

How to choose the appropriate map and reduce functions to make full use of the computing power of the Hadoop platform? In other words, how to choose the appropriate K2 and V2 in the above formula?

The size of the 9:maptask or reducetask should be moderate, with a task running for 2-3 minutes, and the task should not exceed the computing capacity of the computing node.

Although the Hadoop platform helps us to split the data into small tasks, we should also be aware that each task runs on a computing node, if a task requires more machine resources (CPU, memory, disk space, etc.) than computing.

The ability of the node, the task will fail. If the task is too small, although the computing node can complete the execution of the task quickly, the excessive management overhead of task and the frequent network transmission of intermediate results will occupy most of the time of task execution.

This can also seriously affect performance. The recommended task size * * is suitable for running for 2-3 minutes.

The intermediate result of clause 10:map should not be too large.

The pairs generated by the input data after the user-defined map function are the intermediate calculation results of the Map-Reduce model.

Maptask saves the intermediate results of the calculation on the local disk, then pulls out all the intermediate results required for the current task through Reducetask, and sorts the intermediate results by Key. Obviously, if the intermediate result produced by map is too large, the network transmission time and intermediate junction

Fruit sorting will take up most of the Job execution time.

This is the end of the article on "how to use Hadoop". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report