Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the important parameter setting skills of hadoop performance tuning

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article shows you how to set important parameters for hadoop performance tuning, which is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Setting skills of important parameters for hadoop performance tuning

Classification: hadoop2012-12-16

At 19:53 43 people read the comments (0) favorite reports

This is mainly aimed at tuning the performance of Mapreduce.

In the past month or two, I have been tuning the performance of mapreduce. I still need to write it down for future generations.

The main parameters involved here include:

HDFS:

Dfs.block.size

Mapredure:

Io.sort.mb

Io.sort.spill.percent

Mapred.local.dir

Mapred.map.tasks & mapred.tasktracker.map.tasks.maximum

Mapred.reduce.tasks & mapred.tasktracker.reduce.tasks.maximum

Mapred.reduce.max.attempts

Mapred.reduce.parallel.copies

Mapreduce.reduce.shuffle.maxfetchfailures

Mapred.child.java.opts

Mapred.reduce.tasks.speculative.execution

Mapred.compress.map.output & mapred.map.output.compression.codec

Mapred.reduce.slowstart.completed.maps

A total of 16 parameters are listed here. These 16 parameters can basically meet the performance tuning of applications that are not specific to specific scenarios. I will take Terasort as an example to describe in detail how the effects of these parameters have been optimized.

As the basic distributed file system of mapreduce, HDFS of hadoop also has a direct impact on the running effect of mapred. The first parameter that affects our performance is block.size. In clusters with a good network environment, it is recommended to increase this parameter to 128m or 256m or greater (the default is 64m).

But the impact of HDFS is limited to this, and most of its configuration items are directory configuration and fault tolerance, as well as the number of backups, and so on, which are of little significance to our performance tuning. To give an example, that is the number of backups. This parameter is mainly used to set the number of backups of block in the cluster, which will be assigned to each machine in the cluster according to some rule. The default is 3 backups. However, because the map of mapred requires input data, the general default is a map and a block, so when you start job in the cluster, when a job pulls up more than N map on a machine, if the input data of this map is local, then obviously the execution of map will be faster, because there is no need to wait for network transmission. Take four nodes as an example. If you set up three backups, no matter what data you store, the amount of your data that can be stored on any machine is 3Universe 4. But if you set it to four backups, then your data can be found completely on any node, so no matter how you start job, your map will be localized. But the disadvantage is that the disk overhead is too large, the general large cluster can not bear more than 5 backup data, so it is basically meaningless.

Let's talk about the big part, which is the parameters of these NB in mapred. I believe you have already understood the pre-knowledge (if you do not understand the operating mechanism of mapred, it is meaningless to see this.), first, the data should be map, then merge, then the reduce process for copy, and finally reduce, in which merge and copy can be called shuffle. Before you start a job, hadoop needs to know how many map and how many renduce processes you want to start. If you start with the default parameters, there is only one map thread by default. (reduce may also be a..) this speed is very slow. The parameter to set the number of map launches is mapred.map.tasks,reduce or mapred.reduce.tasks. These two parameters can be said to play a leading role in the performance of the entire cluster, and debugging basically revolves around these two parameters. Then we have to ask what can be modified back and forth with regard to the two parameters? In fact, the setting ratio of these two parameters also directly affects the setting of other parameters. The first to bear the brunt is mapred.tasktracker.map.tasks.maximum

And mapred.tasktracker.reduce.tasks.maximum. Because these two parameters set the maximum number of map and reduce that can run simultaneously on a server. Now let's assume that a cluster has one namenode and eight datanode, which is a very objective cluster. Let's assume that all of the above data are backed up by three backups, then the local data rate is 3gam8. If you set the map.tasks=128,reuce.tasks=64, then your corresponding maximum should be 16 and 8 or higher. Because this ensures that all your map and reduce tasks are started at the same time, if you set the maximum of reduce to 7, then you will get very bad results, because in this way, the number of reduce that eight machines can run at the same time is 56, which is 8 processes less than the 64 you set. These eight processes will be in pending state until some running reduce finishes it. It is bound to greatly increase the running time. Of course, this is not the bigger the better, because map has coexisted with the reduce process for a long time, depending on the mapred.reduce.slowstart.completed.maps you set, if you set it to 0. 6. Then reduce will enter the running state after 60% of the map is completed. Therefore, if you set map and reduce parameters are very large, it is bound to cause map and reduce to compete for resources, resulting in some processes hungry, timeout error, the most likely is the error of socket.timeout, the network is too busy. Therefore, these need to be properly debugged to add and subtract according to the performance of the cluster in order to achieve the best results. So, what is the better ratio between map and reduce? The apache website gives us some suggestions, such as setting up reduce and map, and there is a specific formula between them. But the actual situation can not always be applied with the formula (otherwise there is no need for a systems engineer.) In general, after you have set the number of map and reduce processes, you can check the progress of map and reduce through the page entry (http://namenode:50030/jobdetai.jps) of hadoop's mapred. If you find that when reduce is 33%, map is just a little bit to 100% earlier, then this will be the best ratio, because reduce completes the copy phase at 33%, that is, map needs to complete all map tasks before reduce reaches 33%. Get the data ready. Never let reduce wait, but you can let map finish it first.

OK! After this focus is done, we will look at two closely related parameters, io.sort.mb and mapred.child.java.opts. Because every map or reduce process is a task and launches a JVM, java.opts is also related to the number of map and reduce you start and other jvm-sensitive parameters. Since task runs in JVM, the sort.mb I want to mention here is also allocated in JVM. This value is used to set the available buffer size of a map sort. If the sort result of map in memory reaches a specific value, it will be entered into the hard disk by spill. Specifically, this value is equal to mb*io.sort.spill.percent. According to the usual setting method, in order to maximize the performance of jvm, the maximum amount of memory available for JVM is generally set to twice the amount of memory set by mb. So what is the setting of the amount of memory in mb? It is mainly related to the amount of data of one of your map results. If the result of a map is 600m of data, then if you set the mb*io.sort.spill.percent.=200M, spill will be done three times to enter the hard disk, and then after the map is completed, the data will be removed from the hard drive for copy. Therefore, if this mb setting is 600m, then there is no need for this hard disk access, saving a lot of time. But the biggest problem is that memory is expensive. If mb is 600m, then jvm.opts will need to be set to more than 1G, so, according to the above example, you start 16 map and 8 reduce at the same time

Then you should have at least 24 gigabytes of memory. Therefore, the setting here should also be cautious, because after all, your server has to run a lot of other services.

Let's talk about some other influential parameters, which can be set according to the general setting method. First of all, for disks and disk IO, mapred.local.dir, this parameter is best set to the same number of disks as yours. Your disk should be set to RAID0 separately, and then configure all disks to multipath under this configuration item, then HDFS will sequentially cycle storage when determining data storage, ensuring the consistency of the amount of data on all disks, and improving the IO speed of the overall disk. Then for the network, it needs to be carefully considered when reduce and map are running at the same time. Parameters such as mapred.reduce.parallel.copies and mapreduce.reduce.shuffle.maxfetchfailures have some impact on the network. The first is the maximum number of parallel copy threads that reduce can do, which fetches map results from different datanode at the same time, while too many retries in the second error is a performance problem for many of our applications. Because in general, a job retries once without success, then basically, no matter how you retry, it doesn't matter if you retry. The key is that this retry consumes a lot of system resources, so that other threads may also be because of starvation.

And enter the state of retry, a vicious circle. If your network is really very bottleneck, gigabit network can not reach, then it is recommended to turn on the mapred.compress.map.output compression option, and configure mapred.map.output.compression.codec compression format, generally will use snappy, because this format is relatively fast for compression and decompression. In addition, if your cluster is heterogeneous, some machines have good performance and some are poor, it is recommended to turn on mapred.reduce.tasks.speculative.execution speculative execution, which is helpful to optimize process allocation and improve cluster performance.

The above is what the important parameter setting skills of hadoop performance tuning are. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report