How to optimize Hadoop 07/06 Update SLTechnology News&Howtos

How to optimize Hadoop

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is about how to optimize Hadoop. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Network bandwidth

The servers of the Hadoop cluster are planned under a unified switch, which is recommended in the official documentation. However, the interconnection bandwidth between our switch and other switches is limited, so we encounter the problem of slow HDFS access on the client side. This problem is solved by connecting the clients operating the cluster to the switch of DataNode.

two。 System parameters

The change to ulimit-c is also recommended by the official documentation, and there are no problems encountered when the cluster has only 10 servers. As the number of machines and tasks increases, this value needs to be changed more.

3. Profile management

This cluster uses the Cloudera distribution, and the configuration file exists in the / etc/hadoop/conf location by default. This is a location that only root can change. For the convenience of modification, I save the configuration file on a single machine and distribute it by script after modification. Ensure that all servers are configured uniformly.

4. Mapred.tasktracker.map.tasks.maximum

This parameter controls the number of Map tasks that each TaskTracker runs concurrently. The previous settings were the same as CPU cores, and occasionally encountered the problem of tasks crowding out DataNode resources. Now change it to map+reduce+1==num_cpu_cores.

5. Strict control of root permissions

The distribution of Cloudera creates a hadoop user under which various daemons should run. There has been a misoperation (/ usr/lib/hadoop/bin/hadoop datanode &) that caused the local data directory to be written to a new file by root, so the correctly started hadoop user process cannot read or write. So today's cluster servers do not provide day-to-day root access.

6. GC mode of Java

Both mapred.child.java.opts and HADOOP_OPTS have added-XX:+UseConcMarkSweepGC. Modern multi-core processor systems are recommended in JDK's documentation. Using this GC approach, you can make full use of the concurrency capabilities of CPU. This change has a significant positive impact on performance.

7. Choose the right JDK

Some of the servers in this cluster have a 32-bit version of JDK and cannot create processes above-Xmx4g. Unified as x64 version of JDK.

8. Mapred.reduce.slowstart.completed.maps

This parameter controls the timing of the slowstart feature. By default, the scheduling reduce process starts and the copy process starts after 5% of the map tasks are completed. But we have a small number of machines, once a large number of tasks piled up in JobTracker, each TaskTracker's map and reduce slots are full. Because the map does not have enough resources to complete quickly, the reduce cannot be finished, resulting in a deadlock between the resources of the cluster. Change this parameter to 0.75, and the list of tasks stacked from an average of 10 to 3.

9. Mapred.fairscheduler.preemption

This parameter is set to true. So that fairscheduler can free up enough resources for the tasks of others in kill when the user's minimum resources are not satisfied. The cluster runs various types of tasks, and some map tasks take hours to run. This parameter causes such tasks to be frequently kill and almost impossible to complete. There was a mission that was kill 137 times in 7 hours. This can be solved by adjusting the pool configuration of fairscheduler, and a separate pool of minMap==maxMap can be configured for this task.

10. Mapred.jobtracker.completeuserjobs.maximum

Limit the number of tasks that each user can save in JobTracker memory. Because this parameter is too large, our JobTracker will fall into frequent FullGC in less than 24 hours. At present, it is changed to 5JT to run smoothly and handle 1500 tasks a day, taking up only 800m of memory. This parameter is no longer necessary when > 0.21.0, because version 0.21 modifies the use of completeuserjobs, which will be written to disk as soon as possible and no longer exist in memory for a long time.

11.mapred.jobtracker.update.faulty.tracker.interval,mapred.jobtracker.max.blacklist.percent

A miswritten task can cause a large number of TaskTracker to be blacklisted and take 24 hours to recover. This situation has a great impact on the performance of small and medium-sized clusters. It can only be fixed by manually restarting TaskTracker. So we modified part of the JobTracker code to expose two parameters: mapred.jobtracker.update.faulty.tracker.interval controls the blacklist reset time, the default is 24 hours can not be changed, we have now changed to 1 hour.

Mapred.jobtracker.max.blacklist.percent controls the proportion of TT on the blacklist, which we changed to 0.2. I am supplementing the TestCase of these two parameters, ready to submit to trunk.

twelve。 Use more hive and less streaming

Because of the convenience and speed of streaming, we have done a lot of development based on it. However, because the task of streaming also has a java process to read and write stdin/out when it is running, there is a certain performance overhead. It is best to use custom Deserializer+hive to accomplish similar requirements.

Thank you for reading! This is the end of this article on "how to optimize Hadoop". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.