In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article will explain in detail how to optimize the performance of Hadoop clusters. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.
Background
Distributed clusters usually contain a large number of machines. Due to the limitations of rack slots and switch network ports, large distributed clusters usually span several racks and are composed of machines on multiple racks to form a distributed cluster. The network speed between the machines in the rack is usually higher than that between the machines across the rack, and the network communication between the machines between the racks is usually limited by the network bandwidth between the upper switches.
Specific to the Hadoop cluster, because the HDFS of Hadoop stores data files in a chunk block, each block will have multiple copies (default is 3), and for the sake of data security and efficiency, Hadoop defaults to the storage policy of 3 replicas:
Store a block in the hdfs directory of the local machine
Store a block on a datanode in another rack
Store a block on a machine under the same rack of the machine
Such a policy can ensure that access to the files to which the block belongs can be found first under this rack, and if there is an exception in the entire rack, you can also find a copy of the block on another rack. This is efficient enough and achieves fault tolerance of the data at the same time.
However, Hadoop's perception of the rack is not adaptive, that is, the Hadoop cluster does not only perceive which rack a slave machine belongs to, but requires the manager of Hadoop to artificially tell Hadoop which machine belongs to which rack. In this way, when Hadoop's namenode initializes, the corresponding information of these machines and rack will be stored in memory. It is used to select the datanode strategy when assigning the datanode list to all the write block operations of the next HDFS (for example, three block corresponds to three datanode), so as to achieve the strategy of Hadoopallocateblock: distribute three copies to different rack as far as possible.
The next question is: how can I tell Hadoopnamenode which slaves machines belong to which rack? The following are the configuration steps.
Configuration
Rack awareness for Hadoop is not enabled by default. Therefore, under normal circumstances, the HDFS of a Hadoop cluster is randomly selected when selecting a machine, that is to say, when writing data, Hadoop writes * * block data block1 to rack1, and then randomly writes block2 to rack2. At this time, the data transmission traffic is generated between the two rack, and then, in the random case, the block3 is written back to rack1. Another data flow is generated between the two rack. When the amount of data processed by job is very large, or the amount of data pushed to Hadoop is very large, this situation will cause the network traffic between rack to increase exponentially, become the bottleneck of performance, and then affect the performance of jobs and even the services of the whole cluster.
To enable Hadoop rack-aware functionality, it is very simple to configure an option in the Hadoop-site.xml configuration file of the machine where namenode resides:
Topology.script.file.name / path/to/script
The value of this configuration option is specified as an executable program, usually a script that takes a parameter and outputs a value. The accepted parameter is usually the ip address of a datanode machine, and the output value is usually the rack where the datanode corresponding to that ip address is located, such as "/ rack1". When Namenode starts, it will determine whether the configuration option is empty. If not, it means that the rack-aware configuration has been used. At this time, namenode will look for the script according to the configuration, and when it receives the heartbeat of each datanode, it will pass the ip address of the datanode as a parameter to the script to run, and save the output to a map in memory as the rack to which the datanode belongs.
As for the writing of the script, it is necessary to understand the real network topology and rack information clearly, and the ip address of the machine can be correctly mapped to the corresponding rack through the script. A simple implementation is as follows:
#! / usr/bin/perl-w usestrict; my$ip=$ARGV [0]; my$rack_num=3; my@ip_items=split/\. /, $ip; my$ip_count=0; foreachmy$i (@ ip_items) {$ip_count+=$i;} my$rack= "/ rack". ($ip_count%$rack_num); print "$rack" This is the end of the article on "how to optimize the performance of Hadoop clusters". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.