Rack Awareness of Hadoop 07/09 Update SLTechnology News&Howtos

Rack Awareness of Hadoop

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Background

When the rack information is not configured, all machine hadoop are in the same default rack by default, called "/ default-rack". In this case, any datanode machine, regardless of whether it physically belongs to the same rack or not, will be considered to be in the same rack. At this time, it is easy to increase the inter-rack network load mentioned earlier. In the absence of rack information, namenode defaults all slaves machines to / default-rack.

When rack-aware information is configured in the Hadoop cluster, hadoop will make a decision when selecting three datanode:

1. If the upload machine is not a datanode, but a client, then randomly select a datanode from all slave machines as the writing machine (datanode1) for the first block.

Note: at this point, if the upload machine itself is a datanode (for example, when task writes data to hdfs through DFSClient in a mapreduce job), then write the datanode itself to the machine (datanode1) as the first block.

two。 Then, on a rack other than the rack to which datanode1 belongs, randomly select one as the second block writing datanode machine (datanode2).

3. Before writing the third block, determine whether the first two datanode are on the same rack, and if they are on the same rack, try to select the third datanode as the writing machine (datanode3) on the other rack. If datanode1 and datanode2 are not on the same rack, select a datanode as the datanode3 on the rack where the datanode2 is located.

4. After you get a list of three datanode, before returning the list from namenode to DFSClient, the NameOde side will first sort from near to far according to the "distance" between the writing client and each datanode in the datanode list. If the DFS writer is not datanode at this time, select the first one in the datanode list to be ranked first. The client writes data blocks from near to far according to this order. Here, the algorithm for determining the "distance" between two datanode is more critical. Hadoop is currently implemented as follows, taking two objects DatanodeInfo (node1,node2) representing datanode as an example:

A) first, the hierarchy of the two datanode in the entire hdfs cluster is obtained based on the node1 and node2 objects, respectively. The concept of hierarchy here needs to be explained: the hierarchy string in which each datanode resides in the hdfs cluster is described in this way, assuming that the topology of the hdfs is as follows:

Each datanode corresponds to its own location and hierarchy in the cluster. For example, if the location information of node1 is "/ rack1/datanode1", then its level is 2, and so on. After getting the two node levels, it will look up along the position in the topology tree where each node is located, for example, the upper level of "/ rack1/datanode1" is "/ rack1". In this case, the distance between the two nodes is added by 1, and the two node search up until the common ancestor node location is found. In this case, the number of distances is used to represent the distance between the two nodes. So, as shown in the figure above, the distance between node1 and node2 is 4. 5.

5. When a list of datanode nodes sorted by distance is returned to DFSClient, DFSClient creates a BlockOutputStream and writes this block to the first node (the nearest node) in the pipeline.

6. After the first block is written, the next farthest node in the datanode list is written until the last block write succeeds, DFSClient returns success, and the block write operation ends.

Through the above strategy, when namenode chooses data blocks to write to the datanode list, it fully takes into account the distribution of block copies in different racks, and at the same time avoids the excessive network overhead described earlier.

Rack-aware strategy

Rack awareness for hadoop is not enabled by default. Therefore, under normal circumstances, the HDFS of the hadoop cluster is randomly selected when selecting the machine, that is to say, it is very likely that when writing data, hadoop writes the first piece of data block1 to rack1, and then randomly writes block2 to rack2. At this time, the data transmission traffic is generated between the two rack, and then, in the random case, the block3 is written back to rack1. Another data flow is generated between the two rack. When the amount of data processed by job is very large, or the amount of data pushed to hadoop is very large, this situation will cause the network traffic between rack to increase exponentially, become the bottleneck of performance, and then affect the performance of jobs and even the services of the whole cluster.

Distributed clusters usually contain a large number of machines. Due to the limitations of rack slots and switch network ports, large distributed clusters usually span several racks and are composed of machines on multiple racks to form a distributed cluster. The network speed between the machines in the rack is usually higher than that between the machines across the rack, and the network communication between the machines between the racks is usually limited by the network bandwidth between the upper switches.

Specific to the Hadoop cluster, because the HDFS of hadoop stores data files in a chunk block, each block will have multiple copies (default is 3), and for the sake of data security and efficiency, hadoop defaults to the storage policy of 3 replicas:

The first copy of block is in the same node as and client (if the client is not in the scope of the cluster, the first node is randomly selected).

The second copy is placed in the node in a rack that is different from the first node (randomly selected).

The third copy seems to be placed on another node on the same rack as the first copy.

If there are more copies, they are randomly placed in the node of the cluster.

Such a policy can ensure that access to the files to which the block belongs can be found first under this rack, and if there is an exception in the entire rack, you can also find a copy of the block on another rack. This is efficient enough and achieves fault tolerance of the data at the same time.

However, hadoop's perception of the rack is not adaptive, that is, the hadoop cluster does not only perceive which rack a slave machine belongs to, but requires the manager of hadoop to artificially tell hadoop which machine belongs to which rack. In this way, when hadoop's namenode initializes, the corresponding information of these machines and rack will be stored in memory. It is used to select the datanode strategy when assigning the datanode list to all the write block operations of the next HDFS (for example, three block corresponds to three datanode), so as to achieve the strategy of hadoop allocate block: distribute three copies to different rack as far as possible.

The next question is: how can I tell hadoop namenode which slaves machines belong to which rack? The following are the configuration steps.

Configuration

To enable hadoop rack-aware functionality, it is very simple to configure an option in the hdfs-site.xml configuration file of the machine where namenode resides:

Topology.script.file.name / path/to/RackAware.py

The value of this configuration option is specified as an executable program, usually a script that takes a parameter and outputs a value. The accepted parameter is usually the ip address of a datanode machine, and the output value is usually the rack where the datanode corresponding to that ip address is located, such as "/ rack1". When Namenode starts, it will determine whether the configuration option is empty. If not, it means that the rack-aware configuration has been used. At this time, namenode will look for the script according to the configuration, and when it receives the heartbeat of each datanode, it will pass the ip address of the datanode as a parameter to the script to run, and save the output to a map in memory as the rack to which the datanode belongs.

As for the writing of the script, it is necessary to understand the real network topology and rack information clearly, and the ip address of the machine can be correctly mapped to the corresponding rack through the script. A simple implementation is as follows:

#! / usr/bin/python #-*-coding:UTF-8-*-import sys rack = {"hadoopnode-176.tj": "rack1", "hadoopnode-178.tj": "rack1", "hadoopnode-179.tj": "rack1", "hadoopnode-180.tj": "rack1", "hadoopnode-186.tj": "rack2", "hadoopnode-187.tj": "rack2" "hadoopnode-188.tj": "rack2", "hadoopnode-190.tj": "rack2", "192.168.1.15": "rack1", "192.168.1.17": "rack1", "192.168.1.18": "rack1", "192.168.1.19": "rack1", "192.168.1.25": "rack2" "192.168.1.26": "rack2", "192.168.1.27": "rack2", "192.168.1.29": "rack2",} if _ _ name__== "_ _ main__": print "/" + rack.get (sys.argv [1], "rack0")

Since there is no exact document indicating whether the hostname or ip address will be passed to the script, it is best to be compatible with the hostname and ip address in the script. If the server room architecture is complex, the script can return a string such as: / dc1/rack1.

Execute the command: chmod + x RackAware.py

Restart namenode. If the configuration is successful, the namenode startup log will output:

2011-12-21 14-14-28-44-95 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: / rack1/192.168.1.15:50010 distance between network topology machines

Based on a network topology case, this paper introduces the distance between each machine of a hadoop cluster in a complex network topology.

With rack awareness, NameNode can draw the datanode network topology shown in the figure above. D1 and R1 are all switches, and the bottom layer is datanode. Then the rackid=/D1/R1/H1,H1 of H1 is parent of R1 and the parent of R1 is D1. This rackid information can be configured through topology.script.file.name. With this rackid information, the distance between any two datanode can be calculated.

Distance (/ D1Grampxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxmxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.