How to test the Hadoop performance optimization function 04/21 Update SLTechnology News&Howtos

How to test the Hadoop performance optimization function

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to test the performance optimization function of Hadoop. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

Hadoop performance optimization function test

The following are the test results when hadoopHDFS starts instance for data upload with rack-aware information and no rack-aware information, respectively.

Write data

When no rack information is configured, all machine hadoop are in the same default rack by default, called "/ default-rack". In this case, any datanode machine, regardless of whether it physically belongs to the same rack or not,

Will be considered to be under the same rack, at this point, it is easy to increase the load of the inter-rack network mentioned earlier. For example, if you start instance to upload a file for hadoopHDFS without rack information, the block information is as follows:

As can be seen from the above figure, in the absence of rack information, namenode defaults all slaves machines to / default-rack by default. According to the analysis of hadoop code, we can also know, ah, when writing block, there are three

The choice of datanode machines is completely random.

When the rack-aware information is configured in the hadoop performance optimization function test, hadoop will make a corresponding judgment when selecting three datanode:

1. If the upload machine is not a datanode, but a client, then randomly select one datanode from all the slave machines as the writing machine (datanode1) for * blocks.

A) at this point, if the upload machine itself is a datanode (for example, when task writes data to hdfs through DFSClient in a mapreduce job), then write the datanode itself as * blocks to the machine (datanode1)

two。 Then, on a rack other than the rack to which datanode1 belongs, randomly select one as the second block writing datanode machine (datanode2).

3. Before writing the third block, determine whether the first two datanode are on the same rack, and if they are on the same rack, try to select the third datanode as the writing machine (datanode3) on the other rack.

If datanode1 and datanode2 are not on the same rack, select a datanode as the datanode3 on the rack where the datanode2 is located.

4. After getting the list of three datanode, before returning the list from namenode to DFSClient, the NameOde side will first proceed from near to far according to the "distance" between the writing client and each datanode in the datanode list.

A sort. If the DFS writer is not datanode at this time, select the * bits in the datanode list in the * bit. The client writes data blocks from near to far according to this order. Here, judge the distance between two datanode

The algorithm of "Li" is more critical. Hadoop is currently implemented as follows, taking two objects DatanodeInfo (node1,node2) representing datanode as an example:

A) first, the hierarchy of the two datanode in the entire hdfs cluster is obtained based on the node1 and node2 objects, respectively. The concept of hierarchy here needs to be explained: the hierarchical string in which each datanode is located in the hdfs cluster is described as follows

Assume that the topology of hdfs is as follows:

Each datanode corresponds to its own location and hierarchy in the cluster. For example, if the location information of node1 is "/ rack1/datanode1", then its level is 2, and so on.

B) after getting the two node levels, it will look up along the position in the topology tree where each node is located. For example, the upper level of "/ rack1/datanode1" is "/ rack1". In this case, the distance between the two nodes is increased by 1, and the two node are the same respectively.

Search up until the common ancestor node location is found, and the number of distances obtained is used to represent the distance between the two nodes. So, as shown in the figure above, the distance between node1 and node2 is 4. 5.

5. When the list of datanode nodes sorted according to "distance" is returned to DFSClient, DFSClient creates the BlockOutputStream and wants the block to write to the * nodes in the pipeline (the nearest node).

Write block data.

6. After writing * block, write according to the next farthest node in the datanode list until * A block is successfully written, and DFSClient returns successful, and the block write operation ends.

Through the above strategy, when namenode chooses data blocks to write to the datanode list, it fully takes into account the distribution of block copies in different racks, and at the same time tries to avoid the network more than overhead described earlier.

Start instance to upload a file to the hadoopHDFS configured with rack information, and the block information is as follows:

As can be seen from the figure above, in order to reduce the network traffic between racks when rack information is configured, namenode will write two copies on the same rack, and in order to be as fault-tolerant as possible, the third block will write the other.

On the datanode on the rack. Take a look at how the hadoop performance tuning function test reads data.

Read data

When reading a block from a file, hadoop follows the same strategy:

1. First you get a list of the datanode in which the block is located, and there are as many datanode as there are copies of the list.

two。 Sort from small to large according to the distance from the datanode to the reader in the list:

A) first find out whether a copy of the block exists locally, and if so, use the local datanode as * datanode to read the block

B) then find out whether there is a datanode that saves the copy of the block under the same local rack

C) if none is found, or if the node reading the data itself is not a datanode node, a random order of the datanode list is returned.

On "how to test Hadoop performance optimization features" this article is shared here, I hope the above content can be of some help to you, so that you can learn more knowledge, if you think the article is good, please share it out for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.