Hadoop rack awareness 02/11 Update SLTechnology News&Howtos

Hadoop rack awareness

2026-02-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Background

Recently, I have sorted out the knowledge related to big data that I have learned before. starting from the rack awareness of Hadoop, Hadoop rack awareness can be implemented in two ways:

1), by implementing a Java interface DNSToSwitchMapping, and then configuring net.topology.node.switch.mapping.impl in the core-site.xml configuration file, whose value is the full path of the class that implements DNSToSwitchMapping, for example:

Net.topology.node.switch.mapping.impl com.inspur.rackawar.test.MyDNSToSwitchMapping

2) most installations do not require an additional implementation of the new interface, but simply use the default ScriptBasedMapping implementation, which runs user-defined scripts to describe mapping relationships. The storage path of the script is controlled by the configuration item topology.script.file.name in the core-site.xml file. As long as it is not a very complex business, I personally recommend the second way, flexible and simple.

Hadoop distributed clusters usually contain a large number of servers. Due to the limitations of rack slots and switch ports, large distributed clusters usually span several racks and consist of servers on multiple racks to form a distributed cluster. The network speed between servers in the rack is usually higher than that between servers across racks, and the network communication between servers between racks is usually limited by the network bandwidth between upper switches.

Specific to the Hadoop cluster, because the HDFS of hadoop stores data files in a chunk block, each block will have multiple copies (default is 3), and for the sake of data security and efficiency, hadoop defaults to the storage policy of 3 replicas:

The first copy of block is in the same node as and client (if the client is not in the scope of the cluster, the first node is randomly selected).

The second copy is placed in the node in a rack that is different from the first node (randomly selected).

The third copy is placed on another node in the same rack as the first copy.

If there are more copies, they are randomly placed in the node of the cluster.

Such a policy can ensure that access to the files to which the block belongs can be found first under this rack, and if there is an exception in the entire rack, you can also find a copy of the block on another rack. This is efficient enough and achieves fault tolerance of the data at the same time.

However, hadoop's perception of the rack is not adaptive, that is, the hadoop cluster does not intelligently perceive which rack a slave machine belongs to, but requires the manager of hadoop to artificially tell hadoop which machine belongs to which rack. In this way, when hadoop's namenode initializes, the corresponding information of these machines and rack will be stored in memory. It is used to select the datanode strategy when assigning the datanode list to all the write block operations of the next HDFS (for example, three block corresponds to three datanode), so as to achieve the strategy of hadoop allocate block: distribute three copies to different rack as far as possible.

The next question is: how can I tell hadoop namenode which slaves machines belong to which rack? The following are the configuration steps.

Configuration

Rack awareness for hadoop is not enabled by default. Therefore, under normal circumstances, the HDFS of the hadoop cluster is randomly selected when selecting the machine, that is to say, it is very likely that when writing data, hadoop writes the first piece of data block1 to rack1, and then randomly writes block2 to rack2. At this time, the data transmission traffic is generated between the two rack, and then, in the random case, the block3 is written back to rack1. Another data flow is generated between the two rack. When the amount of data processed by job is very large, or the amount of data pushed to hadoop is very large, this situation will cause the network traffic between rack to increase exponentially, become the bottleneck of performance, and then affect the performance of jobs and even the services of the whole cluster.

To enable hadoop rack-aware functionality, it is very simple to configure an option in the core-site.xml configuration file of the machine where namenode resides:

Topology.script.file.name / software/hadoop/etc/hadoop/topology.py

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.