How to configure hadoop rack awareness 07/19 Update SLTechnology News&Howtos

How to configure hadoop rack awareness

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article focuses on "how to configure Hadoop rack awareness". Interested friends may wish to take a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to configure Hadoop rack awareness.

1. Background

The design of Hadoop takes into account the security and efficiency of data. By default, three copies of data files are stored on HDFS, and the storage strategy is one local, one on other nodes in the same rack, and one on a node in different racks. In this way, if the local data is damaged, the node can get the data from the neighboring nodes in the same rack, which is certainly faster than from the cross-rack node; at the same time, if the network of the whole rack is abnormal, it is also guaranteed to find data on the nodes of other racks. To reduce overall bandwidth consumption and read latency, HDFS tries to get the reader to read the nearest copy. If there is a copy on the same rack of the reader, read the copy. If a HDFS cluster spans multiple data centers, the client will also read the copy of the local data center first. So how does Hadoop determine whether any two nodes are in the same rack or across racks? The answer is rack awareness.

Rack awareness for hadoop is not enabled by default. Therefore, under normal circumstances, the HDFS of the hadoop cluster is randomly selected when selecting the machine, that is to say, it is very likely that when writing data, hadoop writes the first piece of data block1 to rack1, and then randomly writes block2 to rack2. At this time, the data transmission traffic is generated between the two rack, and then, in the random case, the block3 is written back to rack1. Another data flow is generated between the two rack. When the amount of data processed by job is very large, or the amount of data pushed to hadoop is very large, this situation will cause the network traffic between rack to increase exponentially and become a performance bottleneck, which in turn affects the performance of jobs and even the services of the whole cluster.

two。 Configuration

There are two ways to configure rack awareness. One is to configure a script to map; the other is to map the network location by implementing the resolve () method of the DNSToSwitchMapping interface.

Hadoop itself does not have rack awareness, and must be artificially set to achieve this purpose. The resolveNetworkLocation () method payload in the FSNamesystem class transforms the network location. The dnsToSwitchMapping variable represents the class that completes the specific conversion work, and its value is as follows:

This.dnsToSwitchMapping = ReflectionUtils.newInstance (

Conf.getClass ("topology.node.switch.mapping.impl", ScriptBasedMapping.class

DNSToSwitchMapping.class), conf)

That is, the value of dnsToSwitchMapping is specified by the "topology.node.switch.mapping.impl" parameter in the "core-site.xml" configuration file. The default value is ScriptBasedMapping, that is, network location mapping is done by reading pre-written script files. But if the script is not configured, use the default value "default-rack" as the network location of all nodes.

Let's start with the first way to configure rack awareness, using scripts to map network locations.

To enable hadoop rack-aware functionality, it is very simple to configure an option in the / home/bigdata/apps/hadoop-talkyun/etc/hadoop core-site.xml configuration file of the node where the NameNode resides:

Topology.script.file.name

/ home/bigdata/apps/hadoop-talkyun/etc/hadoop/topology.sh

The value of this configuration option is specified as an executable program, usually a script that takes a parameter and outputs a value. The accepted parameter is usually the ip address of a datanode machine, and the output value is usually the rack where the datanode corresponding to that ip address is located, such as "/ rack1". When Namenode starts, it will determine whether the configuration option is empty. If not, it means that rack-aware configuration has been enabled. At this time, namenode will look for the script according to the configuration, and when it receives the heartbeat of each datanode, it will pass the ip address of the datanode as a parameter to the script to run, and save the output to a map in memory as the rack ID to which the datanode belongs.

As for the writing of the script, it is necessary to understand the real network topology and rack information clearly, and the ip address and machine name of the machine can be correctly mapped to the corresponding rack through the script. A simple implementation is as follows:

Find an official configuration script on wiki that you can refer to. The first is the shell script:

Topology.sh:

#! / bin/bash

HADOOP_CONF=/etc/hadoop/conf

While [$#-gt 0]; do / / $# represents the number of parameters entered when executing the command

NodeArg=$1

Exec < ${HADOOP_CONF} / topology.data / / read into the file

Result= ""

While read line; do / / cycle through the file contents

Ar= ($line)

If ["${ar [0]}" = "$nodeArg"]; then

Result= "${ar [1]}"

Done

Shift

If [- z "$result"]; then

Echo-n "/ default/rack"

Else

Echo-n "$result"

Done

Topology.data, in the format: node (ip or hostname) / switch xx/ rack xx

192.168.147.91 tbe192168147091 / dc1/rack1

192.168.147.92 tbe192168147092 / dc1/rack1

192.168.147.93 tbe192168147093 / dc1/rack2

192.168.147.94 tbe192168147094 / dc1/rack3

192.168.147.95 tbe192168147095 / dc1/rack3

192.168.147.96 tbe192168147096 / dc1/rack3

It should be noted that on Namenode, the nodes in this file must use IP, and the hostname is not valid, while on Jobtracker, the nodes in this file must use hostname, and IP is not valid, so it is best to match both ip and hostname.

The second way to configure rack awareness is by implementing the DNSToSwitchMapping interface and overriding the resolve () method. This requires you to write a java class to complete the mapping. Then specify your own implementation class in "topology.node.switch.mapping.impl" in the "core-site.xml" configuration file. In this way, when you parse the network location, you will call the resolve () method in your own class to complete the transformation. My writing is relatively simple, as long as I can complete the function, the code is as follows (the Great God flies by):

Public class MyResolveNetworkTopology implements DNSToSwitchMapping {

Private String [] hostnameLists = {"tt156", "tt163", "tt164", "tt165"}

Private String [] ipLists = {"10.32.11.156", "10.32.11.163", "10.32.11.164", "10.32.11.165"}

Private String [] resolvedLists = {"/ dc1/rack1", "/ dc1/rack1", "/ dc1/rack2", "/ dc1/rack2"}

@ Override

Public List resolve (List names) {

Names = NetUtils.normalizeHostNames (names)

List result = new ArrayList (names.size ())

If (names.isEmpty ()) {

Return result

}

For (int I = 0; I < names.size (); iTunes +) {

String name = names.get (I)

For (int j = 0; j < hostnameLists.length; jacks +) {

If (name.equals (hostnameLists [j])) {

Result.add (roomvedLists [j])

} else if (name.equals (ipLists [j])) {

Result.add (roomvedLists [j])

}

Return result

}

I put this custom MyResolveNetworkTopology class in the org.apache.hadoop.net directory of the core package. So the configuration in the "core-site.xml" file is as follows:

Topology.node.switch.mapping.impl

Org.apache.hadoop.net.MyResolveNetworkTopology

The default implementation of the DNSToSwitchMapping. It

Invokes a script specified in topology.script.file.name to resolve

Node names. If the value for topology.script.file.name is not set, the

Default value of DEFAULT_RACK is returned for all node names.

After the above two methods are configured, the following information will be printed in the log of NameNode and JobTracker:

INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: / dc1/rack3/ 192.168.147.94:50010

This means that the rack-aware configuration is successful.

Summarize the above two ways. Through the script configuration, the flexibility is high, but the execution efficiency is low. Because the system needs to be transferred from jvm to shell for execution; the java class has a higher performance, but cannot be changed after compilation, so it is less flexible. Therefore, we should choose the strategy according to the specific situation.

Add:

View Hadoop rack information command:

. / hadoop dfsadmin-printTopology

At this point, I believe you have a deeper understanding of "Hadoop rack perception how to configure", might as well come to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.