Design and Optimization method of Network Architecture in Hadoop Cluster Environment 04/27 Update SLTechnology News&Howtos

Design and Optimization method of Network Architecture in Hadoop Cluster Environment

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "the design and optimization method of network architecture in Hadoop cluster environment". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "the design and optimization of network architecture in Hadoop cluster environment".

The network characteristics of big data's Hadoop environment the nodes in the Hadoop cluster are connected through the network, and the following procedures in MapReduce transmit data in the network.

(1) write data. The data writing process occurs when initial data or large chunks of data are written to HDFS. The blocks written need to be backed up to other nodes and the data needs to be transferred over the network.

(2) execution of the job.

① Map phase. In the Map phase of the algorithm, there is almost no need to transmit data in the network. At the beginning of Map, when HDFS data is not local (blocks are not stored locally and need to be copied from other nodes), the data needs to be transferred in the network.

② Shuffle phase. This is the stage of transferring data in the network during the execution of the job, and the degree of data transfer depends on the job. The output of the Mapper phase is transferred to Reducer for sorting at this time.

③ Reduce phase. Because the data needed by Reducer has already come from the Shuffle phase, there is no need for network transmission at this stage.

④ Output replication. The output of MapReduce is stored as a file on HDFS. When the output is written to HDFS, the resulting backup is transmitted across the network.

(3) read the data. The data reading process occurs when an application, such as a Web site, index, or SQL database, reads data from HDFS. In addition, the network is very important to the control layer of Hadoop, such as the signaling and operation of HDFS, as well as the MapReduce architecture are affected by the network.

Five network characteristics

Cisco conducted a test on the network environment under the Hadoop cluster environment. The test results show that a resilient network is very important to the Hadoop cluster; the network characteristics that have an important impact on the Hadoop cluster are ranked in the following order: network availability and resilience, Burst traffic burst processing and queue depth, network overload ratio, Datanode network access and network delay.

(1) Network availability and resilience. To deploy a highly redundant and scalable network to support the growth of Hadoop clusters. The technology of deploying multiple links between Datanode is better than those with single or two points of failure. Switches and routers have been proven in the industry to provide network availability for servers.

(2) burst processing and queue depth of Burst traffic. Some operations of HDFS and MapReduce Job will generate burst traffic, such as loading files into HDFS or writing result files to HDFS through the network. If the network can not handle the burst traffic, it will discard the packet, so the appropriate cache can alleviate the impact of burst traffic. Ensure that switches and routers that use caching and queuing are selected to handle traffic bursts effectively.

(3) Network overload ratio. A good network design needs to take into account the congestion of key nodes in the network. A ToR switch receives 20Gbps data from the server, but only 2 1Gbps uplink ports will cause packet loss (10:1 overload ratio), seriously affecting the performance of the cluster. Overconfigured networks are also very expensive. In general, the acceptable overload ratio of the server access layer is about 4:1, and the overload ratio between the access layer and the aggregation layer, or the core layer, is about 2:1.

(4) Datanode network access. Recommend the bandwidth configuration based on the cluster workload. Generally, the nodes in a cluster have 1 or 2 1GB uplink ports. Whether or not to choose a 10Gbps server depends on the price and performance.

(5) Network delay. Changes in switch and router latency have a limited impact on cluster performance. Compared with the network delay, the application layer delay has a greater impact on the task. However, the delay of the network will have a potential impact on the application system, such as unnecessary application switching and so on.

At this point, I believe you have a deeper understanding of "the design and optimization of network architecture in Hadoop cluster environment". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.