Data flow Analysis of Hadoop-HDFS 07/19 Update SLTechnology News&Howtos

Data flow Analysis of Hadoop-HDFS

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Reading of profiling files

To understand what the data flow looks like between the client and the HDFS, namenode, and datanode with which it interacts, you can refer to the following figure, which shows the sequence of events that occur when the file is read.

The client opens the file it wants to read by calling the open () method of the FileSystem object, which for HDFS is an instance of the distributed file system (step 1 in the figure). DistributedFileSystem calls namenode by using RPC to determine the location of the starting block of the file (step 2). For each block, namenode returns the datanode address where the copy of the block is stored. In addition, these datanode are sorted according to their distance from the client. If the client itself is a datanode (for example, in a MapReduce task) and holds a copy of the corresponding data block, the node reads the data from the local datanode.

The DistributedFileSystem class returns a FSDataInputStream object (a data stream that supports file location) to the client and reads the data. The FSDataInputStream class instead encapsulates the DFSInputStream object, which manages the datanode and namenode's Icano.

Next, the client calls the read () method on the input stream (step 3). The DFSInputStream that stores the datanode address of the first few blocks of the file then connects to the nearest datanode. By repeatedly calling the read () method on the data, you can transfer the data from datanode to the client (step 4). When it reaches the end of the block, DFSInputStream closes the connection to the datanode and finds the best datanode for the next block (step 5). The client only needs to read the continuous stream and is transparent to the client.

When the client reads data from the stream, the blocks are read in the order in which the new connection between DFSInputStream and datanode is opened. It also asks namenode as needed to retrieve the location of the datanode for the next batch of data blocks. Once the client finishes reading, the close () method is called on the FSDataInputStream (step 6).

When reading data, if DFSInputStream encounters an error in communicating with datanode, it attempts to read data from the other nearest datanode of this block. It also remembers the fault datanode to ensure that subsequent blocks on that node are not read over and over again. DFSInputStream also verifies the completeness of the data sent from datanode by checksum. If a corrupted block is found, notify DFSInputStream before namenode attempts to read its copy from another datanode.

A key point of this design is that namenode tells the client the best datanode in each block and lets the client connect directly to the datanode to retrieve data. Because the data flow is scattered across all datanode of the urgent crowd, this design enables HDFS to be extended to a large number of concurrent clients. At the same time, namenode only needs to respond to requests for block locations (this information is stored in memory, so it is very efficient) and does not need to respond to data requests, otherwise namenode will quickly become a bottleneck as the number of clients grows.

Network Topology and Hadoop

In the local network, what does it mean that two nodes are called "neighbors"? In massive data processing, the main limiting factor is the scarcity of bandwidth, which is the data transmission rate between nodes. The idea here is to use the bandwidth between two nodes as a measure of distance.

Without measuring the bandwidth between nodes-- it's actually hard to achieve (it requires a stable cluster, and the number of pairs of nodes in a cluster is the square of the number of nodes)-- Hadoop takes a simple approach: think of the network as a tree, and the distance between the two nodes is the sum of their distances to the nearest common ancestor. The hierarchy in the tree is not pre-set, but levels can usually be set relative to data centers, several homes, and running nodes. The specific idea is that the available bandwidth decreases in turn for each of the following scenarios:

Processes on the same node

Different nodes on the same rack

Nodes on different racks in the same data center

Nodes in different data centers

For example, suppose you have node N1 in the data center D1 rack R1. This node can be represented as / d1/r1/n1. Using this marker, four distance descriptions are given here:

Distance (/ d1/r1/n1, / d1/r1/n1) = 0 (processes on the same node)

Distance (/ d1/r1/n1, / d1/r1/n2) = 2 (different nodes on the same rack)

Distance (/ d1/r1/n1, / d1/r2/n3) = 4 (nodes on different racks in the same data center)

Distance (/ d1/r1/n1, / d2/r3/n4) = 6 (nodes in different data centers)

Finally, we must realize that Hadoop cannot define the network topology on its own. It requires us to be able to understand and assist in definition.

Writing of profiling files

Next, let's take a look at how the HDFS is written to the file, which, although more detailed, is useful for understanding the data flow because it clearly illustrates the consistent model of HDFS.

What we need to consider is how to create a new file, write the data to the file, and finally close the file. See the picture below.

The client creates a new file by calling the create () function on the DistributedFileSystem object (step 1). DistributedFileSystem creates a RPC call to namenode to create a new file in the namespace of the file system, where there is no corresponding data block (step 2). Namenode performs various checks to ensure that the file does not exist and that the client has permission to create a new file. If all of these checks pass, namenode records a record for creating a new file; otherwise, file creation fails and an IOException exception is thrown to the client. DistributedFileSystem returns a FSDataOutputStream object to the client so that the client can start writing data. Just like a read event, FSDataOutputStream encapsulates a DFSoutPutstream object that handles communication between datanode and namenode.

When the client writes data (step 3), DFSOutputStream divides it into packets and writes it to an internal queue called a "data queue" (data queue). DataStreamer handles data queues, and its responsibility is to require namenode to allocate appropriate new blocks to store copies of data based on the datanode list. This set of datanode forms a pipeline-- we assume that the number of replicas is 3, so there are three nodes in the pipeline. DataStreamer streams the packet to the first datanode in the pipeline, which stores the packet and sends it to the second datanode in the pipeline. Similarly, the second datanode stores the packet and sends it to the third (and last) datanote in the pipeline (step 4).

DFSOutputStream also maintains an internal packet queue to wait for an acknowledgement receipt from datanode, called an acknowledgement queue (ack queue). The packet is not removed from the acknowledgement queue until all datanode acknowledgements are received in the pipeline (step 5).

If the datanode fails during data writing, do the following (transparent to the client that writes the data). First close the pipeline and confirm that all packets in the queue are added back to the front end of the data queue to ensure that the datanode downstream of the failed node does not miss any packets. Specify a new identity for the current data agglomeration stored in another normal datanode and pass that identity to the namenode so that the failed datanode can delete part of the stored data block after recovery. Remove the failed data node from the pipeline and write the remaining data blocks to the other two normal datanode in the pipeline. When namenode notices that there is not enough copy of the block, it creates a new copy on another node. Subsequent data blocks continue to be processed normally.

It is rare that multiple datanode may fail at the same time during a block write. As long as the number of replicas of the dfs.replication.min is written (the default is 1), the write operation succeeds, and the block can be replicated asynchronously in the cluster until its target number of replicas is reached (the default for dfs.replication is 3).

After the client finishes writing the data, the close () method is called on the data stream (step 6). This writes all remaining packets to the datanode pipeline and waits for confirmation before contacting the namenode and sending a file write completion signal (step 7). Namenode already knows which blocks the file consists of (allocating blocks through a Datastreamer request), so it only needs to wait for the blocks to be copied at a minimum before returning success.

How to put the copy?

How does namenode choose which datanode to store replicas (replica)? There is a tradeoff between reliability, write bandwidth, and read bandwidth. For example, storing all replicas on one node has the least write bandwidth loss because the replication pipeline is running on the same node, but this does not provide real redundancy (if the node fails, then the data in the block will be lost). At the same time, the read bandwidth between servers on the same rack is very high. At the other extreme, putting copies in different data centers can maximize redundancy, but the loss of bandwidth is very high. Even in the same data center (so far, all Hadoop clusters are running in the same data center), there are many different data layout strategies. In fact, in the release of Hadoop version 0.17.0, the data layout strategy was changed to replicate and keep the data blocks relatively evenly distributed within the cluster. In releases after 1.x, the layout strategy for blocks can be selected immediately.

The default layout strategy for Hadoop is to put the first copy on the node on which the client is running (if the client is running outside the cluster, select a node at random, but the system avoids picking nodes that are too slow or too busy to store). The second copy is placed on a node in the rack that is different from the first and randomly selected (off the shelf). The third copy is placed on the same rack as the second copy, and another node is randomly selected. Other copies are placed on randomly selected nodes in the cluster, but the system tries to avoid putting too many copies on the same rack.

Once the location of the replica is selected, a Pipeline is created based on the network topology. If the number of copies is 3, there is the pipeline shown in the following illustration.

Overall, this approach not only provides good stability (data blocks are stored in two racks) and achieves good load balancing, including write bandwidth (write operations need to traverse only one switch), read performance (reads can be selected from two racks), and uniform distribution of blocks in the cluster (clients write only one block on the local rack).

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.