2. Detailed explanation of Hdfs architecture design and principle. 07/19 Update SLTechnology News&Howtos

2. Detailed explanation of Hdfs architecture design and principle.

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

@ [TOC]

1.Hadoop architecture

Hadoop consists of three modules: distributed storage HDFS, distributed computing MapReduce, and resource scheduling engine Yarn.

2.HDFS Architecture 2.1NameNode

NameNode is responsible for manipulating file metadata information and processing client requests.

NameNode management: the namespace NameSpace of the HDFS file system.

NameNode maintenance: file system tree (FileSystem) and metadata information (matedata) of all files and folders in the file tree maintain file-to-block correspondence and block-to-node correspondence

NameNode file: namespace image file (fsimage), operation log file (edit log) these information is Cache in RAM, of course, these two files will also be persisted on the local hard disk.

NameNode records the location information of the data node where each block is located in each file. However, it does not permanently preserve the location information of the block because it is rebuilt by the data node when the system starts. Rebuild from the data node: when nameNode starts, DataNode registers with NameNode and sends it to NameNode

2.1.1 metadata information

file name, file directory structure, file attributes (generation time, number of copies, permissions) list of blocks for each file.

And the address mapping relationship between the block in the list and the DataNode where the block is located loads in memory the reference relationship (mapping information between file, block, datanode) of each file and each data block in the file system. The data is saved to the local disk periodically, but the location information of the block is not saved, but is reported by DataNode registration and maintained at run time.

2.1.2NameNode file operation

NameNode is responsible for the operation of the file metadata, and DataNode is responsible for handling the read and write requests for the contents of the file. The data stream does not pass through the NameNode and will ask it to contact that DataNode.

2.1.3NameNode copy

Which DataNode the file data blocks are stored on is determined by NameNode, and the NN makes the decision to place the copy according to the overall situation.

When reading a file, NN tries to let client read the copy on the nearest datanode to reduce bandwidth consumption and read delay.

2.1.4NameNode heartbeat mechanism

Fully manage block replication and periodically accept heartbeat and block status report information (including a list of all blocks on the DataNode)

If the heartbeat message is received, NN thinks that DN is working normally, and if the heartbeat of DN is not received after 10 minutes, then NN thinks that DN has been down and NN is ready to re-copy the data blocks on DN. The status report of the block contains a list of all the blocks on the DN, which blocks report sends every hour

2.1.5NameNode fault-tolerant mechanism

can't work without Namenode,HDFS. In fact, if the machine running namenode goes down, the files in the system will be completely lost, because there is no other way to rebuild files that are located on different datanode blocks (blocks). Therefore, the fault-tolerant mechanism of namenode is very important, and Hadoop provides two mechanisms.

The first way for is to back up the file system metadata that is persisted on the local hard disk. Hadoop can be configured to have Namenode write his persistence state files to different file systems. This write operation is synchronous and atomized. A more common configuration is to write the persistence state to a remotely mounted network file system (NFS) as well as to the local hard disk.

The second way to is to run a secondary Namenode (SecondaryNamenode). In fact, SecondaryNamenode cannot be used as a Namenode. Its main function is to periodically merge the Namespace image with the operation log file (edit log) to prevent the operation log file (edit log) from becoming too large. Typically, SecondaryNamenode runs on a separate physical machine because the merge operation takes a lot of CPU time and memory equivalent to Namenode. The secondary Namenode holds a backup of the merged Namespace image, which can be used in case Namenode goes down one day.

but the secondary Namenode always lags behind the primary Namenode, so when the Namenode goes down, data loss is inevitable. In this case, generally, it is necessary to use the Namenode metadata file in the remotely mounted network file system (NFS) mentioned in the first way, copy the Namenode metadata file in NFS to the secondary Namenode, and run the auxiliary Namenode as the primary Namenode.

Physical structure of 2.1.6NameNode

2.1.7NameNode file structure

Storage directory of NameNode

2.2DataNode

A cluster may contain thousands of DataNode nodes. These DataNode communicate with NameNode regularly and accept instructions from NameNode. In order to reduce the burden of NameNode, NameNode does not permanently store information about which data blocks are on which DataNode. Instead, the mapping table on NameNode is updated by reporting when DataNode starts.

stores and retrieves data according to the scheduling of the client or namenode, and periodically sends a list of stored blocks (block) to namenode. The blocks are stored as files on the local disk on the node where the DataNode process is located, one is the data itself, and the other is metadata (block length, checksum of block data, and timestamp) to maintain mapping information (meta-information) between blockid and DataNode.

2.2.1DataNode working mechanism

When datanode starts, each datanode scans the local disk and reports the block information saved on this datanode to the block information received by namenode namenode and the datanode information where the block is located.

DataNode registers with NameNode after startup, and reports all block information to NameNode periodically (1 hour).

(1) sends a heartbeat to NameNode to keep in touch with it (once every 3 seconds), and the heartbeat returns a command with NN. The command returned is: if a block is copied, delete a data block. ..

(2) if the heartbeat of DataNode is not received in 10 minutes, it is considered to have lost and copy the block on it to other DataNode

(3) DN verifies that the value of its checkSum is consistent with the checksum value at the time of file creation three weeks after the creation of its file

2.2.2DataNode read and write operation

Each server in the cluster runs a DataNode daemon that reads and writes HDFS blocks to the local file system. When you need to read / write some data through the client, NameNode first tells the client which DataNode to do specific read / write operations, and then the client communicates directly with the daemon on the DataNode server and reads / writes related data blocks.

2.3SecondaryNameNode

SecondaryNameNode is a part of HDFS architecture, but its real purpose is often misunderstood because of its name. In fact, its real use is to save backups of HDFS metadata information in namenode and to reduce the time it takes to restart namenode. For the hadoop process, there is still some work to be done to configure and use snn correctly. The default configuration of hadoop allows the snn process to run on the namenode machine by default, but in this way, if this machine goes wrong and goes down, it will be a great disaster to restore the HDFS file system. It is better to configure the snn process to run on another machine.

in hadoop, namenode is responsible for persistent storage of HDFS's metadata and handles interactive feedback from the client on various operations of HDFS. In order to ensure the speed of interaction, the metadata of the HDFS file system is load into the memory of the namenode machine, and the data in memory is saved to disk for persistent storage. In order to ensure that this persistence process will not become a bottleneck in HDFS operations, hadoop takes the following approach: the snapshot of the current file system is not persisted at any time, and the most recent operation on HDFS list is saved to a file called Editlog in namenode. When namenode is restarted, in addition to the load fslmage accident, the HDFS operations recorded in this Editlog file are replay to restore the final state before HDFS restart.

while SecondaryNameNode periodically merges the operations on HDFS recorded in Editlog into a checkpoint, and then empties the Editlog. So the restart of namenode will Load the latest checkpoint and reproduce the hdfs operations recorded in the Editlog. Because the Editlog records the list of operations since the last checkpoint, it will be relatively small. Without the periodic merge process of SecondaryNameNode, it would take a long time every time namenode is restarted. Such periodic mergers can reduce restart time. At the same time, it can also ensure the integrity of HDFS system.

, this is what SecondaryNameNode does. So snn does not share the pressure on namenode for HDFS interactive operations. However, when the namenode machine goes down or the namenode process goes wrong, the namenode daemon process can manually copy a metadata from the snn to restore the HDFS file system.

As for why the snn process should be run on a non-NameNode machine, there are two main considerations:

1, scalability: to create a new HDFS snapshot (snapshot), you need to copy all the metadata information from load to memory in namenode, which requires the same memory as namenode. Because the memory allocated to the namenode process is actually a limitation to the HDFS file system, if the distributed file system is very large, then the memory of the namenode machine may be fully occupied by the namenode process.

2, fault tolerance: when snn creates a checkpoint, it copies the checkpoint into several copies of metadata. Running this operation to another machine also provides the fault tolerance of a distributed file system.

How SECONDARYNAMENODE works

Steps to merge 2.3.1SecondaryNameNode logs and mirrors

The periodic merge of logs and mirrors is divided into five steps:

1 and SecondaryNameNode notify NameNode that the edits file is ready to be submitted, and the master node generates edits.new

2 and SecondaryNameNode obtain the fsimage and edits files of NameNode through http get (you can see temp.check-point or previous-checkpoint directories in the same current directory of SecondaryNameNode, where the image files copied from namenode are stored)

3 and SecondaryNameNode begin to merge the above two files to produce a new fsimage file fsimage.ckpt

4 and SecondaryNameNode send fsimage.ckpt to NameNode in http post mode

5, NameNode renames the fsimage.ckpt and edits.new files to fsimage and edits, respectively, and then updates fstime, and the entire checkpoint process ends. In the new version of hadoop (hadoop0.21.0), two roles of SecondaryNameNode are replaced by two nodes, checkpoint node and backup node. SecondaryNameNode backup is controlled by three parameters, fs.checkpoint.period control cycle, fs.checkpoint.size control log files over the size of the merge, dfs.http.address indicates the http address, this parameter needs to be set when the SecondaryNameNode is a separate node.

3.HDFS mechanism 3.1heartbeat mechanism

How it works:

When master starts, it opens an ipc server there. Slave starts, connects to master, and sends a "heartbeat" to master every 3 seconds, carrying status information. Master sends instructions to slave nodes through the return value of this heartbeat.

Role: Namenode oversees block replication, periodically receiving heartbeats and block status reports from each Datanode in the cluster

(Blockreport). Receiving a heartbeat signal means that the Datanode node is working properly. The block status report contains a place on the Datanode

The list with blocks is registered with NameNode after DataNode is started, and the list of all blocks is reported to NameNode periodically (1 hour) after passing

Send a heartbeat to NamNode every 3 seconds and return the command that NameNode gave to the DataNode, such as copying block data to another machine or deleting it.

Except for a data block. If the NameNode does not receive the heartbeat of a DataNode for more than 10 minutes, the node is considered unavailable. When the hadoop cluster starts up, it enters the safe mode (99.9%) and uses the heartbeat mechanism 3.2 load balancing.

The HDFS cluster of Hadoop is very prone to unbalanced disk utilization between machines, for example, when nodes are added or deleted in the cluster, or when the hard disk storage in the machine of a node reaches saturation. When the data is unbalanced, Map tasks may be assigned to machines that do not store data, which will lead to the consumption of network bandwidth and the inability to perform local computing.

When the load of HDFS is unbalanced, it is necessary to adjust the data load balance of HDFS, that is, to adjust the storage distribution of data on the machines of each node. Thus, the data can be evenly distributed on each DataNode, balance the IO performance, and prevent the occurrence of hot spots. To adjust the load balance of data, the following principles must be met:

C (1) data balance does not result in block reduction and loss of block backup

(2) administrators can abort the data balancing process

(3) the amount of data per move and the network resources consumed must be controllable.

(4) data equalization process, which can not affect the normal operation of namenode

The principle of load balancing is as follows:

The steps are analyzed as follows:

(1) data Equalization Service (Rebalancing Server) first requires NameNode to generate a DataNode data distribution analysis report to obtain the usage of each DataNode disk

(2) Rebalancing Server summarizes the distribution of data that needs to be moved and calculates a specific block migration roadmap. Block migration roadmap to ensure the shortest path within the network

(3) starts the block migration task, and Proxy Source Data Node copies a block that needs to be moved.

(4) copies the replicated data blocks to the target DataNode

(5) Delete the original data block

(6) Target DataNode confirms to Proxy Source DataNode that the block migration is complete

(7) Proxy Source Data Node confirms to Rebalancing Server that the block migration is complete. Then continue the process until the cluster reaches the standard of data equalization

4.HDFS read and write process

understands the basic concepts before learning about the read-write process:

has three units that need to be understood in the process of writing HDFS in DFSClient: block, packet and chunk.

block is the largest unit. It is the data granularity ultimately stored on the DataNode, which is determined by the dfs.block.size parameter. The default is 64m. Note: this parameter is determined by the client configuration.

packet is a medium unit, which is the granularity of data flowing from DFSClient to DataNode. The dfs.write.packet.size parameter is used as the reference value. The default is 64K. Note: this parameter is the reference value, which means that the real data transmission will be adjusted based on it. The reason for the adjustment is that a packet has a specific structure, and the goal of adjustment is that the size of the packet just contains all the members of the structure. At the same time, it also ensures that the size of the current block does not exceed the set value after writing to the DataNode.

chunk is the smallest unit, which is the granularity of data verification in data transmission from DFSClient to DataNode. It is determined by the io.bytes.per.checksum parameter. The default is 512B.

Note: in fact, a chunk also contains a check value of 4B, so when chunk is written to packet, it is 516B; the ratio of data to check value is 128 block, so there will be a 1m check file corresponding to it for a 128m block.

4.1 data reading process

1. The client calls the open method of the FileSystem instance to get the input stream InputStream corresponding to this file.

2. Call NameNode remotely through RPC to get the corresponding data block storage location of this file in NameNode, including the location of the copy of this file (mainly the address of each DataNode).

3. After obtaining the input stream, the client calls the read method to read the data. Select the nearest DataNode to establish the connection and read the data.

4. If the client and one of the DataNode are on the same machine (such as mapper and reducer in the MapReduce process), then the data will be read directly from the local.

5. Reach the end of the data block, close the connection to the DataNode, and then re-find the next data block.

6. Continue to perform steps 2-5 until all the data has been read.

7. The client calls close to close the input stream DF S InputStream.

How to ensure the integrity of the data in the reading process of :

passed checksum. Because there is a check bit in each chunk, the chunk forms the packet, and the packet finally forms the block, so the checksum can be obtained on the block. The client side of HDFS implements the checksum (checksum) check of the contents of the HDFS file. When the client creates a new HDFS file, the checksum of each block of the file is calculated after the file is divided, and the checksum is saved in the same HDFS namespace as a hidden file. When the client reads the contents of the file from the HDFS, it will check whether the checksum calculated when the chunk is divided (in the hidden file) and the checksum in the read file block match. If not, the client can choose to obtain a copy of the data block from another Datanode.

4.2 data writing process

1. Use the client Client provided by HDFS to initiate a RPC request to the remote namenode

2 and namenode will check whether the file to be created already exists and whether the creator has permission to operate. If it succeeds, it will create a record for the file, otherwise it will cause the client to throw an exception.

3. When the client starts to write to the file, the client will split the file into multiple packets, manage these packets internally in the form of data queue "data queue", and apply to namenode for blocks to obtain the appropriate datanode list for storing replicas. The size of the list depends on the setting of replication in namenode.

4. Start writing packet to all replicas in the form of pipeline (pipes). The client writes the packet to the first datanode as a stream, and the datanode stores the packet and then passes it to the next datanode in the pipeline until the last datanode. This way of writing data is pipelined.

5. After the last datanode is successfully stored, an ack packet (acknowledgement queue) will be returned, which will be passed to the client in pipeline, and the "ack queue" will be maintained inside the client's development library. When the ack packet returned by datanode is successfully received, the corresponding packet will be removed from "data queue".

6. If a datanode fails during the transmission process, the current pipeline will be closed, the blocked datanode will be removed from the current pipeline, and the remaining block will continue to be transmitted in the form of pipeline in the remaining datanode. At the same time, the namenode will assign a new datanode to maintain the number set by replicas.

7. After the client finishes writing the data, it will call the close () method on the data stream to close the data flow.

8. As long as the number of replicas of dfs.replication.min (the minimum number of successful replicas) is written (the default is 1), the write operation will succeed, and this block can be replicated asynchronously in the cluster until it reaches its target number of replicas (the default value of dfs.replication is 3). Because namenode already knows which blocks the file consists of, it only needs to wait for the minimum number of data blocks before returning a successful copy.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.