Example Analysis of HDFS of Hadoop Architecture 04/21 Update SLTechnology News&Howtos

Example Analysis of HDFS of Hadoop Architecture

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail the sample analysis of HDFS on Hadoop architecture for everyone. Xiaobian thinks it is quite practical, so share it with you as a reference. I hope you can gain something after reading this article.

HDFS adopts Master/Slave structure model, an HDFS cluster is composed of a NameNode and several DataNodes (in the latest version of Hadoop 2.2, multiple NameNode configurations have been implemented-this is also a function implemented by some large companies by modifying the hadoop source code, which has been implemented in the latest version). NameNode acts as the master server, managing file system namespaces and client access to files. DataNodes manage stored data. HDFS supports data in file form.

Internally, the file is divided into several data blocks, which are stored on a set of DataNodes. NameNode enforces file system namespaces, such as opening, closing, renaming files or directories, and is also responsible for mapping data blocks to specific DataNodes. DataNode is responsible for processing file reading and writing of file system clients, and creating, deleting and copying databases under the unified scheduling of NameNode. NameNode is the custodian of all HDFS metadata and user data never passes through NameNode.

Figure: HDFS architecture diagram

Description:

There are three roles involved: NameNode, DataNode, and Client. NameNode is the administrator, DataNode is the file store, and Client is the application that needs to acquire the distributed file system.

namenode is responsible for: receiving user operation requests, maintaining the directory structure of the file system, managing the relationship between files and blocks, and the relationship between blocks and datanodes.

DataNode is responsible for: storing files Files are divided into blocks and stored on disk. To ensure data security, files will have multiple copies.

Data write:

1) Client initiates file write request to NameNode.

2) NameNode returns the information of DataNode managed by NameNode to Client according to file size and file block configuration.

3) Client divides the file into multiple blocks, and writes the blocks into DataNode blocks in order according to the address of DataNode.

Data read:

1) Client initiates a request to read a file from NameNode.

2) NameNode returns the DataNode information stored in the file.

3) Client reads file information.

Data read policy. According to the data storage policy mentioned above, when data is read, the client also has an api to determine its own cabinet id. When reading, if

If there is block data that is the same as the cabinet ID of the client, the data node is selected first, and the client directly establishes a connection with the data node to read the data. If not, choose at random.

Take a data node.

Data replication:

This occurs primarily during data writes and data restores, and data replication is a strategy that uses pipelined replication. When a client writes a file on top of hadoop,

First, it writes the file locally, and then divides the file into blocks. The default is 64m blocks. Each piece of data requests the hadoop directory server. The directory server selects

Select a list of data machines and return it to the client, and then the client writes the data to the first data machine and transmits the list to the data machine. When the data machine receives 4k data,

Write locally and initiate a connection to the next modem, passing this 4k through to form a pipeline. When the final file is written, data replication is the same

Time to complete, this is the advantage of pipeline processing.

HDFS as a distributed file system in data management can learn from:

Placement of file blocks: A Block will have three copies, one on the DataNode specified by NameNode, one on a DataNode that is not on the same machine as the specified DataNode, and one on a DataNode that is on the same Rack as the specified DataNode. Simply put, 1/3 of the redundant data is in one cabinet, and 2/3 of the redundant data is in another cabinet. The purpose of backup is for data security. This method is adopted to consider the failure of the same Rack and the performance problems caused by different data copies.

HDFS design features:

1, large data files, very suitable for T-level large files or a bunch of large data file storage, if the file is only a few G or even smaller, it is meaningless.

2, file block storage, HDFS will be a complete large file evenly divided into blocks stored on different calculators, its significance is that when reading files can be taken from multiple hosts at the same time different blocks of files, multi-host reading than single host reading efficiency is much higher.

3, streaming data access, write multiple times read and write, this mode is different from traditional files, it does not support dynamic change of file content, but requires that the file does not change once written, to change can only add content at the end of the file.

4, cheap hardware, HDFS can be applied to ordinary PCs, this mechanism can allow some companies to use dozens of cheap computers to support a large data cluster.

HDFS believes that all computers may have problems. In order to prevent a host from failing to read the block file of the host, it distributes the copy of the same file block to several other hosts. If one of the hosts fails, it can quickly find another copy to get the file.

Key elements of HDFS:

Block: Divide a file into blocks, usually 64 megabytes.

NameNode: Save the directory information, file information and block information of the entire file system, which is specially saved by a single host. Of course, if this host has an error, NameNode will be invalid. In Hadoop2.* Activity-standby mode is supported---If the primary NameNode fails, the standby host is started to run the NameNode.

DataNode: Distributed on cheap computers to store block files.

HDFS and MR together form the core of the Hadoop distributed system architecture. HDFS implements distributed file system on cluster, MR implements distributed computing and task processing on cluster. HDFS provides file operation and storage support in MR task processing, MR implements task distribution, tracking, execution and other work on the basis of HDFS, and collects results. The two interact to complete the main tasks of distributed cluster.

About "HDFS sample analysis of Hadoop architecture" this article is shared here, I hope the above content can be of some help to everyone, so that you can learn more knowledge, if you think the article is good, please share it to let more people see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.