Big data technical analysis: HDFS distributed system introduction! 10/30 Update SLTechnology News&Howtos

Big data technical analysis: HDFS distributed system introduction!

2025-10-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

HDFS is mainly used for the distributed file system originally proposed by Yahoo. Its main uses are as follows:

1. Save big data

2. Provide the ability to read big data quickly

The main feature of Heroop frame is to achieve the purpose of distributed computing by distributing data and computing to each node server in the cluster. At the point that the computing logic and the required data are close, the parallel computing partition is summarized.

Basic module

HDFS: distributed File system (by Yahoo) Mpredues: distributed Computing frame (by Google) HBCD: distributed, non-relational database (by Poerset-> Microsoft) Pig:HDoop large-scale data analysis tool (by Yahoo) Hial: copy database tools and structured data files to database tables (by Facebook) ZooKeoler: distributed Collaborative Services (by Yahoo) Yarn: task scheduling and cluster resource management framework

HDFS stores Meta data and user data separately. The Meta data is saved in Namicos, and the user data is saved in the Datan path. The communication between servers is based on TCP protocol.

Like GFS (Google File System), for the sake of reliability, it has the purpose and advantage of copying the contents of the file to multiple Datao, and then copying the data to multiple Datannampa.

HDFS architecture

1 、 Namamos

Namelos is the key point of HDFS. It saves the spatial tree named by the HDFS file system, and the files and paths are displayed in Nameos using inpoes. In the HDFS system, the contents of the file are divided into large block (for example, 128Mbytes, which is configured according to the needs of the user), and each block is copied independently into multiple Data south paths. Namicos stores copies of each block of each file in the physical location of the Datanpase.

The process of reading HDFS by HDFS cial.

Read: when reading some files saved by HDFS, first to Nameos, when Nameos returns the location of the Datan path of the file's block, you can read data from the nearest Datao.

Write: when cial writes the file, the request to the Namelos, the Namicos will write the location of the Datao to return (multiple, for example, 3 Datao), it requires a direct Datannampas, write to the file block. For each block, for example, write three Data number paths, make sure the file block.

How to write data using pporela mode, to put it simply, copy the first Datao data of a Datao to the second Datao, and the data of the second Datao to the third Datapass.

Here are a few concepts:

In RAM, the block list of inos data and files. The image stored on the disk. It should be noted here that the copy of block is constantly changing, and the copy location of block is not part of checkpot. Save the change record of the image on disk

Many friends are vague about the concept of big data. What is big data, what can he do, what route to follow when learning, and where to develop after learning. Students who want to learn are welcome to join big data's study group: 775908246, there are a lot of practical information (zero foundation and advanced classic actual combat) to share with you. And there is a senior lecturer big data who graduated from Tsinghua University to teach you free of charge to share with you the most complete big data high-end practical learning process system in China.

2 、 Datao

The copy of the block on a Datao is represented by two files, the first file is the content of the data itself, and the second file includes the data of the block meta (including the file checksm), and the generation time.

When Datao starts, you can actively connect to Namelos to verify the software version of names ple ID and Datao. If it does not comply with the Namelox,Datao, it will be shut down automatically. Names psteID belongs to the node of the different names p dispute ID assigned when initializing the example of the file system.

After the HANshake handshake, Datao logs the assigned stor of the Namicos to the ID (used to identify the Datao) into the Datanmupas by logging in to the Namelos.

Datao can send the copy information of block saved at the time of registration to Nameos through Block rep report. The block rep report is sent to Namicos every hour to update the saved copy information. With such a Namicos, you know which Dataman path is saved by each copy.

If the periodicity of Databs (for example, every 3 seconds), send Namelox messages, there is a message that the Namicos10 does not get the Data number plate, I think this Datao can no longer provide service. The copy of block above cannot be used either.

The Holtbated message is Datao a. Total memory capacity, b. The memory space used and c. The number of data currently transferred, which can be used for Nameos space allocation and load balancing.

Because Nameos does not directly adjust the Data south path, use hittbal's answer to send a command. These commands are:

Copy the block to another node, delete the copy of the local blog, re-register or close the node to send block reping immediately

3. Image and Journal

On any HDFS client-initiated transaction, the change is recorded on the journal. The checkpoint file will not change, it will only be updated by the new checkpoint file. If the checkpoint file or journal file is lost or corrupted, some or all of the namespace information will be lost, and to avoid this, HDFS can configure to save the checkpoint and journal files in different storage paths.

4. CheckpointNode and BackupNode

CheckpointNode periodically combines the current checkpoint and journal to produce a new checkpoint and an empty journal. CheckpointNode tends to run on a separate server from NameNode.

BackupNode is similar to CheckpointNode in that it can generate checkpoint periodically, but in addition, it can keep a copy of image synchronized with NameNode in memory. Active NameNode sends changes to journal to BackupNode.

File operation and copy distribution

1. Read and write files

HDFS implements multiple read models.

HDFS cial can get the reader for the file before creating it. Other cial that is not rented cannot be written to this file. Write the operation of the cial, if the update to Namelos closes the file, close the contract. If the software expires, cial will be closed or the lease will not be renewed, and other cial will be granted access to the lease contract. If the Howard lease expires (1 hour), the HDFS lease cannot be renewed.

Reading can be unaffected by the rental mechanism, and multiple clients can read the file in parallel.

2. Block distribution

The distribution of different copies of the same block is important for the reliability, reading and writing performance of HDFS data. The default policy is as follows: when a new block is created, HDFS places one copy on the local node of the writer, the second and third copies on different nodes in different racks, and the rest on another node. The principle: multiple copies to the same node cannot be placed. Two or more copies cannot be put on the same machine. When the number of copies is 2 times less than that of RK.

In a general network structure, nodes of the same machine are connected by a switch. The bandwidth of the network between nodes of the same machine tends to become higher.

On the whole:

There are no copies of one or more block. There are no two copies of the same block on one back.

3. Copy management

Namicos ensures the number of copies specified in all block. When Namelos receives a block reping from Datao, the number of block detects as high as-or over-the specified number of copies.

If it is exceeded, Nameos deletes a copy.

If the number of copies is lower than the specified number of copies, the block has replication priority, and only one block of the number of copies has the highest priority. There is a thread that determines where the new replication is created.

Nameos must ensure that all copies are not on the same shelf, and if all copies are on the same shelf, Nameos must reduce the specified number of copies to initiate copying. After the copy is completed, Nameos detects that the number of copies is greater than the specified number, and deletes a copy. By copying-deleting and copying.

4. Balancer

The balancer is used to balance the disk utilization of the nodes in the HDFS cluster. When the disk utilization of a node is greater than the average utilization of the cluster and exceeds a certain threshold, the balancer will move the data from the DataNode node with high disk utilization to the DataNode node with low disk utilization. The balancer minimizes data copies across racks.

5. Block scanner

All Databs are used to detect whether the copy of the block is damaged. In addition, if damage is detected, Namicos will create a new copy while the copy mark is damaged and delete the damaged copy after the new copy is successful.

6. End of node

The cluster administrator can control the exit of Datao, and when Datao exits, it will not be selected as the destination for copying. But it can still support the reader. Nameos moves copies of all block to other Datanpass.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.