What are the basic concepts of HDFS 07/06 Update SLTechnology News&Howtos

What are the basic concepts of HDFS

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Editor to share with you what the basic concepts of HDFS, I believe that most people do not know much about it, so share this article for your reference, I hope you will learn a lot after reading this article, let's go to understand it!

1.1 introduction to HDFS

The GFS paper from Google was published in October 2003. HDFS is a clone of GFS. The full name of HDFS is Hadoop Distributed File System's easy-to-extend distributed file system, which runs on a large number of ordinary cheap machines, provides fault-tolerant mechanism, and provides file access services with good performance for a large number of users.

The whole Hadoop architecture is mainly through HDFS to achieve the underlying support for distributed storage, and through MR to achieve the program support for distributed parallel task processing.

HDFS adopts the master-slave (Master/Slave) structure model, and a HDFS cluster is made up of one NameNode and several DataNode (multiple NameNode configurations have been implemented in the latest version of Hadoop2.2-this is also a function implemented by some large companies by modifying the hadoop source code, which has been implemented in the latest version). NameNode acts as the primary server, managing the file system namespace and client access to files. DataNode manages the stored data. HDFS supports data in file form.

Internally, the file is divided into several blocks, which are stored on a set of DataNode. NameNode performs the namespace of the file system, such as opening, closing, renaming files or directories, etc., and is also responsible for mapping data blocks to specific DataNode. DataNode is responsible for handling the file reading and writing of the file system client, and the creation, deletion and replication of the database under the unified scheduling of NameNode. NameNode is the manager of all HDFS metadata, and user data never passes through NameNode.

2 HDFS Design goal

Automatic and rapid detection to deal with hardware errors

Streaming access to data

Mobile computing is more cost-effective than mobile data itself

Simple consistency model

Heterogeneous platforms are portable

Mobile computing and mobile data

When learning big data, I came into contact with the two closely related and very different concepts of mobile data and mobile computing.

Mobile computing is also called local computing. Used in previous data processing

Mobile data, in fact, is to transfer the data that need to be processed to each node that stores the logic of different ways of processing data. This is very inefficient, especially

The amount of data in big data is very large, at least above GB, and even larger is TB, PB or even larger, and the efficiency of disk and network is very high.

Low, so it will take a long time to deal with, far from meeting our requirements. And mobile computing appeared.

Mobile computing, also known as local computing, is that the data is stored on the node and no longer changes, but the processing logic program is transferred to each data node. Due to

The size of the processor will certainly not be particularly large, so that the program can be quickly transferred to each node where the data is stored, and then executed locally.

Manage data and have high efficiency. Nowadays, big data's processing technology is all in this way.

Example of the HDFS model:

So many notebooks in the class can actually form a cluster, right? then the next class needs to store the files in the notebook, and one by one come in and look for it.

A notebook storage, after leaving, after a period of time, need to pick up the file, but everyone does not know that they uploaded to that server, so every

The station needs to find, so the complexity of the operation is high, so how to improve this problem?

At the expense of the fact that my notebook does not store data, it specializes in recording the process of storage (NameNode). The first person came to me and said that I wanted to store the file, and then I

Let's go to the first computer to save it, and then go to the first computer to store it. When it stores files, it takes a few minutes to upload files, and then the second person comes in.

Also need to save files, and then I say you go to the second computer to store it, so is there a large amount of interactive information between me and the storekeeper? Not much, is it? my main role is to sue.

Tell each other where to store, the first computer is transmitting again on the second computer, is it transmitting at the same time, then this is equivalent to the load, then when they upload it,

Is an independent resource, and will not preempt resources.

Question: when will this storage record be recorded?

Did I record it at the beginning of the conversation, or when he finished passing the data?

It must be recorded after transmitting the data, which is for the consistency of the data.

After the upload was successful, whether the uploader told me that the upload was successful or whether the computer sent the information successfully.

It must be that the computer sends the information and uploads it successfully, because only the computer recognizes that the upload is successful. If there are files and records, then OK.

Is it very simple, very easy? if you had listened to HDFS a few years earlier, you would have developed it.

Characteristics of HDFS

Advantages:

High reliability: Hadoop's ability to store and process data bit by bit is trustworthy

High scalability: Hadoop distributes data and performs computing tasks among available computer clusters that can be easily extended to thousands of nodes.

High efficiency: Hadoop can move data dynamically between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.

High fault tolerance: Hadoop can automatically save multiple copies of data and automatically reassign failed tasks.

Disadvantages:

Not suitable for low-latency data access.

Unable to store a large number of small files efficiently.

Multi-user writing and arbitrary modification of files are not supported.

1.4 the core design idea and function of hdfs

Divide and conquer: large files and large quantities of files are distributed and stored on a large number of servers to facilitate the operation and analysis of massive data in a divide-and-conquer way.

For all kinds of distributed computing frameworks (such as mapreduce,spark,tez,... ) provide data storage services

Hdfs describes it more specifically.

First of all, it is a file system that stores files and locates them through a unified namespace, the directory tree.

Secondly, it is distributed, and many servers work together to achieve its functions, and the servers in the cluster have their own roles.

1.5 the important features are as follows:

Files in HDFS are physically partitioned storage (block), and the size of the block can be specified by the configuration parameter (dfs.blocksize). The default size is 128m in the hadoop2.x version and 64m in the previous version.

The HDFS file system provides a unified abstract directory tree for the client, and the client accesses the file through a path, such as hdfs://namenode:port/dir-a/dir-b/dir-c/file.data.

The management of directory structure and file block information (metadata) is undertaken by the namenode node-namenode is the master node of the HDFS cluster, which is responsible for maintaining the directory tree of the entire hdfs file system and the block block information corresponding to each path (file) (block's id, and the datanode server where it is located).

The storage management of each block of the file is carried out by the datanode node-datanode is the slave node of the HDFS cluster, and each block can store multiple copies on multiple datanode (the number of copies can also be set by parameter dfs.replication).

Composition of HDFS architecture

NameNode (NN)

Memory-based storage: there is no swap with disk, only in memory

The main purpose of this is to be fast, but a common problem with memory storage is that power is easy to lose.

Once the power is off, there is nothing stored in memory, so it needs to be persisted (it's a bit of a blow because it needs to be stored on disk).

NameNode persistence

The metadate information of NameNode will be loaded into memory after startup

There are two ways to store to disk:

The first is to store data information on disk in a form similar to "taking a snapshot" at some point in time.

Metadata is stored to disk file named "fsimage". The location information of Block is not saved to fsimage.

If you reply, you need to wait for DataNode to re-report the information Block location of each copy (reported by DataNode)

The second is to generate a file in the form of a log edit log to record the operation log of metadata.

Main functions of NameNode:

Accept the read and write service from the client and collect the Block list information reported by DataNode

NameNode saves metadata information including

File ownership and permissions, file size, time (Block list: Block offset), location information

DataNode (DN)

DataNode is where Block is really stored. The local disk of DataNode stores Block information as a file. At the same time, the metadata information file of Block is also stored.

Metadata mainly stores MD5 values for validation

When HDFS starts, DataNode reports block information to NameNode.

DataNode keeps in touch with NameNode by sending a heartbeat (once every 3 seconds). If NameNode does not receive a heartbeat from DataNode in 10 minutes, it is considered to have lost and copies the block on it to other DataNode.

SecondaryNameNode (SNN)

At first glance, you might think that SecondaryNameNode is a backup of NameNode, in fact, the main role of SecondaryNameNode is not this, of course, it can also be used for backup.

To understand the role of SecondaryNameNode, you have to talk about the startup process of HDFS.

We have mentioned two files, fsimage and edits, above. Fsimage is a snapshot of the current HDFS system. Edits keeps a log of various operations on HDFS.

Suppose: there is a cluster that has been running for 10 years without any problems. Fsimage is a point ten years ago. In order not to affect performance, it is recorded only once, while edits has been recording the day. Blah, blah.

When HDFS starts, you can get the latest status of the system according to the fsimage and edit log logs, and generate a new fsimage file. This kind of startup takes a lot of time. Especially when the edit log file is very large, merging takes up a lot of extra time.

These are all the contents of the article "what are the basic concepts of HDFS". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.