Example Analysis of HDFS in Hadoop 07/16 Update SLTechnology News&Howtos

Example Analysis of HDFS in Hadoop

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you the "sample analysis of HDFS in Hadoop", which is easy to understand and clear. I hope it can help you solve your doubts. Let the editor lead you to study and study the "sample analysis of HDFS in Hadoop".

Hadoop is a distributed storage and computing platform suitable for big data's processing. I personally feel that it is very appropriate to call it a platform, because products such as hive and hbase all need to rely on the two core hdfs and mapreduce of hadoop. Hdfs and mapreduce are the basic cores of the hadoop platform, in which hdfs is responsible for big data's distributed storage, and mapreduce is the computing framework for data processing. Let's start recording hdfs,mapreduce and then record it later.

For HDFS, I personally feel that it is a file management system based on the operating system platform. Unlike the file system of the operating system, it has the ability of automatic distributed management of files, but all this is transparent to the user. For example, when we send a file to hdfs put, he will generate replication for the file according to our settings. This is just about his distributed storage. In order to be more suitable for big data's handling, hdfs also has its own mechanism.

1. Safety mode

When the cluster starts, it goes into safe mode by default. The default is 30 seconds to automatically exit.

In safe mode, you can only read, not write.

Dfsadmin-safemode get | enter | leave

If you encounter that the cluster has just started, the write operation will be performed immediately and the system will report an error.

Safe mode gives the system a time to initialize.

two。 Quotas are usually used in the case of multiple users, that is, a directory is set for each user, and the metadata (size, quantity) information of the directory is limited. SetQuota limits the number of files (folders) in a directory.

ClrQuota cancel

SetSpaceQuota limits the space size of a directory.

ClrSpaceQuota cancel

Hdfs dfs-count-Q / lisi View the value of the configuration settings

3. The architecture of HDFS

A distributed file system means that data is stored on many nodes.

NameNode is used to manage the metadata itself. DataNode is used to manage the data itself.

4. The architecture of NameNode

The node that manages the metadata. Manages the metadata of files (folders), such as permissions, owner, group, size, last access time, last modified time, name, quota.

When the client accesses the data, it first deals with NameNode. The metadata information is located in the fsimage file, and fsimage is the embodiment of the state of hdfs. When the cluster starts, the information in the fsimage is loaded into memory.

What is saved in edits is the operation log. Edits is the embodiment of hdfs transactions.

In hdfs-site.xml, the value of configuring dfs.namenode.name.dir is a list of directories that are well separated. Fsimage content is stored in multiple directories at the same time for data security.

In hdfs-site.xml, the value of configuring dfs.namenode.edits.dir is ${dfs.namenode.name.dir}.

If the machine goes down, the above scheme will not work. Realize the HA function of NameNode.

5. The architecture of DataNode

Realize the storage of data. There are many nodes, each of which is a DataNode.

When DataNode stores data, it is stored in block form.

Block is the basic unit for DataNode to store data, similar to parcels of express companies.

In order to save space, generally only the maximum size of the package will be specified, not the standard size, as is the principle of hdfs's block. The maximum size of block is 128MB, and there is no minimum size.

When a 12k of data is stored in a block, the block takes up only 12k of disk physical space.

When I looked at the contents of fsimage earlier, I found that each file had a block correspondence. The contents of the fsimage are stored in memory while the cluster is running. Each file generates at least one block. When there are a lot of small files, what is the impact on the memory of Namenode nodes? More memory is consumed, resulting in a lot of memory pressure. So hdfs is not suitable for storing a large number of small files.

When the namenode-format command is executed, a namespaceID value is generated in the VERSION of both the namenode node and the datanode node, and the two values must be the same. If formatting is performed again, namenode will generate a new namespaceID, but datanode will not produce a new value.

6. Read and write data

Example of reading, writing and life: go to the customer site to solve the problem, I (client) need to ask the project manager (namenode) customer address and other information, the project manager told me (block location, etc.), I can go to the customer site, I told the client that the solution, the customer needs to confirm that the solution is the real solution, I will report to the project manager.

The read and write of Hdfs is basically the same as that of a normal file system, except that there is an intermediate registration. Just like when we are single, we can spend whatever we want, but when we get married, we need to get through our wives first.

Namenode maintainer has two tables (from the namenode source code), one is the mapping of files to block blocks (blockid, as you can see in the fsimage file, there is a local disk), and the other is the mapping between block blocks and disks (represented as inode, which is the location of each block and block's datanode disk and is regenerated every time hdfs starts In fact, it is the block information that datanode reports to namenode every time it starts to enter safe mode, that is, the former filename-blocksequence block-node is static (once the information that changes will affect the data is not corresponding), the latter is dynamic (in distributed cases, the node downtime is normal, then the corresponding relationship needs to be updated)

The Slaves file only enables datanode nodes to be started together when the cluster is started. Even if the slaves file is not normal, datanode cannot be started when Namenode is started, and it can be used manually when datanode is started, because the namespaceID in their vision file is the same.

The above is all the content of the article "sample Analysis of HDFS in Hadoop". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.