The basic principle of HDFS and how to access data 07/03 Update SLTechnology News&Howtos

The basic principle of HDFS and how to access data

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you about the basic principles of HDFS and how to access data. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

A brief introduction to HDFS

Hadoop distributed file system (HDFS) is designed to be suitable for distributed file system running on general hardware (commodity hardware). There are two types of nodes in the HDFS architecture, one is NameNode, also known as "metadata node", and the other is DataNode, also known as "data node". These two types of nodes are responsible for the specific tasks of Master and Worker respectively. The general design idea: divide and conquer-large files and large numbers of files are distributed and stored on a large number of independent servers, so as to facilitate the operation and analysis of massive data in a divide-and-conquer way.

HDFS is a master / slave (Mater/Slave) architecture that, from the end user's point of view, can perform CRUD (Create, Read, Update, and Delete) operations on files through directory paths, just like traditional file systems. However, due to the nature of distributed storage, the HDFS cluster has a NameNode and some DataNode. NameNode manages the metadata of the file system and DataNode stores the actual data. The client accesses the file system through interaction with NameNode and DataNodes. The client contacts the NameNode to get the metadata of the file, while the real file Icano operation interacts directly with the DataNode.

HDFS is generally used to "write once, read multiple", not suitable for real-time interactive things, not suitable for storing a large number of small files (of course, if you have to save a large number of small files, there will be a solution at the end of this article).

II. HDFS working principle 2.1basic principles

1 distributed file system, the files managed by it are stored in blocks on several datanode servers.

2 hdfs provides a unified directory tree to locate files in hdfs. The client only needs to specify the path of the directory tree when accessing the files, regardless of the specific physical location of the files.

Each chunk of each file can save multiple backups in the hdfs cluster (default 3). In hdfs-site.xml, the number of value of dfs.replication is the number of backups.

4 there is a key process service process in hdfs: namenode, which maintains a directory tree of hdfs and the mapping relationship (metadata) between hdfs directory structure and the real storage location of files. The datanode service process is specially responsible for receiving and managing "file blocks"-block. The default size is 128m (configurable), (dfs.blocksize). (the default block for older versions of hadoop is 64m)

2.2 Common operations

The root directory of hdfs. You can view the files stored by hdfs through this command, under the installation directory of hadoop:

Bin/hdfs dfs-ls / hdfs dfs-chmod 600 / test.txt / / set file permissions bin/hdfs dfs-df-h / / View disk space bin/hdfs dfs-mkdir / aaa / / create a new folder bin/hdfs dfs-tail / / View the tail of the file

The storage location of the hdfs block is located in: (in the tmp folder under the root of the hadoop) hadoop-2.5.2/tmp/hadoop/dfs/data/current/BP-33587274-127.0.0.1-1465370300743/current/finalized

From the figure we can see the blocks stored by hdfs, of course, the two blocks can be merged to form a new file, and then can be verified by decompression.

Touch full hadoop.tar.gzcat blk_1073741825 > > hadoop.tar.gzcat blk_1073741826 > > hadoop.tar.gztar-xvzf hadoop.tar.gz

Similarly, in the hadoop installation directory, you can access files directly with the command through the following command:

Save the file:

. / hdfs dfs-put / home/admin1/ Desktop / test.txt hdfs://localhost:9000/

Fetch the file:

. / hdfs dfs-get hdfs://localhost:9000/test.txt

Permission control of hdfs:

Operation commands of the client

Bin/hdfs dfs-chown aa:bgroup / test.txt

Change the permissions of the test.txt file to the bgroup group of the aa user.

(because I don't have this group and user at all, but I can modify it successfully, so hadoop itself doesn't have very strict control over permissions.)

2.3 shell operation of HDFS

-- appendToFile-append a file to the end of an existing file

-> hadoop fs-appendToFile. / hello.txt hdfs://hadoop-server01:9000/hello.txt

It can be abbreviated as:

Hadoop fs-appendToFile. / hello.txt / hello.txt

-cat-display file contents

-- > hadoop fs-cat / hello.txt

-chgrp

-chmod

-chown

The above three are the same as in linux

-- > hadoop fs-chmod / hello.txt

-copyFromLocal # copy files from the local file system to the hdfs path

Hadoop fs-copyFromLocal. / jdk.tar.gz / aaa/

-copyToLocal # copied from hdfs to local

Eg: hadoop fs-copyToLocal / aaa/jdk.tar.gz

-count # counts the number of file nodes in a specified directory

-- > hadoop fs-count / aaa/

-cp # copy one path of hdfs from another path of hdfs

Hadoop fs-cp/aaa/jdk.tar.gz / bbb/jdk.tar.gz.2

-createSnapshot

-deleteSnapshot

-renameSnapshot

The above three are used to manipulate snapshots of hdfs file system directory information

-> hadoop fs-createSnapshot /

-df # Statistics the free space information of the file system

-du

-> hadoop fs-df-h /

-- > hadoop fs-du-s-h / aaa/*

-get # is equivalent to copyToLocal, that is, downloading files from hdfs to local

-getmerge # merge and download multiple files

-- > for example, there are multiple files under the directory / aaa/ of hdfs: log.1, log.2,log.3,...

Hadoop fs-getmerge / aaa/log.*. / log.sum

-help # output this command parameter manual

-ls # displays directory information

-> hadoop fs-ls hdfs://hadoop-server01:9000/

Among these parameters, all hdfs paths can be abbreviated.

-- > hadoop fs-ls / equivalent to the effect of the previous command

-mkdir # create a directory on hdfs

-- > hadoop fs-mkdir-p / aaa/bbb/cc/dd

-moveFromLocal # cut and paste locally to hdfs

-moveToLocal # cut and paste from hdfs to local

-mv # move files in the hdfs directory

-put # is equivalent to copyFromLocal

-rm # Delete files or folders

-> hadoop fs-rm-r/aaa/bbb/

-rmdir # Delete empty directory

-setrep # sets the number of copies of files in hdfs

-- > hadoop fs-setrep 3 / aaa/jdk.tar.gz

-stat # displays meta information for a file or folder

-tail # shows the end of a file

-text # prints the contents of a file in character form

Third, the basic principle of HDFS data writing process parsing 3.1

Writing data to hdfs is a very complex process, so let's take a look at the brief steps.

So the question is, what if one of the datanode between them suddenly breaks down.

1. If a datanode fails during transmission, the current pipeline will be closed, the faulted datanode will be removed from the current pipeline, and the remaining block will continue to be transmitted in the form of pipeline in the remaining datanode. At the same time, the Namenode will assign a new datanode to maintain the number set by replicas.

2. Close pipeline and put the data blocks in ack queue into the beginning of data queue.

3. If the current data block is given a new mark by the metadata node in the written data node, the error node will be able to detect that its data block is outdated and will be deleted after restart.

4. The failed data node is removed from the pipeline, and the other data blocks are written to the other two data nodes in the pipeline.

5. The metadata node is notified that the data block is not replicated enough, and a third backup will be created in the future.

6. The client calls create () to create the file

7. DistributedFileSystem calls the metadata node with RPC to create a new file in the namespace of the file system.

8. The metadata node first determines that the file does not exist and that the client has permission to create the file, and then creates a new file.

9. DistributedFileSystem returns DFSOutputStream, and the client is used to write data.

10. The client begins to write data, and DFSOutputStream divides the data into blocks and writes it to data queue.

11. Data queue is read by Data Streamer and informs the metadata node to allocate data nodes to store data blocks (3 blocks are copied by default). The assigned data nodes are placed in a pipeline.

12. Data Streamer writes the block to the first data node in the pipeline. The first data node sends the data block to the second data node. The second data node sends the data to the third data node.

13. DFSOutputStream saves the ack queue for the sent data block, waiting for the data node in the pipeline to tell you that the data has been written successfully.

3.2 coding practice

This code means to upload the file from / home/admin1/hadoop/eclipse.tar.gz to the hdfs file, and the address of hdfs is written in URI.

Test public void testUpload () throws IOException, InterruptedException, URISyntaxException {Configuration conf = new Configuration (); / / to get a client of the hdfs system FileSystem fs = FileSystem.get (new URI ("hdfs://ubuntu2:9000"), conf, "admin1") Fs.copyFromLocalFile (new Path ("/ home/admin1/hadoop/eclipse.tar.gz"), new Path ("/")); fs.close ();}

Fourth, HDFS read data flow parsing 4.1 basic principles

The process of reading is much simpler than that of writing.

4.2 coding practice

Download the file:

/ * public static void main (String [] args) throws IOException {/ / download from hdfs Configuration conf = new Configuration (); conf.set ("fs.defaultFS", "hdfs://ubuntu2:9000/"); conf.set ("dfs.blocksize", "64") / / to get a client of the hdfs system FileSystem fs = FileSystem.get (conf); fs.copyToLocalFile (new Path ("hdfs://ubuntu2:9000/test.txt"), new Path ("/ home/admin1/Downloads/test1.txt")); fs.close ();} * /

Delete the file:

@ Test public void rmove () throws IOException, InterruptedException, URISyntaxException {Configuration conf = new Configuration (); FileSystem fs = FileSystem.get (new URI ("hdfs://ubuntu2:9000"), conf, "admin1"); fs.delete (new Path ("/ test.txt")); fs.close ();}

V. working mechanism of NameNode

Metadata is stored in memory and needs to be backed up. Log records (when there are modifications to metadata) are written to disk, and dump is done periodically with edits_ingropress. When power is cut off, it is difficult to recover.

So we use sercondName to download operation logs and memory image files periodically, and then merge operation records periodically (log + fsimage) and generate a new namenode image file (fs.image.ckpt).

Finally, upload it to namenode, and then namenode rename it to fsimage.

Metadata management mechanism:

1. Metadata can be stored in three forms: memory, edits log, and fsimage.

2. The most complete and up-to-date metadata must be this part of memory.

VI. RPC programming

Programming with RPC is one of the most powerful and efficient ways to communicate reliably between client and server entities. It provides the foundation for almost all applications running in a distributed computing environment. Important entities of any RPC client-server program include IDL files (interface definition files), client stub, server stub, and header files shared by client and server programs. The client and server stub communicate using the RPC runtime library. The RPC runtime library provides a set of standard runtime routines to support RPC applications. In a normal application, the called procedure runs in the same address space and returns the result to the calling procedure. In a distributed environment, the client and the server run on different machines, and the client invokes the process running on the server and sends the results back to the client. This is called remote procedure call (RPC) and is the foundation of RPC programming.

The following example is used to simulate the process of user login.

Eclipse in linux. Create a new java project and create a new java file:

RPCServer.java

Package hdfsTest;public interface RPCService {public static final long versionID=100L; / / defines the version number of the communication interface public String userLogin (String username,String password); / / user name and password}

RPCServerImpl.java

Package hdfsTest;public class RPCServiceImpl implements RPCService {@ Override public String userLogin (String username, String password) {return username+ "logged in successfully!";}}

RPCControl.java

Package hdfsTest;import java.io.IOException;import java.net.InetSocketAddress;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.ipc.RPC;//public class RPCController {public static void main (String [] args) throws IOException {RPCService serverceImpl=RPC.getProxy (RPCService.class,100,new InetSocketAddress ("ubuntu2", 10000), new Configuration ()) / / 100 refers to the version number previously set, InetSocketAddress is the hdfs hostname and 10000 is the communication port String result=serverceImpl.userLogin ("aaa", "aaa"); / / set user name and password System.out.println (result);}}

ServerRunner.java

Package hdfsTest;import java.io.IOException;import org.apache.hadoop.HadoopIllegalArgumentException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.ipc.RPC;import org.apache.hadoop.ipc.RPC.Builder;import org.apache.hadoop.ipc.Server;public class ServerRunner {public static void main (String [] args) throws HadoopIllegalArgumentException, IOException {Builder builder=new RPC.Builder (new Configuration ()) Builder.setBindAddress ("ubuntu2"); / / hostname builder.setPort (10000) of hdfs; / / set communication port number builder.setProtocol (RPCService.class); builder.setInstance (new RPCServiceImpl ()); Server server=builder.build () Server.start (); / / start the service}}

Running result:

7. Frequently asked questions: analyze whether 7.1.HDFS can be used as a network disk

Answer: no

The network disk is only stored and does not need to be analyzed. Hdfs not only increases the number of nodes, but also increases the analysis ability, mainly data analysis based on mass storage.

1. The capacity cost is too high. 2. The file size is uncertain. If you save a large number of small files, it will cause a great waste. 3. Compared with the network disk, the efficiency of reading and writing files is low. 4. It is only suitable for the operation of writing once and reading multiple times.

5. Hdfs does not support file content modification, but can append content to the end of the file.

7.2 HDFS large number of small file storage

1. For files to be uploaded, merge small files into a large file before uploading, and then use SequenceFile, MapFile, Har and other methods to archive small files. The principle of this method is to archive and manage small files, and HBase is based on this. For this method, if you want to retrieve the contents of the original small file, you must know the mapping to the archived file.

2. If you need to merge the uploaded files, we can use Map-redure to archive them.

3. BlueSky solution, using a two-level prefetching mechanism to improve the efficiency of reading small files, that is, index file prefetching and data file prefetching. Index file prefetching means that when a user accesses a file, the index file corresponding to the block of the file is loaded into memory, so that the user does not have to interact with namenode when accessing these files. Data file prefetching means that when a user accesses a file, all the files in the courseware where the file is located are loaded into memory, so that if the user continues to access other files, the speed will be significantly improved.

The above is the basic principles of HDFS shared by the editor and how to access the data. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.