Getting started with HDFS and basic Operation 07/16 Update SLTechnology News&Howtos

Getting started with HDFS and basic Operation

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. What does HDFS do

Hadoop implements a distributed file system (Hadoop Distributed File System), referred to as HDFS,hdfs, which is the basis of data storage management in distributed computing. It is developed based on the requirements of streaming data mode to access and deal with super-large files, and can be run on cheap commercial servers. It has the characteristics of high fault tolerance, high reliability, high scalability, high availability, high throughput and so on. It provides fault-free storage for massive data and brings a lot of convenience for the application and processing of super-large data sets (Large Data Set).

2. Why HDFS is selected to store data

HDFS was chosen to store data because HDFS has the following advantages:

1. High fault tolerance

Multiple copies of data are automatically saved. It improves fault tolerance by adding copies.

After a copy of is lost, it can be restored automatically, which is achieved by the internal mechanism of HDFS, and we don't have to worry about it.

2. Suitable for batch processing

it is through mobile computing rather than moving data.

it exposes the data location to the computing framework.

3. It is suitable for big data to deal with data up to GB, TB, or even PB level.

can handle more than a million files, which is quite large.

can handle the size of 10K nodes.

4. Streaming file access

writes once and reads multiple times. Once the file is written and cannot be modified, it can only be appended.

it can guarantee the consistency of data.

5. Can be built on cheap machines

improves reliability through a multi-copy mechanism.

provides fault tolerance and recovery mechanisms. For example, if a copy is lost, it can be restored by other copies.

Of course, HDFS also has its disadvantages and is not suitable for all situations:

1. Low latency data access

, such as millisecond data storage, this is not possible, it can not do.

is suitable for high-throughput scenarios, where a large amount of data is written at a certain time. But it is not good in the case of low latency, such as reading data within milliseconds, so it is very difficult to do so.

2. Small file storage

If stores a large number of small files (small files here are files smaller than the Block size of the HDFS system (the default is 64m), it takes up a lot of NameNode memory to store file, directory, and block information. This is not desirable because the memory of NameNode is always limited.

The seek time of the small file storage exceeds the read time, which violates the design goal of HDFS.

3. Write concurrently and modify files randomly

A file can only have one write, and multiple threads are not allowed to write at the same time.

only supports data append (append) and does not support random modification of files.

3. Internal structure

How HDFS uploads data

HDFS uses the architecture of Master/Slave to store data, which is mainly composed of four parts, namely HDFS Client, NameNode, DataNode and Secondary NameNode. Let's introduce these four components respectively.

1. Client: the client. File slicing. When the file is uploaded to the HDFS, the Client splits the file into a Block and stores it. Interact with NameNode to get the location information of the file. Interact with DataNode, read or write data. Client provides commands to manage HDFS, such as starting or shutting down HDFS. Client can access HDFS through some commands. 2. NameNode: it is master, which is a supervisor and manager. Manage the namespace management data block (Block) mapping information for HDFS configure the replica policy to handle client read and write requests. 3. DataNode: it is Slave. NameNode gives the command, and DataNode performs the actual operation. Stores the actual blocks of data. Perform read / write operations on the data block. 4. Secondary NameNode: not a hot backup for NameNode. When NameNode dies, it does not immediately replace NameNode and provide services. Assist NameNode and share its workload. Merge fsimage and fsedits periodically and push them to NameNode. In case of emergency, the recovery of NameNode can be assisted. 5. How does HDFS read files

The file reading principle of HDFS mainly includes the following steps:

First call the open method of the FileSystem object, which actually gets an instance of DistributedFileSystem. DistributedFileSystem gets the locations of the first batch of block of the file through RPC (remote procedure call). The same block returns multiple locations according to the number of repeats, and these locations are sorted according to the hadoop topology, with the closest to the client first. The first two steps return a FSDataInputStream object, which is encapsulated into a DFSInputStream object, and DFSInputStream can easily manage datanode and namenode data flows. When the client calls the read method, DFSInputStream finds the nearest datanode to the client and connects to the datanode. Data flows from the datanode to the client. If the data of the first block block is read, the datanode connection to the first block block is closed and the next block block is read. These operations are transparent to the client and simply read a continuous stream from the client's point of view. If the first batch of block is finished, DFSInputStream will go to namenode to get a batch of blocks location, and then continue to read. If all the block blocks are read, all streams will be closed. 6. How to write HDFS to a file

The file writing principle of HDFS mainly includes the following steps:

The client creates a new file by calling the create method of DistributedFileSystem. DistributedFileSystem calls NameNode through RPC (remote procedure call) to create a new file with no blocks association. Before creation, NameNode will do various checks, such as whether the file exists, whether the client has permission to create it, and so on. If the verification passes, NameNode will record the new file, otherwise an IO exception will be thrown. The object of FSDataOutputStream is returned at the end of the first two steps, just like when reading a file, FSDataOutputStream is encapsulated as DFSOutputStream,DFSOutputStream to coordinate NameNode and DataNode. When the client starts writing data to DFSOutputStream,DFSOutputStream, it will cut the data into small packet, and then queue it into data queue. DataStreamer will process and accept the data queue. It first asks NameNode which DataNode is the most suitable for storage of the new block, such as a repeat number of 3, then finds the three most suitable DataNode and arranges them into a pipeline. DataStreamer queues the packet to the first DataNode in the pipeline, the first DataNode outputs the packet to the second DataNode, and so on. DFSOutputStream also has a queue called ack queue, which is also composed of packet, waiting for the DataNode to receive a response. When all the DataNode in the pipeline indicates that they have been received, akc queue will remove the corresponding packet packet. After the client finishes writing the data, the close method is called to close the write stream. DataStreamer brushes the rest of the packets into pipeline, then waits for the ack message, and after receiving the last ack, notifies DataNode to mark the file as completed. 7. Command line interface

Two property items: fs.default.name is used to set the default file system for Hadoop, and hdfs URL is used to configure HDFS as the default file system for Hadoop. Dfs.replication sets the number of copies of file system blocks

Basic operation of the file system: hadoop fs-help can get all the commands and their interpretations

The commonly used ones are:

Hadoop fs-ls / list directories and files under the root directory of the hdfs file system hadoop fs-put uploads a file from the local file system to HDFShadoop fs-get uploads a file from the local file system uploads a file to HDFShadoop fs-rm-r deletes files or files under folders and folders hadoop fs-mkdir creates a new folder in hdfs

Operating distance

Cd hadoop.2.5.2

Cd sbin

. / start-all.sh / / start hdfs service, yarn service

Cd..

Cd bin

. / haoop dfs-ls / explanation:. / hdfs is the hdfs command dfs parameter indicates that it is valid in hadoop-ls / shows the hdfs root directory

. / haoop dfs-rm / test/count/SUCCESS / / Delete the SUCCESS file in the / test/count directory

. / haoop dfs-rmr / test/count/output / / Delete / test/count/output directory

. / haoop dfs-mkdir / test/count/input / / create / test/count/input directory

Get the files to be analyzed from the shared folder in linux and upload them to hdfs

. / hadoop fs-put / mnt/hgfs/share/phone.txt / test/network

Perform code analysis

. / hadoop jar / mnt/hgfs/share/mobile.jar com.wanho.hadoopmobile.PhoneDriver

The result will be returned to the shared folder of linux

. / hadoop fs-get / test/network/output1 / mnt/hgfs/share

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.