How to install and use Hadoop HDFS 10/31 Update SLTechnology News&Howtos

How to install and use Hadoop HDFS

2025-10-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to install and use Hadoop HDFS". In daily operation, I believe many people have doubts about how to install and use Hadoop HDFS. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts about "how to install and use Hadoop HDFS". Next, please follow the editor to study!

I. the basic principles of HDFS

HDFS (Hadoop Distribute File System) is a distributed file system and an important member of Hadoop.

1. File system problems

The file system is a disk space management service provided by the operating system, which only requires us to decide where to put the file and which path to read the file from, regardless of how the file is stored on disk.

What to do when the space required by the file is larger than the local disk space?

One is to add disks, but to a certain extent, there is a limit; the other is to add machines to provide networked storage by means of remote shared directories, which can be understood as the embryonic form of a distributed file system. Different files can be put into different machines, and machines can continue to be added when there is insufficient space, breaking through the limitation of storage space. But there are several problems with this approach:

The load on a single machine may be extremely high, for example, a file is popular, and many users often read this file, which makes the access pressure on the machine where the secondary file is located.

Data is not secure if the machine on which a file is located fails, the file cannot be accessed and its reliability is poor.

It is difficult to organize documents, for example, if you want to adjust the storage location of some files, you need to see whether the space of the target machine is enough, and you need to maintain the file location yourself. if there are many machines, the operation is extremely complex.

2. The solution of HDFS

HDFS is an abstract layer, and the underlying layer relies on many independent servers to provide unified file management functions. For users, it feels like operating a machine and cannot feel the multiple servers under the HDFS.

For example, when a user accesses the / a/b/c.mpg file in HDFS, HDFS reads it from the underlying corresponding server and returns it to the user, so that the user only needs to deal with HDFS and does not care about how the file is stored.

For example, the user needs to save a file / a/b/xxx.avi.

HDFS first splits the file, for example, into four pieces, and then puts it on different servers.

This has the advantage of not being afraid that the file is too large, and the pressure of reading the file is not all concentrated on a single server. But if a server breaks down, you can't read all the files.

To ensure the reliability of files, HDFS makes multiple backups of each file block:

Block 1Rom A B C

Block 2Rom A B D

Block 3RB C D

Block 4RU A C D

In this way, the reliability of the file is greatly enhanced, and even if a server is down, the file can be read completely.

At the same time, it also brings a great benefit, that is, it increases the ability of concurrent access to files. For example, when multiple users read this file, they all have to read block 1. HDFS can choose to read block 1 from that server according to the busy degree of the server.

3. Metadata management

What files are stored in HDFS?

What blocks are the files divided into?

Which server is each block placed on?

……

These are called metadata, which are abstracted into a directory tree that records these complex relationships. This metadata is managed by a separate module called NameNode. The real server where the file blocks are stored is called DataNode, so the process for users to access HDFS can be understood as:

User-> HDFS-> NameNode-> DataNode

4. Advantages of HDFS

Capacity can be expanded linearly

With copy mechanism, high storage reliability and increased throughput

With NameNode, users only need to specify the path on the HDFS to access the file

II. HDFS practice

After the above introduction, you can have a basic understanding of HDFS, let's start with the actual operation, a better understanding of HDFS in practice.

1. Installation practice environment

You can choose to build your own environment or use a packaged Hadoop environment (version 2.7.3)

This Hadoop environment is actually a virtual machine image, so you need to install the virtualbox virtual machine, vagrant image management tools, and my Hadoop image, and then start the virtual machine with this image. Here are the specific steps:

1) install virtualbox

Download address: https://www.virtualbox.org/wiki/Downloads

2) install vagrant

Because the official website is slow to download, I uploaded it to the cloud disk.

Windows version

Link: https://pan.baidu.com/s/1pKKQGHl

Password: eykr

Mac version

Link: https://pan.baidu.com/s/1slts9yt

Password: aig4

After the installation is complete, the vagrant command can be used under the command line terminal.

3) download Hadoop image

Link: https://pan.baidu.com/s/1bpaisnd

Password: pn6c

4) start

Load Hadoop Mirror

Vagrant box add {Custom Image name} {path where the Image is located}

For example, if you want to name it Hadoop, and the downloaded path of the image is d:\ hadoop.box, the load command is like this:

Vagrant box add hadoop d:\ hadoop.box

Create a working directory, such as d:\ hdfstest.

Enter this directory and initialize

Cd d:\ hdfstest vagrant init hadoop

Start the virtual machine

Vagrant up

After the boot is complete, you can log in to the virtual machine using the SSH client

IP 127.0.0.1

Port 2222

User name root

Password vagrant

After logging in, use the command ifconfig to view the IP of the virtual machine (such as 192.168.31.239). You can log in using this IP and port 22.

IP 192.168.31.239

Port 22

User name root

Password vagrant

The Hadoop server environment has been built.

2. Shell command line operation

After logging in to the Hadoop server, start HDFS first and execute the command:

Start-dfs.sh

View help

Hdfs dfs-help

Display catalog information

-ls is followed by the directory path to be viewed

Create a directory

Create a directory / test

Hdfs dfs-mkdir / test

Create a multi-level directory / aa/bb at once

Hdfs dfs-mkdir-p / aa/bb

Upload files

Form

Hdfs dfs-put {Local path} {path in hdfs}

Instance (create a test file mytest.txt at will, and then upload it to / test)

Hadoop fs-put ~ / mytest.txt / test

Show file contents

Hdfs dfs-cat / test/mytest.txt

Download a file

Hdfs dfs-get / test/mytest.txt. / mytest2.txt

Merge download

First create 2 test files (log.access, log.error), and upload them to the / test directory using-put.

Hdfs dfs-put log.* / test

Then merge and download two log files into one file

Hdfs dfs-getmerge / test/log.*. / log

Check the contents of the local log file, which should contain the contents of the log.access and log.error files.

Copy

Copy one path of HDFS from another path of HDFS

Hdfs dfs-cp / test/mytest.txt / aa/mytest.txt.2

Verification

Hdfs dfs-ls / aa

Move Fil

Hdfs dfs-mv / aa/mytest.txt.2 / aa/bb

Verification

Hdfs dfs-ls / aa/bb

Mytest.txt.2 should be listed.

Delete

Hdfs dfs-rm-r / aa/bb/mytest.txt.2

Use the-r parameter to delete multi-level directories at a time.

Verification

Hdfs dfs-ls / aa/bb

Should be empty

Modify file permissions

Modify the permissions to which the file belongs as used in the Linux file system

-chgrp-chmod-chown

Example

Hdfs dfs-chmod 666 / test/mytest.txt hdfs dfs-chown someuser:somegrp / test/mytest.txt

Statistics on the free space of the file system

Hdfs dfs-df-h /

Count the size of the folder

Hdfs dfs-du-s-h / test

3. Java API operation

(1) Environment configuration

Because you need to link the Hadoop virtual server locally, you need to configure Hadoop so that it can be accessed externally.

1) check the native IP

Ip address

For example, IP is: 192.168.31.239

2) modify the file:

Vi / usr/local/hadoop-2.7.3/etc/hadoop/core-site.xml fs.defaultFS hdfs://localhost:9000

Modify the localhost:9000 to native IP 192.168.31.2399000

3) restart HDFS

# stop

Stop-dfs.sh

# start

Start-dfs.sh

(2) build the development environment.

1) create a new project directory hdfstest

2) create a pom.xml under the project directory

Content:

3) create a source code directory src/main/java

Now the project directory structure

├── pom.xml! "" Src │ └── main │ └── java

(3) sample code

View file list ls

1) create a new file src/main/java/Ls.java

List the files under /, and get all the files recursively

2) compile and execute

Mvn compile mvn exec:java-Dexec.mainClass= "Ls"-Dexec.cleanupDaemonThreads = false

Create a directory mkdir

Create a directory / mkdir/a/b in HDFS

1) create a new file

Src/main/java/Mkdir.java

2) compile and execute

Mvn compile mvn exec:java-Dexec.mainClass= "Mkdir"-Dexec.cleanupDaemonThre ads=false

3) use the HDFS command to verify in the server

Hdfs dfs-ls / mkdir

Upload file put

Create a new test file under the current project directory and upload it to / mkdir in HDFS

1) create a test file testfile.txt under the project directory with arbitrary content

2) create a new file src/main/java/Put.java

3) compile and execute

Mvn compile mvn exec:java-Dexec.mainClass= "Put"-Dexec.cleanupDaemonThread s=false

4) use the HDFS command to verify in the server

Hdfs dfs-ls / mkdir hdfs dfs-cat / mkdir/testfile.txt

Download the file get

1) create a new file src/main/java/Get.java

Download the HDFS / mkdir/test download le.txt to the current project directory

2) compile and execute

Mvn compile mvn exec:java-Dexec.mainClass= "Get"-Dexec.cleanupDaemonThread s=false

3) check to see if test contacts le2.txt and its contents exist in the project directory

Delete the file delete

Delete the previously uploaded / mkdir/test upload le.txt on HDFS

1) create a new file src/main/java/Del.java

2) compile and execute

Mvn compile mvn exec:java-Dexec.mainClass= "Del"-Dexec.cleanupDaemonThread s=false

3) use the HDFS command in the server to verify that the test destroy le.txt has been deleted

Hdfs dfs-ls / mkdir

Rename rename

Rename / mkdir/an in HDFS to / mkdir/a2

1) create a new file src/main/java/Rename.java

2) compile and execute

Mvn compile mvn exec:java-Dexec.mainClass= "Rename"-Dexec.cleanupDaemonThr eads=false

3) use the HDFS command to verify in the server

Hdfs dfs-ls / mkdir

Read part of the file in stream mode

Upload a text file, and then use streaming to read part of the content and save it to the current project directory.

1) create a test file test.txt in the server with the following contents:

123456789abcdefghijklmn

Upload to HDFS

Hdfs dfs-put test.txt /

2) create a new file src/main/java/StreamGet.java in the local project

2) compile and execute

Mvn compile mvn exec:java-Dexec.mainClass= "StreamGet"-Dexec.cleanupDaemon Threads=false

3) check the test.txt.part2 in the project directory after execution

6789abcdefghijklmn

The first 12345 has been skipped.

Third, in-depth understanding

1. Writing mechanism

When writing a file to HDFS, it is in blocks. Client will cut the target file into multiple blocks according to the block size set in the configuration. For example, if the file is 300m and the block size in the configuration is 128m, then it will be divided into three blocks.

Specific writing process:

Client sent a request to namenode that he wanted to upload the file.

Namenode checks whether the target file and the parent directory exist, and returns a confirmation message after checking that there is no problem.

Client sends a request again, asking which datanode a block should be transferred to.

Namenode is measured and returns 3 available datanode (A _ Magi B ~ C)

Client establishes a connection with A, An and B, and B and C to form a pipeline

After the transport pipeline is established, client begins to send a packet to A, which is passed through the pipeline to B and C at once.

When the data of * block has been transmitted, client asks namenode which datanode the second block is uploaded to, and then sets up a transmission pipeline to send data.

That's it, until client uploads all the files.

2. Reading mechanism

Client sends the path of the file to be read to namenode, queries the metadata, and finds the datanode server where the file block is located

Client selects the datanode that is close to you until the file contains several pieces and which datanode each piece is on (in the same computer room, if there are multiple nearby ones, randomly select them), and request resume socket stream

Get data from datanode

Client receives the packet, caches it locally, and then writes to the target file

Until the file is read.

3. NameNode mechanism

Through the understanding of the HDFS read and write process, we can find that namenode is a very important part, it records the metadata of the entire HDFS system, which needs to be persisted and saved to a file.

Namenode also has to bear a huge amount of access. Client needs to request namenode when reading and writing files, modify metadata when writing files, and query metadata when reading files.

To improve efficiency, namenode loads the metadata into memory and modifies the memory directly instead of modifying the file directly each time it is modified. At the same time, it records the operation log for later modification of the file.

In this way, namenode's management of data involves three forms of storage:

Memory data

Metadata file

Operation log file

Namenode needs to integrate metadata files and log files regularly to ensure that the data in the files is new, but this process consumes performance, and namenode needs to respond to a large number of requests from client quickly, so it is difficult to complete file consolidation operations, so a small assistant secondnamenode is introduced.

Secondnamenode periodically downloads metadata files and operation logs from namenode, integrates them to form new data files, then sends them back to namenode and replaces the previous files.

Secondnamenode is a good helper for namenode to complete this heavy physical work for namenode, and it can also be used as a disaster prevention backup for namenode. When namenode data is lost, there is the latest sorted data file on secondnamenode, which can be passed to namenode for loading, so as to ensure the least data loss.

At this point, the study on "how to install and use Hadoop HDFS" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.