In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "how to install and use Hadoop HDFS". In daily operation, I believe many people have doubts about how to install and use Hadoop HDFS. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts about "how to install and use Hadoop HDFS". Next, please follow the editor to study!
I. the basic principles of HDFS
HDFS (Hadoop Distribute File System) is a distributed file system and an important member of Hadoop.
1. File system problems
The file system is a disk space management service provided by the operating system, which only requires us to decide where to put the file and which path to read the file from, regardless of how the file is stored on disk.
What to do when the space required by the file is larger than the local disk space?
One is to add disks, but to a certain extent, there is a limit; the other is to add machines to provide networked storage by means of remote shared directories, which can be understood as the embryonic form of a distributed file system. Different files can be put into different machines, and machines can continue to be added when there is insufficient space, breaking through the limitation of storage space. But there are several problems with this approach:
The load on a single machine may be extremely high, for example, a file is popular, and many users often read this file, which makes the access pressure on the machine where the secondary file is located.
Data is not secure if the machine on which a file is located fails, the file cannot be accessed and its reliability is poor.
It is difficult to organize documents, for example, if you want to adjust the storage location of some files, you need to see whether the space of the target machine is enough, and you need to maintain the file location yourself. if there are many machines, the operation is extremely complex.
2. The solution of HDFS
HDFS is an abstract layer, and the underlying layer relies on many independent servers to provide unified file management functions. For users, it feels like operating a machine and cannot feel the multiple servers under the HDFS.
For example, when a user accesses the / a/b/c.mpg file in HDFS, HDFS reads it from the underlying corresponding server and returns it to the user, so that the user only needs to deal with HDFS and does not care about how the file is stored.
For example, the user needs to save a file / a/b/xxx.avi.
HDFS first splits the file, for example, into four pieces, and then puts it on different servers.
This has the advantage of not being afraid that the file is too large, and the pressure of reading the file is not all concentrated on a single server. But if a server breaks down, you can't read all the files.
To ensure the reliability of files, HDFS makes multiple backups of each file block:
Block 1Rom A B C
Block 2Rom A B D
Block 3RB C D
Block 4RU A C D
In this way, the reliability of the file is greatly enhanced, and even if a server is down, the file can be read completely.
At the same time, it also brings a great benefit, that is, it increases the ability of concurrent access to files. For example, when multiple users read this file, they all have to read block 1. HDFS can choose to read block 1 from that server according to the busy degree of the server.
3. Metadata management
What files are stored in HDFS?
What blocks are the files divided into?
Which server is each block placed on?
……
These are called metadata, which are abstracted into a directory tree that records these complex relationships. This metadata is managed by a separate module called NameNode. The real server where the file blocks are stored is called DataNode, so the process for users to access HDFS can be understood as:
User-> HDFS-> NameNode-> DataNode
4. Advantages of HDFS
Capacity can be expanded linearly
With copy mechanism, high storage reliability and increased throughput
With NameNode, users only need to specify the path on the HDFS to access the file
II. HDFS practice
After the above introduction, you can have a basic understanding of HDFS, let's start with the actual operation, a better understanding of HDFS in practice.
1. Installation practice environment
You can choose to build your own environment or use a packaged Hadoop environment (version 2.7.3)
This Hadoop environment is actually a virtual machine image, so you need to install the virtualbox virtual machine, vagrant image management tools, and my Hadoop image, and then start the virtual machine with this image. Here are the specific steps:
1) install virtualbox
Download address: https://www.virtualbox.org/wiki/Downloads
2) install vagrant
Because the official website is slow to download, I uploaded it to the cloud disk.
Windows version
Link: https://pan.baidu.com/s/1pKKQGHl
Password: eykr
Mac version
Link: https://pan.baidu.com/s/1slts9yt
Password: aig4
After the installation is complete, the vagrant command can be used under the command line terminal.
3) download Hadoop image
Link: https://pan.baidu.com/s/1bpaisnd
Password: pn6c
4) start
Load Hadoop Mirror
Vagrant box add {Custom Image name} {path where the Image is located}
For example, if you want to name it Hadoop, and the downloaded path of the image is d:\ hadoop.box, the load command is like this:
Vagrant box add hadoop d:\ hadoop.box
Create a working directory, such as d:\ hdfstest.
Enter this directory and initialize
Cd d:\ hdfstest vagrant init hadoop
Start the virtual machine
Vagrant up
After the boot is complete, you can log in to the virtual machine using the SSH client
IP 127.0.0.1
Port 2222
User name root
Password vagrant
After logging in, use the command ifconfig to view the IP of the virtual machine (such as 192.168.31.239). You can log in using this IP and port 22.
IP 192.168.31.239
Port 22
User name root
Password vagrant
The Hadoop server environment has been built.
2. Shell command line operation
After logging in to the Hadoop server, start HDFS first and execute the command:
Start-dfs.sh
View help
Hdfs dfs-help
Display catalog information
-ls is followed by the directory path to be viewed
Create a directory
Create a directory / test
Hdfs dfs-mkdir / test
Create a multi-level directory / aa/bb at once
Hdfs dfs-mkdir-p / aa/bb
Upload files
Form
Hdfs dfs-put {Local path} {path in hdfs}
Instance (create a test file mytest.txt at will, and then upload it to / test)
Hadoop fs-put ~ / mytest.txt / test
Show file contents
Hdfs dfs-cat / test/mytest.txt
Download a file
Hdfs dfs-get / test/mytest.txt. / mytest2.txt
Merge download
First create 2 test files (log.access, log.error), and upload them to the / test directory using-put.
Hdfs dfs-put log.* / test
Then merge and download two log files into one file
Hdfs dfs-getmerge / test/log.*. / log
Check the contents of the local log file, which should contain the contents of the log.access and log.error files.
Copy
Copy one path of HDFS from another path of HDFS
Hdfs dfs-cp / test/mytest.txt / aa/mytest.txt.2
Verification
Hdfs dfs-ls / aa
Move Fil
Hdfs dfs-mv / aa/mytest.txt.2 / aa/bb
Verification
Hdfs dfs-ls / aa/bb
Mytest.txt.2 should be listed.
Delete
Hdfs dfs-rm-r / aa/bb/mytest.txt.2
Use the-r parameter to delete multi-level directories at a time.
Verification
Hdfs dfs-ls / aa/bb
Should be empty
Modify file permissions
Modify the permissions to which the file belongs as used in the Linux file system
-chgrp-chmod-chown
Example
Hdfs dfs-chmod 666 / test/mytest.txt hdfs dfs-chown someuser:somegrp / test/mytest.txt
Statistics on the free space of the file system
Hdfs dfs-df-h /
Count the size of the folder
Hdfs dfs-du-s-h / test
3. Java API operation
(1) Environment configuration
Because you need to link the Hadoop virtual server locally, you need to configure Hadoop so that it can be accessed externally.
Log in to the Hadoop virtual server first, and then:
1) check the native IP
Ip address
For example, IP is: 192.168.31.239
2) modify the file:
Vi / usr/local/hadoop-2.7.3/etc/hadoop/core-site.xml fs.defaultFS hdfs://localhost:9000
Modify the localhost:9000 to native IP 192.168.31.2399000
3) restart HDFS
# stop
Stop-dfs.sh
# start
Start-dfs.sh
(2) build the development environment.
1) create a new project directory hdfstest
2) create a pom.xml under the project directory
Content:
3) create a source code directory src/main/java
Now the project directory structure
├── pom.xml! "" Src │ └── main │ └── java
(3) sample code
View file list ls
1) create a new file src/main/java/Ls.java
List the files under /, and get all the files recursively
2) compile and execute
Mvn compile mvn exec:java-Dexec.mainClass= "Ls"-Dexec.cleanupDaemonThreads = false
Create a directory mkdir
Create a directory / mkdir/a/b in HDFS
1) create a new file
Src/main/java/Mkdir.java
2) compile and execute
Mvn compile mvn exec:java-Dexec.mainClass= "Mkdir"-Dexec.cleanupDaemonThre ads=false
3) use the HDFS command to verify in the server
Hdfs dfs-ls / mkdir
Upload file put
Create a new test file under the current project directory and upload it to / mkdir in HDFS
1) create a test file testfile.txt under the project directory with arbitrary content
2) create a new file src/main/java/Put.java
3) compile and execute
Mvn compile mvn exec:java-Dexec.mainClass= "Put"-Dexec.cleanupDaemonThread s=false
4) use the HDFS command to verify in the server
Hdfs dfs-ls / mkdir hdfs dfs-cat / mkdir/testfile.txt
Download the file get
1) create a new file src/main/java/Get.java
Download the HDFS / mkdir/test download le.txt to the current project directory
2) compile and execute
Mvn compile mvn exec:java-Dexec.mainClass= "Get"-Dexec.cleanupDaemonThread s=false
3) check to see if test contacts le2.txt and its contents exist in the project directory
Delete the file delete
Delete the previously uploaded / mkdir/test upload le.txt on HDFS
1) create a new file src/main/java/Del.java
2) compile and execute
Mvn compile mvn exec:java-Dexec.mainClass= "Del"-Dexec.cleanupDaemonThread s=false
3) use the HDFS command in the server to verify that the test destroy le.txt has been deleted
Hdfs dfs-ls / mkdir
Rename rename
Rename / mkdir/an in HDFS to / mkdir/a2
1) create a new file src/main/java/Rename.java
2) compile and execute
Mvn compile mvn exec:java-Dexec.mainClass= "Rename"-Dexec.cleanupDaemonThr eads=false
3) use the HDFS command to verify in the server
Hdfs dfs-ls / mkdir
Read part of the file in stream mode
Upload a text file, and then use streaming to read part of the content and save it to the current project directory.
1) create a test file test.txt in the server with the following contents:
123456789abcdefghijklmn
Upload to HDFS
Hdfs dfs-put test.txt /
2) create a new file src/main/java/StreamGet.java in the local project
2) compile and execute
Mvn compile mvn exec:java-Dexec.mainClass= "StreamGet"-Dexec.cleanupDaemon Threads=false
3) check the test.txt.part2 in the project directory after execution
6789abcdefghijklmn
The first 12345 has been skipped.
Third, in-depth understanding
1. Writing mechanism
When writing a file to HDFS, it is in blocks. Client will cut the target file into multiple blocks according to the block size set in the configuration. For example, if the file is 300m and the block size in the configuration is 128m, then it will be divided into three blocks.
Specific writing process:
Client sent a request to namenode that he wanted to upload the file.
Namenode checks whether the target file and the parent directory exist, and returns a confirmation message after checking that there is no problem.
Client sends a request again, asking which datanode a block should be transferred to.
Namenode is measured and returns 3 available datanode (A _ Magi B ~ C)
Client establishes a connection with A, An and B, and B and C to form a pipeline
After the transport pipeline is established, client begins to send a packet to A, which is passed through the pipeline to B and C at once.
When the data of * block has been transmitted, client asks namenode which datanode the second block is uploaded to, and then sets up a transmission pipeline to send data.
That's it, until client uploads all the files.
2. Reading mechanism
Client sends the path of the file to be read to namenode, queries the metadata, and finds the datanode server where the file block is located
Client selects the datanode that is close to you until the file contains several pieces and which datanode each piece is on (in the same computer room, if there are multiple nearby ones, randomly select them), and request resume socket stream
Get data from datanode
Client receives the packet, caches it locally, and then writes to the target file
Until the file is read.
3. NameNode mechanism
Through the understanding of the HDFS read and write process, we can find that namenode is a very important part, it records the metadata of the entire HDFS system, which needs to be persisted and saved to a file.
Namenode also has to bear a huge amount of access. Client needs to request namenode when reading and writing files, modify metadata when writing files, and query metadata when reading files.
To improve efficiency, namenode loads the metadata into memory and modifies the memory directly instead of modifying the file directly each time it is modified. At the same time, it records the operation log for later modification of the file.
In this way, namenode's management of data involves three forms of storage:
Memory data
Metadata file
Operation log file
Namenode needs to integrate metadata files and log files regularly to ensure that the data in the files is new, but this process consumes performance, and namenode needs to respond to a large number of requests from client quickly, so it is difficult to complete file consolidation operations, so a small assistant secondnamenode is introduced.
Secondnamenode periodically downloads metadata files and operation logs from namenode, integrates them to form new data files, then sends them back to namenode and replaces the previous files.
Secondnamenode is a good helper for namenode to complete this heavy physical work for namenode, and it can also be used as a disaster prevention backup for namenode. When namenode data is lost, there is the latest sorted data file on secondnamenode, which can be passed to namenode for loading, so as to ensure the least data loss.
At this point, the study on "how to install and use Hadoop HDFS" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.