Construction of HDFS pseudo-distributed environment 07/19 Update SLTechnology News&Howtos

Construction of HDFS pseudo-distributed environment

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Overview and Design objectives of HDFS

What is HDFS:

Is a distributed file system (Hadoop Distributed File System) implemented by Hadoop, referred to as HDFS. The GFS paper derived from Google was published in 2003. HDFS is a clone of GFS.

The design goals of HDFS:

A very large distributed file system runs on ordinary cheap hardware that is easy to expand and provides good file storage services for users, that is, fault tolerance.

The address of the official HDFS document is as follows:

Https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

HDFS architecture

HDFS is a master / slave architecture. A HDFS cluster will have a NameNode (NN for short), a named node, which exists as a master server (master server). NameNode is used to manage the namespace of the file system and to regulate customer access to files. In addition, there will be multiple DataNode (DN for short), that is, data nodes, which exist as slave nodes (slave server). Usually, the DataNode in each cluster is managed by NameNode, and DataNode is used to store data.

HDFS exposes the file system namespace, allowing users to store data in files, just as we usually use file systems in the operating system, users do not have to care about how the underlying data is stored. At the bottom, a file is divided into one or more data blocks, which are stored in a set of data nodes. The default block size in CDH is 128m, which can be adjusted through the configuration file. On NameNode, we can perform file system namespace operations, such as opening, closing, renaming files, etc. This also determines the mapping of data blocks to data nodes.

Let's take a look at the architecture diagram of HDFS:

HDFS is designed to run on ordinary cheap machines, which usually run a Linux operating system. HDFS is written in the Java language, and any machine that supports java can run HDFS. HDFS written in the highly portable Java language means that it can be deployed on a wide range of machines. A typical HDFS cluster deployment will have a dedicated machine that can only run NameNode, while machines in other clusters will each run an instance of DataNode. Although multiple nodes can be run on a single machine, this is not recommended except in a learning environment.

Summary:

HDFS is a master / slave architecture, a HDFS cluster will have a NameNode and multiple DataNode, a file will be split into multiple blocks for storage, the default block size is 128m, even if a block size is 130m, it will be split into two Block, one size is 128m, one size is written in Java, so that it can run on the operating system with JDK installed.

NN:

Responsible for the response of client request is responsible for the management of metadata (file name, copy coefficient, DN stored in Block)

DN:

The data block (Block) corresponding to the user's file periodically sends heartbeat information to NN, reporting itself and all its block information and health status HDFS copy mechanism.

In HDFS, a file is split into one or more blocks. By default, there are three copies for each data block. Each copy is stored on a different machine, and each copy has its own unique number.

As shown below:

HDFS copy storage policy

The process in which a NameNode node chooses a DataNode node to store a block copy is called replica storage, and the strategy of this process is to make a tradeoff between reliability and read-write bandwidth.

The default method in the Hadoop authoritative Guide:

The first copy is randomly selected, but no nodes with overfull storage are selected. The second copy is placed on a randomly selected rack that is different from the first copy. The third and second are placed on different nodes on the same rack. The rest of the copy is completely random.

As shown below:

It can be seen that this plan is reasonable:

Reliability: block stores write bandwidth on two racks: write operations only read through a network switch: select one of the racks to read block distributed across the cluster. Construction of HDFS pseudo-distributed environment

The official installation documentation address is as follows:

Http://archive.cloudera.com/cdh6/cdh/5/hadoop-2.6.0-cdh6.7.0/hadoop-project-dist/hadoop-common/SingleCluster.html

Environment description:

CentOS7.3JDK1.8Hadoop 2.6.0-cdh6.7.0

Download the tar.gz package for Hadoop 2.6.0-cdh6.7.0 and extract it:

[root@localhost ~] # cd / usr/local/src/ [root@localhost / usr/local/src] # wget http://archive.cloudera.com/cdh6/cdh/5/hadoop-2.6.0-cdh6.7.0.tar.gz[root@localhost / usr/local/src] # tar-zxvf hadoop-2.6.0-cdh6.7.0.tar.gz-C / usr/local/

Note: if you are slow to download on Linux, you can use this link on windows's Thunderbolt to download. Then upload it to Linux, which will be faster.

After decompressing, enter the decompressed directory, and you can see that the directory structure of hadoop is as follows:

[root@localhost / usr/local/src] # cd / usr/local/hadoop-2.6.0-cdh6.7.0/ [root@localhost / usr/local/hadoop-2.6.0-cdh6.7.0] # lsbin cloudera examples include libexec NOTICE.txt sbin srcbin-mapreduce1 etc examples-mapreduce1 lib LICENSE.txt README.txt share [root@localhost / usr/local/hadoop-2.6.0-cdh6.7.0] #

A brief description of what is stored in several of these directories:

Bin directory stores executable files etc directory stores configuration files sbin directory stores services startup commands jar packages and documents are stored in share directory

Even if the hadoop is installed, the next step is to edit the configuration file and configure the JAVA_HOME:

[root@localhost / usr/local/hadoop-2.6.0-cdh6.7.0] # cd etc/ [root@localhost / usr/local/hadoop-2.6.0-cdh6.7.0/etc] # cd hadoop [root@localhost / usr/local/hadoop-2.6.0-cdh6.7.0/etc/hadoop] # vim hadoop-env.shexport JAVA_HOME=/usr/local/jdk1.8/ # modify according to your environment variables

Since we are going to build a single-node pseudo-distributed environment, we also need to configure two configuration files, namely core-site.xml and hdfs-site.xml, as follows:

[root@localhost / usr/local/hadoop-2.6.0-cdh6.7.0/etc/hadoop] # vim core-site.xml # add the following content: fs.defaultFS hdfs://192.168.77.130:8020 # specifies the default access address and port number hadoop.tmp.dir # specifies the directory / data/ where temporary files are stored Tmp/ [root@localhost / usr/local/hadoop-2.6.0-cdh6.7.0/etc/hadoop] # mkdir / data/tmp/ [root@localhost / usr/local/hadoop-2.6.0-cdh6.7.0/etc/hadoop] # vim hdfs-site.xml # add the following content dfs.replication # specifies that only one copy 1 is produced

Then configure the key pair, set local secret-free login, and build pseudo-distribution. This step is necessary:

[root@localhost] # ssh-keygen-t dsa-P''- f ~ / .ssh/id_dsaGenerating public/private dsa key pair.Your identification has been saved in / root/.ssh/id_dsa.Your public key has been saved in / root/.ssh/id_dsa.pub.The key fingerprint is:c2:41:89:65:bd:04:9e:3e:3f:f9:a7:51:cd:e9:cf:1e root@localhostThe key's randomart image is:+-- [ DSA 1024]-+ | oasis + | |. +.. o | | +. . | | o. O. | | = S. + | | +. . . | | +. .e | | o.. O. | | oo. + | +-+ [root@localhost ~] # cat ~ / .ssh/id_dsa.pub > > ~ / .ssh/authorized_ Keys [root @ localhost ~] # ssh localhostssh_exchange_identification: read: Connection reset by peer [root@localhost ~] #

As above, a ssh_exchange_identification: read: Connection reset by peer error was reported when testing the local secret-free login, so we began to troubleshoot the error and found that IP was restricted in the / etc/hosts.allow file. Then modify the configuration file, as follows:

[root@localhost ~] # vim / etc/hosts.allow # modified to sshd: ALL [root@localhost ~] # service sshd restart [root@localhost ~] # ssh localhost # Test login succeeded Last login: Sat Mar 24 21:56:38 2018 from localhost [root@localhost ~] # logoutConnection to localhost closed. [root@localhost ~] #

Then you can start HDFS, but you need to format the file system before starting it:

[root@localhost] # / usr/local/hadoop-2.6.0-cdh6.7.0/bin/hdfs namenode-format

Note: formatting is required only on the first startup.

Use the service startup script to start the service:

[root@localhost] # / usr/local/hadoop-2.6.0-cdh6.7.0/sbin/start-dfs.sh18/03/24 21:59:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicableStarting namenodes on [192.168.77.130] 192.168.77.130: namenode running as process 8928. Stop it first.localhost: starting datanode, logging to / usr/local/hadoop-2.6.0-cdh6.7.0/logs/hadoop-root-datanode-localhost.outStarting secondary namenodes [0.0.0.0] The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.ECDSA key fingerprint is 63:74:14:e8:15:4c:45:13:9e:16:56:92:6a:8c:1a:84.Are you sure you want to continue connecting (yes/no)? The first launch of yes # will ask if you want to connect the node 0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.0.0.0.0: starting secondarynamenode, logging to / usr/local/hadoop-2.6.0-cdh6.7.0/logs/hadoop-root-secondarynamenode-localhost.out18/03/24 21:59:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable [root@localhost ~] # jps # check whether there are the following processes If there is one missing 8928 NameNode9875 Jps9578 DataNode9757 SecondaryNameNode [root@localhost ~] # netstat-lntp | grep java # check port tcp 00 0.0.0.0 netstat 50090 0.0.0.0 lntp * LISTEN 9757/java tcp 00 192.168.77.130 netstat 8020 0.0.0.0 grep java * LISTEN 8928/java tcp 00 0.0.0.0 LISTEN 9578/java tcp 50070 0.0.0.0 LISTEN 9578/java tcp 00 0.0.0.0 LISTEN 9578/java tcp 00 0.0.0.0 .0.0 50075 0.0.0.0 LISTEN 9578/java tcp 00 0.0.0.015 50020 0.0.0.0 LISTEN 9578/java tcp 00 127.0.0.1 53703 0.0.0.0 LISTEN 9578/java [root@localhost ~] #

Then configure the installation directory of Hadoop into the environment variable to make it easier to use its commands later:

[root@localhost ~] # vim ~ / .bash_profile # add the following content export HADOOP_HOME=/usr/local/hadoop-2.6.0-cdh6.7.0/export PATH=$HADOOP_HOME/bin:$ path [root @ localhost ~] # source! $source ~ / .bash_ profile [root @ localhost ~] #

After confirming that the service has started successfully, visit 192.168.77.130 50070 using a browser, and you will visit the following page:

Click Live Nodes to view the living nodes:

As above, you can see the information of the node. At this point, our pseudo-distributed hadoop cluster is completed.

HDFS shell operation

The above has introduced how to build a pseudo-distributed Hadoop, now that the environment has been set up, how to operate it? This is what this section will cover:

HDFS comes with some shell commands, through which we can manipulate the HDFS file system. These commands are similar to those of Linux. If you are familiar with Linux commands, you can easily start HDFS commands. The official document address for these commands is as follows:

Http://archive.cloudera.com/cdh6/cdh/5/hadoop-2.6.0-cdh6.7.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html

Here are a few commonly used commands:

First, let's create a test file:

[root@localhost ~] # cd / data/ [root@localhost / data] # vim hello.txt # write something hadoop welcomehadoop hdfs mapreducehadoop hdfs [root@localhost / data] #

1. View the root directory of the file system:

[root@localhost / data] # hdfs dfs-ls / [root@localhost / data] #

two。 Copy the file you just created to the root of the file system:

[root@localhost / data] # hdfs dfs-put. / hello.txt / [root@localhost / data] # hdfs dfs-ls / Found 1 items-rw-r--r-- 1 root supergroup 49 2018-03-24 22:37 / hello.txt [root@localhost / data] #

3. View the contents of the file:

[root@localhost / data] # hdfs dfs-cat / hello.txthadoop welcomehadoop hdfs mapreducehadoop hdfs [root@localhost / data] #

4. Create a directory:

[root@localhost / data] # hdfs dfs-mkdir / test [root@localhost / data] # hdfs dfs-ls / Found 2 items-rw-r--r-- 1 root supergroup 49 2018-03-24 22:37 / hello.txtdrwxr-xr-x-root supergroup 0 2018-03-24 22:40 / test [root@localhost / data] #

5. Recursively create a directory:

[root@localhost / data] # hdfs dfs-mkdir-p / test/a/b/c

6. Recursively view the directory:

[root@localhost / data] # hdfs dfs-ls-R /-rw-r--r-- 1 root supergroup 49 2018-03-24 23:02 / hello.txtdrwxr-xr-x-root supergroup 0 2018-03-24 23:04 / testdrwxr-xr-x-root supergroup 0 2018-03-24 23:04 / test/adrwxr-xr-x-root supergroup 0 2018-03-24 23:04 / test/a/bdrwxr -xr-x-root supergroup 0 2018-03-24 23:04 / test/a/b/c [root@localhost / data] #

7. Copy the file:

[root@localhost / data] # hdfs dfs-copyFromLocal. / hello.txt / test/a/b [root@localhost / data] # hdfs dfs-ls-R /-rw-r--r-- 1 root supergroup 49 2018-03-24 23:02 / hello.txtdrwxr-xr-x-root supergroup 0 2018-03-24 23:04 / testdrwxr-xr-x-root supergroup 0 2018-03-24 23:04 / test/adrwxr-xr-x -root supergroup 0 2018-03-24 23:06 / test/a/bdrwxr-xr-x-root supergroup 0 2018-03-24 23:04 / test/a/b/c-rw-r--r-- 1 root supergroup 49 2018-03-24 23:06 / test/a/b/hello.txt [root@localhost / data] #

8. Take files out of the file system:

[root@localhost / data] # hdfs dfs-get / test/a/b/hello.txt

9. Delete the file:

[root@localhost / data] # hdfs dfs-rm / hello.txtDeleted / hello.txt [root@localhost / data] #

10. Delete the directory:

[root@localhost / data] # hdfs dfs-rm-R / testDeleted / test [root@localhost / data] #

These are some of the most commonly used operation commands. If you need to use other commands, you can see all the supported commands by executing hdfs dfs directly.

Next, let's look at the file system in the browser and first put the file we just deleted:

[root@localhost / data] # hdfs dfs-put. / hello.txt /

View the file system on the browser:

View information about the file:

You can see the details about the file:

Because this file is too small, there is only one data block.

Let's put a larger file, such as the Hadoop installation package we used earlier:

[root@localhost / data] # cd / usr/local/src/ [root@localhost / usr/local/src] # hdfs dfs-put. / hadoop-2.6.0-cdh6.7.0.tar.gz /

As you can see below, this file is divided into three data blocks:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.