Installation and configuration of Hadoop 04/27 Update SLTechnology News&Howtos

Installation and configuration of Hadoop

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "the method of installation and configuration of Hadoop". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Related software download: Weiyun network disk link: https://share.weiyun.com/5uIOSHe password: osmzbn

JDK 8: https://jdk.java.net/java-se-ri/8-MR3

Hadoop 3.2.1: https://hadoop.apache.org/releases.html

If you have not already installed and set up the virtual machine, refer to the previous article Ubuntu installation and configuration, where the default server user name is hadoop, the machine name is master, and the IP of master is written into the hosts file and configured for SSH secret-free login. This will introduce two ways to install Hadoop and make it easy to use and operate MapReduce and HDFS.

Main contents:

Hadoop pseudo-distributed installation

Hadoop cluster installation

Dynamically add or delete nodes

1. Installation of necessary software

Hadoop 3 supports Java 8 at a minimum. Here, using OpenJDK 8 of Oracle, you can extract and download it and put it in a shared folder.

# extract and create the link file sudo tar-xvf openjdk-XXX_XXX.tar.gz / user/localsudo ln-s / user/local/openjdk-XXX_XXX / user/local/openjdk-1.8# add JAVA_HOME to the environment variable sudo vim / etc/profile# add the following export JAVA_HOME=/user/local/openjdk-1.8export PATH=$JAVA_HOME/bin:$PATH# test java-versionjava version "1.8.0_XXX" # install SSH pdshsudo apt install sshsudo apt install pdsh

Hadoop supports the following three modes of installation:

Local independent mode

Single node mode

Cluster mode

Supporting platform: GNU/Linux is recommended, but the Windows platform is not introduced here.

2. Pseudo-distributed installation 2.1, Download and extract Hadooptar-xvf. / hadoop-3.X.X.tar.gzln-s. / hadoop-3.X.X.tar.gz. / hadoop# settings Hadoop installation directory add PATHvim .bashrc # add the following content at the end of the file export HADOOP_HOME=/home/hadoop/hadoopexport HADOOP_MAPRED_HOME=/home/hadoop/hadoop# PDSH_RCMD_TYPE solution pdsh@master: master: ssh exited with exit code 1export PDSH_RCMD_TYPE=sshexport PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Set parameters such as Java Home of Hadoop

Vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh# finds the location of export JAVA_HOME, removes the comment and changes it to the JAVA_HOME path # The java implementation to use. By default, this environment# variable is REQUIRED on ALL platforms except OS this environment# variable is REQUIRED on ALL platforms except OS export JAVA_HOME=/user/local/openjdk-1.8# optional modification Hadoop Home HADOOP_CONF_DIRexport HADOOP_HOME=/home/hadoop/hadoopexport HADOOP_CONF_DIR=$ {HADOOP_HOME} / etc/hadoop# optional JVM Heap heap, equivalent to-Xms512m-Xmx1024mexport HADOOP_HEAPSIZE_MAX=1024mexport HADOOP_HEAPSIZE_MIN=512m

View the hadoop command

Hadoop version# will see the following output Hadoop 3.X.X

Local independent mode

By default, Hadoop is configured to run in non-distributed mode as a single Java process, which is useful for debugging.

The cd $HADOOP_HOMEmkdir inputcp etc/hadoop/*.xml input# regular expression matches the words bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs [a murz.] +' cat output/*# in all XML files. You can see the following output: 1 dfsadmin2.2, pseudo distribution pattern

First of all, you need to configure the default SSH secret-free login, and skip it if it is already configured:

Ssh-keygen-t rsa-P''- f ~ / .ssh/id_rsacat ~ / .ssh/id_rsa.pub > > ~ / .ssh/authorized_keyschmod 0600 ~ / .ssh/authorized_keys# Test ssh localhost

In the following command, by default, change to the hadoop installation directory cd $HADOOP_HOME

If you are not familiar with vim, you can use Visual Studio Remote to connect to the master server, and then open the Hadoop configuration directory. Reference https://code.visualstudio.com/docs/remote/wsl

File vim etc/hadoop/core-site.xml, modify the content as follows

Fs.defaultFS hdfs://localhost:9000

File vim etc/hadoop/hdfs-site.xml, modified as follows

Dfs.replication 1

Running

# format file system bin/hdfs namenode-format# running DataNode and NameNodesbin/start-dfs.sh#, you can see the following output Starting namenodes on [localhost] Starting datanodesStarting secondary namenodes [master]

Access http://master:9870/ through the browser, and replace master with the IP where the Ubuntu is located or write to the hosts of the main system.

test

# create a directory for HDFS Hadoop is your user name, which is equivalent to Linux's personal home directory bin/hdfs dfs-mkdir / userbin/hdfs dfs-mkdir / user/hadoop#. Create bin/hdfs dfs-mkdir input# under / user/hadoop of HDFS to upload files to HDFSbin/hdfs dfs-put etc/hadoop/*.xml input# to run Hadoop's sample program bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs [Amurz. .] +'# take the running result to the local bin/hdfs dfs- get output outputcat output/*# or view it directly in HDFS: bin/hdfs dfs- cat output/*# can also see that the following output is different from the local stand-alone mode because the hdfs-site.xml1 dfsadmin1 dfs.replication2.3, Single node Yarn

File configuration: etc/hadoop/mapred-site.xml

Mapreduce.framework.name yarn mapreduce.application.classpath $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*

File etc/hadoop/yarn-site.xml

Yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.env-whitelist JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME

The command to see what service is running is jps, and you can make a copy of the Hadoop installation configuration directory as a local stand-alone mode or pseudo-distributed mode configuration backup.

Close hdfs and Yarn

Sbin/stop-yarn.shsbin/stop-dfs.sh3, cluster settings

Here, it is built with two VM virtual machines (master and worker1). Worker2 is used in the next section to add new nodes to the Hadoop cluster, which can theoretically support thousands of clusters.

After shutting down hdfs and Yarn, shut down the current virtual machine, and make two copies of the virtual machine, named worker1, worker2, one for one, because the previous one will conflict with the master IP address, refer to Ubuntu Server installation and settings to configure the local IP address (for example, 192.168.128.11,192.168.128.12), set hostname to worker1, worker2, and then shut down.

Start the virtual machine of master and worker1, log in to master, and configure the hosts sudo vim / etc/hosts of each node:

# replace the configuration of each node with the actual ip192.168.128.10 master192.168.128.11 worker1192.168.128.12 worker2# of the virtual machine, and distribute the following commands to worker1, worker2, etc. You need to enable root login permission scp / etc/hosts root@worker1:/etc/hosts# to check whether different hosts can log in without secret. # try the common configurations of ssh username@master, ssh username@worker1ssh hadoop@worker13.1, Hadoop in master, worker1, etc.

Commonly used configuration fil

The file format describes the environment variables that the hadoop-env.shbash script Hadoop runs, overrides the environment variables used by the system to set up the mapred-env.shbash script MapReduce, overrides the environment variables used by the hadoop-env.shyarn-env.shbash script Yarn to run, overrides the mapred-env.shcore-site.xmlxml configuration Hadoop Core configuration, configures the IO settings commonly used by HDFS, MapReduce and Yarn, sets the hdfs-site.xmlxml configuration HDFS configuration Namenode, datanode, secondary namenode and other mapred-site.xmlxml configuration MapReduce daemon configuration, such as jobhistoryserveryarn-site.xmlxml configuration Yarn daemon configuration, such as ResourceManger, NodeManager, Web application agent service and other workers plain text running datanode node machine, one per line (hadoop name is slaves) log4j.propertiesJava attribute log configuration file

Environment variable

The environment variable for Hadoop is set through the bash script. For example, you can set the JAVA_HOME,JVM memory heap size, log storage directory, etc.

JAVA_HOME: must be specified. Hadoop-env.sh is recommended to ensure that the cluster uses the same version of JDK.

HADOOP_HEAPSIZE_MAX: maximum JVM memory heap

HADOOP_HEAPSIZE_MIN: minimum JVM memory heap

HADOOP_LOG_DIR: log storage directory

HADOOP_HOME: the root directory of Hadoop

HADOOP_MAPRED_HOME: home directory of MapReduce

3.1.2, Hadoop configuration

File etc/hadoop/core-site.xml

Fs.defaultFS hdfs://master:9000 hadoop.tmp.dir / home/hadoop/cluster

For the set temporary directory, you need to make sure that the user running the Hadoop cluster has write and read permissions. Here, the cluster folder under the user's home directory is used. It is not recommended to store the relevant directory in the Hadoop installation directory.

File etc/hadoop/hdfs-site.xml

Dfs.replication 1 dfs.namenode.name.dir / home/hadoop/cluster/dfs/name dfs.datanode.name.dir / home/hadoop/cluster/dfs/data dfs.datanode.name.dir / home/hadoop/cluster/dfs/namesecondary

File etc/hadoop/yarn-site.xml

File etc/hadoop/mapred-site.xml

Mapreduce.framework.name yarn mapreduce.application.classpath $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*

File workers, which sets the host running datanode, one host name per line

Worker13.2, run Hadoop

Copy Hadoop to worker1.

# copy some environment variables of .bashrc. Optional scp ~ / .bashrc hadoop@worker1:~/.bashrc# copy the Hadoop installation directory to worker1,scp-r ~ / hadoop hadoop@worker1:~/hadoop

Format HDFS

# the master host only needs to run once, and the second run will result in an error hdfs namenode-format

Open namenode, datanode

# master run hdfs-- daemon start namenode# master and each worker run hdfs-- daemon start datanode# after configuring ssh secret-free login, you only need to run on master, which is equivalent to the above two commands (choose one of the two) start-dfs.sh

Use the jps command to see if the operation was successful

Services run by master:

8436 ResourceManager8200 SecondaryNameNode11275 Jps10684 NameNode

Services run by worker1:

5267 Jps4966 DataNode5143 NodeManager

Visit http://master:9870/dfshealth.html through the browser and you can see that there is a datanode. (hee hee, wrong screenshot)

Start Yarn:

# run the following command on ResourceManager. Here, we set it to masteryarn-- daemon start resourcemanageryarn-- daemon start nodemanager# script, which is equivalent to the above two commands (choose one of the two) start-yarn.sh

Use a browser to access http://master:8088/cluster

Start JobHistory Server

Record the MapReduce jobs that have been run and put them in the HDFS directory. The default configuration is sufficient, so there is no configuration above

# run mapred-- daemon start historyserver under master

Browsers access http://master:19888/jobhistory

When master runs start-all.sh, all nodes and their configured services are turned on according to the configuration file.

test

# create a directory for HDFS Hadoop is your user name, which is equivalent to Linux's personal home directory bin/hdfs dfs-mkdir / userbin/hdfs dfs-mkdir / user/hadoop#. Create bin/hdfs dfs-mkdir input# under / user/hadoop of HDFS to upload files to HDFSbin/hdfs dfs-put etc/hadoop/*.xml input# to run Hadoop's sample program bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs [Amurz. .] +'# take the running result to the local bin/hdfs dfs-get output outputcat output/* # or view it directly in HDFS: bin/hdfs dfs-cat output/*1 dfsadmin1 dfs.replication3.3, Close Hadoop

Close namenode, datanode

# master run hdfs-- daemon stop namenode# master and each worker run hdfs-- daemon stop datanode# after configuring ssh secret-free login, you only need to run on master, which is equivalent to the above two commands (choose one of the two) stop-dfs.sh

Turn off Yarn:

# run the following command on ResourceManager. Here, we set it to masteryarn-- daemon stop resourcemanageryarn-- daemon stop nodemanager# script, which is equivalent to the above two commands (choose one of the two) stop-yarn.sh

Close JobHistory Server

# run mapred-- daemon start historyserver under master

Master executes stop-all.sh to shut down all services for all nodes in the cluster.

4. Dynamically add or delete nodes

Enable the worker2 virtual machine. See the virtual machine settings section. 3. Cluster settings. Enable master and worker1 virtual machines.

Add worker2 to the cluster

Enable Hadoop clusters for master and worker1:

# execute start-all.sh on master host

In the Hadoop configuration file workers of master, add worker2 as datanode, and use the scp command to synchronize to worker1 and worker2 after modification. TODO: learn about zookeeper configuration to achieve configuration file synchronization.

Worker1worker2

# start datanode,nodemanager# separately in worker2. For other feature nodes, you only need to start them locally. It is not recommended to hdfs-- daemon start datanodeyarn-- daemon start nodemanager# refresh datanode and yarnhdfs dfsadmin-refreshNodesyarn rmadmin-refreshNodes by script.

Visiting http://master:9870/dfshealth.html through the browser, you can see that worker2 has successfully joined the cluster.

4.2.Move worker1 out of the cluster

The function of Hadoop configuration file workers is that Hadoop scripts such as start-all.sh and stop-all.sh send operation instructions to the entire cluster through their configuration, such as opening the entire cluster and shutting down the entire cluster. Hadoop's namenode daemon does not use workers files.

Nodes connected to the namenode are actually allowed to be configured through the file hdfs-site.xml. Without configuration, all nodes can be connected by default. Strictly speaking, the above configuration is incomplete and all nodes in the cluster should be explicitly managed.

Use stop-all.sh to shut down the cluster, and on the master machine, create a new hosts.includes to indicate the node configuration that allows connections. The contents of the file are as follows:

Worker1worker2

The following configuration is added to the file hdfs-site.xml:

Dfs.hosts / home/hadoop/hadoop-3.2.1/etc/hadoop/hosts.includes

File yarn-site.xml

Yarn.resourcemanager.nodes.include-path / home/hadoop/hadoop-3.2.1/etc/hadoop/hosts.includes

Use scp to synchronize to worker1, worker2, and then start the Hadoop cluster. Suppose that worker3 joins the cluster in the same way that worker2 is added to the cluster and cannot be connected to namenode. You can start it again by adding it to the host.includes file in the first step above.

Cd ~ / hadoop-3.2.1/etc/hadoopscp. / * xian@worker1:~/hadoop-3.2.1/etc/hadoopscp. / * xian@worker2:~/hadoop-3.2.1/etc/hadoop# starts cluster start-all.sh

The easiest way to remove a node from the Hadoop cluster is to shut down the node directly:

# error demonstration hdfs-daemon stop datanode

HDFS is fault-tolerant, and shutting down one or two nodes in a cluster with multiple replicas will not lead to the loss of cluster data. However, this is not recommended, as in this article, there is only one copy of the file, just close a datanode, then the data of that node will disappear from the cluster directly. The correct thing to do is to tell the namenode,Hadoop daemon that the datanode that needs to exit will copy the data of that node to another node, and then the node that needs to be removed goes to Decommissioned before it can be removed.

Correct example:

The first step: tell Hadoop that I want to remove a node through the exclude file control of the removal node, through the dfs.hosts.exclude and yarn.resourcemanager.nodes.exclude-path these two attributes configuration. Create a new hosts.excludes file and join worker1, which means that you want to exclude worker1.

Worker1

The following configuration is added to the file hdfs-site.xml:

Dfs.hosts.exclude / home/hadoop/hadoop-3.2.1/etc/hadoop/hosts.excludes

File yarn-site.xml

Yarn.resourcemanager.nodes.exclude-path / home/hadoop/hadoop-3.2.1/etc/hadoop/hosts.excludes

Use scp to synchronize to worker1, worker2.

Step 2: update the node data to be removed by namenode and resourcemanager, copy to other nodes, and update the metadata of namenode, etc.

Hdfs dfsadmin-refreshNodes# Refresh nodes successfulyarn rmadmin-refreshNodes# INFO client.RMProxy: Connecting to ResourceManager at master/192.168.128.10:8033# data balance start-balancer.sh

Check to see if the datanode has been removed: execute the hdfs dfsadmin-report or web page after execution to see that the worker is in Decommissioned.

Decommission Status: Decommissioned

In service: normal service

Decommissioning: replicating data

Decommissioned: data replication is complete. You can remove this node.

Web page: http://master:9870/dfshealth.html#tab-datanode, refresh, find worker1

Check to see if the ResourceManager of Yarn has been removed: execute the command yarn node-list-- all, and check the node status:

Worker1:44743 DECOMMISSIONED worker1:8042 0

RUNNING: running

DECOMMISSIONED: can be removed

Web page refresh: http://master:8088/cluster/nodes/decommissioned also displays Decommissioned

Errors you may encounter when one master, two worker clusters, the datanode of worker1 does not display Decommissioned all the time. After the addition of two worker, worker1 normal hang Decommissioned status. Analysis of the reasons for the failure of one master and two worker clusters:

Dfs.replication: I previously set the attribute to 2, but later changed it to 1, but it may not take effect. After exiting worker1, the number of datanode in the cluster is less than 2. The number of datanode in the datanode must be greater than the value of dfs.replication before a node can be removed. TODO: supplement the command to refresh dfs.replication.

The configuration related to HDFS cluster does not allow only one worker2 node to be left after worker1 is removed. TODO: look at the source code.

Exiting safe mode hadoop dfsadmin-safemode is not the reason hdfs dfsadmin-safemode enter/leave

Tip: for clusters with a large amount of data, the state of Decommissioning data movement will be longer, so you can set the dfs.replication to a relatively small point to reduce the copy of the amount of data. TODO to be tested

Step 3: delete the node records to be removed from hosts.includes, hosts.excludes, and workers. Synchronize the configuration file, and then run:

Hdfs dfsadmin-refreshNodesyarn rmadmin-refreshNodes# data balance start-balancer.sh

Step 4: close woker1

Hdfs-daemon stop datanodeyarn-daemon stop nodemanagersudo poweroff

In the case of configuring hosts.includes and hosts.excludes files through hdfs-site.xml, the first and second columns indicate whether the node appears in the modified file: citing the authoritative Guide to Hadoop (Chinese version) P.333

Includesexcludes parsing No No Node cannot connect No Node cannot be connected whether Node can be connected whether Node can be connected will be removed

Error encountered:

ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM

The reason is that both master and worker1 are regarded as worker, but successfully started, and finally found that master still does not add the workers file as datanode.

This is the end of the content of "how to install and configure Hadoop". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.