First, install hadoop 07/11 Update SLTechnology News&Howtos

First, install hadoop

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

[TOC]

1. Preparation before installation. 1. Use of software packages. Package version centos (similar to redhat) 7.2 x64hadoop2.8.4VMware workstations pro12.0jdk1.82, system environment preparation (1) turn off the firewall and selinux

This operation is not necessary, just to prevent unnecessary difficulties due to the configuration of the firewall and the limitations of selinux at the beginning.

First turn off the firewall:

/ / turn off the firewall and disable self-booting systemctl stop firewalldsystemctl disable firewalld

Then close selinux:

/ / check the enabled status of selinux. In order to enable getenforce//, enforcing first changes the selinux configuration file to prevent the automatic opening of selinuxsed-I's IP for etc/selinux/config// after restart, and then modifies the selinux enabled status of the current system. 0 means disable setenforce 0 (2) configuration IP and hostname.

The virtual machine is in NAT network mode with the network segment 192.168.50.0Universe 24 and the gateway 192.168.50.2.

Configure virtual as static IP:192.168.50.121

/ / first, shut down NetworkManager as a network service Often interferes with network problems [root@localhost software] # systemctl stop NetworkManager [root@localhost software] # systemctl disable NetworkManager// and then modifies the network card configuration file [root@localhost software] # cat / etc/sysconfig/network-scripts/ifcfg-ens33 TYPE=EthernetPROXY_METHOD=noneBROWSER_ONLY=no#BOOTPROTO=dhcpBOOTPROTO=staticDEFROUTE=yesIPV4_FAILURE_FATAL=noIPV6INIT=yesIPV6_AUTOCONF=yesIPV6_DEFROUTE=yesIPV6_FAILURE_FATAL=noIPV6_ADDR_GEN_MODE=stable-privacyNAME=ens33UUID=83b16941-ca72-46ad-85d4-c32929147098DEVICE=ens33ONBOOT=yesIPADDR=192.168.50.121GATEWAY=192.168.50.2NETMASK=255.255.255.0DNS1=192.168.50.2 / / finally restart the network network service [root@localhost software] # systemctl restart network// to check whether the ip is set successfully [root@localhost software] # ip addr show dev ens332: ens33: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:0c:29:ea:b1:f7 brd ff:ff:ff:ff:ff:ff inet 192.168.50.121 scope global ens33 valid_lft forever preferred_lft forever inet6 24 brd 192.168.50.255 scope global ens33 valid_lft forever preferred_lft forever inet6 Fe80::20c:29ff:feea:b1f7/64 scope link valid_lft forever preferred_lft forever

Modify the hostname to bigdata121:

[root@localhost software] # hostnamectl set-hostname bigdata121// this command modifies both the current system user name and the configuration in the / etc/hostname configuration file (3) install jdk

As we all know, hadoop is developed using java, so it must be run using jdk. Let's start installing jdk

/ / create two folders to store the compressed package and the decompression program, software to unzip the package, and modules to extract [root@localhost software] # mkdir / opt/ {software,modules} / / extract jdk to the modules directory [root@localhost software] # tar zxf jdk-8u144-linux-x64.tar.gz-C / opt/modules/// to check whether other versions of jdk are installed locally. Generally, centos comes with openjdk by default. It is better to uninstall [root@localhost software] # rpm-qa | grep java [root@localhost software] # rpm-e xxxx// to configure the jdk PATH environment variable [root@localhost software] # vim / ETC _ vim _ profile.d _ bin// to let the new environment variable take effect [root@bigdata121 opt] # source / etc/profile.d/java.sh// to check whether the environment variable is configured successfully. The correct display of the following information is successful [root@bigdata121 opt] # java-versionjava version "1.8.000144" Java (TM) SE Runtime Environment (build 1.8.0_144-b01) Java HotSpot (TM) 64-Bit Server VM (build 25.144-b01, mixed mode) [root@bigdata121 opt] # javac-versionjavac 1.8.0144II, install and configure hadoop1, install hadoop program / / upload hadoop package to software And unzip it to modules [root@bigdata123 opt] # tar zxf software/hadoop-2.8.4.tar.gz-C / opt/modules/// to set the hadoop environment variable Add the bin and sbin directories to the environment variable [root@bigdata121 hadoop-2.8.4] # vim / et _ c _ _ profile.d _ Print the following information to represent the members of the OK [root@bigdata121 hadoop-2.8.4] # hadoop versionHadoop 2.8.4Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git-r 17e75c2a11685af3e043aa5e604dc831e5b14674Compiled by jdu on 2018-05-08T02:50ZCompiled with protoc 2.5.0From source with checksum b02a59bb17646783210e979bea443b0This command was run using / opt/modules/hadoop-2.8.4/share/hadoop/common/hadoop-common-2.8.4.jar2, hadoop component

The hadoop project mainly consists of three major components: yarn, MapReduce and HDFS, as well as an auxiliary toolkit that includes Common.

(1) yarn

A framework for job scheduling and cluster resource management, which contains the following members:

ResourceManager (rm): processing client requests, launching / monitoring ApplicationMaster, monitoring NodeManager, resource allocation and scheduling NodeManager (nm): resource management on a single node, processing commands from ResourceManager, processing commands from ApplicationMaster: data segmentation, requesting resources for applications, and assigning to internal tasks, task monitoring and fault tolerance. In fact, it is the place where the driver program of the MapReduce task runs: Container: the abstraction of the task running environment, encapsulating multi-dimensional resources such as CPU, memory, as well as environment variables, startup commands and other task running related information. Responsible for running map or reduce tasks (2) HDFS

A distributed file system that is suitable for scenarios where multiple reads are written at a time. It contains the following members:

NameNode (nn): stores the metadata of a file, such as file name, file directory structure, file attributes, etc., as well as the list of blocks of each file and the DataNode where the blocks are located. And respond to the read and write operations of the client to hdfs. And save read-write log DataNode (dn): storing block data in the local file system, and checksum SecondaryNameNode (snn) for block data: an auxiliary daemon used to monitor the status of HDFS, taking snapshots of HDFS metadata at regular intervals, which is equivalent to backing up NameNode. 3. Three installation modes of hadoop

There are three installation modes: local mode, pseudo-distributed mode, and fully distributed mode.

(1) Local mode

In this mode, the local file system is directly used as the storage system, there is no HDFS, no yarn, can only be used to test MR programs, MR programs directly read files from the local file system.

Steps:

/ * configuration / opt/modules/hadoop-2.8.4/etc/hadoop/hadoop-env.sh this is mainly modified to prevent the environment variable JAVA_HOME in the system from not being set, and there is no judgment as to whether the JAVA_HOME is empty or not, so it is best to specify it manually. If it is determined that it is configured in the system, there is no need to change it. * / [root@bigdata121 hadoop] # vim hadoop-env.shexport JAVA_HOME=$ {JAVA_HOME} changed to: export JAVA_HOME=/opt/modules/jdk1.8.0_144

There is no need to start any services of hadoop in this process, because you are only using the MR framework. There is no need for hdfs and yarn.

Test with a wordcount instance:

[root@bigdata121 opt] # mkdir testdir// creates a source data file for character statistics. [root@bigdata121 opt] # echo "i am wang, so you are wang too. I just jin tao king. "> testdir/name.txt// runs wordcount Hadoop jar target jar package main class name input file output file [root@bigdata121 opt] # cd modules/hadoop-2.8.4/share/hadoop/mapreduce/ [root@bigdata121 mapreduce] # hadoop jar hadoop-mapreduce-examples-2.8.4.jar wordcount / opt/testdir/name.txt / opt/testdir/name.txt.output// View word frequency statistics [root@bigdata121 mapreduce] # cat / opt/testdir/name.txt.output/part-r-00000 1. 2am 1are 1i 2jin 1just 1king 1so 1tao 1too 1wang 2you 1

Note: if pseudo-distribution mode or full distribution mode is started at the same time, the local mode cannot be used because the files are looked for in the hdfs cluster by default. Instead of looking for files in the local file system.

(2) pseudo-distributed mode

Is to deploy a complete hadoop environment on a machine with yarn,hdfs,mapreduce capabilities.

There are mainly the following components and roles:

Components include roles HDFSNameNode, SecondaryNameNode,DataNodeYARNResourceManager,NodeManager

The configuration file that needs to be modified is found in $HADOOP_HOME/etc/hadoop/:

1 > core-site.xml

Fs.defaultFS hdfs://bigdata121:9000 hadoop.tmp.dir / opt/modules/hadoop-2.8.4/data/tmp

2 > hdfs-site.xml

Dfs.replication 1 dfs.namenode.secondary.http-address bigdata121:50090 dfs.permissions false

3 > yarn-site.xml

Yarn.nodemanager.aux-services mapreduce_shuffle yarn.resourcemanager.hostname bigdata121 yarn.log-aggregation-enable true yarn.log-aggregation.retain-seconds 604800

4 > mapred-site.xml

Mapreduce.framework.name yarn mapreduce.jobhistory.address bigdata121:10020 mapreduce.jobhistory.webapp.address bigdata121:19888

After modifying the configuration file, format the NameNode to generate the environment required by hdfs:

[root@bigdata121 hadoop-2.8.4] # hdfs namenode-format Note: do not repeat the format, otherwise the hdfs cluster cannot be started later. If you need to repeat the format, please delete the original hdfs file directory manually first.

Then start the cluster:

/ / start hdfs-related processes, such as nameNode,DataNode, corresponding to stop-dfs.sh [root@bigdata121 hadoop-2.8.4] # start-dfs.sh / / start yarn, and turn off stop-yarn.sh [root@bigdata121 hadoop-2.8.4] # start-yarn.sh to start cluster-related services together. However, this script has been deprecated and is not recommended. It is recommended to use the above partial method.

Jps to see if the relevant process is started:

[root@bigdata121 sbin] # jps15649 SecondaryNameNode15493 DataNode18278 Jps15368 NameNode15804 ResourceManager15901 NodeManager can see that it has been started normally

You can also use web to view related administrative pages:

Http://bigdata121:8088 yarn Management Page http://bigdata121:50070 HDFS Management Page (3) fully distributed mode

Deploy a complete hadoop environment with yarn,hdfs,mapreduce capabilities on at least 3 machines.

Before the configuration is fully distributed, because when we start each process (datanode,namenode, etc.), we need to manually enter the user password of the machine where the startup process is located. And when shutting down, you also need to enter the password manually, which is too cumbersome. Therefore, it is generally configured to log in with a secret key between cluster machines, so that there is no need to enter the password manually.

First of all, prepare 3 machines, because the default in hdfs is 3 copies, so the minimum DataNode is 3 machines. You can use the pseudo-distributed virtual machine above to clone two virtual machines. The ip of each host and the process of the launched hadoop are planned as follows:

Bigdata121bigdata122bigdata123ip192.168.50.121/24192.168.50.122/24192.168.50.123/24HDFSNameNode, SecondaryNameNodeDataNodeDataNodeYARNResourceName,NodeManagerNodeManager

1 > configure that you can log in with a secret key between the three machines (including logging in to yourself). The three hosts execute the following command:

/ / generate key [root@bigdata122 ~] # ssh-keygen// copy the public key to the USER_HOME/.ssh directory of the three machines [root@bigdata122 ~] # ssh-copy-id bigdata121 [root@bigdata122 ~] # ssh-copy-id bigdata122 [root@bigdata122 ~] # ssh-copy-id bigdata123

2 > configure three machines for time synchronization

Because you are in the same cluster, you need to ensure that the time is the same, otherwise it is easy to have problems and lead to cluster disorder. Use ntp to synchronize time

/ / can connect to the Internet [root@bigdata121 ~] # / usr/sbin/ntpdate cn.ntp.org.cn// cannot connect to the Internet, manually modify [root@bigdata121 ~] # data-s "xxxx"

3 > profile modification

The main configuration file is the same as in pseudo-distributed mode, except that the number of copies is modified to 2. In addition, the following files need to be modified.

Etc/hadoop/slaves is mainly used to specify the hostname of DataNode (both hostname and ip can be used, but make sure that the hostname can be resolved), such as:

/ / add datanode ip or hostname [root@bigdata121 hadoop-2.8.4] # cat etc/hadoop/slaves bigdata122bigdata123

Then press to format according to the pseudo-distributed mode.

4 > start the cluster

Boot only needs to be started on the same machine as NameNode and ResourceManager. They will start the corresponding dataNode and NodeManager on other machines respectively (in fact, it is just a script to start the service remotely to other hosts in a similar way to ssh, which is why secret-free login is configured).

NameNode:

[root@bigdata121 hadoop-2.8.4] # start-dfs.sh Starting namenodes on [bigdata121] bigdata121: starting namenode, logging to / opt/modules/hadoop-2.8.4/logs/hadoop-root-namenode-bigdata121.outbigdata123: starting datanode, logging to / opt/modules/hadoop-2.8.4/logs/hadoop-root-datanode-bigdata123.outbigdata122: starting datanode, logging to / opt/modules/hadoop-2.8.4/logs/hadoop-root-datanode-bigdata122.outStarting secondary namenodes [bigdata121] bigdata121: starting secondarynamenode Logging to / opt/modules/hadoop-2.8.4/logs/hadoop-root-secondarynamenode-bigdata121.out, just look at the log. Start the corresponding process service on three hosts

ResourceManager:

[root@bigdata121 hadoop-2.8.4] # start-yarn.shstarting yarn daemonsstarting resourcemanager, logging to / opt/modules/hadoop-2.8.4/logs/yarn-root-resourcemanager-bigdata121.outbigdata123: starting nodemanager, logging to / opt/modules/hadoop-2.8.4/logs/yarn-root-nodemanager-bigdata123.outbigdata122: starting nodemanager, logging to / opt/modules/hadoop-2.8.4/logs/yarn-root-nodemanager-bigdata122.out

Note: NameNode and ResourceManager must be started on the same machine, and if they are not on the same machine, they must be booted separately to their respective machines.

5 > problems encountered in the configuration process

A common problem is that the datanode process cannot be started in datanode. Looking at the HADOOP_HOME/logs/xxx.log startup log, we can see that it is the different clusterID of namenode and datanode that makes it impossible to start. This usually occurs when the format is repeated, resulting in a different clusterID. The solution is to delete the namenode data directory and reformat it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.