How to deploy Hadoop Cluster in Linux 07/01 Update SLTechnology News&Howtos

How to deploy Hadoop Cluster in Linux

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article is about how to deploy Hadoop clusters in Linux. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed computing and storage.

A brief introduction to the Hadoop framework

The core design of Hadoop's framework is: HDFS and MapReduce. HDFS provides storage for massive data, while MapReduce provides computing for massive data.

HDFS (Hadoop Distribution File System), called Hadoop distributed file system, has the following main features:

HDFS stores files at least in blocks of 64MB, which is much larger than 4KB~32KB chunks in other file systems.

HDFS optimizes the throughput on the basis of delay. It can efficiently handle the flow of read requests for large files, but it is not good at locating requests for many small files.

HDFS optimizes the normal "write once, read multiple" workload.

Each storage node runs a process called DataNode, which manages all data blocks on the corresponding host. These storage nodes are coordinated by a main process called NameNode, which runs on a separate process.

Unlike physical redundancy in disk arrays to handle disk failures or similar strategies, HDFS uses replicas to handle failures, each block of data made up of files is stored in multiple nodes of the collection, and HDFS's NameNode constantly monitors reports from each DataNode.

1. The working principle of MapReduce

Client, submit the MapReduce job; jobtracker, coordinate the operation of the job, jobtracker is a java application, its main class is JobTracker;tasktracker. To run the divided tasks, tasktracker is a java application, and TaskTracker is the main class.

2. Advantages of Hadoop

Hadoop is a distributed computing platform that makes it easy for users to structure and use. Users can easily develop and run applications that deal with huge amounts of data on Hadoop. It mainly has the following advantages:

High reliability: Hadoop's ability to store and process data bit by bit is trustworthy.

High scalability: Hadoop distributes data and performs computing tasks among available computer clusters that can be easily extended to thousands of nodes.

High efficiency: Hadoop can move data dynamically between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.

High fault tolerance: Hadoop can automatically save multiple copies of data and automatically reassign failed tasks.

Low cost: compared with all-in-one machines, commercial data warehouses and data marts such as QlikView and Yonghong Z-Suite, hadoop is open source, so the software cost of the project will be greatly reduced.

Hadoop comes with a framework written in the Java language, so it is ideal to run on the Linux production platform. Applications on Hadoop can also be written in other languages, such as C++.

Hadoop official website: http://hadoop.apache.org/

II. Prerequisites

To keep the configuration environment of each node in the Hadoop cluster consistent, install java and configure ssh.

Experimental environment:

Platform:xen vm

OS: CentOS 6.8

Software: hadoop-2.7.3-src.tar.gz, jdk-8u101-linux-x64.rpm

HostnameIP AddressOS versionHadoop roleNode rolelinux-node1192.168.0.89CentOS 6.8Masternamenodelinux-node2192.168.0.90CentOS 6.8Slavedatenodelinux-node3192.168.0.91CentOS 6.8Slavedatenodelinux-node4192.168.0.92CentOS 6.8Slavedatenode

# download the required software package and upload it to each node of the cluster

III. Structure and installation of the cluster

1. Hosts file settings

# the hosts file of each node in the Hadoop cluster needs to be modified

[root@linux-node1 ~] # cat / etc/hosts127.0.0.1 localhost localhost.localdomain linux-node1192.168.0.89 linux-node1192.168.0.90 linux-node2192.168.0.91 linux-node3192.168.0.92 linux-node4

2. Install java

# upload the downloaded JDK (RPM package) to the server in advance, and then install it

Rpm-ivh jdk-8u101-linux-x64.rpmexport JAVA_HOME=/usr/java/jdk1.8.0_101/export PATH=$JAVA_HOME/bin:$PATH# java-versionjava version "1.8.0mm 101" Java (TM) SE Runtime Environment (build 1.8.0_101-b13) Java HotSpot (TM) 64-Bit Server VM (build 25.101-b13, mixed mode)

3. Install hadoop

# create a hadoop user and set it to use sudo

[root@linux-node1 ~] # useradd hadoop & & echo hadoop | passwd-- stdin hadoop [root@linux-node1 ~] # echo "hadoopALL= (ALL) NOPASSWD:ALL" > > / etc/sudoers [root@linux-node1 ~] # su-hadoop [hadoop@linux-node1 ~] $cd / usr/local/src/ [hadoop@linux-node1src] $wget http://apache.fayea.com/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz[hadoop@linux- Node1 src] $sudo tar zxvf hadoop-2.7.3.tar.gz-C / home/hadoop/ & & cd / home/hadoop [hadoop@linux-node1 home/hadoop] $sudo mv hadoop-2.7.3/ hadoop [hadoop@linux-node1 home/hadoop] $sudo chown-R hadoop:hadoop hadoop/

# add the binary directory of hadoop to the PATH variable and set the HADOOP_HOME environment variable

[hadoop@linux-node1 home/hadoop] $export HADOOP_HOME=/home/hadoop/hadoop/ [Hadoop @ linux-node1 home/hadoop] $export PATH=$HADOOP_HOME/bin:$PATH

4. Create hadoop-related directories

[hadoop@linux-node1 ~] $mkdir-p / home/hadoop/dfs/ {name,data} [hadoop@linux-node1 ~] $mkdir-p / home/hadoop/tmp

# Node storage data backup directory

Sudo mkdir-p / data/hdfs/ {name,data} sudo chown-R hadoop:hadoop / data/

# the above operations need to be performed on every node of the hadoop cluster

5. SSH configuration

# set the cluster master node to log in to other nodes without password

[hadoop@linux-node1 ~] $ssh-keygen-t rsa [hadoop@linux-node1 ~] $ssh-copy-id linux-node1@192.168.0.90 [hadoop@linux-node1 ~] $ssh-copy-id linux-node2@192.168.0.91 [hadoop@linux-node1 ~] $ssh-copy-id linux-node3@192.168.0.92 actual combat CentOS system deployment Hadoop cluster service deployment CentOS system deployment Hadoop cluster service

# Test ssh login

6. Modify the configuration file of hadoop

File location: / home/hadoop/hadoop/etc/hadoop, file name: hadoop-env.sh, yarn-evn.sh, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml

(1) configure hadoop-env.sh file

# under the hadoop installation path, go to the hadoop/etc/hadoop/ directory and edit hadoop-env.sh, and modify the installation path from JAVA_HOME to JAVA

[hadoop@linux-node1 home/hadoop] $cd hadoop/etc/hadoop/ [hadoop@linux-node1 hadoop] $egrep JAVA_HOME hadoop-env.sh# The only required environment variable is JAVA_HOME. All others are# set JAVA_HOME in this file, so that it is correctly defined on#export JAVA_HOME=$ {JAVA_HOME} export JAVA_HOME=/usr/java/jdk1.8.0_101/

(2) configure yarn.sh file

Specify the java runtime environment for the yran framework, which is the configuration file for the yarn framework runtime environment and needs to modify the location of the JAVA_HOME.

[hadoop@linux-node1 hadoop] $grep JAVA_HOME yarn-env.sh# export JAVA_HOME=/home/y/libexec/jdk1.6.0/export JAVA_HOME=/usr/java/jdk1.8.0_101/

(3) configure slaves file

Specify the DataNode data storage server and write the hostnames of all DataNode machines to this file, as follows:

[hadoop@linux-node1 hadoop] $cat slaveslinux-node2linux-node3linux-node4

Three operating modes of Hadoop

Local standalone mode: all components of Hadoop, such as NameNode,DataNode,Jobtracker,Tasktracker, run in a java process.

Pseudo-distributed mode: each component of Hadoop has a separate Java virtual machine that communicates through network sockets.

Fully distributed mode: Hadoop is distributed on multiple hosts, and different components are installed on the impassable Guest according to the nature of the work.

# configure fully distributed mode

(4) modify the core-site.xml file, add the code for the red area, and pay attention to the contents marked in blue.

Fs.default.namehdfs://linux-node1:9000io.file.buffer.size131072hadoop.tmp.dirfile:/home/hadoop/tmpAbase for other temporary directories.

(5) modify hdfs-site.xml file

Dfs.namenode.secondary.http-addresslinux-node1:9001# uses the web interface to view HDFS status dfs.namenode.name.dirfile:/home/hadoop/dfs/namedfs.datanode.data.dirfile:/home/hadoop/dfs/datadfs.replication2# has 2 backup dfs.webhdfs.enabledtrue per Block

(6) modify mapred-site.xml

This is the configuration of the mapreduce task, and because hadoop2.x uses the yarn framework, to achieve distributed deployment, it must be configured as yarn under the mapreduce.framework.name property. Mapred.map.tasks and mapred.reduce.tasks are the tasks of map and reduce, respectively.

[hadoop@linux-node1 hadoop] $cp mapred-site.xml.template mapred-site.xmlmapreduce.framework.nameyarnmapreduce.jobhistory.addresslinux-node1:10020mapreduce.jobhistory.webapp.addresslinux-node1:19888

(7) configure node yarn-site.xml

# this file is related to the configuration of yarn architecture

Mapred.child.java.opts-Xmx400mjobs can include JVM debuggung options-- > yarn.nodemanager.aux-servicesmapreduce_shuffleyarn.nodemanager.aux-services.mapreduce.shuffle.classorg.apache.hadoop.mapred.ShuffleHandleryarn.resourcemanager.addresslinux-node1:8032yarn.resourcemanager.scheduler.addresslinux-node1:8030yarn.resourcemanager.resource-tracker.addresslinux-node1:8031yarn.resourcemanager.admin.addresslinux-node1:8033yarn.resourcemanager.webapp.addresslinux-node1:8088yarn.nodemanager.resource.memory-mb8192

7. Copy hadoop to other nodes

Scp-r / home/hadoop/hadoop/ 192.168.0.90:/home/hadoop/scp-r / home/hadoop/hadoop/ 192.168.0.91:/home/hadoop/scp-r / home/hadoop/hadoop/ 192.168.0.92:/home/hadoop/

8. Initialize NameNode with hadoop user in linux-node1

/ home/hadoop/hadoop/bin/hdfs namenode-format actual combat CentOS system deployment Hadoop cluster service actual combat CentOS system deployment Hadoop cluster service # echo $? # sudo yum-y install tree# tree / home/hadoop/dfs actual combat CentOS system deployment Hadoop cluster service actual combat CentOS system deployment Hadoop cluster service

9. Start hadoop

/ home/hadoop/hadoop/sbin/start-dfs.sh/home/hadoop/hadoop/sbin/stop-dfs.sh actual combat CentOS system deployment Hadoop cluster service actual combat CentOS system deployment Hadoop cluster service

# View the process on the namenode node

Ps aux | grep-- color namenode actual combat CentOS system deployment Hadoop cluster service actual combat CentOS system deployment Hadoop cluster service

# check the process above the DataNode

Ps aux | grep-- color datanode

10. Launch the yarn distributed computing framework

[hadoop@linux-node1 .ssh] $/ home/hadoop/hadoop/sbin/start-yarn.sh starting yarn daemons

# View processes on NameNode nodes

Ps aux | grep-- color resourcemanager

# View processes on DataNode nodes

Ps aux | grep-- color nodemanager

Note: the scripts start-dfs.sh and start-yarn.sh can be replaced by start-all.sh

/ home/hadoop/hadoop/sbin/stop-all.sh/home/hadoop/hadoop/sbin/start-all.sh actual combat CentOS system deployment Hadoop cluster service actual combat CentOS system deployment Hadoop cluster service

11. Start the jobhistory service and check the mapreduce status

# on the NameNode node

[hadoop@linux-node1 ~] $/ home/hadoop/hadoop/sbin/mr-jobhistory-daemon.sh start historyserverstarting historyserver, logging to / home/hadoop/hadoop/logs/mapred-hadoop-historyserver-linux-node1.out

12. View the status of HDFS distributed file system

/ home/hadoop/hadoop/bin/hdfs dfsadmin-report actual combat CentOS system deployment Hadoop cluster service actual combat CentOS system deployment Hadoop cluster service

# check the composition of file blocks, and a file consists of those blocks

/ home/hadoop/hadoop/bin/hdfs fsck /-files-blocks actual combat CentOS system deployment Hadoop cluster service actual combat CentOS system deployment Hadoop cluster service

13. Check the hadoop cluster status on the web page

View HDFS status: http://192.168.0.89:50070/ View Hadoop cluster status: http://192.168.0.89:8088/

Thank you for reading! This is the end of the article on "how to deploy Hadoop clusters in Linux". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.