How to use Centos7 system to build Hadoop-3.1.4 fully distributed Cluster 07/01 Update SLTechnology News&Howtos

How to use Centos7 system to build Hadoop-3.1.4 fully distributed Cluster

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article focuses on "how to use the Centos7 system to build a fully distributed Hadoop-3.1.4 cluster", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to use Centos7 system to build a fully distributed Hadoop-3.1.4 cluster.

Big data's commonly used technical vocabulary

The competition in the future is the competition for data. Big data is essentially an ecological group of Hadoop. Here are some common technical terms.

ETL: stands for extraction, transformation, and loading.

Hadoop: distributed system Infrastructure

HDFS: distributed file system

HBase: big data's NoSQL database

Hive: data Warehouse tool

DAG: the second Generation Computing engine

Spark: the third generation data processing engine

Flink: the fourth Generation data processing engine

MapReduce: the original parallel computing framework

A tool for transferring data between Sqoop:nosql databases and traditional databases

Hive: data Warehouse tool

Storm: distributed real-time computing system

Flume: a distributed mass log collection system.

Kafka: distributed publish and subscribe message system

ElasticSearch: distributed search engine

The graphical display tool of Kibana:ElasticSearch big data

Logstash:Elasticsearch 's conveyor belt.

Neo4j:nosql graphic database

Oozie: workflow scheduling system-YARN: a framework for job scheduling and cluster resource management

Hadoop cluster

Big data is a cluster-based distributed system. The so-called cluster refers to a multiprocessor system composed of a group of independent computer systems, which communicate between processes through the network, allowing several computers to work together (services), either in parallel or for backup.

Distributed: the main task of distributed is to decompose tasks, disassemble functions, and multiple people do different things together.

Cluster: cluster is mainly the deployment of the same business on multiple servers, and multiple people do the same thing together.

Hadoop introduction

Hadoop is an open source software framework implemented in Java language under Apache. It is a software platform for storing and calculating large-scale data.

Hadoop was founded by Doug Cutting, the founder of Apache Lucene, and originated as a Nutch project.

In 2003, Google published a GFS paper, which provides a feasible solution for large-scale data storage.

In 2004, Google published a paper on MapReduce system, which provides a feasible solution for large-scale data computing. Based on Google's paper, Nutch developers completed the corresponding open source implementations of HDFS and MAPREDUCE, and spun off from Nutch into a stand-alone project Hadoop.

By January 2008, Hadoop became a top-level Apache project and ushered in a period of rapid development.

Nowadays, Internet giants at home and abroad are basically using Hadoop framework as big data solution, and more and more enterprises regard Hadoop technology as a necessary technology to enter the field of big data.

Currently, Hadoop distributions are divided into open source community versions and commercial versions.

Open source community version: refers to the version maintained by the Apache Software Foundation, which is officially maintained, with rich versions and poor compatibility.

Commercial version: refers to the version released by third-party commercial companies that have made some modifications, integration and compatibility testing of various service components on the basis of the community version of Hadoop, such as cloudera's CDH.

Open source community version: generally use 2.x version series, 3.x version series: this version is the latest version, but it is not very stable.

Nonsense, let's start today's topic: using three Centos7 systems to build a fully distributed Hadoop2.X cluster

Last year, I used CentOS 7 to build a hadoop3.X distributed cluster. Due to the replacement of the computer, considering that the computer installed a lot of other things before, this time I used two Centos7 systems to build a completely distributed Hadoop cluster. Although the Centos was updated to version 8, a lot of big data's learning was based on the Centos7 system. A pseudo-distributed version is not built here, but the current stable Hadoop-3.1.4 of Haddop3.X.

Last year's corresponding article tutorials:

Https://blog.csdn.net/weixin_44510615/article/details/104625802

Https://blog.csdn.net/weixin_44510615/article/details/106540129

Preparation before setting up the cluster

Download address for Centos7: http://mirrors.aliyun.com/centos/7.9.2009/isos/x86_64/CentOS-7-x86_64-DVD-2009.iso. The total is 4.8g.

In preparation for building a cluster, you need to build a Centos7 system in VMwear Workstation. The building process is omitted here because of its simplicity.

When connecting a virtual machine through a physical machine, two virtual network cards, VMnet1 and VMnet8, are required.

If you install Vmware without VMnet1 and VMnet8, according to the pit I stepped on before, it is said on the Internet that installing cclear software package to delete the registry is constantly deleting and downloading Vmware, which does not solve the problem, and finally adopts the method of system brushing, which can be solved.

Therefore, the premise of building a virtual machine is that the local host must have a virtual local environment, otherwise you will do it in vain.

Insert a picture description here

Here, we can realize the local ping to communicate with the IP of the virtual machine, and realize the information connection between the local and the virtual machine.

In this way, you can make a remote connection to centos7 through xshell.

When you use Centos7 for the first time, you need to provide administrator permissions to the created user, so you need to use the root account to make relevant modifications to prevent node01 from being in the sudoers file. The matter will be reported. The wrong report.

Use: wq! Save exit,

If the remote connection fails, there should be no development port and IP address. Need to set sudo vim / etc/ssh/sshd_config

Set static ip

Set up static ip through ifconfig

Restart the network card

Configure Ali Cloud yum source

The download speed is very slow at the beginning, so you need to configure the Aliyun yum source. The following source is the official document, using the root account.

# configure Ali Cloud yum source yum install-y wget cd / etc/yum.repos.d/ mv CentOS-Base.repo CentOS-Base.repo.bak wget http://mirrors.aliyun.com/repo/Centos-7.repo mv Centos-7.repo CentOS-Base.repo # configure epel source wget https://mirrors.aliyun.com/repo/epel-7.repo # clear cache and update yum clean all yum makecache yum update install JDK

Because the startup of the hadoop framework depends on the java environment, you need to prepare the jdk environment. Currently, OpenJDK and Oracle Java are the two main Java implementations. Uninstall the original jdkOpenJDK on the Linux system, and then install Oracle Java.

Specific blog: https://blog.csdn.net/weixin_44510615/article/details/104425843

Clone a virtual machine

And set the static to 192.168.147.129 respectively, and set the three Centos7 host names to node01 and node02 respectively to distinguish the centos7 machine.

Previously, I used node01 to create the user name, and found that I had made a mistake, so I set the user name of both hosts to hadoop.

Modify the user name for Centos7: [root@node01 ~] # usermod-l hadoop-d / home/hadoop-m node01.

Since then we have two Centos computers and do not use root accounts in the hadoop cluster.

All xshell can be connected successfully.

Configure ssh password-free login [root@node01 ~] # vim / etc/sysconfig/network # HOSTNAME=node01 [root@node01 ~] # vim / etc/hosts # 192.168.147.128 node01 192.168.147.129 node02 [root@node01 ~] # systemctl stop firewalld [root@node01 ~] # systemctl disable firewalld.service [root@node02 ~] # vim / etc/sysconfig/network # HOSTNAME=node02 [root@ Node02 ~] # vim / etc/hosts # 192.168.147.128 node01 192.168.147.129 node02 [root@node02 ~] # systemctl stop firewalld [root@node02 ~] # systemctl disable firewalld.service

Realize the freedom of switching between node01 and node02 for hadoop accounts. Check out my blog: https://blog.csdn.net/weixin_44510615/article/details/104528001?

Download hadoop download link: https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.1.4/hadoop-3.1.4.tar.gz

[hadoop@node01 ~] $ls hadoop-3.1.4.tar.gz module wget-log Public template Video Picture document download Music Desktop [hadoop@node01 ~] $mkdir-p module/hadoop [hadoop@node01 ~] $tar-zxvf hadoop-3.1.4.tar.gz-C module/hadoop/ [hadoop@node01 ~] $cd module/hadoop/hadoop-3.1.4/ [hadoop@node01 hadoop-3.1. 4] $sudo mkdir-p data/tmp [hadoop@node01 hadoop-3.1.4] $ls bin data etc include lib libexec LICENSE.txt NOTICE.txt README.txt sbin share

Modify the configuration file

When configuring cluster / distributed mode, you need to modify the configuration file under the "hadoop/etc/hadoop" directory. Here, only the necessary settings for normal startup are set, including workers, core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml. For more settings, please see the official instructions.

Modify file hadoop-env.sh [hadoop@node01 hadoop] # vim hadoop-env.sh # export JAVA_HOME=/usr/java/jdk1.8.0_281/ [hadoop@node01 hadoop] # vim yarn-env.sh # export JAVA_HOME=/usr/java/jdk1.8.0_231 modify file workers

Specify the Slave node, which is node02, in the workers file of the Master node

[hadoop@node01 hadoop] $vim workers [hadoop@node01 hadoop] $cat workers node02 modify file core-site.xml

Modify the core-site.xml file to read as follows:

Modify the file hdfs-site.xml

Modify the hdfs-site.xml file to read as follows:

For Hadoop's distributed file system HDFS, redundant storage is generally used, and the redundancy factor is usually 3, that is to say, three copies of one data. However, this tutorial has only one Slave node as the data node, that is, there is only one data node in the cluster, and only one copy of the data can be saved, so the value of dfs.replication is still set to 1.

Modify the file mapred-site.xml

Modify the mapred-site.xml file to read as follows:

[hadoop@node01 hadoop] $cat mapred-site.xml mapreduce.framework.name yarn modifies the file yarn-site.xml

Modify the yarn-site.xml file to read as follows:

Configure Hadoop environment variables

Add hadoop path to etc/profile:

Initialize HDFS

Initialize HDFS and execute the namenode initialization command:

Hdfs namenode-format

There may be a problem of failed to create a folder, this permission problem, use the root account to use the command sudo chmod-R aqiw / absolute path. If you fail to initialize HDFS, you have to delete the previously created folder.

Start the cluster

Execute start-all.sh directly and start Hadoop. At this point, the relevant services on the node02 will also be started:

Use the jps command on each server to view service processes

Or go directly to the Web-UI interface to view, port 9870. You can see that there is a Datanode available at this time:

Then you can check the situation of Yarn. The port number is 8088:

So far, the Hadoop distributed cluster has been successfully built.

At this point, I believe you have a deeper understanding of "how to use the Centos7 system to build a fully distributed Hadoop-3.1.4 cluster". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.