Hadoop + spark+ hive cluster building (apache version) 04/21 Update SLTechnology News&Howtos

Hadoop + spark+ hive cluster building (apache version)

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

0. Introduction

Hadoop cluster, beginners put it up smoothly, must go through a lot of holes. After a week of twists and turns, I finally got the cluster running normally, so I want to record the process of building the cluster and share it for everyone to make a reference.

As the construction process is relatively long, therefore, this article should also be very long, I hope you can read it patiently.

1. Cluster environment and version description: 3 CentOS 7.4 servers with 4CPU and 8G memory; jdk 1.8hadoop 2.7.7spark 2.3.0hive 2.1.1

The corresponding relationship between node and host hostname:

Master node: 172.18.206.224 nn1 Namenode and YARN Resourcemanage slave node 1: 172.18.206.228 dn1 Datanode and YAR Nodemanager slave node 2: 172.18.206.229 dn2 Datanode and YARN Nodemanager

For the hadoop cluster, create a non-root user, and I use the user name hadoop. The installation directory is under the hadoop user's home directory / data/hadoop.

2. Hadoop cluster installation 2.1 install jdk 1.8

Since the hadoop cluster needs the support of the java environment, before installing the cluster, make sure that your system has jdk installed, and check as follows:

[root@ND-ES-3 ~] # java-versionopenjdk version "1.8.0mm 161" OpenJDK Runtime Environment (build 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)

If you do not install jdk 1.8 or above, you need to uninstall the old version and reinstall it. Here, the jdk I choose is the version provided by oracal, and the jdk provided by other companies. When I test, it seems to be incompatible with apache hadoop and always reports an error.

Download: jdk-8u181-linux-x64.rpm

Then upload it to the server and install:

Rpm-ivh jdk-8u181-linux-x64.rpm

After the installation is complete, check that the java-version output is correct.

2.2 modify the / etc/hosts file to achieve ssh password-free login

Modify the / etc/hosts file on the nn1,dn1 and dn2 servers to facilitate access to communication between hosts through hostname:

Vi / etc/hosts:

172.18.206.224 nn1 nn1.cluster1.com172.18.206.228 dn1 dn1.cluster1.com172.18.206.229 dn2 dn2.cluster1.com

In fact, what kind of hostname can be used, depending on the usage.

To create a hadoop user:

Create it on all servers without a password

Useradd-d / data/hadoop/ hadoop

Then, create a key file on nn1 for hadoop to use for ssh password-free login, so that you don't have to enter a password every time when you communicate in a hadoop cluster, which is so troublesome.

How to create a key pair:

Su-hadoopssh-key-gen-t rsamv id_rsa.pub authorized_keyschmod 0700 / data/hadoop/.ssh chmod 0600 / data/hadoop/.ssh/authorized_keys

Then copy the authorized_keys and id_rsa key pairs to the / data/hadoop/.ssh directory of the other two hosts.

To test, log in to the other two servers with hadoop ssh:

[root@ND-ES-3] # su-hadoopLast login: Mon Sep 10 09:32:13 CST 2018 from 183.6.128.86 on pts/1 [hadoop@ND-ES-3 ~] $ssh dn1Last login: Thu Sep 6 15:49:20 2018Welcome to Alibaba Cloud Elastic Compute Service! [hadoop@ND-DB-SLAVE ~] $[hadoop@ND-ES-3 ~] $ssh dn2Last login: Fri Sep 7 16:43:04 2018Welcome to Alibaba Cloud Elastic Compute Service! [hadoop@ND-BACKUP ~] $

By default, when you log in for the first time in ssh, you need to enter a confirmation to receive the key to log in, just confirm it directly.

After the login of the ssh password is through, you can move on.

2.3 install hadoop-2.7.7

The installation process is relatively simple, first download the corresponding version of the compressed package, and then decompress it to use, I chose the version is hadoop-2.7.7.tar.gz.

Tar-xvzf / usr/local/src/hadoop-2.7.7.tar.gz mv hadoop-2.7.7 / data/hadoop/2.4 sets the environment variable of hadoop

You can modify the hadoop home directory / data/hadoop/.bash_profile file

Vi / data/hadoop/.bash_profile, add as follows:

# # JAVA env variablesexport JAVA_HOME=$ (dirname $(dirname $(readlink-f $(which javac) export PATH=$PATH:$JAVA_HOME/binexport CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar## HADOOP env variablesexport HADOOP_HOME=/data/hadoop/hadoop-2.7.7export HADOOP_COMMON_HOME=$HADOOP_HOMEexport HADOOP_HDFS_HOME=$HADOOP_HOMEexport HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_YARN_HOME=$HADOOP_HOMEexport HADOOP_OPTS= "- Djava.library.path=$HADOOP_HOME / lib/native "export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/nativeexport PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Pay attention to modify the path according to your actual situation.

Then, source ~ / .bash_profile makes the variable setting take effect.

2.5 modify hadoop configuration file core-site.xml

Vi / data/hadoop/hadoop-2.7.7/etc/hadoop/core-site.xml

Fs.default.name

Hdfs://nn1:9000

2.6 modify hdfs-site.xml

Dfs.namenode.name.dir

File:/data/hadoop/hadoop-2.7.7/hadoop_store/hdfs/namenode2

Dfs.datanode.data.dir

File:/data/hadoop/hadoop-2.7.7/hadoop_store/hdfs/datanode2

2.7 modify mapred-site.xml

Mapreduce.framework.name

Yarn

2.8Modification of yarn-site.xml

Yarn.nodemanager.aux-services

Mapreduce_shuffle

Yarn.nodemanager.aux-services.mapreduce.shuffle.class

Org.apache.hadoop.mapred.ShuffleHandler

Yarn.resourcemanager.resource-tracker.address

172.18.206.224:8025

Yarn.resourcemanager.scheduler.address

172.18.206.224:8030

Yarn.resourcemanager.address

172.18.206.224:8050

2.9 modify the slaves file to add the ip addresses of two slave hosts

[hadoop@ND-BACKUP hadoop] $cat slaves

172.18.206.228

172.18.206.229

2.10 modify hadoop-env.sh

Modify the JAVA_HOME in the hadoop-env.sh file to:

Export JAVA_HOME=$ (dirname $(dirname $(readlink-f $(which javac)

2.11 copy the / data/hadoop/hadoop-2.7.7 installation directory to dn1 and dn2 servers > scp-r / data/hadoop/hadoop-2.7.7 hadoop@dn1:/data/hadoop/ > scp-r / data/hadoop/hadoop-2.7.7 hadoop@dn2:/data/hadoop/2.12 to create the NameNode directory in nn1

Mkdir-p / data/hadoop/hadoop-2.7.7/hadoop_store/hdfs/namenode2

2.13 create datanode directories in dn1 and dn2

Mkdir-p / data/hadoop/hadoop-2.7.7/hadoop_store/hdfs/datanode2

Chmodu 755 / data/hadoop/hadoop-2.7.7/hadoop_store/hdfs/datanode2

2.14 turn off selinux and iptables Firewall

Iptabels-F

Setenforce 0

2.15 format namenode in nn1

Hdfs namenode-format

2.16 start the hadoop cluster (only operate in nn1)

Cd / data/hadoop/hadoop-2.7.7/

. / sbin/start-all.sh

2.17 check

In nn1:

$jps3042 NameNode3349 SecondaryNameNode3574 ResourceManager11246 Jps

In dn1 or dn2:

$jps26642 NodeManager14569 Jps26491 DataNode checks whether there are two active nodes [hadoop@ND-ES-3 ~] $hdfs dfsadmin-reportConfigured Capacity: 3246492319744 (2.95 TB) Present Capacity: 2910313086244 (2.65 TB) DFS Remaining: 2907451403556 (2.64 TB) DFS Used: 2861682688 (2.67 GB) DFS Used%: 0.10%Under replicated blocks: 34Blocks with corrupt replicas: 0Missing blocks: 0Missing blocks (with replication factor 1): 0- -- Live datanodes (2): Name: 172.18.206.228 Hostname 50010 (dn1) Hostname: dn1Decommission Status: NormalConfigured Capacity: 1082119344128 (1007.80 GB) DFS Used: 1430839296 (1.33 GB) Non DFS Used: 161390100480 (150.31 GB) DFS Remaining: 864172031634 (804.82 GB) DFS Used%: 0.13%DFS Remaining%: 79.86%Configured Cache Capacity: 0 (0B) ) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 3Last contact: Mon Sep 10 17:26:59 CST 2018Name: 172.18.206.229 3Last contact 50010 (dn2) Hostname: dn2Decommission Status: NormalConfigured Capacity: 2164372975616 (1.97 TB) DFS Used: 1430843392 (1.33 GB) Non DFS Used: 9560809472 (8.90 GB) DFS Remaining: 2043279371922 (1.86 TB) DFS Used%: 0.07%DFS Remaining%: 94 .41% configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00%Cache Remaining%: 0.00%Xceivers: 3Last contact: Mon Sep 10 17:26:59 CST 2018 [hadoop@ND-ES-3 ~] $

If all the above checks pass, then the hadoop cluster has been successfully set up.

In fact, you can also check the hadoop through the web page, for example, through port 50070:

Http://host_ip:50070

Or check the resource manager resources via port 8088:

Http://host_ip:8088

OK, the above is the process of building a hadoop cluster. Next, let's continue to build a spark cluster.

3. Install spark on yarn

First of all, there are two operation modes after the integration of spark and hadoop's yarn, one is client mode, the other is cluster mode. By default, after spark is installed, it runs in cluster mode, and we usually choose cluster mode. What is the specific principle of client schema and cluster schema? if you are interested, you can search for more documents to read.

3.1 download and install spark

The version of spark I use is:

Spark-2.3.0-bin-hadoop2.7.tgz

Upload to the server / usr/local/src/ directory and extract:

Tar-xvzf / usr/local/src/spark-2.3.0-bin-hadoop2.7.tgz

Mv / usr/local/src/spark-2.3.0-bin-hadoop2.7 / data/hadoop/spark-2.3.0

Create a soft connection:

Cd / data/hadoop

Ln-s spark-2.3.0 spark # Don't ask why you use soft connection, this is for the convenience of switching between multiple versions in the future, as experts do.

Spark is now installed in the / data/hadoop/spark directory.

3.2 add some environment variables of the system with respect to spark

Add to the / data/hadoop/.bash_profile file:

# # sparkexport SPARK_HOME=/data/hadoop/sparkexport HADOOP_CONF_DIR=/data/hadoop/hadoop-2.7.7/etc/hadoopexport LD_LIBRARY_PATH=/data/hadoop/hadoop-2.7.7/lib/native:$LD_LIBRARY_PATHexport PATH=/data/hadoop/spark/bin:$PATH

Then, make the configuration effective:

Source .bash _ profiel

There is actually a problem here, so every time you add an environment variable, the PATH variable will become longer and longer, and if you are using a server that will not be restarted for a long time, there is nothing you can do about it.

3.3 modify spark-env.xml

Copy the official template file that comes with the installation package:

Cd / data/hadoop/spark/conf

Cp spark-env.sh.template spark-env.sh

Add the following:

Export SPARK_HOME=/data/hadoop/spark#export SCALA_HOME=/lib/scalaexport JAVA_HOME=$ (dirname $(dirname $(readlink-f $(which javac) export HADOOP_HOME=/data/hadoop/hadoop-2.7.7#export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/binexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop#export YARN_CONF_DIR=$YARN_HOME/etc/hadoop#export SPARK_LOCAL_DIRS=/data/haodop/sparkexport SPARK_LIBARY_PATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib:$HADOOP_HOME/lib/native#export SPAR_MASTER_PORT=7077export SPARK_MASTER_HOST=nn1

In the above configuration, some comments are not needed, I copy it over, I will not change it.

3.4 modify the slaves file

Make a copy of the official template:

Mv slaves.template slaves

Vi slaves, add the following:

Nn1

Dn1

Dn2

3.5 modify spark-defaults.cof

Make an official copy:

Mv spark-defaults.conf.templates spark-defaults.conf

Add the following configuration:

Vi spark-defaults.conf

Spark.eventLog.enabled truespark.eventLog.dir hdfs://nn1:9000/spark-logsspark.history.provider org.apache.spark.deploy.history.FsHistoryProviderspark.history.fs.logDirectory hdfs://nn1:9000/spark-logsspark.history.fs.update.interval 10sspark.history.ui.port 18080

This file is mainly used to configure the memory use or allocation of spark. If you are not familiar with the allocation of memory allocation, you can not configure this file for the time being and let spark run according to the default parameters.

3.5.1 create a spark-logs directory

Since we set up spark.eventLog.dir in the previous step, we need to create a directory path on hdfs for the log to use:

[hadoop@ndj-hd-1 spark] $hadoop dfs-mkdir / spark-logs

Then add the following at the end of spark-env.sh:

Export SPARK_HISTORY_OPTS= "- Dspark.history.ui.port=18080-Dspark.history.retainedApplications=3-Dspark.history.fs.logDirectory=hdfs://nn1:9000/spark-logs"

3.6Test spark

If there is nothing wrong with the environment variable settings above, you can do the following simple example test.

(the spark-defaults.conf file may not be configured):

[hadoop@ND-ES-3 spark] $. / bin/run-example SparkPi 10

2018-09-11 11:17:47 INFO SparkContext:54-Running Spark version 2.3.0

2018-09-11 11:17:47 INFO SparkContext:54-Submitted application: Spark Pi

2018-09-11 11:17:47 INFO SecurityManager:54-Changing view acls to: hadoop

2018-09-11 11:17:47 INFO SecurityManager:54-Changing modify acls to: hadoop

2018-09-11 11:17:47 INFO SecurityManager:54-Changing view acls groups to:

2018-09-11 11:17:47 INFO SecurityManager:54-Changing modify acls groups to:

Spark comes with some examples. The above is an example of calculating pi and part of the output. If the calculation is successful, at the end of the output, you turn up and you will see:

Pi is roughly 3.141415141415141

3.7 copy the installation directory of spark to another node

Chown-R hadoop:hadoop / data/hadoop/spark

Scp-r / data/hadoop/spark dn1:/data/hadoop/

Scp-r / data/hadoop/spark dn2:/data/hadoop/

After the replication is complete, pay attention to modifying the system variable of / data/hadoop/.bash_profile on the dn1 and dn2 nodes

3.8 start spark

Before starting spark, make sure that the hadoop cluster is running

Start spark:

Cd / data/hadoop/spark

To save trouble, start all master and worker at once:

. / sbin/start-all.sh

3.9 check startup

Nn1 startup:

[hadoop@ND-ES-3 conf] $jps3042 NameNode8627 RunJar3349 SecondaryNameNode3574 ResourceManager10184 Master10332 Worker4654 Jps

Dn1 and dn2:

[hadoop@ND-DB-SLAVE ~] $jps26642 NodeManager10679 Jps29960 Worker26491 DataNode [hadoop@ND-DB-SLAVE ~] $

Just see worker and master start up.

4. Install scala

After installing spark, install scala first. The process is relatively simple.

Download and upload the installation package to the server and extract:

Tar xvzf scala-2.12.6.tgz

Create a soft connection:

Ln-s scala-2.12.6 / data/hadoop/scala

Modify the PATH variable of scalac:

Vi .bash _ profile

# scalaexport SCALA_HOME=/lib/scalaexport PATH=$ {SCALA_HOME} / bin:$PATH

Effective:

Source .bash _ profiel

Ok,scala is configured and ready to use.

5. Install hive

When your hadoop cluster and spark cluster are installed, you can start installing hive.

Hive is the last part of this article and is a component with a lot of dependencies.

First of all, write down some of the pits I summarized after installing hive:

First, the hadoop and spark functions should be normal.

Second, hive needs to be combined with mysql database, so it is necessary to configure mysql database and authorize users.

Third, notice that the jdbc driver and spark engine are configured in the hive-site.xml configuration file.

In particular, the third point has not been mentioned in a lot of materials. If you find that your configuration has been done, and when you do test verification, it is always unsuccessful, then the best way is to solve the problem one by one according to the error report.

5.1 install mysql-5.7

Hive needs to be used in conjunction with the database so that you don't have to write copied code when you operate on SQL statements like you do with mysql.

I am using binary compilation to install mysql because I can configure the parameters of mysql myself.

The address to download the installation package:

Https://dev.mysql.com/get/Downloads/MySQL-5.7/mysql-5.7.17-linux-glibc2.5-x86_64.tar.gz

Create users and user groups

Groupadd mysql

Useradd-g mysql-s / sbin/nologin mysql

Extract to the specified directory

Tar-zxvf mysql-5.7.17-linux-glibc2.5-x86_64.tar.gz-C / usr/local

Cd / usr/local/

Ln-s mysql-5.7.17-linux-glibc2.5-x86_64 mysql

Mv mysql-5.7.17-linux-glibc2.5-x86_64 mysql

Configure the PATH directory

Vi / etc/profile.d/mysql.sh

Add:

Export PATH=$PATH:/usr/local/mysql/bin

Then, source / etc/profile.d/mysql.sh

Mysql directory resource planning

File type instance 3306 soft chain data datadir/usr/local/mysql/data/data/mysql/data parameter file my.cnf/usr/local/mysql/etc/my.cnf error log log-error/usr/local/mysql/log/mysql_ error.log binary log log-bin/usr/local/mysql/binlogs/mysql-bin/data/mysql/binlogs/mysql-bin slow query log slow _ query_log_file/usr/local/mysql/log/mysql_slow _ query.log socket socket file / usr/local/mysql/run/mysql.sock pid file / usr/local/mysql/run/mysql.pid

Note: taking into account the large data and binary logs, the need for soft chain, the actual / data/mysql in the server's data disk, sufficient disk space, if you do not consider the disk space problem, you can install the default path arrangement.

Mkdir-p / data/mysql/ {data,binlogs,log,etc,run}

Ln-s / data/mysql/data / usr/local/mysql/data

Ln-s / data/mysql/binlogs / usr/local/mysql/binlogs

Ln-s / data/mysql/log / usr/local/mysql/log

Ln-s / data/mysql/etc / usr/local/mysql/etc

Ln-s / data/mysql/run / usr/local/mysql/run

Chown-R mysql.mysql / data/mysql/

Chown-R mysql.mysql / usr/local/mysql/ {data,binlogs,log,etc,run}

Note: you can also make soft connections to datadir and binlog directories only

Configure the my.cnf file

Delete the my.cnf that comes with the system

Rm-rf / etc/my.cnf

Edit my.cnf

Vi / usr/local/mysql/etc/my.cnf

Add the following:

[client]

Port = 3306

Socket = / usr/local/mysql/run/mysql.sock

[mysqld]

Port = 3306

Socket = / usr/local/mysql/run/mysql.sock

Pid_file = / usr/local/mysql/run/mysql.pid

Datadir = / usr/local/mysql/data

Default_storage_engine = InnoDB

Max_allowed_packet = 512m

Max_connections = 2048

Open_files_limit = 65535

Skip-name-resolve

Lower_case_table_names=1

Character-set-server = utf8mb4

Collation-server = utf8mb4_unicode_ci

Init_connect='SET NAMES utf8mb4'

Innodb_buffer_pool_size = 1024m

Innodb_log_file_size = 2048m

Innodb_file_per_table = 1

Innodb_flush_log_at_trx_commit = 0

Key_buffer_size = 64m

Log-error = / usr/local/mysql/log/mysql_error.log

Log-bin = / usr/local/mysql/binlogs/mysql-bin

Slow_query_log = 1

Slow_query_log_file = / usr/local/mysql/log/mysql_slow_query.log

Long_query_time = 5

Tmp_table_size = 32m

Max_heap_table_size = 32m

Query_cache_type = 0

Query_cache_size = 0

Server-id=1

Initialize the database

Mysqld-initialize-user=mysql-basedir=/usr/local/mysql-datadir=/usr/local/mysql/data

A temporary password will be generated in the database. Please write it down and use it later:

[hadoop@ND-ES-3 mysql] $sudo grep 'temporary password' / usr/local/mysql/log/mysql_error.log 2018-09-08T05:03:32.509910Z 1 [Note] A temporary password is generated for root@localhost: show databases;OKdefaultTime taken: 0.769 seconds, Fetched: 1 row (s) hive >

The initial output information does not affect the use, as long as the show databases; command can output the correct information, it has been configured.

6. Conclusion

I have to say, configuring this hadoop+spark+hive environment is really too troublesome for beginners.

Although my installation process can run the cluster smoothly, there are still many areas that need to be optimized. For example, the memory resource allocation of spark, hadoop, and hive is not discussed in depth in this article. I hope this article can help more people save more time when building clusters, spend their time in more interesting places, and help you start to experience the various functions of hadoop as soon as possible.

In the follow-up optimization direction, you will learn more about the performance of hadoop clusters, and take a look at their source code when you have time.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.