Hive On Spark 09/13 Update SLTechnology News&Howtos

Hive On Spark

2025-09-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

since the company used big data products, rarely touch open source things, cluster problems are also to communicate with research and development, a few days ago, a friend asked me, how to change the underlying engine of hive to spark, I thought about it, is not to share the hive database with spark and then use spark-shell, and then check the information, it turned out that this is not the case, there are still a lot of operations. Ah, it is true that if you use other people's products, the development is convenient, and the principle is understood less. The editor has been immersed in a happy life of parallel execution of tasks with a spark program that can be converted at the bottom of SQL. Take advantage of the weekend, a person to enjoy the company's WiFi and air conditioning, replace this open source hive engine with spark, and share it with you, the most important thing is the transition from fat house to technology house.

Due to limited funds, can only use a virtual machine to demonstrate. Here, the editor introduces the platform environment of hadoop built by himself, and first takes you to review what processes need to be started in hadoopHA mode: (hadoop is version 2.7.x)

→ Namenode: (active-standby): master node of HDFS for metadata management and slave node management

Slave node of → Datanode:HDFS, used to store data

Master node of → ResourceMananger:yarn for resource scheduling

Slave node of → Nodemanager:yarn, which is used to perform specific tasks

→ Zookeeper: service orchestration (process name QuorumPeerMain)

→ JournalNode: sharing of metadata for master / slave namenode

→ DFSZKFailoverController: monitors the life and death of namenode and is always ready to switch between master and slave.

That's about all , a very ordinary hadoop platform, where the editor uses three virtual machines:

Services on each node:

Hadoop01:

Hadoop02:

Hadoop03:

complains about the uneven distribution of services, stop your keyboard, the editor is just a demonstration, in a hurry to build.

1. Test whether hive is working properly:

Here I have distributed the hive installation package on all three machines:

Execute the command to start hive: (how to get here quickly, don't use beeline)

[hadoop@hadoop01 applications] $hive

Try running a few commands:

Hive > use test; # enter the database hive > show tables; # to see which tables are hive > create external table `user` (id string,name string) row format delimited fields terminated by', 'location "/ zy/test/user'; # create table # Import data [hadoop@hadoop01 ~] $for i in `seq 100`; do echo" 10$ iGrainzySecreti "> > user.txt; done; [hadoop@hadoop01 ~] $hadoop fs-put user.txt / zy/test/userhive > select * from `user`

There is no problem with OK,hive!

2. Confirm the replacement of hive engine with spark (1) version

First check the compatibility of the hive and spark versions:

Here the editor's spark is 2.0.0 and Hive is 2.3.2.

Spark download address: https://archive.apache.org/dist/spark/spark-2.0.0/

Download address of Hive: http://hive.apache.org/downloads.html

Here is the spark that needs to be compiled by the hive module. Here, the editor will provide the compiled spark to you:

Link: https://pan.baidu.com/s/1tPu2a34JZgcjKAtJcAh-pQ extraction code: kqvs

As for hive, the official website will be fine.

(2) modify the configuration file # hive configuration (hive-site.xml: javax.jdo.option.ConnectionURL jdbc:mysql://hadoop03:3306/hivedb?createDatabaseIfNotExist=true JDBC connect string for a JDBC metastore javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver Driver class name for a JDBC metastore javax.jdo.option.ConnectionUserName root username To use against metastore database javax.jdo.option.ConnectionPassword 123456 password to use against metastore database hive.metastore.warehouse.dir / user/hive/warehouse hive.execution.engine spark hive.enable.spark.execution.engine true spark.home / applications/spark-2.0 .0-bin-hadoop2-without-hive spark.master yarn spark.eventLog.enabled true spark.eventLog.dir hdfs://zy-hadoop:8020/spark-log must have this directory spark.executor.memory 512m spark.driver.memory 512m Spark.serializer org.apache.spark.serializer.KryoSerializer spark.yarn.jars hdfs://zy-hadoop:8020/spark-jars/* hive.spark.client.server.connect.timeout 300000 spark.yarn.queue default spark.app.name zyInceptor one thing to note here is that Hadoop is HA mode So the path to hdfs should be written as: configuration of hdfs://cluster_name:8020/path#spark (spark-env.sh) #! / usr/bin/env bashexport JAVA_HOME=/applications/jdk1.8.0_73export SCALA_HOME=/applications/scala-2.11.8export HADOOP_HOME=/applications/hadoop-2.8.4export HADOOP_CONF_DIR=/applications/hadoop-2.8.4/etc/hadoopexport HADOOP_YARN_CONF_DIR=/applications/hadoop-2.8.4/etc / hadoopexport SPARK_HOME=/applications/spark-2.0.0-bin-hadoop2-without-hiveexport SPARK_WORKER_MEMORY=512mexport SPARK_EXECUTOR_MEMORY=512mexport SPARK_DRIVER_MEMORY=512mexport SPARK_DIST_CLASSPATH=$ (/ applications/hadoop-2.8.4/bin/hadoop classpath) (3) configuration of jar

① found the following jar package in hive lib and copied it to the spark jars directory:

Hive-beeline-2.3.3.jar

Hive-cli-2.3.3.jar

Hive-exec-2.3.3.jar

Hive-jdbc-2.3.3.jar

Hive-metastore-2.3.3.jar

[hadoop@hadoop01 lib] $cp hive-beeline-2.3.2.jar hive-cli-2.3.2.jar hive-exec-2.3.2.jar hive-jdbc-2.3.2.jar hive-metastore-2.3.2.jar / applications/spark-2.0.0-bin-hadoop2.7/jars/

② found the following jar package in spark jars and copied it to the hive lib directory:

Spark-network-common_2.11-2.0.0.jar

Spark-core_2.11-2.0.0.jar

Scala-library-2.11.8.jar

Chill-java

Chill

Jackson-module-paranamer

Jackson-module-scala

Jersey-container-servlet-core

Jersey-server

Json4s-ast

Kryo-shaded

Minlog

Scala-xml

Spark-launcher

Spark-network-shuffle

Spark-unsafe

Xbean-asm5-shaded

[hadoop@hadoop01 jars] $cp spark-network-common_2.11-2.0.0.jar spark-core_2.11-2.0.0.jar scala-library-2.11.8.jar chill-java-0.8.0.jar chill_2.11-0.8.0.jar jackson-module-paranamer-2.6.5.jar jackson-module-scala_2.11-2.6.5.jar jersey-container-servlet-core-2.22.2.jar Jersey-server-2.22.2.jar json4s-ast_2.11-3.2.11.jar kryo-shaded-3.0.3.jar minlog-1.3.0.jar scala-xml_2.11-1.0.2.jar spark-launcher_2.11-2.0.0.jar spark-network-shuffle_2.11-2.0.0.jar spark-unsafe_2.11-2.0.0.jar xbean-asm5-shaded-4.4.jar / Applications/hive-2.3.2-bin/lib/

Distribution of ③ configuration files

Put the yarn-site.xml and hdfs-site.xml of hadoop into the conf of spark

Also put hive-site.xml into the conf of spark

④ distributes jar packages

Configured in hive-site.xml: spark.yarn.jars

Here we first create this directory in hdfs:

[hadoop@hadoop01 conf] $hadoop fs-mkdir / spark-jars

Put all the jar packages in spark's jars into this directory:

[hadoop@hadoop01 jars] $hadoop-put. / jars/*.jar / spark-jars

⑤ starts spark

[hadoop@hadoop01 jars] $/ applications/spark-2.0.0-bin-hadoop2-without-hive/sbin/start-all.sh

At this point, these processes appear in this node:

(4) after completing the above steps:

Test, run a SQL in hive:

Select count (1) from table; is generally used here to detect!

The Spark interface appears:

The interface of Yarn will include:

The above interface indicates that hive on spark is installed successfully!

4. Problem encountered: (version incompatible)

Reason: spark can not contain hive dependencies, remove-Phive to compile spark.

Solution: compiling spark

Here are the tutorials from the hive website:

# Prior to Spark 2.0.0: (he said priority on spark2.0.0 It is actually the compilation of the spark1.6 version). / make-distribution.sh-- name "hadoop2-without-hive"-- tgz "--Pyarn,hadoop-provided,hadoop-2.4,parquet-provided" # Since Spark 2.0.0:./dev/make-distribution.sh-- name "hadoop2-without-hive"-- tgz "- Pyarn,hadoop-provided,hadoop-2.7 Parquet-provided "# Since Spark 2.3.0:. / dev/make-distribution.sh-- name" hadoop2-without-hive "--tgz"-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided "

After the compilation is successful, you can execute the previous content.

Here the editor also has the compiled spark:

Link: https://pan.baidu.com/s/1tPu2a34JZgcjKAtJcAh-pQ extraction code: kqvs

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.