The ninth of big data's learning series-Hive integrates Spark and HBase and related tests 07/03 Update SLTechnology News&Howtos

The ninth of big data's learning series-Hive integrates Spark and HBase and related tests

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Preface

In the previous big data learning series 7-Hadoop+Spark+Zookeeper+HBase+Hive cluster building introduced the cluster environment building, but in the use of hive for data query will be very slow, because the default engine used by hive is MapReduce. So I use spark as the engine of hive to query hbase. After successful integration, I will write a cost blog post on how to integrate the process. The details are as follows!

Prepare beforehand

Before the integration, first make sure that the environment of Hive, HBase, and Spark has been successfully built! If it is not built successfully, you can take a look at my previous article on Hadoop+Spark+Zookeeper+HBase+Hive Cluster Building, which is the seventh of big data's learning series.

So start integrating hive, hbase, and spark.

The current configuration of the cluster is as follows:

Hive integrates HBase

Because the implementation of the integration of Hive and HBase is accomplished by using their own external API interface to communicate with each other, the specific work is realized by the hive-hbase-handler-*.jar tool class in the lib directory of Hive. So you just need to copy the hive-hbase-handler-*.jar of hive into hbase/lib.

Change to the hive/lib directory

Enter:

Cp hive-hbase-handler-*.jar / opt/hbase/hbase1.2/lib

Note: if there is a version problem in the hive integration hbase, then focus on the version of hbase and overwrite the jar package in hbase over the jar package in hive.

As for the relevant tests between Hive and HBase, please see my previous article on the fifth of big data's learning series-Hive integrated HBase graphics and text, which will not be described too much in this article.

Hive integrates Spark

In fact, Hive integration Spark is actually a framework package successfully compiled by Hive using Spark, but Hive integration Spark is relatively crappy because the version cannot be arbitrary and must be compiled using the specified one. At first, this problem has been bothering us for a long time, but at last we looked up the compiled versions of spark and hive. We just need to bring the compiled jar to use. The specific use is as follows.

Configuration changes for hive

Change to the hive/conf directory

Edit the hive-env.sh file

Add the environment for spark:

Export SPARK_HOME=/opt/spark/spark1.6-hadoop2.4-hive

Then edit the hive-site.xml file

Add these configurations in hive-site.xml

Descriptions of these configurations:

Hive.execution.engine: indicates that the default engine for hive execution is, here we fill in spark. If you do not want to add this configuration and want to use spark manually, after entering hive shell, enter:

Set hive.execution.engine=spark

Spark.master: the host address of spark. Here we fill in the default address of spark.

Spark.home: the installation path of spark. Write the installation path of spark.

For the submission method of spark.submit.deployMode:spark, client is written by default.

Spark.serializer: how spark is serialized.

Spark.eventLog.enabled: whether to use the log of spark. Default is true.

Spark.eventLog.dir: the log storage path of spark. Note that this path should be created with hadoop!

Spark.executor.memory: the execution memory allocated to spark, configured according to the personal machine.

Spark.driver.memory: spark total memory, which is configured according to the personal machine.

Full configuration:

Hive.execution.engine spark spark.master spark://master:7077 spark.home / opt/spark/spark1.6-hadoop2.4-hive spark.submit.deployMode client spark.serializer org.apache.spark.serializer.KryoSerializer Spark.eventLog.enabled true spark.eventLog.dir hdfs://master:9000/directory spark.executor.memory 10G spark.driver.memory 10G

After successfully configuring these, enter the hive shell.

Simply make an associative query between two tables

You can see that hive has successfully used spark as its engine.

Hive on HBase is tested using the spark engine

After the successful integration of the environment and the establishment of two hive Outreach hbase tables. Conduct a data query test.

The creation script for two tables:

Create table t_student (id int,name string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties ("hbase.columns.mapping" = ": key,st1:name") tblproperties ("hbase.table.name" = "t_student", "hbase.mapred.output.outputtable" = "t_student") Create table t_student_info (id int,age int,sex string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties ("hbase.columns.mapping" = ": key,st1:age,st1:sex") tblproperties ("hbase.table.name" = "t_student_info", "hbase.mapred.output.outputtable" = "t_student_info")

Then insert 1 million data test in two tables

Note: here I insert 100w data directly in HBase, using HBase's Api to complete, specific can big data study series 3-HBase Java Api picture and text to explain this blog post in detail.

After successfully inserting, we test the query speed in hive shell.

Number of entries test:

Primary key Management query Test:

Non-primary key query test:

Note: in fact, you can also use the Api of hive, that is, an ordinary JDBC connection, but the connection driver should be replaced with

Class.forName ("org.apache.hive.jdbc.HiveDriver")

For specific implementation, you can see the code in my github: https://github.com/xuwujing/pancm_project/blob/master/src/main/java/com/pancm/test/hiveTest/hiveUtil.java

Conclusion: using hive on spark query, we can see that if the query condition is the primary key, that is, the rowkey in hbase, the query 100w data can be found in about 2.3s (personal feeling that it will take about 2s to open spark, if the quantity is large, the speed is not very slow), but if you use the condition of non-primary key to query, you can see that the speed has slowed down obviously.

So when using hive on hbase, try to use rowkey for query.

Postscript

In fact, the environment construction and integration of the cluster was already set up when I wrote the first big data learning series blog. As for why the blog was written so late, the first point is that when we built the environment, we did not really understand the role of those configurations; the second point is that the environment construction is somewhat baffling and often has problems. However, most of the problems and solutions have been recorded and written as a blog, so writing a blog slowly is actually a rearrangement of personal knowledge. Third, the personal energy is limited, can not write all these into a blog, after all, writing a blog also takes a certain amount of time and energy.

After completing this blog post, I will not write big data's blog for the time being. Feel that their current ability is not enough, if so reluctantly to self-study, it is estimated that it is difficult to learn knowledge points, let alone write it into a blog to explain. So put it first for now, and continue to write when you have the ability!

Big data study series of articles: http://blog.csdn.net/column/details/18120.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.