Preliminary application of SparkSQL (HiveContext use) 07/06 Update SLTechnology News&Howtos

Preliminary application of SparkSQL (HiveContext use)

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

After tossing about all day, I finally solved the error of result3 in the previous section. As for why this error occurred, here, let's take a look at how the problem was discovered:

First of all, I found this article: there is a paragraph in http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-td16299.html:

The issue is that you're using SQLContext instead of HiveContext. SQLContext implements a smaller subset of the SQL language and so you're getting a SQL parse error because it doesn't support the syntax you have. Look at how you'd write this in HiveQL, and then try doing that with HiveContext.

In fact, there are more problems than that. The sparkSQL will conserve (15520) columns in the final table, if I remember well. Therefore, when you are doing join on two tables which have the same columns will cause doublecolumn error.

Two points are mentioned here: (1) the use of HiveContext; (2) is the cause of this error.

Well, when it comes to using HiveContext, let's use HiveContext (Nima, here for a long time):

First of all, to see what requirements are required to use HiveContext, refer to this article: http://www.cnblogs.com/byrhuangqiang/p/4012087.html

There are three requirements in the article:

1. Check whether there are datanucleus-api-jdo-3.2.1.jar, datanucleus-rdbms-3.2.1.jar and datanucleus-core-3.2.2.jar jar packages in the $SPARK_HOME/lib directory.

2. Check whether there is a hive-site.xml copied from the $SPARK_HOME/conf directory in the $HIVE_HOME/conf directory.

3. When submitting the program, specify the jar package of the database driver to DriverClassPath, such as bin/spark-submit-- driver-class-path * .jar. Or set SPARK_CLASSPATH in spark-env.sh.

Then we will configure it as required, but an error will be reported after the configuration is completed (interactive mode):

Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

According to the preliminary judgment, it is the problem of hive connecting to the source database, so add the parameters to connect to the source database in the hive-site.xml file:

Hive.metastore.uris

Thrift://111.121.21.23:9083

After specifying the parameters, a query is executed expectantly, and Nima reports another error (this error has been entangled for a long time):

ERROR ObjectStore: Version information not found in metastore.

This error says that when using HiveContext, you need to access the data source of Hive to get the version information of the data source, and if it is not available, the exception will be thrown. There is a lot about the solution online, and you need to add parameters to the hive-site.xml file:

Hive.metastore.schema.verification

False

After adding the parameters, restart the Hive service, execute the HiveContext of Spark, and still report the error correction. After the program is compiled and packaged with IDE, it is executed on the server:

#! / bin/bash

Cd / opt/huawei/Bigdata/DataSight_FM_BasePlatform_V100R001C00_Spark/spark/

. / bin/spark-submit\

-- class HiveContextTest\

-- master local\

-- files / opt/huawei/Bigdata/hive-0.13.1/hive-0.13.1/conf/hive-site.xml\

/ home/wlb/spark.jar\

-- archives datanucleus-api-jdo-3.2.6.jar,datanucleus-core-3.2.10.jar,datanucleus-rdbms-3.2.9.jar\

-- classpath / opt/huawei/Bigdata/hive-0.13.1/hive-0.13.1/lib/*.jar

Helplessly, I reported another mistake (what a collapse! ): java.net.UnknownHostException: hacluster

This is hadoop's dfs.nameservices.

Well, the exception thrown by the hostname hacluster cannot be resolved, so go on, the result given online is:

You need to copy the configuration hdfs-site.xml to the conf directory of spark, and sure enough, after the copy is complete, the jar package made by the program can finally run successfully on the server.

But in retrospect, this mistake: ERROR ObjectStore: Version information not found in metastore.

What on earth is the cause? What is the difference between executing jar packages and shell patterns?

Go on, use shell mode to execute HiveContext-based SQL, or report this error, so open spark's debug to see if there is any valuable information, look for it for a long time, and don't find any valuable logs. Continue to search the Internet, this question on the Internet, is WARN level, and mine is ERROR level.

Here, there is really no way of thinking. Alas, now that my jar package executes successfully, let's see what's the difference between using jar package execution and this pattern?

The first thing that comes to mind is, why doesn't the parameter hive.metastore.schema.verification of hive-site.xml take effect? I restarted the service, is it not referenced to this parameter?

Well, I added the HIVE_HOME environment variable and executed it, but it didn't take effect, that is to say, the parameter was not referenced. I was on the verge of collapse, and after a long time, it occurred to me, where did my spark-shell command come from? Then take a look: which spark-shell

As if I found something, the spark-shell is from the bin directory of the spark client program (the environment variable was set and the product of Huawei for ease of use), that is, my default environment variable points to the spark client program directory!

Finally found the root of the problem, so, copy hive-site.xml, hdfs-site.xml to the client program's conf directory, restart the hive service, everything OK!

After a while, I still feel a little uneasy. Is this problem the cause or not? OK, then test it on other nodes. First of all, there is no such parameter in the client program directory. Execution failed. After adding, hive.metastore.schema.verification is valid!

The great task has been completed! Throughout the process, spark's debug function was turned on, but no valuable information was found in the log.

By the way, to use IDE to debug Spark's HiveContext programs, you need to add the resource directory (of type Resources) under the main directory, and add hive-site.xml and hdfs-site.xml to that directory.

And introduce three driver packages:

Datanucleus-api-jdo-3.2.6.jar,datanucleus-core-3.2.10.jar,datanucleus-rdbms-3.2.9.jar

I almost forgot that I was trying to solve the result3 problem in the previous section. Haha, this problem is actually due to SparkSQL's support for SQL syntax. Consider using other ways (not nesting subqueries in IN), such as setting up multiple RDD or left and right joins (to be tested).

In the next section, I'll briefly talk about how to configure Scala IDE (this problem took two days during the Qingming Festival holiday, summing up two ways)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.