Practical cases of Spark on Yarn with Hive and solution of Common problems 07/12 Update SLTechnology News&Howtos

Practical cases of Spark on Yarn with Hive and solution of Common problems

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

[TOC]

1 scene

In the actual process, you encounter a scenario like this:

Log data is sent to HDFS, and the operator loads the HDFS data into hive after ETL, and then needs to use Spark to analyze and process the log. Spark is deployed in Spark on Yarn mode.

From the perspective of the scenario, we need to load the data in hive through HiveContext in our Spark program.

If you want to do your own testing, you can refer to my previous article on the configuration of the environment, which mainly includes the following:

1.Hadoop environment Hadoop environment configuration can refer to the previous article; 2.Spark environment Spark environment only needs to be configured on the node that submitted job, because it uses Spark on Yarn method; 3.Hive environment needs to configure Hive environment, because when submitting Spark tasks, it needs to be submitted together with hive-site.xml file, because only in this way can the metadata information of existing hive environment be identified. So in fact, in the deployment mode of Spark on Yarn, all you need is the configuration file of hive, so that HiveContext can read the metadata information stored in mysql and the hive table data stored on HDFS; the configuration of hive environment can refer to the previous article.

In fact, Spark Standalone with Hive has been written before the article, you can refer to: "Spark SQL Notes collation (3): load save function and Spark SQL function."

2 programming and packaging

As a test case, the test code here is relatively simple, as follows:

Package cn.xpleaf.spark.scala.sql.p2import org.apache.log4j. {Level, Logger} import org.apache.spark.sql.DataFrameimport org.apache.spark.sql.hive.HiveContextimport org.apache.spark. {SparkConf SparkContext} / * * @ author xpleaf * / object _ 01HiveContextOps {def main (args: Array [String]): Unit = {Logger.getLogger ("org.apache.spark") .setLevel (Level.OFF) val conf = new SparkConf () / / .setM aster ("local [2]") .setAppName (s "${_ 01HiveContextOps.getClass.getSimpleName}") val sc = new SparkContext (conf) Val hiveContext = new HiveContext (sc) hiveContext.sql ("show databases"). Show () hiveContext.sql ("use mydb1") / / create the teacher_info table val sql1 = "create table teacher_info (\ n" + "name string" \ n "+" height double)\ n "+" row format delimited\ n "+" fields terminated by','"hiveContext.sql (sql1) / / create teacher_basic table val sql2 =" create table teacher_basic (\ n "+" name string,\ n "+" age int,\ n "+" married boolean,\ n "+" children int)\ n "+" row format delimited\ n "+" fields terminated by' '"hiveContext.sql (sql2) / / load data hiveContext.sql (" load data inpath' hdfs://ns1/data/hive/teacher_info.txt' into table teacher_info ") hiveContext.sql (" load data inpath 'hdfs://ns1/data/hive/teacher_basic.txt' into table teacher_basic ") / / step 2: calculate the association number between the two tables According to val sql3 = "select\ n" + "b.name" \ n "+" b.age,\ n "+" if (B. married, 'unmarried') as married,\ n "+" b.children \ n "+" i.height\ n "+" from teacher_info I\ n "+" inner join teacher_basic b on i.name=b.name "val joinDF:DataFrame = hiveContext.sql (sql3) val joinRDD = joinDF.rdd joinRDD.collect () .foreach (println) joinDF.write.saveAsTable (" teacher ") sc.stop ()}}

You can see that it's really simple to create tables in hive, load data, associate data, and save data to hive tables.

Just package it after you've written it, and note that you don't need to package the dependencies together. After that, you can upload the jar package to our environment.

3 deployment

Write a submit script as follows:

[hadoop@hadoop01 jars] $cat spark-submit-yarn.sh / home/hadoop/app/spark/bin/spark-submit\-- class $2\-- master yarn\-- deploy-mode cluster\-- executor-memory 1G\-- num-executors 1\-- files $SPARK_HOME/conf/hive-site.xml\-- jars $SPARK_HOME/lib/mysql-connector-java-5.1.39.jar,$SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar $SPARK_HOME/lib/datanucleus-core-3.2.10.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar\ $1\

Note that the key ones-- files and-- jars, are described as follows:

-- files $HIVE_HOME/conf/hive-site.xml / / add the configuration file of Hive to the classpath of Driver and Executor-- jars $HIVE_HOME/lib/mysql-connector-java-5.1.39.jar, … . / / add Hive dependent jar packages to the classpath of Driver and Executor

You can then execute the script to submit the task to Yarn:

[hadoop@hadoop01 jars] $. / spark-submit-yarn.sh spark-process-1.0-SNAPSHOT.jar cn.xpleaf.spark.scala.sql.p2._01HiveContextOps4 to view the results

To be clear, if you need to monitor the execution process, you need to configure historyServer (jobHistoryServer for mr and historyServer for spark). You can refer to my previous article.

4.1 Yarn UI

4.2 Spark UI

4.3 Hive

You can start hive and view the data loaded by our spark program:

Hive (mydb1) > show tables;OKt1t2t3_arrt4_mapt5_structt6_empt7_externalt8_partitiont8_partition_1t8_partition_copyt9t9_bucketteacherteacher_basicteacher_infotesttidTime taken: 0.057 seconds, Fetched: 17 row (s) hive (mydb1) > select * > from teacher_info OKzhangsan 175.0lisi 180.0wangwu 175.0zhaoliu 195.0zhouqi 165.0weiba 185.0Time taken: 1.717 seconds, Fetched: 6 row (s) hive (mydb1) > select * > from teacher_basic;OKzhangsan 23 false 0lisi 24 false 0wangwu 25 false 0zhaoliu 26 true 1zhouqi 27 true 2weiba 28 true 3Time taken: 0.115 seconds, Fetched: 6 row (s) hive (mydb1) > select * > from teacher OKSLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" .SLF4J: Defaulting to no-operation (NOP) logger implementationSLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.zhangsan 23 unmarried 0 175.0lisi 24 unmarried 0 180.0wangwu 25 unmarried 0 175.0zhaoliu 26 married 1 195.0zhouqi 27 married 2 165.0weiba 28 married 3 185.0Time taken: 0.134 seconds Fetched: 6 row (s) 5 problem and solution

1.User class threw exception: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Note that our Spark deployment model has no dependency on spark and hive on Yarn,yarn, so when submitting a task, you must specify the dependency of the jar package to be uploaded:

-- jars $SPARK_HOME/lib/mysql-connector-java-5.1.39.jar,$SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar\

In fact, when submitting a task, observe the output of the console:

10:57:44 on 18-10-09 INFO yarn.Client: Uploading resource file:/home/hadoop/app/spark/lib/spark-assembly-1.6.2-hadoop2.6.0.jar-> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/spark-assembly-1.6.2-hadoop2.6.0.jar18/10/09 10:57:47 INFO yarn.Client: Uploading resource file:/home/hadoop/jars/spark-process-1.0-SNAPSHOT. Jar-> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/spark-process-1.0-SNAPSHOT.jar18/10/09 10:57:47 INFO yarn.Client: Uploading resource file:/home/hadoop/app/spark/lib/mysql-connector-java-5.1.39.jar-> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/mysql-connector-java-5.1.39.jar18/10/09 10:57:47 INFO yarn .client: Uploading resource file:/home/hadoop/app/spark/lib/datanucleus-api-jdo-3.2.6.jar-> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/datanucleus-api-jdo-3.2.6.jar18/10/09 10:57:47 INFO yarn.Client: Uploading resource file:/home/hadoop/app/spark/lib/datanucleus-core-3.2.10.jar-> hdfs://ns1/user/hadoop/.sparkStaging/ Application_1538989570769_0023/datanucleus-core-3.2.10.jar18/10/09 10:57:47 INFO yarn.Client: Uploading resource file:/home/hadoop/app/spark/lib/datanucleus-rdbms-3.2.9.jar-> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/datanucleus-rdbms-3.2.9.jar18/10/09 10:57:47 INFO yarn.Client: Uploading resource file:/home/hadoop/app/spark/conf/hive- Site.xml-> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/hive-site.xml18/10/09 10:57:47 INFO yarn.Client: Uploading resource file:/tmp/spark-6f582e5c-3eef-4646-b8c7-0719877434d8/__spark_conf__103916311924336720.zip-> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/__spark_conf__103916311924336720.zip

You can also see that it will upload the relevant spark-related jar packages to the yarn environment, that is, hdfs, and then perform the related tasks.

2.User class threw exception: org.apache.spark.sql.execution.QueryExecutionException: FAILED: SemanticException [Error 10072]: Database does not exist: mydb1

Mydb1 does not exist, which means that the metadata information of our existing hive environment has not been read. This is because the hive-site.xml configuration file is not specified when submitting the task, as shown below:

-- files $SPARK_HOME/conf/hive-site.xml\

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.