How Spark SQL accesses Hive and MySQL 07/06 Update SLTechnology News&Howtos

How Spark SQL accesses Hive and MySQL

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article focuses on "how Spark SQL accesses Hive and MySQL". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how Spark SQL accesses Hive and MySQL.

One: version

Build a good Hadoop environment, Hive environment, Spark environment. The Hadoop version of this article is Hadoop-2.6.4,Hive, the Hive-2.0.0,Spark version is spark-1.6.1-bin-hadoop2.6.

Two: configure spark-env.sh

Configure the following in SPARK_HOME/conf/spark-env.sh:

Export SCALA_HOME=/mysoftware/scala-2.11.8

Export JAVA_HOME=/mysoftware/jdk1.7.0_80

Export SPARK_MASTER_IP=master

Export SPARK_WORKER_MEMORY=512m

Export master=spark://master:7077

In addition, the following two lines have been added to a lot of data, namely:

Export CLASSPATH=$CLASSPATH:/mysoftware/spark-1.6.1/lib

Export SPARK_CLASSPATH=$SPARK_CLASSPATH:/mysoftware/spark-1.6.1/lib/mysql-connector-java-5.1.5-bin.jar

The above two lines are not added to spark-env.sh here, because the Spark1.0+ version has already rejected this, so it is not added here. You can see the following message appears when starting spark-shell, namely:

SPARK_CLASSPATH was detected (set to': / mysoftware/spark-1.6.1/lib/mysql-connector-java-5.1.5-bin.jar').

This is deprecated in Spark 1.0 years.

Please instead use:

-. / spark-submit with-driver-class-path to augment the driver classpath

-spark.executor.extraClassPath to augment the executor classpath

Three: configure spark-defaults.sh

First make a copy of the SPARK_HOME/conf/spark-defaults.conf.template (cp) as spark-defaults.conf, and then you can see that much of the configuration information told in this file is the default, that is, default. Therefore, this article has not been modified, if you need to modify it, please modify it to be consistent with your own environment.

In addition, many materials on the Internet have added the following contents to the document, namely:

Spark.executor.extraClassPath / mysoftware/spark-1.6.1/lib/mysql-connector-java-5.1.5-bin.jar

Spark.driver.extraClassPath / mysoftware/spark-1.6.1/lib/mysql-connector-java-5.1.5-bin.jar

As a result, when you start spark-shell, WARN appears because the above two lines are set. Setting .

Four: add the driver jar package of mysql

Add mysql-connector-java-5.1.5-bin.jar to the SPARK_HOME/lib/ directory

Add files under the SPARK_HOME/conf directory

Copy hive-site.xml, core-site.xml (for security) and hdfs-site.xml (configured for HDFS) to the SPARK_HOME/conf directory.

According to the official website:

Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), hdfs-site.xml (for HDFS configuration) file in conf/. Please note when running the query on a YARN cluster (cluster mode), the datanucleus jars under the lib directory and hive-site.xml under conf/ directory need to be available on the driver and all executors launched by the YARN cluster. The convenient way to do this is adding them through the-jars option and-file option of the spark-submit command.

Six: Spark SQL visits Hive

6.1 the first way to start spark-shell:

Bin/spark-shell-driver-class-path / mysoftware/spark-1.6.1/lib/mysql-connector-java-5.1.5-bin.jar

Hadoop@master:/mysoftware/spark-1.6.1$ bin/spark-shell-- driver-class-path / mysoftware/spark-1.6.1/lib/mysql-connector-java-5.1.5-bin.jarlog4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). Log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.Using Spark's Repl log4j profile: org/apache/spark/log4j-defaults-repl.propertiesTo adjust logging level use sc.setLogLevel ("INFO") Welcome to _ / / _ / / _ _\ / / _ `/ _ _ / / _ /. _ _ /. _ /\ _\ version 1.6.1 / _ / Using Scala version 2.10.5 (Java HotSpot (TM) 64-Bit Server VM Java 1.7.0 (80) Type in expressions to have them evaluated.Type: help for more information.Spark context available as sc.16/06/06 18:56:11 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 16-06-06 18:56:12 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 16-06-06 18:56:20 WARN ObjectStore: Version information not found in metastore. Hive.metastore.schema.verification is not enabled so recording the schema version 1.2.016 BoneCP specified but not present in CLASSPATH 06 WARN ObjectStore 06 18:56:20 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException16/06/06 18:56:28 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 16-06-06 18:56:28 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 16-06-06 18:56:31 ERROR ObjectStore: Version information found in metastore differs 2.0.0 from expected schema version 1.2.0. Schema verififcation is disabled hive.metastore.schema.verification so setting version.SQL context available as sqlContext.scala >

Run the following command, that is:

Scala > sc

Res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@631a8160

Scala > val sqlContext = new org.apache.spark.sql.hive.HiveContext (sc)

SqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@3a957b9e

Scala > sqlContext.sql ("CREATE TABLE IF NOT EXISTS sparkhive (key INT, value STRING)")

Res1: org.apache.spark.sql.DataFrame = [result: string]

After running the third command above, the created table sparkhive can be queried in hive, that is:

Hive > show tables;OKhbase_personhivehbasehivehbase_personhivehbase_studentmultiplehivesparkhivetesthivetestsparkhiveTime taken: 1.154 seconds, Fetched: 8 row (s)

Now add data to the table and view the data, that is:

Scala > sqlContext.sql ("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE sparkhive")

Scala > sqlContext.sql ("FROM sparkhive SELECT key, value") collect ()

'examples/src/main/resources/kv1.txt'-- "this path is in the installation package.

Scala > sqlContext.sql ("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE sparkhive") Res2: org.apache.spark.sql.DataFrame = [result: string] scala > sqlContext.sql ("FROM sparkhive SELECT key, value"). Collect () res3: Array [org.apache.spark.sql.Row] = Array ([238], [86], [311], [27], [165], [409], [255], [278], [98] [484,val_484], [265,val_265], [193,val_193], [401,val_401], [150,val_150], [273,val_273], [224,val_224], [369,val_369], [66,val_66], [128,val_128], [213,val_213], [146,val_146], [406,val_406], [429,val_429], [374 Val_374], [152,val_152], [469,val_469], [145,val_145], [495,val_495], [37,val_37], [327,val_327], [281,val_281], [277,val_277], [209,val_209], [15,val_15], [82,val_82], [403,val_403], [166,val_166], [417,val_417] [430 recordings validated 430], [252diary valiants 252], [292 memorials 292], [2197 pencils 287], [153 pencils validates 153], [193 parcels validates 193], [338 heroines 338], [446 parcels validates 446], [459 parcels validates 459], [394 parcels 394], [237 heroes 237], [482 cents 482],.. Scala >

You can also view the data through seelct * from sparkhive in hive.

6.2 the second way to start spark-shell:

SPARK_CLASSPATH=$SPARK_CLASSPATH:/mysoftware/spark-1.6.1/lib/mysql-connector-java-5.1.5-bin.jar bin/spark-shell

However, some WARN messages will appear, as follows: (it is recommended to start in the first way)

Hadoop@master:/mysoftware/spark-1.6.1 $SPARK_CLASSPATH=$SPARK_CLASSPATH:/mysoftware/spark-1.6.1/lib/mysql-connector-java-5.1.5-bin.jar bin/spark-shelllog4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). Log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.Using Spark's repl Log4j profile: org/apache/spark/log4j-defaults-repl.propertiesTo adjust logging level use sc.setLogLevel ("INFO") Welcome to _ / / _ / / _ _\ / / _ `/ _ _ / / _ /. _ _ /. _ /\ _\ version 1.6.1 / _ / Using Scala version 2.10.5 (Java HotSpot (TM) 64-Bit Server VM Java 1.7.080) Type in expressions to have them evaluated.Type: help for more information.16/06/06 19:14:10 WARN SparkConf: SPARK_CLASSPATH was detected (set to': / mysoftware/spark-1.6.1/lib/mysql-connector-java-5.1.5-bin.jar'). This is deprecated in Spark 1.0+.Please instead use: -. / spark-submit with-- driver-class-path to augment the driver classpath-spark.executor.extraClassPath to augment The executor classpath 16-06-06 19:14:10 WARN SparkConf: Setting 'spark.executor.extraClassPath' to': / mysoftware/spark-1.6.1/lib/mysql-connector-java-5.1.5-bin.jar' as a work-around.16/06/06 19:14:10 WARN SparkConf: Setting 'spark.driver.extraClassPath' to': / mysoftware/spark-1.6.1/lib/mysql-connector-java-5.1.5-bin.jar' as a Work-around.Spark context available as sc.16/06/06 19:14:26 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 16-06-06 19:14:27 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 16-06-06 19:14:35 WARN ObjectStore: Version information not found in metastore. Hive.metastore.schema.verification is not enabled so recording the schema version 1.2.016 Failed to get database default 06 19:14:35 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException16/06/06 19:14:39 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 19:14:39 on 16-06-06 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) SQL context available as sqlContext.scala >

Seven: Spark SQL visits MySQL

Also start spark-shell in the first way described above.

Note the writing of the following parameters:

"url"-> "jdbc:mysql://192.168.226.129:3306/hive?user=hive&password=xujun"

-- (the url address of the remote connection mysql plus a user name and password attempts to connect to the hive database)

"dbtable"-> "hive.TBLS", (here we use a table TBLS that already exists in the hive database)

Driver-> com.mysql.jdbc.Driver (driver)

The first way is to load data through sqlContext.read.format ("jdbc") .options ("xxxx"), (a DataFrameReader object is generated along the way, see API for details)

Val jdbcDF = sqlContext.read.format ("jdbc") .options (Map ("url"-> "jdbc:mysql://192.168.226.129:3306/hive?user=hive&password=xujun", "dbtable"-> "hive.TBLS", "driver"-> "com.mysql.jdbc.Driver")) .load ()

The specific information is as follows:

Scala > val jdbcDF = sqlContext.read.format ("jdbc"). Options (Map ("url"-> "jdbc:mysql://192.168.226.129:3306/hive?user=hive&password=xujun", "dbtable"-> "hive.TBLS", "driver"-> "com.mysql.jdbc.Driver"). Load () jdbcDF: org.apache.spark.sql.DataFrame = [TBL_ID: bigint, CREATE_TIME: int, DB_ID: bigint, LAST_ACCESS_TIME: int, OWNER: string, RETENTION: int SD_ID: bigint, TBL_NAME: string, TBL_TYPE: string, VIEW_EXPANDED_TEXT: string VIEW_ORIGINAL_TEXT: string] scala > jdbcDF.show () +- -+-+ | TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | +- -- +-+ | 11 | 1464510462 | 1 | 0 | hive | 0 | 11 | testhive | MANAGED_TABLE | null | Null | | 22 | 1464513715 | 1 | 0 | hadoop | 0 | 22 | hivehbase | MANAGED_TABLE | null | null | 23 | 1464517000 | 1 | 0 | hadoop | 0 | 23 | hbase_person | EXTERNAL_TABLE | null | null | 24 | 1464517563 | 1 | hadoop | 0 | 24 | hivehbase_student | EXTERNAL_TABLE | null | null | 29 | 1464521014 | 1 | 0 | hadoop | 0 | 29 | multiplehive | MANAGED_TABLE | null | null | 36 | 1464522011 | 1 | 0 | hadoop | 0 | 36 | hivehbase_person | MANAGED_TABLE | null | null | 41 | 1465227955 | 0 | hadoop | 0 | 41 | testsparkhive | MANAGED_TABLE | null | null | 46 | 1465264720 | 1 | 0 | hadoop | 0 | 46 | sparkhive | MANAGED_TABLE | null | null | +- -+

7.2 the second way is to load data through sqlContext.load ("jdbc", "xxxx"):

Val jdbcDF = sqlContext.load ("jdbc", Map ("url"-> "jdbc:mysql://192.168.226.129:3306/hive?user=hive&password=xujun", "dbtable"-> "hive.TBLS", "driver"-> "com.mysql.jdbc.Driver"))

Display data:

JdbcDF.show ()

The specific information is the same as in the first way above.

At this point, I believe you have a deeper understanding of "how Spark SQL accesses Hive and MySQL". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.