Lesson 57: Spark SQL on Hive configuration and practice 04/24 Update SLTechnology News&Howtos

Lesson 57: Spark SQL on Hive configuration and practice

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. First you need to install hive. Refer to http://lqding.blog.51cto.com/9123978/1750967.

2. Add a configuration file under the configuration directory of spark so that Spark can access the metastore of hive.

Root@spark-master:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# vi hive-site.xml hive.metastore.uris thrift://spark-master:9083 Thrift uri for the remote metastore. Used by metastore client to connect to remote metastore.

3. Copy the MySQL jdbc driver to the lib directory of spark

Root@spark-master:/usr/local/hive/apache-hive-1.2.1/lib# cp mysql-connector-java-5.1.36-bin.jar / usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/

4. Start the metastore service of Hive

Root@spark-master:/usr/local/hive/apache-hive-1.2.1/bin#. / hive--service metastore & [1] 20518root@spark-master:/usr/local/hive/apache-hive-1.2.1/bin# SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5. Jarhammer / impl/StaticLoggerBinder.class] SLF4J: Found binding in [See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Starting Hive Metastore ServerSLF4J: Class path contains multiple SLF4J bindings] SLF4J: See .SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jarbank jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12 implexstop StaticLoggerBinder.class] SLF4J: Found binding in [JarJet fileGroupe LocalSparkAfter 1.6.0Find Hadoop2.6 LBINFOPP 1.6.0Flt assemblyMutel 1.6.0FHADOOop2.6.0.jarlagorgOnslf4jActionStaticLoggerBinder.class] SLF4J : See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

5. Start spark-shell

Root@spark-master:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin#. / spark-shell-- master spark://spark-master:7077

Generate hiveContext

Scala > val hc = new org.apache.spark.sql.hive.HiveContext (sc)

Execute sql

Scala > hc.sql ("show tables") .foreach (println) [sougou,false] [T1 from sougou false] scala > hc.sql ("select count (*) foreach") .foreach (println) 16-03-14 23:15:58 INFO parse.ParseDriver: Parsing command: select count (*) from sougou16/03/14 23:16:00 INFO parse.ParseDriver: Parse Completed16/03/14 23:16:01 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps16/03/14 23:16:02 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 474.9 KB, free 474.9 KB) 16-03-14 23:16:02 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 41.6 KB, free 516.4 KB) 16-03-14 23:16:02 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.199.100 estimated size 41635 (size: 41.6 KB Free: 517.4 MB) 16-03-14 23:16:02 INFO spark.SparkContext: Created broadcast 0 from collect at: 3016-03-14 23:16:03 INFO mapred.FileInputFormat: Total input paths to process: 116-03-14 23:16:03 INFO spark.SparkContext: Starting job: collect at: 3016-03-14 23:16:03 INFO scheduler.DAGScheduler: Registering RDD 5 (collect at: 30) 16-03-14 23:16:03 INFO scheduler.DAGScheduler: Got job 0 (collect at: 30) with 1 output partitions16/03/14 23:16:03 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at: 30) 16-03-14 23:16:03 INFO scheduler.DAGScheduler: Parents of final stage: List (ShuffleMapStage 0) 16-03-14 23:16:04 INFO scheduler.DAGScheduler: Missing parents: List (ShuffleMapStage 0) 16-03-14 23:16:04 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD [5] at collect at: 30) Which has no missing parents16/03/14 23:16:04 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 13.8 KB, free 530.2 KB) 16-03-14 23:16:04 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 6.9 KB, free 537.1 KB) 16-03-14 23:16:04 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.199.100 free 41635 (size: 6.9 KB (free: 517.4 MB) 16-03-14 23:16:04 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:100616/03/14 23:16:04 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD [5] at collect at: 30) 16-03-14 23:16:04 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks16/03/14 23:16:04 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0 (TID 0, spark-worker2, partition 0 not designed local (2152 bytes) 23:16:04 INFO scheduler.TaskSetManager on 16-03-14 (TID 1, spark-worker1, partition 1 in stage 2152 bytes) 16-03-14 23:16:05 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on spark-worker2:55899 (size: 6.9 KB, free: 146.2 MB) 23:16:05 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on spark-worker1:38231 (size: 146.2 KB) Free: 146.2 MB) 23:16:09 on 16-03-14 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-worker1:38231 (size: 41.6 KB, free: 146.2 MB) 16-03-14 23:16:10 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-worker2:55899 (size: 41.6 KB) Free: 146.2 MB) 23:16:16 on 16-03-14 INFO scheduler.TaskSetManager: Finished task 1.0 in stage (TID 1) in 12015 ms on spark-worker1 (1) 16-03-14 23:16:16 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (collect at: 30) finished in 12.351 s16 * / 14 23:16:16 INFO scheduler.DAGScheduler: looking for newly runnable stages16/03/14 23:16:16 INFO scheduler.DAGScheduler: running: Set () 16-03-14 23:16:16 INFO scheduler.DAGScheduler: waiting: Set (ResultStage 1) 16-03-14 23:16:16 INFO scheduler.DAGScheduler: failed: Set () 16-03-14 23:16:16 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD [8] at collect at: 30) Which has no missing parents16/03/14 23:16:16 INFO scheduler.TaskSchedulerImpl: Removed TaskSet, whose tasks have all completed, from pool 16-03-14 23:16:16 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.9 KB, free 550.1 KB) 16-03-14 23:16:16 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 550.1 KB Free 556.5 KB) 23:16:16 on 16-03-14 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.199.100 INFO storage.BlockManagerInfo 41635 (size: 6.4 KB) (free: 517.4 MB) 16-03-14 23:16:16 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:100616/03/14 23:16:16 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD [8] at collect at: 30) 16-03-14 23:16:16 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks16/03/14 23:16:16 INFO scheduler.TaskSetManager: Starting task 517.4 in stage 23:16:16 (TID 2, spark-worker1, partition 0 focus local Bytes) 23:16:16 on 16-03-14 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on spark-worker1:38231 (size: 1999 KB) Free: 146.1 MB) 16-03-14 23:16:17 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to spark-worker1:4356816/03/14 23:16:17 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 158bytes16/03/14 23:16:18 INFO scheduler.DAGScheduler: ResultStage 1 (collect at: 30) finished in 1.288 s16 Legend 03pm 14 23:16:18 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.288 (TID 2) in 1279 ms On spark-worker1 (1 move 1) 23:16:18 on 16-03-14 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0 Whose tasks have all completed, from pool 16-03-14 23:16:18 INFO scheduler.DAGScheduler: Job 0 finished: collect at: 30, took 14.285673 s [1000000]

Compared with Hive, the speed is improved. If it is a complex statement, it will be faster than hive.

Scala > hc.sql ("select word,count (*) cnt from sougou group by word order by cnt desc limit 5") .foreach (println). 23:19:16 on 16-03-14 INFO scheduler.DAGScheduler: ResultStage 3 (collect at: 30) finished in 11.900 s16 INFO scheduler.DAGScheduler 14 23:19:16 INFO scheduler.DAGScheduler: Job 1 finished: collect at: 30 Took 17.925094 s16 TID 03 INFO scheduler.TaskSetManager 14 23:19:16 INFO scheduler.TaskSetManager: Finished task 195.0 in stage 3.0 in 696 ms on spark-worker2 16-03-14 23:19:16 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool [Baidu, 7564] [baidu,3652] [body Art, 2786] [father of Yan Ning, head of Guantao County, 2388] [4399 Mini Game, 2119]

Previously, it took nearly 110s to run with Hive, but only 17s with Spark SQL.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.