Code sample Analysis of Spark SQL 07/15 Update SLTechnology News&Howtos

Code sample Analysis of Spark SQL

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will analyze "Code sample Analysis of Spark SQL". The content is detailed and easy to understand, "Spark SQL code sample analysis" interested friends can follow the editor's ideas slowly in-depth to read, I hope that after reading can be helpful to everyone. Let's learn more about "Code sample Analysis of Spark SQL" with the editor.

Referring to the example of the official website Spark SQL, I wrote a script myself:

Val sqlContext = new org.apache.spark.sql.SQLContext (sc) import sqlContext.createSchemaRDDcase class UserLog (userid: String, time1: String, platform: String, ip: String, openplatform: String, appid: String) / / Create an RDD of Person objects and register it as a table.val user = sc.textFile ("/ user/hive/warehouse/api_db_user_log/dt=20150517/*"). Map (_ .split ("\\ ^")) .map (u = > UserLog (u (0), u (1), u (2), u (3)) U (4), u (5)) user.registerTempTable ("user_log") / / SQL statements can be run by using the sql methods provided by sqlContext.val allusers = sqlContext.sql ("SELECT * FROM user_log") / / The results of SQL queries are SchemaRDDs and support all the normal RDD operations.// The columns of a row in the result can be accessed by ordinal.allusers.map (t = > "UserId:" + t (0)). Collect (). Foreach (println)

As a result, an error occurred in execution:

Org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 50.0 failed 1 times, most recent failure: Lost task 1.0 in stage 50.0 (TID 73 Localhost): java.lang.ArrayIndexOutOfBoundsException: 5 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply (: 30) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply (: 30) at scala.collection.Iterator$$anon$11.next (Iterator.scala:328) at org.apache.spark.util.Utils$.getIteratorSize (Utils.scala:1319) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply (RDD.scala:910) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply (RDD.scala:910 ) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply (SparkContext.scala:1319) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply (SparkContext.scala:1319) at org.apache.spark.scheduler.ResultTask.runTask (ResultTask.scala:61) at org.apache.spark.scheduler.Task.run (Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run (Executor.scala:196) at java.util .concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:615) at java.lang.Thread.run (Thread.java:745)

As you can see from the log, the array is out of bounds.

Use a command

Sc.textFile ("/ user/hive/warehouse/api_db_user_log/dt=20150517/*") .map (_ .split ("\\ ^")) .foreach (x = > println (x.size))

It is found that there is a line recording that the size of split is "5".

6666666615 INFO Executor 05 in stage 21 20:47:37 in stage: Finished task 0.0 (TID 4). 1774 bytes result sent to driver6666665615/05/21 20:47:37 INFO Executor: Finished task in stage 2.0 (TID 5). 1774 bytes result sent to driver

The reason is that there is a blank value in this line "44671799 20:56: 05 ^ 2 ^ 117.93.193.238 ^ 0 ^"

The solution was found online-using the split (str,int) function. Modified code:

Val sqlContext = new org.apache.spark.sql.SQLContext (sc) import sqlContext.createSchemaRDDcase class UserLog (userid: String, time1: String, platform: String, ip: String, openplatform: String, appid: String) / / Create an RDD of Person objects and register it as a table.val user = sc.textFile ("/ user/hive/warehouse/api_db_user_log/dt=20150517/*"). Map (_ .split ("\\ ^",-1)) .map (u = > UserLog (u (0), u (1), u (2)) U (3), u (4) U (5)) user.registerTempTable ("user_log") / / SQL statements can be run by using the sql methods provided by sqlContext.val allusers = sqlContext.sql ("SELECT * FROM user_log") / / The results of SQL queries are SchemaRDDs and support all the normal RDD operations.// The columns of a row in the result can be accessed by ordinal.allusers.map (t = > "UserId:" + t (0)). Collect (). Foreach (println) code example analysis of Spark SQL is here. I hope that the above content will enable you to improve. If you want to learn more knowledge, please pay more attention to the editor's updates. Thank you for following the website!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.