What are the ways spark sql is used in scala? 04/26 Update SLTechnology News&Howtos

What are the ways spark sql is used in scala?

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "what are the ways in which spark sql is used in scala". In daily operation, I believe that many people have doubts about the way spark sql is used in scala. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about "what is the way spark sql is used in scala?" Next, please follow the editor to study!

The first way for package hgs.spark.sqlimport org.apache.spark.SparkConfimport org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport org.apache.spark.sql.SQLContextimport org.apache.spark.sql.SQLImplicitsimport org.apache.spark.sql.types.StructTypeimport org.apache.spark.sql.types.StructFieldimport org.apache.spark.sql.types.StringTypeimport org.apache.spark.sql.types.IntegerTypeimport org.apache.spark.sql.Row// to create dataframeobject SqlTest1 {def main (args: Array [String]) ): Unit = {val conf = new SparkConf () .setAppName ("sqltest1") .setMaster ("local") val context = new SparkContext (conf) val sqlContext = new SQLContext (context) val rdd = context.textFile ("d:\\ person" 1) val rdd2 = rdd.map (x = > {val t = x.split ("") Person (t (0) .toInt, t (1), t (2) .toInt)}) / / the first method to create dataframe Here you need to import the implicit conversion import sqlContext.implicits._ val persondf = rdd2.toDF () / / this method was abolished in 2.1.0 / / persondf.registerTempTable ("person") / / use this function instead of persondf.createOrReplaceTempView ("person") val result = sqlContext.sql ("select * from person order by age desc") / / print the result result of the query. Show () / / or save the result to the file result.write.json ("d://personselect") context.stop ()}} case class person (id:Int) Name:String,age:Int) / / the second method to create dataframe//3.1.2. Directly specify Schemaobject SqlTest2 {def main (args: Array [String]): Unit = {val conf = new SparkConf (). SetAppName ("sqltest2") .setMaster ("local") val context = new SparkContext (conf) val sqlContext = new SQLContext (context) val rdd = context.textFile ("d:\\ person", 1) / / the first method to create dataframe, here you need to import implicit conversion / / create schema That is, a mapping relationship val personShcema = StructType (List (/ / below is the description of a column, respectively, column name, data type, empty StructField ("id", IntegerType,true), StructField ("name", StringType,true), StructField ("age", IntegerType,true) val rdd2 = rdd.map (x = > {val t = x.split ("")) Row (t (0) .toInt, t (1), t (2) .toInt)}) / / create dataframeval persondf = sqlContext.createDataFrame (rdd2) in this way PersonShcema) / / Map dataframe to a temporary table persondf.createOrReplaceTempView ("person") / / query data display sqlContext.sql ("select * from person order by age desc"). Show () context.stop () / * queried data * +-+ | | 2 | wd | 24 | | 4 | cm | 24 | +-+ * /}} some notes: checkpoint: persist the rdd intermediate process to hdfs. | If a rdd fails, reply from hdfs This cost is less sc.setCheckpointDir ("hdfs dir or other fs dir"). It is recommended to checkpoin after RDD cache. This will reduce one operation and checkpoin the RDD directly from memory, but the previously dependent RDD will also be discarded RDD Objects to build DAG--- > DAGScheduler (each Task on each excutor & & split stage and raise the price stage)-> TaskScheduler (Task&& submits task) )-> Worker (execute task) stage: differentiate stage by dependency When a wide dependency (exchanging data between nodes) is encountered, a stage is divided into narrow dependencies: the partition data of the parent RDD is only transferred to one child RDD partition, while the wide dependency is that the partition data of the parent RDD is transferred to multiple child RDD or multiple partitions spark SQL: dealing with structured data DataFrames: like RDD, DataFrame is also a distributed data container. However, DataFrame is more like the two-dimensional table of the traditional database, in addition to the data, it also records the structural information of the data, namely schema. At the same time, like Hive, DataFrame supports nested data types (struct, array, and map). From the perspective of API ease of use, DataFrame API provides a set of high-level relational operations that are more friendly and have lower barriers to entry than functional RDD API. Similar to R and Pandas's DataFrame, Spark DataFrame well inherits the development experience of traditional stand-alone data analysis to create DataFrame: mapping data to class,RDD.toDF through sql query Register df as a table 1. Df.registerTempTable ("test") sqlContext.sql ("select * from test"). Show 2. Defined by StructType: StrutType (List ()) hive 3.0.0 and spark 1. 0. Copy the hive-site.xml hdfs-site.xml core-site.xml to the conf folder of spark, and place the mysql driver under the jars folder of spark. Statements in hive are fully applicable in spark-sql: create table person (id int,name string,age int) row format delimited fields terminated by''; load data inpath 'hdfs://bigdata00:9000/person' into table person; select * from person The data are as follows: 1 hgs 26 2 wd 24 3 zz 25 4 cm 24 3. In the spark-sql console interaction will print a lot of INFO-level information, very annoying, the solution is in the conf folder: mv log4j.properties.template log4j.properties changed log4j.rootCategory=INFO, console to log4j.rootCategory=WARN, console, on the "what are the ways spark sql is used in scala" learning is over, hope to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.