An overview of spark-sql and an introduction to the programming model 07/08 Update SLTechnology News&Howtos

An overview of spark-sql and an introduction to the programming model

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Overview of spark sql (1) introduction of spark sql:

Spark SQL is a module used by Spark to process structured data (structured data can be obtained from external structured data sources or obtained through RDD). It provides a programming abstraction called DataFrame and acts as a distributed SQL query engine.

Structured data sources outside include JSON, Parquet (default), RMDBS, Hive, and so on. Currently, Spark SQL uses the Catalyst optimizer to optimize SQL, resulting in a more efficient execution scheme. And the results can be stored to an external system.

(2) the characteristics of spark sql:

-easy to integrate

-uniform data access

-compatible with hive

-Standard data connection

(3) iteration of the version of spark sql:

The predecessor of -spark sql is shark. But spark sql abandoned the original shark code and absorbed some advantages of shark, such as column storage (In-Memory Columnar Storage), Hive compatibility and so on, and redeveloped SparkSQL.

-spark-1.1 release of Spark1.1.0 on September 11, 2014. Spark has introduced SparkSQL since 1. 0 (Shark no longer supports upgrades and maintenance). The big changes in Spark1.1.0 are SparkSQL and MLlib

-spark-1.3Add new dataframe

-spark-1. 4 adds window analysis function

-spark-1.5 tungsten wire plan. There are UDF in Hive and UDF support in UDAF,Spark earlier.

The sql executed by -spark 1.6 can add "- -" comments, new features of Spark-1.5/1.6, and the introduction of the concept of DataSet.

-spark 2.x SparkSQL+DataFrame+DataSet (official version), Structured Streaming (DataSet), the introduction of SparkSession unifies the programming entry of RDD,DataFrame,DataSet

2. The programming model of spark sql (1) introduction of sparkSession:

SparkSession is a new concept cited by Spark-2.0. SparkSession provides a unified entry point for users to learn the functions of Spark.

With the API of DataSet and DataFrame, has gradually become the standard API,SparkSession as the entry point for DataSet and DataFrame API, and SparkSession encapsulates SparkConf, SparkContext and SQLContext. SQLContext and HiveContext are also saved for backward compatibility.

features:

-provides users with a unified entry point to use Spark features

-allows users to call DataFrame and Dataset related API to write programs through it

-reduces some concepts that users need to know and makes it easy to interact with Spark

-create SparkConf, SparkContext, and SQlContext that do not need to be displayed when interacting with Spark, which are enclosed in SparkSession

-SparkSession provides internal support for Hive features: writing SQL statements in HiveQL, accessing Hive UDFs, reading data from Hive tables

Creation of SparkSession:

In spark-shell, SparkSession is automatically initialized an object called spark. For backward compatibility, Spark-Shell also provides an initialization object for SparkContext to facilitate users to operate:

is created during code development:

Val conf = new SparkConf () val spark: SparkSession = SparkSession.builder () .appName ("_ 01spark_sql") .config (conf) .getOrCreate () (2) RDD:

The main point here is the limitations of RDD:

-RDD does not support spark-sql

-RDD only represents datasets, and RDD has no metadata, that is, there is no field semantic definition

-RDD requires users to optimize their own programs, and has higher requirements for programmers.

-it is relatively difficult to read data from different data sources, and it is also difficult to combine data from multiple data sources by user-defined conversion methods.

(3) DataFrame:

DataFrame is called SchemaRDD. A distributed collection of data made up of behavior units, given different names according to columns. Abstraction of operators such as select, fileter, aggregation, and sort. Among them, Schema is metadata and semantic description information. DataFrame is a collection of distributed Row objects.

DataFrame = RDD+Schema = SchemaRDD

advantages:

-DataFrame is a special type of Dataset,DataSet [Row] = DataFrame

-DataFrame has its own optimizer Catalyst, which can automatically optimize the program.

-DataFrame provides a complete set of Data Source API

features:

-supports data processing from stand-alone KB level to cluster PB level

-supports multiple data formats and storage systems

-efficient code generation and optimization through the Spark SQL Catalyst optimizer

-seamless integration of all big data processing tools

-provides Python, Java, Scala, R language API

(4) DataSet:

because the data type of DataFrame is Row, so DataFrame also has shortcomings. Row runtime type checking, for example, salary is a string type, and the following statement does type checking only at run time. Dataframe.filter (salary > 1000). Show ()

Dataset extends DataFrame API to provide compile-time type checking, object-oriented style API.

Dataset can be converted to DataFrame and RDD. DataFrame=Dataset [Row], it can be seen that DataFrame is a special Dataset.

(5) the difference between DataSet and DataFrame?

the editor here wants to emphasize the difference between the two, but he is not very clear about the relationship between the two when he is learning spark-sql, and he also asked this question during the interview, which is really a history of blood and tears.

, by looking at the summary of the two by several predecessors, I roughly summarize the difference between the two:

-Dataset can be thought of as a special case of DataFrame. The main difference is that each record of Dataset stores a strongly typed value rather than a Row.

-DataSet can check types at compile time, while DataFrame checks only when it is actually running

The type of each line in -DataFrame is Row, and without parsing we can't know which fields are there and what type each field is. We can only get specific field content through getAs [type] or row (I); while the type of each line in dataSet is not necessarily, after customizing the case class, we can get the information of each line freely.

All right, after a lot of nonsense, you might as well go straight to the code:

Object SparkSqlTest {def main (args: Array [String]): Unit = {/ / screen redundant logs Logger.getLogger ("org.apache.hadoop") .setLevel (Level.WARN) Logger.getLogger ("org.apache.spark") .setLevel (Level.WARN) Logger.getLogger ("org.project-spark") .setLevel (Level.WARN) val conf: SparkConf = new SparkConf () conf. SetMaster ("local [2]") .setAppName ("SparkSqlTest") / / sets the serializer .set ("spark.serializer") of spark "org.apache.spark.serializer.KryoSerializer") / / objects to be customized Add .registerKryoClasses (Array [person]) to the serializer / / build the SparkSession object val spark: SparkSession = SparkSession.builder () .config (conf). GetOrCreate () / create the sparkContext object val sc: SparkContext = spark.sparkContext val list = List (new Person ("xx", 18), new Person ("Wu xx", 20) New Person ("Qi xx", 30), new Person ("Wang xx", 40), new Person ("Xue xx", 18) / / create DataFrame / / build metadata val schema = StructType (List ("name", DataTypes.StringType), StructField ("age") DataTypes.IntegerType) / / build RDD val listRDD: RDD [Person] = sc.makeRDD (list) val RowRDD: RDD [Row] = listRDD.map (field = > {Row (field.name, field.age)}) val perDF: DataFrame = spark.createDataFrame (RowRDD Schema) / / create DataSet import spark.implicits._ / / this sentence must be added val perDS: Dataset [Person] = perDF.as [Person] / * here mainly introduces the difference between DF and DS * / perDF.foreach (field= > {val name=field.get (0) / / based on the index of the element Take the value of the corresponding element val age=field.getInt (1) / / take the value of the element field.getAs [Int] ("age") according to the element's index and the type of the element / / extract the value of the element println (s "${age}) according to the type and name of the element ${name} ")}) perDS.foreach (field= > {/ / value val age=field.ageval name=field.name println directly based on the name of the element defined above (s" ${age}, ${name} ")}} case class Person (name: String, age: Int)

Personally, although DataFrame is integrated and has many advantages, it is uncertain if you want to extract a property of a specific object from DataFrame, and the steps are tedious and the type is uncertain. But using DataSet can effectively avoid all the problems.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.