How to analyze Structured API in Spark SQL 07/09 Update SLTechnology News&Howtos

How to analyze Structured API in Spark SQL

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly analyzes the relevant knowledge points of how to carry out Structured API analysis in Spark SQL, the content is detailed and easy to understand, the operation details are reasonable, and has a certain reference value. If you are interested, you might as well follow the editor and learn more about "how to analyze Structured API in Spark SQL".

Create DataFrame and Dataset1.1 to create DataFrame

The entry point for all functions in Spark is SparkSession, which can be created using SparkSession.builder (). Once created, the application can create an DataFrame from an existing RDD,Hive table or Spark data source. Examples are as follows:

Val spark = SparkSession.builder () .appName ("Spark-SQL") .master ("local [2]") .getOrCreate () val df = spark.read.json ("/ usr/file/json/emp.json") df.show () / / it is recommended to import the following implicit conversion before spark SQL programming, because many operations in DataFrames and dataSets rely on implicit conversion import spark.implicits._

You can test with spark-shell, but note that when spark-shell starts, it automatically creates a SparkSession called spark, which can be referenced directly on the command line.

1.2 create Dataset

Spark supports the creation of DataSet from internal datasets and external datasets in the following ways:

1. Created by an external dataset / / 1. Need to import implicit conversion import spark.implicits._// 2. Creating case class is equivalent to Java Beancase class Emp (ename: String, comm: Double, deptno: Long, empno: Long, hiredate: String, job: String, mgr: Long, sal: Double) / / 3. Create Datasetsval ds = spark.read.json ("/ usr/file/emp.json") from an external dataset. As [Emp] ds.show () 2. Created by the internal dataset / / 1. Need to import implicit conversion import spark.implicits._// 2. Creating case class is equivalent to Java Beancase class Emp (ename: String, comm: Double, deptno: Long, empno: Long, hiredate: String, job: String, mgr: Long, sal: Double) / / 3. Create Datasetsval caseClassDS = Seq from the internal dataset ("ALLEN", 300.0, 30, 7499, "1981-02-20 00:00:00", "SALESMAN", 7698, 1600.0), Emp ("JONES", 300.0, 30, 7499, "1981-02-20 00:00:00", "SALESMAN", 7698, 1600.0)) .toDS () caseClassDS.show ()

1.3 DataFrame created by RDD

Spark supports two ways to convert RDD to DataFrame, using reflection inference and specifying Schema conversion:

1. Use reflection to infer / / 1. Import implicit conversion import spark.implicits._// 2. Create the department class case class Dept (deptno: Long, dname: String, loc: String) / / 3. Create RDD and convert it to dataSetval rddToDS = spark.sparkContext .textFile ("/ usr/file/dept.txt") .map (_ .split ("\ t")) .map (line = > Dept (line (0). Trim.toLong, line (1), line (2)) .toDS () / convert to dataFrame2 if toDF () is called. Specify Schemaimport org.apache.spark.sql.Rowimport org.apache.spark.sql.types._// 1. 0 programmatically. Define the column type val fields = Array for each column (StructField ("deptno", LongType, nullable = true), StructField ("dname", StringType, nullable = true), StructField ("loc", StringType, nullable = true) / / 2. Create schemaval schema = StructType (fields) / / 3. Create RDDval deptRDD = spark.sparkContext.textFile ("/ usr/file/dept.txt") val rowRDD = deptRDD.map (_ .split ("\ t")) .map (line = > Row (line (0). ToLong, line (1), line (2)) / / 4. Convert RDD to dataFrameval deptDF = spark.createDataFrame (rowRDD, schema) deptDF.show ()

1.4 conversion between DataFrames and Datasets

Spark provides a very simple conversion method for converting between DataFrame and Dataset, as shown in the following example:

# DataFrames to Datasetsscala > df.as [Emp] res1: org.apache.spark.sql.Dataset [Emp] = [COMM: double, DEPTNO: bigint... 6 more fields] # Datasets to DataFramesscala > ds.toDF () res2: org.apache.spark.sql.DataFrame = [COMM: double, DEPTNO: bigint... 6 more fields]

II. Columns column operation 2.1 reference column

Spark supports several ways to construct and reference columns, the simplest of which is to use the col () or column () functions.

Col ("colName") column ("colName") / / for the Scala language, you can also use $"myColumn" and 'myColumn syntax sugars for reference. Df.select ($"ename", $"job"). Show () df.select ('ename,' job) .show () 2.2 add columns / / add columns based on existing column values df.withColumn ("upSal", $"sal" + 1000) / / add columns based on fixed values df.withColumn ("intCol", lit (1000)) 2.3.Delete columns / / support deleting multiple columns df.drop ("comm") "job") .show () 2.4renamed df.withColumnRenamed ("comm", "common") .show ()

It is important to note that adding, deleting, and renaming columns will generate a new DataFrame, and the original DataFrame will not be changed.

Third, use Structured API for basic query / / 1. Query employee name and work df.select ($"ename", $"job"). Show () / / 2.filter query employee information of salary greater than 2000 df.filter ($"sal" > 2000). Show () / / 3.orderBy descending by department number Query df.orderBy (desc ("deptno"), asc ("sal"). Show () / / 4.limit query the information of the three highest-paid employees df.orderBy (desc ("sal"). Limit (3). Show () / / 5.distinct query all department numbers df.select ("deptno"). Distinct (). Show () / / 6.groupBy subdivision statistics department number df.groupBy ("deptno"). Count (). Show ()

4. Use Spark SQL for basic query 4.1 Spark SQL basic use / / 1. First, you need to register DataFrame as the temporary view df.createOrReplaceTempView ("emp") / / 2. Query employee name and work spark.sql ("SELECT ename,job FROM emp"). Show () / 3. Query the information of employees whose salary is greater than 2000 spark.sql ("SELECT * FROM emp where sal > 2000"). Show () / / 4.orderBy descending by department number Query spark.sql ("SELECT * FROM emp ORDER BY deptno DESC,sal ASC"). Show () / / 5.limit query the information of the three highest-paid employees spark.sql ("SELECT * FROM emp ORDER BY sal DESC LIMIT 3"). Show () / / 6.distinct query all department numbers spark.sql ("SELECT DISTINCT (deptno) FROM emp"). Show () / / 7. Group statistics on the number of departments spark.sql ("SELECT deptno,count (ename) FROM emp group by deptno") .show () 4.2 global temporary view

The temporary view of the session is created using createOrReplaceTempView above, whose life cycle is limited to the scope of the session and ends with the end of the session.

You can also use createGlobalTempView to create global temporary views, which can be shared among all sessions and will not disappear until the entire Spark application is terminated. Global temporary views are defined under the built-in global_temp database and need to be referenced with qualified names, such as SELECT * FROM global_temp.view1.

/ register as the global temporary view df.createGlobalTempView ("gemp") / / use a qualified name to reference spark.sql ("SELECT ename,job FROM global_temp.gemp"). Show () this article mainly analyzes the relevant knowledge points of how to carry out Structured API analysis in Spark SQL, the content is detailed and easy to understand, the operation details are reasonable, and has a certain reference value. If you are interested, you might as well follow the editor and learn more about "how to analyze Structured API in Spark SQL".

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.