Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to create DataFrames in Spark SQL

2025-01-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains how to create DataFrames in Spark SQL. Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's learn how to create DataFrames in Spark SQL!

Introduction to Spark SQL

Spark SQL is a module used by Spark to process structured data, which provides a programming abstraction called DataFrame and acts as a distributed SQL query engine.

Why learn Spark SQL? We have learned Hive, which converts Hive SQL into MapReduce and then submits it to the cluster for execution, greatly simplifying the complexity of writing MapReduce programs, because MapReduce is a slow computing model. So Spark SQL came into being. It converts Spark SQL into RDD and then submits it to the cluster for execution. The execution efficiency is very fast! Spark SQL also supports reading data from Hive.

Features of Spark SQL

Seamless integration in Spark, mixing SQL queries with Spark programs. Spark SQL allows you to query structured data in Spark programs using SQL or the familiar DataFrame API. Available in Java, Scala, Python and R.

Provide uniform data access, connecting to any data source in the same way. DataFrames and SQL provide a common way to access various data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even connect data through these sources.

Support Hive integration. Run SQL or HiveQL queries on existing repositories. Spark SQL supports HiveQL syntax as well as Hive SerDes and udf, allowing you to access existing Hive repositories.

Standard connections are supported, via JDBC or ODBC. Server patterns provide industry-standard JDBC and ODBC connectivity for business intelligence tools.

Core concepts: DataFrames and DatasetsDataFrame

A DataFrame is a dataset organized into named columns. It is conceptually equivalent to tables in a relational database, but with richer optimizations at the bottom. DataFrames can be constructed from a variety of sources, such as:

structured data files

Table in hive

External databases or existing RDDs

DataFrame API supports Scala, Java, Python and R.

As can be seen from the above figure, DataFrame has more structural information of the data, namely schema. An RDD is a collection of distributed Java objects. DataFrame is a collection of distributed Row objects. DataFrame not only provides richer operators than RDD, but also improves execution efficiency, reduces data reading and optimizes execution plan.

Datasets

Datasets are distributed collections of data. Dataset is a new interface added in Spark 1.6 and is a higher level abstraction than DataFrame. It offers the advantages of RDD (strong typing, the ability to use powerful lambda functions) as well as the advantages of Spark SQL's optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using function transformations (map, flatMap, filter, etc.). Dataset API supports Scala and Java. Python does not support the Dataset API.

Creating DataFrames

The test data are as follows: Employee table

Definition of case class (equivalent to table structure: Schema)

case class Emp(empno:Int,ename:String,job:String,mgr:Int,hiredate:String,sal:Int,comm:Int,deptno:Int)

Read data from HDFS into RDD and associate RDD with case Class

val lines = sc.textFile("hdfs://bigdata111:9000/input/emp.csv").map(_.split(","))

Map each Array to an Emp object

val emp = lines.map(x => Emp(x(0).toInt,x(1),x(2),x(3).toInt,x(4),x(5).toInt,x(6).toInt,x(7).toInt))

Generate DataFrame

val allEmpDF = emp.toDF

Query data through DataFrames

Register DataFrame as a table (view)

allEmpDF.createOrReplaceTempView("emp")

execute SQL queries

spark.sql ("select * from emp").show At this point, I believe everyone has a deeper understanding of "how to create DataFrames in Spark SQL", so let's actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report