Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use DataFrame

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to use DataFrame, the introduction in the article is very detailed, has a certain reference value, interested friends must read!

I. Overview:

DataFrame is a distributed dataset, which can be understood as a table in a relational database, which is organized by fields and field types and field values in columns, and supports four languages. In Scala API, it can be understood as: FataFrame=Dataset [ROW]

Note: DataFrame is produced after V1.3, SchemaRDD before V1.3, and Dataset is added after V1.6

Second, DataFrame vs RDD differences: concept: both are distributed containers, DF understands that a table in addition to RDD data also supports Schema, and also supports complex data types (map..) API: DataFrame provides more API than RDD supports map filter flatMap. Data structure: RDD knows that the type has no structure, DF provides Schema information for optimization, and the performance is good at the bottom: based on the different running environment, Java/Scala API developed by RDD runs the underlying environment JVM, DF is converted into logical execution plan (locaical Plan) and physical execution plan (Physical Plan) in SparkSQL, and the performance difference is big. Json file operation

[hadoop@hadoop001 bin] $. / spark-shell-- master local [2]-- jars ~ / software/mysql-connector-java-5.1.34-bin.jar

-- read json files

Scala > val df = spark.read.json ("file:///home/hadoop/data/people.json")

18-09-02 11:47:20 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException

Df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

-- print schema information

Scala > df.printSchema

Root |-- age: long (nullable = true)-- Field type is allowed to be empty |-- name: string (nullable = true)

-- print field content

Scala > df.show

+-+-+ | age | name | +-+-+ | null | Michael | | 30 | Andy | | 19 | Justin | +-+-+

-- print query fields

Scala > df.select ("name"). Show

+-+ | name | +-+ | Michael | | Andy | | Justin | +-+

-- single quotation marks with implicit conversion

Scala > df.select ('name). Show

+-+ | name | +-+ | Michael | | Andy | | Justin | +-+

-- implicit conversion of double quotation marks is not recognized

Scala > df.select ("name). Show

: 1: error: unclosed string literal

Df.select ("name). Show

^

-- Age calculation cannot be calculated by NULL

Scala > df.select ($"name", $"age" + 1) .show

+-+-+ | name | (age + 1) | +-+-+ | Michael | null | | Andy | 31 | Justin | 20 | +-+-+

-Age filtering

Scala > df.filter ($"age" > 21). Show

+-age | name | +

-- Age grouping summary

Scala > df.groupBy ("age"). Count.show

+-+-- +-+ | age | count | +-+-+ | 19 | 1 | null | 1 | 30 | 1 | +-+-+

-- create a temporary view

Scala > df.createOrReplaceTempView ("people")

Scala > spark.sql ("select * from people") .show

+-+-- +-+ | age | name | +-+-+ | null | Michael | | 30 | Andy | | 19 | Justin | +-+-+ 4. Action operation on DataFrame object

Define the case class used to create the Schema

Case class Student (id:String,name:String,phone:String,Email:String)

RDD and DF reflection implementation

Val students = sc.textFile ("file:///home/hadoop/data/student.data").map(_.split("\\|")).map(x=>Student(x(0),x(1),x(2),x(3))).toDF()

-- print DF information

Students.printSchema

-show (numRows: Int, truncate: Boolean)

-- numRows intercepts the first 20 lines and truncate reads the first 20 strings

-- students.show (5 false) reads the first five lines and all strings

Scala > students.show

+-- + | id | name | phone | Email | +-- + | 1 | Burke | 1,300 | ullamcorper.velit... | 2 | Kamal | 1,668,571-5046 | pede.Suspendisse@... | 3 | Olga | 1,956,311-1686 | Aenean.eget.metus... | | 4 | Belle | 1,246,894-6340 | vitae.aliquet.nec... | 5 | Trevor | 1,300,527-4967 | dapibus.id@acturp... | | 6 | Laurel | 1,691,379-9921 | adipiscing@consec... | 7 | Sara | 1,608,586-1995 | Donec.nibh@enimEt... | 8 | Kaseem | 1,881,586-2689 | cursus.et.magna@ | | e | 9 | Lev | 1,916,367-5608 | Vivamus.nisi@ipsu... | | 10 | Maya | 1,271,683-2698 | accumsan.convalli... | | 11 | Emi | 1,467-1337 | est@nunc.com |. | | 12 | Caleb | 1,683,212-0896 | Suspendisse@Quisq... | | 13 | Florence | 1,603,575-2444 | sit.amet.dapibus@... | 14 | Anika | 1,856,828-7883 | euismod@ligulaeli... | 15 | Tarik | 1,398-2268 | turpis@felisorci.com | 16 | | Amena | 1,878,250-3129 | lorem.luctus.ut@s... | | 17 | Blossom | 1,154,406-9596 | Nunc.commodo.auct... | | 18 | Guy | 1,869,521-3230 | senectus.et.netus... | | 19 | Malachi | 1,608,637-2772 | Proin.mi.Aliquam@... | 20 | Edward | 1,711,710-6552 | lectus@aliquetlib... | +-- +- -+ only showing top 20 rows

-- students.head (5) returns the first few rows of data

Scala > students.head (5). Foreach (println) [1Metro BurkeGrad 1,300,746-8446] [2ec.co.uk] [2le KamalMagol 1-668,571-504ec.co.UK] [2co.co.UK] [2ec.co.co

-- query specific fields

Scala > students.select ("id", "name"). Show (5) +-- +-+ | id | name | +-+-+ | 1 | Burke | | 2 | Kamal | | 3 | Olga | 4 | Belle | 5 | Trevor | +-+-+

-- modify fields with aliases

Scala > students.select ($"name" .as ("new_name")) .show (5)

+-+ | new_name | +-+ | Burke | | Kamal | | Olga | | Belle | | Trevor | +-+

-- query id greater than five

Scala > students.filter ("id > 5"). Show (5)

+-- + | id | name | phone | Email | +-- + | 6 | Laurel | 1,691,379-9921 | adipiscing@consec. | | 7 | Sara | 1,608,140-1995 | Donec.nibh@enimEt... | | 8 | Kaseem | 1,881,586-2689 | cursus.et.magna@e... | | 9 | Lev | 1,916,367-5608 | Vivamus.nisi@ipsu... | | 10 | Maya | 1,271,683-2698 | accumsan.convalli... | +-- + |

-- query name is empty or name is NULL (filter=where)

Scala > students.filter ("name=''or name='NULL'") .show (false)

+-id | name | phone | Email | + | | 1-711,710-6552 | lectus@aliquetlibero.co.uk | | 22 | | 1,711,710-6552 | lectus@aliquetlibero.co.uk | | 23 | NULL | 1,711,710-6552 | lectus@aliquetlibero.co.uk | +-- +-+-- +--

-- query ID greater than 5 and name fuzzy query

Scala > students.filter ("id > 5 and name like'M%'") .show (5)

+-- + | id | name | phone | Email | +-- + | 10 | Maya | 1-271-683 -2698 | accumsan.convalli... | 19 | Malachi | 1,608,637-2772 | Proin.mi.Aliquam@... | +-- +

Sort by name in ascending order and not equal to null

Scala > students.sort ($"name"). Select ("id", "name"). Filter ("name'"). Show (3)

+-- +-- +-+ | id | name | +-- +-+ | 16 | Amena | | 14 | Anika | | 4 | Belle | +-+-+

-- sort by name backwards (sort = orderBy)

Scala > students.sort ($"name" .desc) .select ("name") .show (5)

+-+ | name | +-+ | Trevor | | Tarik | | Sara | | Olga | | NULL | +-+

-- Age grouping summary

Scala > students.groupBy ("age"). Count (). Show

+-+-- +-+ | age | count | +-+-+ | 19 | 1 | null | 1 | 30 | 1 | +-+-+

-- aggregate function use

Scala > students.agg ("id"-> "max", "id"-> "sum") .show (false)

+-+ | max (id) | sum (id) | +-+-+ | 9 | 276.0 | +-+-+

-- join operation. Using mode seq specifies multiple fields.

Students.join (students2, Seq ("id", "name"), "inner")

-- join operations of DataFrame include inner, outer, left_outer, right_outer, and leftsemi types.

-- specify the type, specify the type of join

Students.join (students2, students ("id") = students2 ("t1_id"), "inner") 5. DataFrame API implements file operation

1.maven depends on download

2.3.1 org.apache.spark spark-core_2.11 ${spark.version} org.apache.spark spark-sql_2.11 ${spark.version}

2. IDEA implementation method:

Package com.zrc.ruozedata.sparkSQLimport org.apache.spark.sql.types. {IntegerType, StringType, StructField, StructType} import org.apache.spark.sql. {Row, SparkSession} object SparkSQL001 extends App {/ * * RDD and DataFrame reflection implementation (1) * create RDD-- > DataFrema * create Schema using case class To parse each line of information in the output text * / val spark = SparkSession.builder () .master ("local [2]") .appName ("SparkSQL001") .getOrCreate () / Operation hive add val infos = spark.sparkContext.textFile ("file:///F:/infos.txt") / * import spark.implicits._ val infoDF = infos.map (_ .split (") ") .map (x = > Info (x (0) .toInt, x (1)) X (2) .toInt). ToDF () infoDF.show () * / * * RDD and DataFrame are implemented using StructType (II) * StructType constructs the StructField method to pass in name and dataType * each field is used for a StructField * Schema and RDD to work through the createDataFrame method * / / pay attention to the requirements obtained through ROW To convert the corresponding type val infoss = infos.map (_ .split (" ) .map (x = > Row (x (0) .trim.toInt, x (1), x (2) .trim.toInt)) val fields = StructType (Array ("id", IntegerType,true), StructField ("name", StringType,true), StructField ("age", IntegerType) True) val schema = StructType (fields) val infoDF = spark.createDataFrame (infoss,schema) infoDF.show () spark.stop ()} / / case class Info (id:Int,name:String,age:Int) above are all the contents of the article "how to use DataFrame" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report