In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces how to use DataFrame, the introduction in the article is very detailed, has a certain reference value, interested friends must read!
I. Overview:
DataFrame is a distributed dataset, which can be understood as a table in a relational database, which is organized by fields and field types and field values in columns, and supports four languages. In Scala API, it can be understood as: FataFrame=Dataset [ROW]
Note: DataFrame is produced after V1.3, SchemaRDD before V1.3, and Dataset is added after V1.6
Second, DataFrame vs RDD differences: concept: both are distributed containers, DF understands that a table in addition to RDD data also supports Schema, and also supports complex data types (map..) API: DataFrame provides more API than RDD supports map filter flatMap. Data structure: RDD knows that the type has no structure, DF provides Schema information for optimization, and the performance is good at the bottom: based on the different running environment, Java/Scala API developed by RDD runs the underlying environment JVM, DF is converted into logical execution plan (locaical Plan) and physical execution plan (Physical Plan) in SparkSQL, and the performance difference is big. Json file operation
[hadoop@hadoop001 bin] $. / spark-shell-- master local [2]-- jars ~ / software/mysql-connector-java-5.1.34-bin.jar
-- read json files
Scala > val df = spark.read.json ("file:///home/hadoop/data/people.json")
18-09-02 11:47:20 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
-- print schema information
Scala > df.printSchema
Root |-- age: long (nullable = true)-- Field type is allowed to be empty |-- name: string (nullable = true)
-- print field content
Scala > df.show
+-+-+ | age | name | +-+-+ | null | Michael | | 30 | Andy | | 19 | Justin | +-+-+
-- print query fields
Scala > df.select ("name"). Show
+-+ | name | +-+ | Michael | | Andy | | Justin | +-+
-- single quotation marks with implicit conversion
Scala > df.select ('name). Show
+-+ | name | +-+ | Michael | | Andy | | Justin | +-+
-- implicit conversion of double quotation marks is not recognized
Scala > df.select ("name). Show
: 1: error: unclosed string literal
Df.select ("name). Show
^
-- Age calculation cannot be calculated by NULL
Scala > df.select ($"name", $"age" + 1) .show
+-+-+ | name | (age + 1) | +-+-+ | Michael | null | | Andy | 31 | Justin | 20 | +-+-+
-Age filtering
Scala > df.filter ($"age" > 21). Show
+-age | name | +
-- Age grouping summary
Scala > df.groupBy ("age"). Count.show
+-+-- +-+ | age | count | +-+-+ | 19 | 1 | null | 1 | 30 | 1 | +-+-+
-- create a temporary view
Scala > df.createOrReplaceTempView ("people")
Scala > spark.sql ("select * from people") .show
+-+-- +-+ | age | name | +-+-+ | null | Michael | | 30 | Andy | | 19 | Justin | +-+-+ 4. Action operation on DataFrame object
Define the case class used to create the Schema
Case class Student (id:String,name:String,phone:String,Email:String)
RDD and DF reflection implementation
Val students = sc.textFile ("file:///home/hadoop/data/student.data").map(_.split("\\|")).map(x=>Student(x(0),x(1),x(2),x(3))).toDF()
-- print DF information
Students.printSchema
-show (numRows: Int, truncate: Boolean)
-- numRows intercepts the first 20 lines and truncate reads the first 20 strings
-- students.show (5 false) reads the first five lines and all strings
Scala > students.show
+-- + | id | name | phone | Email | +-- + | 1 | Burke | 1,300 | ullamcorper.velit... | 2 | Kamal | 1,668,571-5046 | pede.Suspendisse@... | 3 | Olga | 1,956,311-1686 | Aenean.eget.metus... | | 4 | Belle | 1,246,894-6340 | vitae.aliquet.nec... | 5 | Trevor | 1,300,527-4967 | dapibus.id@acturp... | | 6 | Laurel | 1,691,379-9921 | adipiscing@consec... | 7 | Sara | 1,608,586-1995 | Donec.nibh@enimEt... | 8 | Kaseem | 1,881,586-2689 | cursus.et.magna@ | | e | 9 | Lev | 1,916,367-5608 | Vivamus.nisi@ipsu... | | 10 | Maya | 1,271,683-2698 | accumsan.convalli... | | 11 | Emi | 1,467-1337 | est@nunc.com |. | | 12 | Caleb | 1,683,212-0896 | Suspendisse@Quisq... | | 13 | Florence | 1,603,575-2444 | sit.amet.dapibus@... | 14 | Anika | 1,856,828-7883 | euismod@ligulaeli... | 15 | Tarik | 1,398-2268 | turpis@felisorci.com | 16 | | Amena | 1,878,250-3129 | lorem.luctus.ut@s... | | 17 | Blossom | 1,154,406-9596 | Nunc.commodo.auct... | | 18 | Guy | 1,869,521-3230 | senectus.et.netus... | | 19 | Malachi | 1,608,637-2772 | Proin.mi.Aliquam@... | 20 | Edward | 1,711,710-6552 | lectus@aliquetlib... | +-- +- -+ only showing top 20 rows
-- students.head (5) returns the first few rows of data
Scala > students.head (5). Foreach (println) [1Metro BurkeGrad 1,300,746-8446] [2ec.co.uk] [2le KamalMagol 1-668,571-504ec.co.UK] [2co.co.UK] [2ec.co.co
-- query specific fields
Scala > students.select ("id", "name"). Show (5) +-- +-+ | id | name | +-+-+ | 1 | Burke | | 2 | Kamal | | 3 | Olga | 4 | Belle | 5 | Trevor | +-+-+
-- modify fields with aliases
Scala > students.select ($"name" .as ("new_name")) .show (5)
+-+ | new_name | +-+ | Burke | | Kamal | | Olga | | Belle | | Trevor | +-+
-- query id greater than five
Scala > students.filter ("id > 5"). Show (5)
+-- + | id | name | phone | Email | +-- + | 6 | Laurel | 1,691,379-9921 | adipiscing@consec. | | 7 | Sara | 1,608,140-1995 | Donec.nibh@enimEt... | | 8 | Kaseem | 1,881,586-2689 | cursus.et.magna@e... | | 9 | Lev | 1,916,367-5608 | Vivamus.nisi@ipsu... | | 10 | Maya | 1,271,683-2698 | accumsan.convalli... | +-- + |
-- query name is empty or name is NULL (filter=where)
Scala > students.filter ("name=''or name='NULL'") .show (false)
+-id | name | phone | Email | + | | 1-711,710-6552 | lectus@aliquetlibero.co.uk | | 22 | | 1,711,710-6552 | lectus@aliquetlibero.co.uk | | 23 | NULL | 1,711,710-6552 | lectus@aliquetlibero.co.uk | +-- +-+-- +--
-- query ID greater than 5 and name fuzzy query
Scala > students.filter ("id > 5 and name like'M%'") .show (5)
+-- + | id | name | phone | Email | +-- + | 10 | Maya | 1-271-683 -2698 | accumsan.convalli... | 19 | Malachi | 1,608,637-2772 | Proin.mi.Aliquam@... | +-- +
Sort by name in ascending order and not equal to null
Scala > students.sort ($"name"). Select ("id", "name"). Filter ("name'"). Show (3)
+-- +-- +-+ | id | name | +-- +-+ | 16 | Amena | | 14 | Anika | | 4 | Belle | +-+-+
-- sort by name backwards (sort = orderBy)
Scala > students.sort ($"name" .desc) .select ("name") .show (5)
+-+ | name | +-+ | Trevor | | Tarik | | Sara | | Olga | | NULL | +-+
-- Age grouping summary
Scala > students.groupBy ("age"). Count (). Show
+-+-- +-+ | age | count | +-+-+ | 19 | 1 | null | 1 | 30 | 1 | +-+-+
-- aggregate function use
Scala > students.agg ("id"-> "max", "id"-> "sum") .show (false)
+-+ | max (id) | sum (id) | +-+-+ | 9 | 276.0 | +-+-+
-- join operation. Using mode seq specifies multiple fields.
Students.join (students2, Seq ("id", "name"), "inner")
-- join operations of DataFrame include inner, outer, left_outer, right_outer, and leftsemi types.
-- specify the type, specify the type of join
Students.join (students2, students ("id") = students2 ("t1_id"), "inner") 5. DataFrame API implements file operation
1.maven depends on download
2.3.1 org.apache.spark spark-core_2.11 ${spark.version} org.apache.spark spark-sql_2.11 ${spark.version}
2. IDEA implementation method:
Package com.zrc.ruozedata.sparkSQLimport org.apache.spark.sql.types. {IntegerType, StringType, StructField, StructType} import org.apache.spark.sql. {Row, SparkSession} object SparkSQL001 extends App {/ * * RDD and DataFrame reflection implementation (1) * create RDD-- > DataFrema * create Schema using case class To parse each line of information in the output text * / val spark = SparkSession.builder () .master ("local [2]") .appName ("SparkSQL001") .getOrCreate () / Operation hive add val infos = spark.sparkContext.textFile ("file:///F:/infos.txt") / * import spark.implicits._ val infoDF = infos.map (_ .split (") ") .map (x = > Info (x (0) .toInt, x (1)) X (2) .toInt). ToDF () infoDF.show () * / * * RDD and DataFrame are implemented using StructType (II) * StructType constructs the StructField method to pass in name and dataType * each field is used for a StructField * Schema and RDD to work through the createDataFrame method * / / pay attention to the requirements obtained through ROW To convert the corresponding type val infoss = infos.map (_ .split (" ) .map (x = > Row (x (0) .trim.toInt, x (1), x (2) .trim.toInt)) val fields = StructType (Array ("id", IntegerType,true), StructField ("name", StringType,true), StructField ("age", IntegerType) True) val schema = StructType (fields) val infoDF = spark.createDataFrame (infoss,schema) infoDF.show () spark.stop ()} / / case class Info (id:Int,name:String,age:Int) above are all the contents of the article "how to use DataFrame" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.