Introduction and simple use of Parquet 07/01 Update SLTechnology News&Howtos

Introduction and simple use of Parquet

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

= = > what is parquet

Parquet is a file type of column storage

= > description on the official website:

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language

Regardless of the choice of data processing framework, data model, or programming language, Apache Parquet is a columnar storage format available for any project in the Hadoop ecosystem

= > Origin

Parquet is inspired by the Dremel paper published by Google in 2010, which introduces a storage format that supports nested structure, and uses column storage to improve query performance. Dremel paper also introduces how Google uses this storage format to achieve parallel queries. If you are interested, you can refer to the paper and open source implementation Apache Drill.

= > Features:

-> you can skip the data that does not meet the requirements, read only the data you need, and reduce the amount of IO data.

-- > Compression coding can reduce disk storage space (since the data types of the same column are the same, more efficient compression coding (such as Run Length Encoding t Delta Encoding) can be used to further save storage space.

-- > read only the required columns and support vector operation to achieve better scanning performance

-> Parquet format is the default data source for Spark SQL and can be configured through spark.sql.sources.default

= = > Common parquet operations

-- > load and save functions

/ / read the Parquet file val usersDF = spark.read.load ("/ test/users.parquet") / / query Schema and data usersDF.printSchemausersDF.show// query the user's name and favorite colors and save usersDF.select ($"name", $"favorite_color"). Write.save ("/ test/result/parquet") / / the data structure can be queried through printSchema Use show to view data / / explicitly specify the file format: load the json format val usersDF = spark.read.format ("json"). Load ("/ test/people.json") / / Storage mode (Save Modes) / / you can use SaveMode to perform storage operations, SaveMode defines the processing mode for data, it should be noted that these save modes do not use any locks, not atomic operations / / when executed in Overwrite mode Before the new data is exported, the original data has been deleted usersDF.select ($"name"). Write.save ("/ test/parquet1") / / if / test/parquet1 exists, it will error usersDF.select ($"name"). Wirte.mode ("overwrite"). Save ("/ test/parquet1") / / you can save the result as a table or partition using overwrite Sub-bucket operation: partitionBy bucketByusersDF.select ($"name") .write.saveAsTable ("table1")

-- > Parquet file

Parquet is a column format and is used in multiple data processing systems

Spark SQL provides support for reading and writing Parquet files, that is, Schema that automatically saves the original data. When writing Parquet files, all columns are automatically converted to nullable because of compatibility.

-read data in Json format, convert it to parquet format, create corresponding tables, and query using SQL statements

/ / read the data val empJson = spark.read.json ("/ test/emp.json") from the json file / / save the data as parquetempJson.write.mode ("overwrite"). Parquet ("/ test/parquet") / / read parquetval empParquet = spark.read.parquet ("/ test/parquet") / / create a temporary table emptableempParquet.createOrReplaceTempView ("emptalbe") / / execute the query spark.sql ("select * from emptable where deptno=10 and sal > 1500") using the SQL statement. Show

-Schematic merge: first define a simple Schema, and then gradually add column descriptions. Users can obtain multiple compatible Parquet files with multiple different Schema

/ / create the first file val df1 = sc.makeRDD (1 to 5) .map (x = > (x, x2)). ToDF ("single", "double") scala > df1.printSchemaroot |-- single: integer (nullable = false) |-- double: integer (nullable = false) / / create a second file scala > val df2 = sc.makeRDD (6 to 10) .map (x = > (x, x2)). ToDF ("single") "triple") df2: org.apache.spark.sql.DataFrame = [single: int, triple: int] scala > df2.printSchemaroot |-- single: integer (nullable = false) |-- triple: integer (nullable = false) scala > df2.write.parquet ("/ data/testtable/key=2") / / merge the above two files scala > val df3 = spark.read.option ("mergeSchema", "true"). Parquet ("/ data/testtable") df3: org.apache.spark.sql.DataFrame = [single: int Double: int... 2 more fields] scala > df3.printSchemaroot |-- single: integer (nullable = true) |-- double: integer (nullable = true) |-- triple: integer (nullable = true) |-- key: integer (nullable = true) scala > df3.show+---+ | single | double | triple | key | +-+ | 8 | null | 16 | 2 | 9 | null | 18 | 2 | 10 | null | 20 | 2 | 3 | 6 | null | 1 | 4 | 8 | null | 1 | 5 | 10 | null | 1 | 6 | null | 12 | 2 | null | 14 | 2 | 2 | 1 | null | 1 | 2 | 4 | null | 1 | +-- +

-- > Json Datasets (two ways of writing)

/ / the first scala > val df4 = spark.read.json ("/ app/spark-2.2.1-bin-hadoop2.7/examples/src/main/resources/people.json") df4: org.apache.spark.sql.DataFrame = [age: bigint Name: string] scala > df4.show+----+-+ | age | name | +-+-+ | null | Michael | | 30 | Andy | | 19 | Justin | +-+-+ / / the second scala > val df5 = spark.read.format ("json"). Load ("/ app/spark-2.2.1-bin-hadoop2.7/examples/src/main/resources/people.json") df5: org.apache.spark.sql.DataFrame = [age: bigint Name: string] scala > df5.show+----+-+ | age | name | +-- +-+ | null | Michael | | 30 | Andy | | 19 | Justin | +-+-- +

-> JDBC reads data in a relational database (you need to add the driver of JDBC)

/ / add the driver of JDBC to bin/spark-shell-- master spark://bigdata11:7077-- jars / root/temp/ojdbc6.jar-- driver-class-path / root/temp/ojdbc6.jar// read Oracleval oracleEmp = spark.read.format ("jdbc") .option ("url", "jdbc:oracle:thin:@192.168.10.100:1521/orcl.example.com") .option ("dbtable") "scott.emp") .option ("user", "scott") .option ("password", "tiger") .load

-- > tables that manipulate Hive

-copy the configuration files of hive and hadoop to the conf directory of sprke: hive-sit.xml, core-sit.xml, hdfs-sit.xml

-specify the driver for the mysql database when starting Spark-shell

. / bin/spark-shell-- master spark://bigdata0:7077-- jars / data/tools/mysql-connector-java-5.1.43-bin.jar-- driver-class-path / data/tools/mysql-connector-java-5.1.43-bin.jar

-use Spark Shell to operate Hive

/ / create table spark.sql ("create table ccc (key INT, value STRING) row format delimited fields terminated by','") / / Import data spark.sql ("load data local path'/ test/data.txt' into table ccc") / / query data spark.sql ("select * from ccc") .show

-use Spark SQL to operate Hive

Show tables;select * from ccc

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.