How Spark-sql creates external partition tables 07/09 Update SLTechnology News&Howtos

How Spark-sql creates external partition tables

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "Spark-sql how to create an external partition table", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "Spark-sql how to create an external partition table" this article.

1. Spark-sql creates external partition tables

1. Use spark-sql

Spark-sql-queue spark--master yarn-deploy-mode client-num-executors 10-executor-cores 2-executor-memory 3G

Create a parquet partition table in 2.spark-sql:

Create external table pgls.convert_parq (bill_num string,logis_id string,store_id string,store_code string,creater_id string,order_status INT,pay_status INT,order_require_varieties INT,order_require_amount decimal (19pr 4), order_rec_amount decimal (19pr 4), order_rec_gpf decimal (19pr 4), deli_fee FLOAT,order_type INT,last_modify_time timestamp Order_submit_time timestamp) partitioned by (order_submit_date date) row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'stored as parquetfilelocation' / test/spark/convert/parquet/bill_parq/'

II. CSV to Parquet

Code: org.apache.spark.ConvertToParquet.scala

Convert package org.apache.sparkimport com.ecfront.fs.operation.HDFSOperationimport org.apache.hadoop.conf.Configurationimport org.apache.spark.sql.SQLContextimport org.apache.spark.sql.types._/*** CSV to parquet* parameters: input path, output path Number of partitions * / object ConvertToParquet {def main (args: Array [String]) {if (args.length! = 3) {println ("jar args: inputFiles outPath numpartitions") System.exit (0)} val inputPath = args (0) val outPath = args (1) val numPartitions = args (2). ToIntprintln ("=") println ("=") println ("= input:" + inputPath) println ("= output:" + outPath) println ("= numPartitions:" + numPartitions) println ("=") / determine whether the output directory exists Delete val fo = HDFSOperation (new Configuration ()) val existDir = fo.existDir (outPath) if (existDir) {println ("HDFS exists outpath:" + outPath) println ("start to delete...") val isDelete = fo.deleteDir (outPath) if (isDelete) {println (outPath + "delete done. ")} val conf = new SparkConf () val sc = new SparkContext (conf) / / Parameter SparkConf create SparkContext,val sqlContext = new SQLContext (sc) / / Parameter SparkContext create SQLContextval schema = StructType (Array (StructField (" bill_num ", DataTypes.StringType,false), StructField (" logis_id ", DataTypes.StringType,false), StructField (" store_id ", DataTypes.StringType,false), StructField (" store_code ", DataTypes.StringType,false), StructField (" creater_id ", DataTypes.StringType,false) StructField ("order_status", DataTypes.IntegerType,false), StructField ("pay_status", DataTypes.IntegerType,false), StructField ("order_require_varieties", DataTypes.IntegerType,false), StructField ("order_require_amount", DataTypes.createDecimalType (19pen4), false), StructField ("order_rec_amount", DataTypes.createDecimalType (19pen4), false), StructField ("order_rec_gpf", DataTypes.createDecimalType (19jin4), false), StructField ("deli_fee", DataTypes.FloatType,false), StructField ("order_type") DataTypes.IntegerType,false), StructField ("last_modify_time", DataTypes.TimestampType,false), StructField ("order_submit_time", DataTypes.TimestampType,false), StructField ("order_submit_date", DataTypes.DateType,false)) convert (sqlContext, inputPath, schema, outPath, numPartitions)} / / CSV into parquetdef convert (sqlContext: SQLContext, inputpath: String, schema: StructType, outpath: String NumPartitions: Int) {/ / Import text into DataFrameval df = sqlContext.read.format ("com.databricks.spark.csv") .schema (schema) .option ("delimiter", " ") .load (inputpath) / / convert to parquet// df.write.parquet (outpath) / / take the number of block as the number of partitions df.coalesce (numPartitions) .write.parquet (outpath) / / Custom Parquet} upload the jar to the local directory after packaging: / soft/sparkTest/convert/spark.mydemo-1.0-SNAPSHOT.jar generates the CSV file on the HDFS beforehand HDFS directory: / test/spark/convert/data/order/2016-05-01 / execute command: spark-submit-- queue spark--master yarn-- num-executors 10-- executor-cores 2-- executor-memory 3G-- class org.apache.spark.ConvertToParquet-- name ConvertToParquet file:/soft/sparkTest/convert/spark.mydemo-1.0-SNAPSHOT.jar / test/spark/convert/data/order/2016-05-01 / / test/spark/convert/parquet/bill_parq/order_submit_date=2016-05-01

Pom.xml related content:

1. Dependency package:

Com.ecfront ez-fs 0.9 org.apache.spark spark-core_2.10 1.6.1 org.apache.spark spark-sql_2.10 1.6.1 com.databricks spark-csv_2.11 1.4.0 org.apache.hadoop hadoop-client 2.6.0

2.plugins (including typing dependency packages)

Net.alchim31.maven scala-maven-plugin 3.2.1 org.apache.maven.plugins maven-compiler-plugin 2.0.2 org.apache.maven.plugins maven-shade-plugin 1.4 *: * META-INF/*.SF META-INF/*.DSA META-INF/*.RSA net.alchim31.maven scala-maven-plugin Scala-compile-first process-resources add-source compile scala-test-compile process-test-resources TestCompile org.apache.maven.plugins maven-compiler-plugin compile compile Org.apache.maven.plugins maven-shade-plugin 1.4 true package shade Org.apache.spark.ConvertToParquet

Add partitions to the table

Execute under spark-sql

Alter table pgls.convert_parq add partition (order_submit_date='2016-05-01')

The appropriate data can be queried through sql:

Select * from pgls.convert_parq where order_submit_date='2016-05-01 'limit 5

These are all the contents of the article "how Spark-sql creates external Partition tables". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.