In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly shows you "Spark-sql how to create an external partition table", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "Spark-sql how to create an external partition table" this article.
1. Spark-sql creates external partition tables
1. Use spark-sql
Spark-sql-queue spark--master yarn-deploy-mode client-num-executors 10-executor-cores 2-executor-memory 3G
Create a parquet partition table in 2.spark-sql:
Create external table pgls.convert_parq (bill_num string,logis_id string,store_id string,store_code string,creater_id string,order_status INT,pay_status INT,order_require_varieties INT,order_require_amount decimal (19pr 4), order_rec_amount decimal (19pr 4), order_rec_gpf decimal (19pr 4), deli_fee FLOAT,order_type INT,last_modify_time timestamp Order_submit_time timestamp) partitioned by (order_submit_date date) row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'stored as parquetfilelocation' / test/spark/convert/parquet/bill_parq/'
II. CSV to Parquet
Code: org.apache.spark.ConvertToParquet.scala
Convert package org.apache.sparkimport com.ecfront.fs.operation.HDFSOperationimport org.apache.hadoop.conf.Configurationimport org.apache.spark.sql.SQLContextimport org.apache.spark.sql.types._/*** CSV to parquet* parameters: input path, output path Number of partitions * / object ConvertToParquet {def main (args: Array [String]) {if (args.length! = 3) {println ("jar args: inputFiles outPath numpartitions") System.exit (0)} val inputPath = args (0) val outPath = args (1) val numPartitions = args (2). ToIntprintln ("=") println ("=") println ("= input:" + inputPath) println ("= output:" + outPath) println ("= numPartitions:" + numPartitions) println ("=") / determine whether the output directory exists Delete val fo = HDFSOperation (new Configuration ()) val existDir = fo.existDir (outPath) if (existDir) {println ("HDFS exists outpath:" + outPath) println ("start to delete...") val isDelete = fo.deleteDir (outPath) if (isDelete) {println (outPath + "delete done. ")} val conf = new SparkConf () val sc = new SparkContext (conf) / / Parameter SparkConf create SparkContext,val sqlContext = new SQLContext (sc) / / Parameter SparkContext create SQLContextval schema = StructType (Array (StructField (" bill_num ", DataTypes.StringType,false), StructField (" logis_id ", DataTypes.StringType,false), StructField (" store_id ", DataTypes.StringType,false), StructField (" store_code ", DataTypes.StringType,false), StructField (" creater_id ", DataTypes.StringType,false) StructField ("order_status", DataTypes.IntegerType,false), StructField ("pay_status", DataTypes.IntegerType,false), StructField ("order_require_varieties", DataTypes.IntegerType,false), StructField ("order_require_amount", DataTypes.createDecimalType (19pen4), false), StructField ("order_rec_amount", DataTypes.createDecimalType (19pen4), false), StructField ("order_rec_gpf", DataTypes.createDecimalType (19jin4), false), StructField ("deli_fee", DataTypes.FloatType,false), StructField ("order_type") DataTypes.IntegerType,false), StructField ("last_modify_time", DataTypes.TimestampType,false), StructField ("order_submit_time", DataTypes.TimestampType,false), StructField ("order_submit_date", DataTypes.DateType,false)) convert (sqlContext, inputPath, schema, outPath, numPartitions)} / / CSV into parquetdef convert (sqlContext: SQLContext, inputpath: String, schema: StructType, outpath: String NumPartitions: Int) {/ / Import text into DataFrameval df = sqlContext.read.format ("com.databricks.spark.csv") .schema (schema) .option ("delimiter", " ") .load (inputpath) / / convert to parquet// df.write.parquet (outpath) / / take the number of block as the number of partitions df.coalesce (numPartitions) .write.parquet (outpath) / / Custom Parquet} upload the jar to the local directory after packaging: / soft/sparkTest/convert/spark.mydemo-1.0-SNAPSHOT.jar generates the CSV file on the HDFS beforehand HDFS directory: / test/spark/convert/data/order/2016-05-01 / execute command: spark-submit-- queue spark--master yarn-- num-executors 10-- executor-cores 2-- executor-memory 3G-- class org.apache.spark.ConvertToParquet-- name ConvertToParquet file:/soft/sparkTest/convert/spark.mydemo-1.0-SNAPSHOT.jar / test/spark/convert/data/order/2016-05-01 / / test/spark/convert/parquet/bill_parq/order_submit_date=2016-05-01
Pom.xml related content:
1. Dependency package:
Com.ecfront ez-fs 0.9 org.apache.spark spark-core_2.10 1.6.1 org.apache.spark spark-sql_2.10 1.6.1 com.databricks spark-csv_2.11 1.4.0 org.apache.hadoop hadoop-client 2.6.0
2.plugins (including typing dependency packages)
Net.alchim31.maven scala-maven-plugin 3.2.1 org.apache.maven.plugins maven-compiler-plugin 2.0.2 org.apache.maven.plugins maven-shade-plugin 1.4 *: * META-INF/*.SF META-INF/*.DSA META-INF/*.RSA net.alchim31.maven scala-maven-plugin Scala-compile-first process-resources add-source compile scala-test-compile process-test-resources TestCompile org.apache.maven.plugins maven-compiler-plugin compile compile Org.apache.maven.plugins maven-shade-plugin 1.4 true package shade Org.apache.spark.ConvertToParquet
Add partitions to the table
Execute under spark-sql
Alter table pgls.convert_parq add partition (order_submit_date='2016-05-01')
The appropriate data can be queried through sql:
Select * from pgls.convert_parq where order_submit_date='2016-05-01 'limit 5
These are all the contents of the article "how Spark-sql creates external Partition tables". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.