In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces Spark SQL parsing query parquet format Hive table to get partition fields and query conditions example analysis, the content is very detailed, interested friends can refer to, hope to be helpful to you.
First of all, let's talk about the problem application scenario solved here: when sparksql processes Hive table data, it determines whether the partition table is loaded and what are the fields of the partition table? Further restrict that query partition tables must specify partitions? There are two situations involved: the select SQL query and the way the path to the Hive table is loaded. Here only "how to load the path of the Hive table" parses the partition table field, and gives a detailed explanation of some problems and solutions when dealing with them. If you have similar requirements, it is recommended to combine parsing Spark SQL logical plan with the solution described below and encapsulate it into a general tool.
Problem phenomenon
Sparksql loads the specified Hive partition table path, and the resulting DataSet has no partition fields.
Such as
SparkSession.read.format ("parquet") .load (s "${hive_path}"), hive_path is the storage path of the Hive partition table on HDFS.
Several ways to specify hive_path can cause this to happen (test_partition is an Hive external partition table, dt is its partition field, partition data has dt of 20200101 and 20200102): 1.hive_path is "/ spark/dw/test.db/test_partition/dt=20200101" 2.hive_path is "/ spark/dw/test.db/test_partition/*"
Because there are many source codes involved, here we only use the class, object and methods in the source code involved in the example program to draw a xmind diagram as follows. For those who want to study carefully, you can refer to this diagram to analyze the spark source code.
Analysis of problems
Here I mainly give several source code segments, combined with the above xmind diagram to understand:
If the parameter basePath is not specified:
1.hive_path is the basePaths obtained from the underlying processing of / spark/dw/test.db/test_partition/dt=20200101sparksql: Set (new Path ("/ spark/dw/test.db/test_partition/dt=20200101")) [pseudo code]
LeafDirs: Seq (new Path ("/ spark/dw/test.db/test_partition/dt=20200101")) [pseudo code] 2.hive_path is / spark/dw/test.db/test_partition/*
BasePaths obtained from the underlying processing of sparksql: Set (new Path ("/ spark/dw/test.db/test_partition/dt=20200101"), new Path ("/ spark/dw/test.db/test_partition/dt=20200102")) [pseudo code]
LeafDirs: Seq (new Path ("/ spark/dw/test.db/test_partition/dt=20200101"), new Path ("/ spark/dw/test.db/test_partition/dt=20200102")) [pseudo code] these two cases cause the source code if (basePaths.contains (currentPath)) to be true, and before parsing the partition, the variable finished is reset to true to jump out of the loop, so the resulting result has no partition field:
Solution (personal test is effective)
1. When Spark SQL loads the Hive data path, specify the parameter basePath, as shown in
SparkSession.read.option ("basePath", "/ spark/dw/test.db/test_partition") 2. You mainly rewrite the processing logic in the basePaths and parsePartition methods, while you need to modify the other code involved. Because there is a lot of code that needs to be rewritten, it can be encapsulated into a tool.
This is the example analysis of Spark SQL parsing query parquet format Hive table obtaining partition fields and query conditions. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.