In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Today, I will talk to you about how to understand the dynamic partitioning in hive query optimization. Many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.
Recently, there are nearly 800 million items in the production environment, and a certain type of customer business data with a data size of about 1T needs to be queried and analyzed. However, it is found that it takes nearly a few thousand seconds to perform a hive sql filter, which is far from ideal and slow. Therefore, it is necessary to optimize the performance of hive query. It is found that hive query optimization can be done in the following ways:
Use the partition table (use partition table)
Connection Optimization (join table)
Sort optimization (order by)
Merge small files, too many files, will put pressure on HDFS, and will affect the processing efficiency.
In view of the above performance optimization methods, analyze the characteristics of the current business data, decide to use partition tables to improve query performance, specific how to operate, the following detailed answers.
As our business data is accompanied by user transactions, it is more in line with the characteristics of time series data, and the daily amount of data is more average, the average daily data volume is about 300,500w. Combined with the above characteristics, we adopt the scheme of time partition that is to build the data partition according to the day according to the data table.
Single partition table building operation
a. Create a datasheet
Create database if not exist test;use test;create table if not exist application (id int, name string, hobby array, address map, create_time string) partitioned by (datetime string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "'", "escapeChar" = "\\") STORED AS TEXTFILE
The above operation creates a single partition table. The table source file stores the text file in CSV/TSV format, and the partition field is datetime. Note: the partition field cannot duplicate the source field in the table.
b. Import data
There are several scenarios for data import. The first is to create an hive external table and specify the table source address with the location keyword.
Create external table application (...) WITH SERDEPROPERTIES (...) partitioned by (datetime string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'STORED AS TEXTFILElocation' / appdata/application'
The second method uses load data to load csv source files directly.
LOAD DATA LOCAL INPATH'/ appdata/application/application.csv'\ OVERWRITE INTO TABLE test.application (datetime='20190418')
c. View data and partitions
Select * from test.application limit 1
Show partitions test.application
You can see that the data has been imported into the partition datetime=20190418, and show create table test.application looks at the directory of the data in hdfs. But I can't manually import data into the specified partition every day for such a partition table, which affects efficiency, so we need to use hive's "dynamic partitioning (dynamic partition)".
Our solution is to first create the external table of the source data, and then import the data from the source table to the new partition table in a dynamic partition manner. The main difference from the above operation is that we do not specify the partition value in the data import operation, but automatically import it to the corresponding partition according to the create_time in the data.
-hive source data table create external table application_source (...) WITH SERDEPROPERTIES (...) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'STORED AS TEXTFILElocation' / appdata/application';---- hive dynamic partition data table create table application (...) WITH SERDEPROPERTIES (...) partitioned by (datetime string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'STORED AS TEXTFILE
Note: when using hive dynamic partition, you need to modify the parameters related to hive dynamic partition.
-hive parameter configuration set hive.exec.dynamic.partition=true;set hive.exec.dynamic.partition.mode=nonstrict;---- hive data import insert overwrite table application partition (datetime) select id,name,hobby,address,create_timefrom application_source
Hive first gets the create_time parameter value of the last location of the select, and then fills this value into the datetime variable in the insert statement partition, that is, the dynamic partition corresponds to the partition value by location.
At this point, we use the partitioned data table to perform hive sql filtering data, but at this point we can manually increase the time filtering in the sql condition.
Select id,name,hobby,address,create_time from application where... And create_time=20190418
This can prevent the hive query from scanning the whole table. After the above optimization, the query time is reduced from thousands of seconds to hundreds of seconds, and the query performance is improved nearly 10 times.
After reading the above, do you have any further understanding of how to understand dynamic partitioning in hive query optimization? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.