In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly shows you "what are the methods of Hive optimization", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "what are the methods of Hive optimization" this article.
I. overall architecture optimization
Now the overall framework of hive is as follows, the computing engine supports not only Map/Reduce, but also Tez, Spark, and so on. Different resource scheduling and storage systems can be used according to different computing engines.
Overall architectural optimization points:
1. Date partition is carried out according to different business requirements, and type dynamic partition is performed.
Related parameter settings:
Default hive.exec.dynamic.partition=ture in 0.14
2. In order to reduce the disk storage space and the number of Icano, compress the data.
Related parameter settings:
The job output file is compressed in Gzip according to BLOCK.
one
two
three
Mapreduce.output.fileoutputformat.compress=true
Mapreduce.output.fileoutputformat.compress.type=BLOCK
Mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
The map output is also compressed in Gzip.
one
two
Mapreduce.map.output.compress=true
Mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec
Compress the hive output and the intermediate results.
one
two
Hive.exec.compress.output=true
Hive.exec.compress.intermediate=true
3. Hive intermediate tables are saved in SequenceFile, which can save the time of serialization and deserialization.
Related parameter settings:
Hive.query.result.fileformat=SequenceFile
4. Yarn optimization, which will not be expanded here, will be discussed later.
II. MR stage optimization
The hive operators are:
The execution process is as follows:
Reduce cutting algorithm:
Related parameter settings. Default is:
Hive.exec.reducers.max=999
Hive.exec.reducers.bytes.per.reducer=1G
Reduce task num=min {reducers.max,input.size/bytes.per.reducer}, the number of reduce can be adjusted according to the actual demand.
III. JOB optimization
1. Local execution
Local execution mode is turned off by default, and small data can use local execution mode to speed up execution.
Related parameter settings:
Hive.exec.mode.local.auto=true
The default condition for local execution is a maximum of 1 hive.exec.mode.local.auto.inputbytes.max=128MB and hive.exec.mode.local.auto.tasks.max=4,reduce task. Performance testing:
Amount of data (ten thousand) normal execution time (seconds) local execution time (seconds)
170 group by 36 16
80 count 34 6
2 、 mapjoin
Default mapjoin is on, hive.auto.convert.join.noconditionaltask.size=10MB
The table loaded into memory must be a table through scan (excluding group by and other operations). If both tables of join meet the above conditions, / * mapjoin*/ specifies that the table does not work and only the small table will be loaded into memory, otherwise the scan table that meets the condition will be selected.
IV. SQL optimization
The overall optimization strategy is as follows:
Remove unwanted column from the query
Where condition judgment and so on are filtered at the TableScan stage.
Use Partition information to read only eligible Partition
Map side join, which is driven by large tables, and small tables are loaded into all mapper memory
Adjust the Join order to ensure that the large table is used as the driving table
For Group by tables with uneven data distribution, in order to avoid data concentration on a small number of reducer, it is divided into two map-reduce phases. In the first stage, shuffle is performed with Distinct columns, then partial aggregation is made on the reduce side to reduce the data size, and in the second map-reduce stage, the data is aggregated by group-by columns.
Partial aggregation is carried out with hash on the map side to reduce the scale of data processing on the reduce side.
V. platform optimization
1 、 hive on tez
2. General trend of spark SQL
The above is all the content of this article "what are the methods of Hive optimization?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.