In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces how to optimize hive, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.
Hive predicate pushdown (Predicate pushdown)
Parameter opening setting: hive.optimize.ppd=true
Implementation: try to execute the filter conditions in advance without affecting the results. After the predicate is pushed down, the filter condition is executed on the map side, which reduces the output of the map side, reduces the amount of data transferred on the cluster, saves the resources of the cluster, and improves the performance of the task.
Enable Map-side aggregation function
Parameter opening setting: hive.map.aggr=true
Implementation: partial aggregation operation will be done in map, which can greatly reduce the amount of data transmitted by map to reduce, thus reducing the data tilt brought by group by to a certain extent.
Small file merge
Parameter settings:
Set hive.merge.mapfiles = true # # merge small files at the end of map only's task
Set hive.merge.mapredfiles = false # # true merge small files at the end of MapReduce's task
Set hive.merge.size.per.task = 256 "1000" 1000 # # size of the merged file
Set mapred.max.split.size=256000000; # # maximum split size per Map
Set mapred.min.split.size.per.node=1; # # minimum value of split on a node
Sethive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; # # merge small files before executing Map
Implementation: too many files will put pressure on HDFS and affect processing efficiency, which can be eliminated by merging the result files of Map and Reduce.
Solve the data skew caused by group by
Parameter setting: hive.groupby.skewindata=true
Implementation: the selected item is set to true, and the generated query plan will have two MR Job. In the first MR Job, the output result set of the Map is randomly distributed to the Reduce, and each Reduce does a partial aggregation operation and outputs the result. The result is that the same Group By Key may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group ByKey according to the preprocessed data results (this process ensures that the same Group ByKey is distributed to the same Reduce), and finally completes the final aggregation operation.
When hive.groupby.skewindata=true, hive does not support deduplicating operations on multiple columns and reports an error:
Error in semanticanalysis: DISTINCT on different columns notsupported with skew in data.
Parallel execution
Parameter settings:
Set hive.exec.parallel=true, concurrent execution can be enabled.
Set hive.exec.parallel.thread.number=16; / / maximum parallelism is allowed for the same sql, which defaults to 8.
Implementation: Hive converts a query into one or more phases, including: MapReduce phase, sampling phase, merge phase, limit phase, and so on. By default, only one phase is executed at a time. However, if certain phases are not interdependent, they can be performed in parallel.
Restricted Cartesian product query
There must be an on statement when two tables join
Parameter setting: set hive.mapred.mode=strict
Realization: even if the amount of data is small, the data pressure caused by Cartesian product is considerable, so try to limit it in the parameter stage to prevent misoperation.
SQL layer optimization
For complex logical processing, there are several principles:
1. Tables with large amounts of data should be filtered as soon as possible and can be inserted into a temporary table, such as by date or region, and then logically associated with other tables, which can greatly reduce the data processing time. This is the simplest and most effective way.
2. When multiple result sets union all, you can use the method of insertinto to the result table one by one, which can improve the data processing speed.
3. Join operation places small tables on the left table as much as possible, and the processing logic of MR will cache small tables in memory, and then scan large tables, which is efficient at this time.
4. Although hive already supports in/exists syntax, and a small number of queries can be used, it is more recommended to rewrite it in left semi join form, so that you can compare the efficiency of the two writing methods when there is a large amount of data.
5. There are some basic principles: sort by instead of order by;group by instead of distinct; multi-table join as far as possible key the same; null values in join may cause data skew, if the null part of the data has no effect on the result set, sift out in advance; and so on
Actual production must be combined with the characteristics of business and company data, some optimization is not omnipotent ointment, more practice comparison will always find more optimization space.
Thank you for reading this article carefully. I hope the article "how to optimize hive" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.