Example Analysis of Hive tuning 07/12 Update SLTechnology News&Howtos

Example Analysis of Hive tuning

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Editor to share with you the example analysis of Hive tuning, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

1. Fetch crawl

Fetch fetching means that queries for certain cases in Hive do not have to be calculated by MapReduce. For example: SELECT * FROM emp; in this case, Hive can simply read the files in the storage directory corresponding to emp, and then output the query results to the console.

In the hive-default.xml.template file, hive.fetch.task.conversion defaults to more, and the old version of hive defaults to minimal. After this attribute is modified to more, mapreduce is not used in global lookup, field lookup, limit lookup, and so on.

Hive.fetch.task.conversion more Expects one of [none, minimal, more]. Some select queries can be converted to single FETCH task minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins. 0. None: disable hive.fetch.task.conversion 1.minimal: SELECT STAR, FILTER on partition columns, LIMIT only 2. More: SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)

Case practice:

(1) set hive.fetch.task.conversion to none, and then execute the query statement, the mapreduce program will be executed. Hive (default) > set hive.fetch.task.conversion=none; hive (default) > select * from emp; hive (default) > select ename from emp; hive (default) > select ename from emp limit 3; (2) set hive.fetch.task.conversion to more, and then execute the query statement, the following query will not execute the mapreduce program. Hive (default) > set hive.fetch.task.conversion=more; hive (default) > select * from emp; hive (default) > select ename from emp; hive (default) > select ename from emp limit 3

Most Hadoop Job require the full extensibility provided by Hadoop to handle large datasets. However, sometimes the amount of input data in Hive is very small. In this case, it may take much more time to trigger execution tasks for the query than the actual job execution time. In most of these cases, Hive can handle all tasks on a single machine in local mode. For small datasets, the execution time can be significantly reduced.

Users can automatically start this optimization at the appropriate time by setting the value of hive.exec.mode.local.auto to true.

/ / enable local mr set hive.exec.mode.local.auto=true; / / set the maximum input data volume of local mr. When the input data volume is less than this value, the local mr method is used. Default is 134217728, i.e. 128m set hive.exec.mode.local.auto.inputbytes.max=50000000. / / set the maximum number of input files for local mr. If the number of input files is less than this value, local mr will be used. Default is 4 set hive.exec.mode.local.auto.input.files.max=10.

Case practice:

(1) enable the local mode and execute the query statement hive (default) > set hive.exec.mode.local.auto=true; hive (default) > select * from emp cluster by deptno; Time taken: 1.328 seconds, Fetched: 14 row (s) (2) close the local mode and execute the query statement hive (default) > set hive.exec.mode.local.auto=false; hive (default) > select * from emp cluster by deptno Time taken: 20.09 seconds, Fetched: 14 row (s) 3, Table Optimization 3.1, small Table, large Table join

Placing tables with relatively scattered key and a small amount of data on the left side of join can effectively reduce the chance of memory overflow errors; further, you can use map join to make small dimension tables (fewer than 1000 records) advanced memory. Complete the reduce on the map side.

The actual test found that the new version of hive has optimized the small table JOIN large table and large table JOIN small table. There is no obvious difference between the left and the right side of the watch.

3.2.Big join big table

1) empty key filtering

Sometimes join times out because there is too much data corresponding to some key, and the data corresponding to the same key will be sent to the same reducer, resulting in insufficient memory. At this point, we should carefully analyze these abnormal key. In many cases, the corresponding data of these key is abnormal data, which we need to filter in the SQL statement. For example, the field corresponding to key is empty, and the operation is as follows:

Case practice:

(1) configure history server configure mapred-site.xml mapreduce.jobhistory.address hadoop102:10020 mapreduce.jobhistory.webapp.address hadoop102:19888 startup history server sbin/mr-jobhistory-daemon.sh start historyserver view jobhistory http://hadoop102:19888/jobhistory (2) create empty id table / / create an empty id table create table nullidtable (id bigint T bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by'\ t' (3) load empty id data into empty id table load data local inpath'/ opt/module/hive/datas/nullid' into table nullidtable; (4) Test but filter out id insert overwrite table jointable select n. * from nullidtable n left join bigtable o on n.id = o.id Time taken: 42.038 seconds Time taken: 37.284 seconds (5) Test filter empty id insert overwrite table jointable select n.* from (select * from nullidtable where id is not null) n left join bigtable o on n.id = o.id; Time taken: 31.725 seconds Time taken: 28.876 seconds

2) empty key conversion

Sometimes, although a key is empty, there is a lot of data corresponding to it, but the corresponding data is not abnormal data and must be included in the result of join. At this time, we can assign a random value to the fields with empty key in table a, so that the data is randomly and evenly distributed to different reducer. For example:

Case practice:

Do not randomly distribute null values: (1) set the number of 5 reduce set mapreduce.job.reduces = 5; (2) JOIN two tables insert overwrite table jointable select n.* from nullidtable n left join bigtable b on n.id = b.id; result: as shown in the following figure, you can see that there is a data skew, and the resource consumption of some reducer is much higher than that of other reducer.

Randomly distributed null values (1) set the number of five reduce set mapreduce.job.reduces = 5; (2) JOIN two tables insert overwrite table jointable select n.* from nullidtable n full join bigtable o on nvl (n.iddline rand ()) = o.id. As shown in the following figure, it can be seen that the data skew is eliminated and the resource consumption of load balancing reducer is eliminated.

3.3 、 MapJoin

If you do not specify MapJoin or do not meet the criteria for MapJoin, the Hive parser converts the Join operation to Common Join, that is, completing the join during the Reduce phase. Data tilt is easy to occur. You can use MapJoin to load all the small tables into the memory map side for join to avoid reducer processing.

1) enable mapjoin parameter settings

(1) set automatic selection Mapjoin set hive.auto.convert.join = true; defaults to true (2) threshold setting for large and small tables (defaults below 25m are regarded as small tables): set hive.mapjoin.smalltable.filesize=25000000

2) the working mechanism of MapJoin'

3) case practice:

(1) enable Mapjoin function set hive.auto.convert.join = true Default is true (2) execute small table JOIN large table sentence Note: at this time, the small table is the main table, and all the data has to be written out, so it will go reduce,mapjoin invalidation insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from smalltable s left join bigtable b on s.id = b.id. Time taken: 24.594 seconds (3) execute large table JOIN and small table statements insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from bigtable b left join smalltable s on s.id = b.idscape time taken: 24.315 seconds3.4, Group By

By default, the same Key data is distributed to a reduce in the Map phase, which is skewed when a key data is too large.

Not all aggregation operations need to be completed on the reduce side. Many aggregation operations can be partially aggregated on the Map side, and finally the final result can be obtained on the Show side.

1) enable the setting of aggregation parameters on Map

(1) whether to aggregate on the map. The default is True set hive.map.aggr = true. (2) the number of entries for aggregation on the map. Set hive.groupby.mapaggr.checkinterval = 100000. (3) load balancing (default is false) when data is skewed. Set hive.groupby.skewindata = true Note: when the option is set to true, the generated query plan will have two MR Job. In the first MR Job, the output of Map will be randomly distributed to the Reduce, and each Reduce will do part of the aggregation operation and output the result. The result is that the same Group By Key may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group ByKey according to the preprocessed data results (this process ensures that the same Group ByKey is distributed to the same Reduce), and finally completes the final aggregation operation.

Case practice:

Hive (default) > select deptno from emp group by deptno Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 23.68 sec HDFS Read: 19987 HDFS Write: 9 SUCCESSTotal MapReduce CPU Time Spent: 23 seconds 680 msecOKdeptno102030 -hive (default) > set hive.groupby.skewindata = true after optimization Hive (default) > select deptno from emp group by deptno;Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 28.53 sec HDFS Read: 18209 HDFS Write: 534 SUCCESSStage-Stage-2: Map: 1 Reduce: 5 Cumulative CPU: 38.32 sec HDFS Read: 15014 HDFS Write: 9 SUCCESSTotal MapReduce CPU Time Spent: 1 minutes 6 seconds 850 msecOKdeptno1020303.5, Count (Distinct) de-duplication statistics

It doesn't matter when the amount of data is small, in the case of a large amount of data, because the COUNT DISTINCT operation needs to be completed with a Reduce Task, and the amount of data that this Reduce needs to deal with is too large, it will make the whole Job very difficult to complete. Generally, COUNT DISTINCT is replaced by GROUP BY and then COUNT, but we need to pay attention to the data tilt problem caused by group by.

Case practice:

(1) create a large table create table bigtable (id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by'\ tcards; (2) load data load data local inpath'/ opt/module/datas/bigtable' into table bigtable; (3) set the number of 5 reduce set mapreduce.job.reduces = 5 (4) execute deduplicated id query select count (distinct id) from bigtable Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.12 sec HDFS Read: 120741990 HDFS Write: 7 SUCCESS Total MapReduce CPU Time Spent: 7 seconds 120 msec OK c0 99947 Time taken: 23.607 seconds, Fetched: 1 row (s) (5) GROUP by weight removal id hive (default) > select count (id) from (select id from bigtable group by id) a Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 17.53 sec HDFS Read: 120752703 HDFS Write: 580 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 4.29 sec2 HDFS Read: 9409 HDFS Write: 7 SUCCESS Total MapReduce CPU Time Spent: 21 seconds 820 msec OK _ c0 99947 Time taken: 50.795 seconds Fetched: 1 row (s) what is worth saying is that although it will be done with an extra Job But in the case of a large amount of data, this is definitely worth it. 3.6, Cartesian product

Try to avoid Cartesian product, join without on condition, or invalid on condition, Hive can only use 1 reducer to complete Cartesian product.

3.7, row and column filtering

Column processing: in SELECT, take only the columns you need, if any, use partition filtering as much as possible and use less SELECT *.

Row processing: in partition clipping, when using external associations, if the filter condition of the secondary table is written after Where, then the whole table will be associated first, and then filtered later, for example:

Case practice:

1) the test first associates two tables, and then filters select o.id from bigtable b join bigtable o on o.id = b.id where o.id set mapreduce.input.fileinputformat.split.maxsize=100; hive (default) > select count (*) from emp; Hadoop job information for Stage-1: number of mappers: 6 with where condition Number of reducers: 14.2, merge small files 1) merge small files before map execution, reduce the number of map: CombineHiveInputFormat has the function of merging small files (the system default format). HiveInputFormat does not have the ability to merge small files. Set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;2) merge small files at the end of Map-Reduce task: / / merge small files at the end of map-only task. Default true SET hive.merge.mapfiles = true; / / merge small files at the end of map-reduce task. Default false SET hive.merge.mapredfiles = true / / merge file size. Default is 256m SET hive.merge.size.per.task = 268435456; / / when the average size of the output file is less than this value, start a separate map-reduce task to execute the file merge SET hive.merge.smallfiles.avgsize = 16777216 4. Set the number of Reduce reasonably 1) adjust the number of reduce method-the amount of data processed by each Reduce in ① defaults to 256MB-the maximum number of reduce per task in hive.exec.reducers.bytes.per.reducer=256000000 ②, and defaults to 1009-the formula for calculating the number of reducer in hive.exec.reducers.max=1009 ③-N=min (parameter 2) Total input data / parameters 1) 2) adjust the number of reduce method 2 modify / / set the number of Reduce per job in the mapred-default.xml file of hadoop set mapreduce.job.reduces = 15 3) the more reduce, the better. (1) too much startup and initialization of reduce will consume time and resources. (2) in addition, there will be as many output files as there are reduce. If many small files are generated, then if these small files are used as inputs for the next task, there will also be the problem of too many small files. Note: when setting the number of reduce, we also need to consider these two principles:-- handle a large amount of data and use the appropriate number of reduce;-- make the data size of a single reduce task appropriate. 5. Parallel execution

Hive converts a query into one or more phases. Such phases can be MapReduce phase, sampling phase, merge phase, limit phase. Or other stages that may be required during Hive execution. By default, Hive executes only one phase at a time. However, a particular job may contain many phases, which may not be completely interdependent, that is, some phases can be executed in parallel, which may shorten the execution time of the entire job. However, if there are more phases that can be executed in parallel, the faster the job may complete.

Concurrent execution can be turned on by setting the parameter hive.exec.parallelism to true. However, in a shared cluster, it is important to note that if there are more parallel phases in the job, the cluster utilization will increase.

Set hive.exec.parallel=true; / / Open tasks are executed in parallel. The default is falseset hive.exec.parallel.thread.number=16; / / the maximum parallelism allowed for the same sql, and the default is 8.

Of course, it is only when the system resources are relatively free that there is an advantage, otherwise, without resources, parallelism will not work.

6, strict mode Hive can be set to prevent some dangerous operations: 1) when the partition table does not use partition filtering to set hive.strict.checks.no.partition.filter to true, for the partition table, it is not allowed to execute unless the where statement contains partition field filtering conditions to limit the scope. In other words, the user is not allowed to scan all partitions. The reason for this restriction is that usually partitioned tables have very large datasets and the data is growing rapidly. Queries without partitioning restrictions may consume unacceptably large resources to process the table. 2) when using orderby without limit filtering, when hive.strict.checks.orderby.no.limit is set to true, for queries that use the orderby statement, the query statement must be used. Because order by distributes all the result data to the same Reducer for processing in order to perform the sorting process, forcing the user to add this LIMIT statement can prevent Reducer from executing for a long time. 3) when Cartesian product sets hive.strict.checks.cartesian.product to true, it will restrict the query of Cartesian product. Users who are familiar with relational databases may expect to use where statements instead of on statements when executing JOIN queries, so that the relational database execution optimizer can efficiently convert WHERE statements into that ON statement. Unfortunately, Hive does not perform this optimization, so if the table is large enough, the query can get out of control. 7. JVM reuse

JVM reuse is the content of Hadoop tuning parameters, which has a great impact on the performance of Hive, especially for scenarios where it is difficult to avoid small files or where there are a lot of task, most of these scenarios have a short execution time.

The default configuration of Hadoop usually uses derived JVM to perform map and Reduce tasks. At this point, the startup process of JVM can incur considerable overhead, especially if the executed job contains hundreds of task tasks. JVM reuse allows JVM instances to be reused N times in the same job. The value of N can be configured in the mapred-site.xml file of Hadoop. It is usually between 10 and 20, which needs to be tested according to the specific business scenario.

Mapreduce.job.jvm.numtasks 10 How many tasks to run per jvm. If set to-1, there is no limit.

The disadvantage of this feature is that turning on JVM reuse will occupy the task slot used for reuse and will not be released until the task is completed. If some reduce task in an "unbalanced" job takes much more time to execute than other Reduce task, then the reserved slot will remain idle and cannot be used by other job until all the task is finished.

8. Compression

For details, see 07Hive Learning path (Compression and Storage)

9. Implementation Plan (Explain)

1) basic grammar

EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query

2) case practice:

① view the execution plan of the following statement explain select * from emp; explain select deptno, avg (sal) avg_sal from emp group by deptno; ② to view the detailed execution plan explain extended select * from emp; explain extended select deptno, avg (sal) avg_sal from emp group by deptno; is all the contents of this article "sample Analysis of Hive tuning", thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.