What are the tuning strategies in the use of hive 07/12 Update SLTechnology News&Howtos

What are the tuning strategies in the use of hive

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces what tuning strategies there are in the use of hive. It is very detailed and has a certain reference value. Friends who are interested must read it!

Here are some tuning strategies for using hive

1. Fetch crawling

Fetch fetching means that queries for certain cases in Hive do not have to be calculated by MapReduce. For example: SELECT * FROM employees; in this case, Hive can simply read the files in the storage directory corresponding to employee, and then output the query results to the console.

In the hive-default.xml.template file, hive.fetch.task.conversion defaults to more, and the old version of hive defaults to minimal. After this attribute is modified to more, mapreduce is not used in global lookup, field lookup, limit lookup, and so on.

Hive.fetch.task.conversion more Expects one of [none, minimal, more]. Some select queries can be converted to single FETCH task minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins. 0. None: disable hive.fetch.task.conversion disable fetch crawling 1. Minimal: SELECT STAR, FILTER on partition columns, LIMIT only use fetch (do not go MapReduce) 2. More: SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns) fields search, limit do not use MapReduce

You can also temporarily modify the value of this parameter under the hive command line:

Hive (default) > set hive.fetch.task.conversion=more; II. Local mode

Sometimes the amount of input data of Hive is very small. In this case, it may take much more time to trigger execution tasks for the query than the actual job execution time. In most of these cases, Hive can handle all tasks on a single machine in local mode. For small datasets, the execution time can be significantly reduced. That is, only one map and reduce task is started and executed on a single host. The relevant parameters are set as follows:

/ / enable local mr, and automatically decide whether to use the local mode set hive.exec.mode.local.auto=true; / / to set the maximum input data amount of local mr according to the following configuration. When the input data amount is less than this value, local mr is used. The default is 134217728bytes, that is, 128Mset hive.exec.mode.local.auto.inputbytes.max=50000000. / / set the maximum number of input files for local mr. When the number of input files is less than this value, the local mr method is used. The default is 4set hive.exec.mode.local.auto.input.files.max=10; III. Optimization of table 3.1 small table join large table

Placing tables with relatively scattered key and a small amount of data on the left side of join can effectively reduce the chance of memory overflow errors because the left table is read first; furthermore, advanced memory can be used with Group smaller dimension tables (fewer than 1000 records). Complete the reduce on the map side.

The actual test found that the new version of hive has optimized the small table JOIN large table and large table JOIN small table. There is no obvious difference between the left and the right side of the watch.

3.2 large table join big table

During this experiment, you can open the jobhistory server of hadoop to view the execution of job, including the execution time and so on.

Configure mapred-site.xmlmapreduce.jobhistory.addressbigdata111:10020 mapreduce.jobhistory.webapp.address bigdata111:19888 startup history server: mr-jobhistory-daemon.sh start historyserver enter the web page of historyserver: http://192.168.1.102:19888

Figure 3.1 hive large table join result chart

You can see that the job result has many execution result status parameters, such as execution time, and so on.

3.2.1 empty key filtering

Sometimes join times out because there is too much data corresponding to some key, and the data corresponding to the same key will be sent to the same reducer, resulting in insufficient memory. At this point, we should carefully analyze these abnormal key. In many cases, the corresponding data of these key is abnormal data, which we need to filter in the SQL statement. For example, key is null. If it is abnormal data, it should be filtered out. For example:

Insert overwrite table jointable select n. * from (select * from nullidtable where id is not null) n left join ori o on n.id = o.id; here, filter out the rows in the nullidtable table where id is null. However, it should be noted that the data that key is null is filtered only if it is invalid. If it is valid data, it cannot be used in this way. 3.2.2 empty key conversion

Sometimes, although a key is empty, there is a lot of data, but the corresponding data is not abnormal data and must be included in the result of join. In this case, we can assign a random value to the fields with empty key in table a, so that the data is randomly and evenly distributed to different reducer. For example:

Insert overwrite table jointableselect n. * from nullidtable n full join ori o on case when n.id is null then concat ('hive', rand ()) else n.id end = o.id; use the case when xxx then value1 else id end statement to determine whether the id is empty, and replace it with a random number, otherwise start automatic map join directly.

If you do not specify MapJoin or do not meet the criteria for MapJoin, the Hive parser converts the Join operation to Common Join, that is, completing the join during the Reduce phase. Data tilt is easy to occur. You can use MapJoin to load all the small tables into the memory map side for join to avoid reducer processing.

We can specify that reduce join is used when the size of the small table is over, and map join is used when it is less than

(1) set automatic selection Mapjoinset hive.auto.convert.join = true; defaults to true (2) threshold setting for large and small tables (defaults below 25m to be regarded as small tables): set hive.mapjoin.smalltable.filesize=25000000;3.4 group by automatic load balancing

With reduce aggregation by default, the same Key data is distributed to a reduce in the Map phase, which is skewed when a key data is too large. Not all aggregation operations need to be completed on the reduce side. Many aggregation operations can be partially aggregated on the Map side, and finally the final result can be obtained on the Show side.

(1) whether to aggregate on the map side. The default is True hive.map.aggr = true. (2) the number of entries for aggregation operation on the map side hive.groupby.mapaggr.checkinterval = 100000. (3) load balancing (default is false) hive.groupby.skewindata = true when data is skewed. When this item is set to true, the generated query plan will have two MR Job. In the first MR Job, the output of Map will be randomly distributed to the Reduce, and each Reduce will do part of the aggregation operation and output the result. The result is that the same Group By Key may be distributed to different Reduce, thus achieving the purpose of load balancing. The second MR Job is distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and finally completes the final aggregation operation. 3.5.Deduplication statistics first group by and then count

In the general case, when we count the number of rows of deduplicated data, the statistics are as follows:

Select count (distinct id) from bigtable

This approach has a huge drawback, because it is the whole de-weight, so when MapReduce, can not use multiple reducer tasks, if used, become local de-weight, but the whole can not guarantee de-weight. In this way, the load of a reducer is actually very large, and can be optimized in the following ways:

Select count (id) from (select id from bigtable group by id) a

First start MapReduce to group by according to id, in fact, this process has been removed, and multiple reducer tasks can be used in group by, so that the pressure on a single reducer can be reduced. Then start another MapReduce, which is used by count to count the number of rows of data after group by. So here it becomes a task executed by two MapReduce job, so we should pay attention to using this method only when the amount of data is large, otherwise the scheduling of multi-task takes up more resources and the efficiency is not good.

3.6 perform row and column filtering before join

Column filtering: try not to use select * but to specify the fields to query

Row filtering: when we do external join, if a table wants to filter to certain rows. Filter before join, not after two tables join, because the amount of data after join is larger than before, and filtering takes longer.

Filter after join:

Select o.id from bigtable bjoin ori o on o.id = b.idwhere o.id load data local inpath'/ opt/module/datas/ds2' into table ori_partitioned partition (paired timekeeper 20111230000011); (3) create the target partition table create table ori_partitioned_target (id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) PARTITIONED BY (p_time STRING) row format delimited fields terminated by'\ t' (4) set the dynamic partition set hive.exec.dynamic.partition = true;set hive.exec.dynamic.partition.mode = nonstrict;set hive.exec.max.dynamic.partitions = 1000 * * set hive.exec.max.dynamic.partitions.pernode = 100 * set hive.exec.max.created.files = 1000000 * set hive.error.on.empty.partition = false Hive (default) > insert overwrite table ori_partitioned_target partition (p_time) select id, time, uid, keyword, url_rank, click_num, click_url, p_time from ori_partitioned; 4, data tilt 4.1reasonably set the number of map 4.1.1 A large number of small files leads to a large number of map

This problem has been mentioned in MapReduce, the default is to slice each file as a whole, a file is at least one slice, a large number of small files, is bound to produce a lot of map tasks. The problem is the same in hive.

Solution:

Merge small files before map execution, reduce the number of map: CombineHiveInputFormat has the function of merging small files (the default format of the system). HiveInputFormat does not have the ability to merge small files.

Set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

4.1.2 excessive workload for a single map

When the execution of each map is very slow, it may be because the processing logic is complex, so consider setting the small point of slice size to increase the number of map and reduce the workload of each map.

The way to increase map is to adjust the maximum value of maxSize according to the formula computeSliteSize (Math.max (minSize,Math.min (maxSize,blocksize) = blocksize=128M. Making the maximum value of maxSize lower than blocksize increases the number of map.

Set the maximum slice value to 100byte hive (default) > set mapreduce.input.fileinputformat.split.maxsize=100;. Here is just an example. How big is the setting? the number of reduce should be set reasonably according to the specific situation.

Adjustment method:

(1) the amount of data processed by each Reduce defaults to 256MBhive.exec.reducers.bytes.per.reducer=256000000 (2) the maximum number of reduce per task, and defaults to 1009hive.exec.reducers.max=1009 (3) the formula N=min for calculating the number of reducer (parameter 2, total input data / parameter 1)

Pay attention to:

1) too much startup and initialization of reduce will also consume time and resources

2) in addition, there will be as many output files as there are reduce. If many small files are generated, then if these small files are used as inputs for the next task, there will be too many small files.

These two principles should also be considered when setting the number of reduce: using the appropriate number of reduce to deal with large amounts of data, and making the amount of data processed by a single reduce task appropriate.

5. Enable concurrent execution

Hive converts a query into one or more phases. Such phases can be MapReduce phase, sampling phase, merge phase, limit phase. Or other stages that may be required during Hive execution. By default, Hive executes only one phase at a time. However, a particular job may contain many phases, which may not be completely interdependent, that is, some phases can be executed in parallel, which may shorten the execution time of the entire job. However, if there are more phases that can be executed in parallel, the faster the job may complete.

Concurrent execution can be turned on by setting the parameter hive.exec.parallelism to true. However, in a shared cluster, it is important to note that if there are more parallel phases in the job, the cluster utilization will increase.

Set hive.exec.parallel=true; / / Open task parallel execution set hive.exec.parallel.thread.number=16; / / maximum parallelism is allowed for the same sql. The default is 8. VI. Open strict mode

Hive provides a strict mode that prevents users from executing undesirable queries that may not be intended. By setting the property hive.mapred.mode value to default to non-strict mode nonstrict. To enable strict mode, you need to change the hive.mapred.mode value to strict. Enable strict mode to disable three types of queries.

Hive.mapred.mode strict The mode in which the Hive operations are being performed. In strict mode, some risky queries are not allowed to run. They include: Cartesian Product. No partition being picked up for a query. Comparing bigints and strings. Comparing bigints and doubles. Orderby without limit.

Restrict the execution of sql statements in the following three situations:

1) for partitioned tables, execution is not allowed unless the partition field filter condition is included in the where statement to limit the scope. In other words, the user is not allowed to scan all partitions. The reason for this restriction is that usually partitioned tables have very large datasets and the data is growing rapidly. Queries without partitioning restrictions may consume unacceptably large resources to process the table.

2) for queries that use the order by statement, the declare statement is required. Because order by distributes all the result data to the same Reducer for processing in order to perform the sorting process, forcing the user to add this LIMIT statement can prevent Reducer from executing for a long time.

3) query that restricts Cartesian product. Users who are familiar with relational databases may expect to use where statements instead of on statements when executing JOIN queries, so that the relational database execution optimizer can efficiently convert WHERE statements into that ON statement. Unfortunately, Hive does not perform this optimization, so if the table is large enough, the query can get out of control.

7. Enable JVM reuse

JVM reuse is the content of Hadoop tuning parameters, which has a great impact on the performance of Hive, especially for scenarios where it is difficult to avoid small files or where there are a lot of task, most of these scenarios have a short execution time.

The default configuration of Hadoop usually uses derived JVM to perform map and Reduce tasks. At this point, the startup process of JVM can incur considerable overhead, especially if the executed job contains hundreds of task tasks. JVM reuse allows JVM instances to be reused N times in the same job. The value of N can be configured in the mapred-site.xml file of Hadoop. It is usually between 10 and 20, which needs to be tested according to the specific business scenario.

Mapreduce.job.jvm.numtasks 10 How many tasks to run per jvm. If set to-1, there is no limit.

The current one also has its drawbacks. The disadvantage of this feature is that enabling JVM reuse will occupy the task slot used all the time, so that it can be reused until the task is completed, that is, all the occupied jvm will not be released until the entire job execution is completed. If some reduce task in an "unbalanced" job takes much more time to execute than other Reduce task, the slots reserved by the entire job will remain idle but cannot be used by other job until all the task is finished.

VIII. Speculative execution

For the conjecture execution on MapReduce, see the conjecture execution in the MapReduce section, which is not repeated here.

Hive itself provides configuration items to control the speculative execution of reduce-side:

Hive.mapred.reduce.tasks.lative.execution true Whether speculative execution for reducers should be turned on.

It is difficult to give a specific suggestion on tuning these speculative execution variables. If the user is very sensitive to run-time deviations, you can turn off these features. If the user needs to execute map or Reduce task for a long time because of the large amount of data input, then the waste of starting speculative execution is huge.

IX. Enable compression

This can be seen in "hive-- fundamentals" to compress the relevant content. It is mainly optimized by reducing the amount of data transferred by map and reduce, as well as reducing the size of reduce output files.

View the implementation plan

When performing a sql task, you can use explain to view the expected process of execution to see if there are any points that can be optimized.

(1) check the execution plan hive (default) > explain select * from emp; (2) of the following statement to view the detailed execution plan hive (default) > explain extended select * from emp;. These are all the contents of the article "what are the tuning strategies in the use of hive". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.