How to optimize query efficiency by Hive 07/12 Update SLTechnology News&Howtos

How to optimize query efficiency by Hive

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how Hive optimizes query efficiency. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.

1. Enable FetchTask

A simple query statement refers to a statement without functions, sorting and other functions. When a FetchTask function is turned on, a simple query statement will not generate MapRreduce jobs, but will directly use FetchTask to query output data from the hdfs file system, thus improving efficiency.

How to set it:

Hive.fetch.task.conversion defaults to minimal

Modify the configuration file hive-site.xml

Hive.fetch.task.conversion

Some select queries can be converted to single FETCH task

Minimizing latency.Currently the query should be single

Sourced not having any subquery and should not have

Any aggregations or distincts (which incurrs RS)

Lateral views and joins.

1. Minimal: SELECT STAR, FILTER on partition columns, LIMIT only

2. More: SELECT, FILTER, LIMIT only (+ TABLESAMPLE, virtual columns)

Or the current session modification

Hive > set hive.fetch.task.conversion=more

Execute SELECT id, money FROM m limit 10; do not use mr

2, merge intermediate tables

In a log file, each line of record will have a lot of fields, 40 or 50 fields are normal. In the actual analysis, a few fields are often used to extract the fields to be analyzed according to the business requirements, and the data are put into the corresponding business table (sub-table). The actual business is analyzed according to the business table.

In practice, we will find that some business processing, there will be a common data set user table, order table, commodity table, three tables need to be operated by join, join will produce a result set, there will be a lot of business analysis for this jion result set.

Optimization: extract the same intermediate result sets from many businesses into a table in a Hive.

3. Rational use of partition tables

External table, partition table, combined with the use of multi-level partition. Data is in storage format (textfile, orcfile, parquet) or data compression (snappy).

We generally use day-to-day partitioning for detailed data, and for very large tables, we can use sub-partitions. Each partition actually corresponds to a directory on the HDFS. Data storage mode we can use parquet column storage, at the same time has a good compression performance; at the same time can reduce a lot of table scan and deserialization time. In the OLAP query scenario, we select the column information we need to query instead of directly select * query all fields.

4djvm reuse

JVM reuse is the content of hadoop tuning parameters, which has a great impact on the performance of hive, especially for scenarios where it is difficult to avoid small files or where there are a lot of task, most of these scenarios have a short execution time. The default configuration of hadoop is to use derived JVM to perform map and reduce tasks, which is because the startup process of jvm can cause considerable overhead, especially if the executed job contains thousands of task tasks. JVM reuse enables JVM instances to reuse N times in the same JOB, and the value of N can be set in the mapre-site.xml file of Hadoop.

Mapred.job.reuse.jvm.num.tasks 1

You can also set it in the execution of hive:

Set mapred.job.reuse.jvm.num.tasks = 10

One disadvantage of JVM is that turning on JVM reuse will occupy the task slot used for reuse until the task is completed. If several reduce task in an "unbalanced" job take much more time to execute than other reduce task, then the reserved slot will remain idle and cannot be used by other job until all the task is finished.

5 speculative execution (conjecture execution)

The so-called speculative execution means that when all task are running, Job Tracker will count the average progress of all tasks. If the configuration of the task node machine in which a task resides is relatively low or the CPU load is very high (for many reasons), which causes the task execution to be slower than the average execution of the overall task, the Job Tracker will start a new task (duplicate task), and the original task or the new task will be executed first and the other kill will be removed.

Speculate that two parameters of Job need to be set for execution:

Mapred.map.tasks.speculative.execution=true

Mapred.reduce.tasks.speculative.execution=true

7. Set the number of reduce reasonably

Number of reduce

Parameter 1:

Hive.exec.reducers.bytes.per.reducer=256000000 / / amount of data processed by each reduce task

Parameter 2:

Hive.exec.reducers.max=1009 / / maximum number of reduce per task

Calculation formula: number of reducer = min (parameter 2, total input data / parameter 1)

Set mapred.reduce.tasks = N:

Default number of reduce per task. The typical number of slots is 0.99 * reduce, and the default is-1, that is, the number of reduce is determined automatically.

The number of reduce is not the more the better.

Like map, starting and initializing reduce takes time and resources; in addition, there are as many output files as there are reduce, and if many small files are generated, there will be too many small files if they are used as input for the next task. Too many small files will greatly affect the query efficiency, the more files will result in more IO, but also increase the pressure on namenode. In the production environment, we must avoid the problem of small files, if the verification found, merge files in time!

8. Enable parallel execution

Parallel execution, which means simultaneous execution of multiple phases of hive, hive converts a query into one or more phases during execution. A particular job may contain many phases, which may not be completely interdependent, that is, they can be executed in parallel, which may shorten the execution time of the entire job.

The number of hive.exec.parallel.thread.number 8//job parallel execution, a SQL statement may have many mapreduce tasks, limit

Hive.exec.parallel false

Hive execution is enabled:

Set hive.exec.parallel=true

9. Optimize the sqlwhere condition

Before optimization (relational database will be automatically optimized without consideration):

Select m.cidreu.id from order m join customer u on (m.cid = u.id) where m.dtbrush 20180808'

After optimization (the where condition is executed on the map side rather than on the reduce side):

Select m.cidreu.id from (select * from order where dt='20180818') m join customer u on (m.cid = u.id)

Union optimization

Try not to use union (union removes duplicate records), but use union all and then use group by to remove duplicates.

Count distinct optimization

Do not use count (distinct cloumn), use subqueries.

Select count (1) from (select id from tablename group by id) tmp

Replace join with in

If you need to constrain another table based on the fields of one table, use in instead of join as much as possible.

Select id,name from tb1 a join tb2 b on (a.id = b.id)

Select id,name from tb1 where id in (select id from tb2)

In is faster than join.

Eliminate group by, COUNT (DISTINCT), MAX,MIN in the subquery. The number of job can be reduced.

Join Optimization:

Common/shuffle/Reduce JOIN: the stage in which the connection occurs, which occurs in the reduce phase, and is suitable for connecting large tables to large tables (the default)

Map join: the connection occurs in the map phase, which is suitable for small tables to connect the data of large tables to read from files; the data of small tables are stored in memory (hive has been automatically optimized, small tables are automatically judged, and then cached).

Set hive.auto.convert.join=true

SMB join:Sort-Merge-Bucket Join optimizes the connection of large tables with large tables by using the concept of bucket tables. Send raw Cartesian product join in one bucket (need to be two bucket tables for join)

Set hive.auto.convert.sortmerge.join=true

Set hive.optimize.bucketmapjoin = true

Set hive.optimize.bucketmapjoin.sortedmerge = true

Set hive.auto.convert.sortmerge.join.noconditionaltask=true

This is the end of the article on "how to optimize query efficiency in Hive". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.