Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to implement parallel execution / strict mode / JVM reuse / speculative execution in Hive performance tuning

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to achieve parallel execution / strict mode / JVM reuse / speculative execution in Hive performance tuning, which has a certain reference value, and interested friends can refer to it. I hope you can learn a lot after reading this article.

Parallel execution

Set hive.exec.parallel=true; / / Open task execution in parallel

Set hive.exec.parallel.thread.number=16; / / maximum parallelism is allowed for the same sql, which defaults to 8.

Of course, it is only when the system resources are relatively free that there is an advantage, otherwise, without resources, parallelism will not work.

Strict mode

Hive provides a strict mode that prevents users from performing "high-risk" queries.

By setting the property hive.mapred.mode value to default to non-strict mode nonstrict. To enable strict mode, you need to change the hive.mapred.mode value to strict. Enable strict mode to disable three types of queries.

Hive.mapred.mode

Strict

The mode in which the Hive operations are being performed.

In strict mode, some risky queries are not allowed to run. They include:

Cartesian Product.

No partition being picked up for a query.

Comparing bigints and strings.

Comparing bigints and doubles.

Orderby without limit.

For partition tables, users are not allowed to scan all partitions and are not allowed to execute unless the partition field filter condition is included in the where statement to limit the scope. The reason for this restriction is that usually partitioned tables have very large datasets and the data is growing rapidly. Queries without partitioning restrictions may consume unacceptably large resources to process the table.

For queries that use the order by statement, the declare statement is required. Because order by distributes all the result data to the same Reducer for processing in order to perform the sorting process, forcing the user to add this LIMIT statement can prevent Reducer from executing for a long time.

A query that limits Cartesian product. Users who are familiar with relational databases may expect to use where statements instead of on statements when executing JOIN queries, so that the relational database execution optimizer can efficiently convert WHERE statements into that ON statement. Unfortunately, Hive does not perform this optimization, so if the table is large enough, the query can get out of control.

JVM reuse

JVM reuse is the content of Hadoop tuning parameters, which has a great impact on the performance of Hive, especially for scenarios where it is difficult to avoid small files or where there are a lot of task, most of these scenarios have a short execution time.

The default configuration of Hadoop usually uses derived JVM to perform map and Reduce tasks. At this point, the startup process of JVM can incur considerable overhead, especially if the executed job contains hundreds of task tasks. JVM reuse allows JVM instances to be reused N times in the same job. The value of N can be configured in the mapred-site.xml file of Hadoop. It is usually between 10 and 20, which needs to be tested according to the specific business scenario.

Mapreduce.job.jvm.numtasks

ten

How many tasks to run per jvm. If set to-1, there is

No limit.

We can also pass through the hive.

Set mapred.job.reuse.jvm.num.tasks=10

This setting sets our jvm reuse, of course, this feature also has its disadvantages. Turning on JVM reuse will occupy the task slot used so that it can be reused until the task is completed. If some reduce task in an "unbalanced" job takes much more time to execute than other Reduce task, then the reserved slot will remain idle and cannot be used by other job until all the task is finished.

Speculative execution

In the distributed cluster environment, because of program Bug (including bug of Hadoop itself), uneven load or uneven distribution of resources, the running speed of multiple tasks of the same job may be inconsistent, and the running speed of some tasks may be significantly slower than that of other tasks (for example, the progress of a task of a job is only 50%, while all other tasks have finished running) These tasks slow down the overall execution of the job. In order to prevent this from happening, Hadoop adopts the Speculative Execution mechanism, which speculates the "lagging behind" task according to certain rules, and initiates a backup task for such a task, which processes the same data as the original task at the same time, and finally selects the calculation result of the first successful run to complete the task as the final result.

Hive can also enable speculative execution.

Set the parameters for enabling speculative execution: configure them in the mapred-site.xml file of Hadoop

Mapreduce.map.speculative

True

If true, then multiple instances of some map tasks

May be executed in parallel.

Mapreduce.reduce.speculative

True

If true, then multiple instances of some reduce tasks

May be executed in parallel.

However, hive itself provides configuration items to control the speculative execution of reduce-side:

Hive.mapred.reduce.tasks.speculative.execution

True

Whether speculative execution for reducers should be turned on.

Thank you for reading this article carefully. I hope the article "how to achieve parallel execution / strict mode / JVM reuse / speculative execution in Hive performance tuning" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report