What are the methods of Hive optimization 04/26 Update SLTechnology News&Howtos

What are the methods of Hive optimization

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "what are the methods of Hive optimization", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "what are the methods of Hive optimization" this article.

I. overall architecture optimization

Now the overall framework of hive is as follows, the computing engine supports not only Map/Reduce, but also Tez, Spark, and so on. Different resource scheduling and storage systems can be used according to different computing engines.

Overall architectural optimization points:

1. Date partition is carried out according to different business requirements, and type dynamic partition is performed.

Related parameter settings:

Default hive.exec.dynamic.partition=ture in 0.14

2. In order to reduce the disk storage space and the number of Icano, compress the data.

Related parameter settings:

The job output file is compressed in Gzip according to BLOCK.

one

two

three

Mapreduce.output.fileoutputformat.compress=true

Mapreduce.output.fileoutputformat.compress.type=BLOCK

Mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec

The map output is also compressed in Gzip.

one

two

Mapreduce.map.output.compress=true

Mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec

Compress the hive output and the intermediate results.

one

two

Hive.exec.compress.output=true

Hive.exec.compress.intermediate=true

3. Hive intermediate tables are saved in SequenceFile, which can save the time of serialization and deserialization.

Related parameter settings:

Hive.query.result.fileformat=SequenceFile

4. Yarn optimization, which will not be expanded here, will be discussed later.

II. MR stage optimization

The hive operators are:

The execution process is as follows:

Reduce cutting algorithm:

Related parameter settings. Default is:

Hive.exec.reducers.max=999

Hive.exec.reducers.bytes.per.reducer=1G

Reduce task num=min {reducers.max,input.size/bytes.per.reducer}, the number of reduce can be adjusted according to the actual demand.

III. JOB optimization

1. Local execution

Local execution mode is turned off by default, and small data can use local execution mode to speed up execution.

Related parameter settings:

Hive.exec.mode.local.auto=true

The default condition for local execution is a maximum of 1 hive.exec.mode.local.auto.inputbytes.max=128MB and hive.exec.mode.local.auto.tasks.max=4,reduce task. Performance testing:

Amount of data (ten thousand) normal execution time (seconds) local execution time (seconds)

170 group by 36 16

80 count 34 6

2 、 mapjoin

Default mapjoin is on, hive.auto.convert.join.noconditionaltask.size=10MB

The table loaded into memory must be a table through scan (excluding group by and other operations). If both tables of join meet the above conditions, / * mapjoin*/ specifies that the table does not work and only the small table will be loaded into memory, otherwise the scan table that meets the condition will be selected.

IV. SQL optimization

The overall optimization strategy is as follows:

Remove unwanted column from the query

Where condition judgment and so on are filtered at the TableScan stage.

Use Partition information to read only eligible Partition

Map side join, which is driven by large tables, and small tables are loaded into all mapper memory

Adjust the Join order to ensure that the large table is used as the driving table

For Group by tables with uneven data distribution, in order to avoid data concentration on a small number of reducer, it is divided into two map-reduce phases. In the first stage, shuffle is performed with Distinct columns, then partial aggregation is made on the reduce side to reduce the data size, and in the second map-reduce stage, the data is aggregated by group-by columns.

Partial aggregation is carried out with hash on the map side to reduce the scale of data processing on the reduce side.

V. platform optimization

1 、 hive on tez

2. General trend of spark SQL

The above is all the content of this article "what are the methods of Hive optimization?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.