Summary of Hive tuning 07/19 Update SLTechnology News&Howtos

Summary of Hive tuning

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

First, view the implementation plan

Explain extended hql; can see the hdfs path of scanned data

II. Optimization of hive table

Partition (different folders):

Dynamic partitioning is enabled:

Set hive.exec.dynamic.partition=true

Set hive.exec.dynamic.partition.mode=nonstrict

Default value: strict

Description: strict avoids that full-partition fields are dynamic. At least one partition field must have a specified value.

Avoid producing a large number of partitions

Separate buckets (different documents):

Set hive.enforce.bucketing=true

Set hive.enforce.sorting=true; enables mandatory sorting, which is performed by inserting data into the table. Default is false.

III. Hive SQL optimization

Groupby data skew optimization

Hive.groupby.skewindata=true; (multiple job) 1.join optimization

(1) data tilt

Hive.optimize.skewjoin=true

If the join process is skewed, it should be set to true

Set hive.skewjoin.key=100000

If the number of records corresponding to the key of join exceeds this value, it will be optimized.

To put it simply, one job becomes two job to execute HQL

(2) mapjoin (execute join on the map side)

Start mode 1: (automatic judgment)

Set.hive.auto.convert.join=true

The default for hive.mapjoin.smalltable.filesize is 25mb

The small table is smaller than 25mb and automatically starts mapjoin

Startup mode 2: (manual)

Select / * + mapjoin (A) * / f.arecoverf.b from A t join B f on (f.a=t.a)

Mapjoin supports non-equivalent conditions

Reducejoin does not support non-equivalence judgment in ON conditions.

(3) bucketjoin (data access can be accurate to bucket level)

Conditions of use: 1. The two tables divide the buckets in the same way

two。 The number of buckets of the two tables is multiplied.

Example:

Create table order (cid int,price float) clustered by (cid) into 32 buckets

Create table customer (id int,first string) clustered by (id) into 32 + + 64 buckets

Select price from order t join customer s on t.cid=s.id

(4) where condition optimization.

Before optimization (relational database will be automatically optimized without consideration):

Select m.cidreu.id from order m join customer u on m.cid = u.id where m.dt='2013-12-12'

After optimization (the where condition is executed on the map side rather than on the reduce side):

Select m.cidreu.id from (select * from order where dt='2013-12-12') m join customer u on m.cid = u.id

(5) group by optimization

Hive.groupby.skewindata=true

If the group by process is skewed, it should be set to true.

Set hive.groupby.mapaggr.checkinterval=100000

If the number of records corresponding to the key of group exceeds this value, it will be optimized.

Also, one job becomes two job.

(6) count distinct optimization

Before optimization (there is only one reduce, the heavy load is removed first and then the count burden is relatively heavy):

Select count (distinct id) from tablename

After optimization (start two job, one job is responsible for the subquery (there can be multiple reduce), and the other job is responsible for count (1):

Select count (1) from (select distinct id from tablename) tmp

Select count (1) from (select id from tablename group by id) tmp

Set mapred.reduce.tasks=3

(7)

Before optimization:

Select a distinct sum (b), count (distinct c), count (distinct d) from test group by a

After optimization:

Select an as sum (b) as bpenh count (c) as c count (d) count (d)

(

Select a, 0 as b,c,null as d from test group by a,c

Union all

Select a,0 as b, null as c,d from test group by a,d

Union all

Select a, bjol null as c, null as d from test) tmp group by a

IV. Hive job optimization

1. Parallel execution

Hive default job is carried out sequentially, and a HQL is split into multiple job,job without dependency and mutual influence. It can be executed in parallel.

Set hive.exec.parallel=true

Set hive.exec.parallel.thread.number=8

Is to control the maximum value of job that can be run simultaneously for the same sql. This parameter defaults to 8. 0. At this point, you can run up to 8 job at the same time

two。 Localized execution (on the node where the data is stored)

Set hive.exec.mode.local.auto=true

Localization execution must meet the conditions:

(1) the input data size of job must be less than the parameter.

Hive.exec.mode.local.auto.inputbytes.max (default 128MB)

(2) the number of map of job must be less than the parameter:

Hive.exec.mode.local.auto.tasks.max (default is 4) too much and not enough slots

(3) the number of reduce of job must be 0 or 1

3.job merge input small files

Set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

Multiple split are combined into one, and the number of merged split is determined by the size of the mapred.max.split.size limit.

4.job merge output small files (preparation for subsequent job optimization)

Set hive.merge.smallfiles.avgsize=256000000; when the average size of the output file is less than this value, start a new job merge file

Each file size after the set hive.merge.size.per.task=64000000; merge

5.JVM reuse

Set mapred.job.reuse.jvm.num.tasks=20

How many task runs per jvm

JVM reuse enables job to retain slot for a long time until the end of the job.

6. Compressed data (multiple job)

(1) Intermediate compression deals with data between multiple job of hive query. For intermediate compression, it is best to choose a compression method that saves cpu time.

Set hive.exec.compress.intermediate=true

Set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec

Set hive.intermediate.compression.type=BLOCK; compresses by block instead of recording

(2) final output compression (select those with good compression effect to reduce storage space)

Set hive.exec.compress.output=true

Set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

Set mapred.output.compression.type=BLOCK; compresses by block instead of recording

V. Hive Map optimization

Invalid 1.set mapred.map.tasks=10

(1) default number of map

Default_num=total_size/block_size

(2) expected size (number set manually)

Goal_num = mapred.map.tasks

(3) set the file size to be processed (the number of map calculated according to the file shard size)

Split_size=max (block_size,mapred.min.split.size)

Split_num=total_size/split_size

(4) final number of map (actual number of map)

Compute_map_num=min (split_num,max (default_num,goal_num))

Summary:

(1) if you want to increase the number of map, set mapred.map.tasks to a larger value.

(2) if you want to reduce the number of map, set mapred.min.split.size to a larger value.

2.map end polymerization

Set hive.map.aggr=true; is equivalent to executing combiner on map side.

3. Speculative execution (default is true)

Mapred.map.tasks.speculative.execution

VI. Hive Shuffle optimization

Map end

Io.sort.mb

Io.sort.spill.percent

Min.num.spill.for.combine

Io.sort.factor

Io.sort.record.percent

Reduce end

Mapred.reduce.parallel.copies

Mapred.reduce.copy.backoff

Io.sort.factor

Mapred.job.shuffle.input.buffer.percent

VII. HIve Reduce optimization

1. Speculative execution (default is true)

Mapred.reduce.tasks.speculative.execution (in hadoop)

Hive.mapred.reduce.tasks.speculative.execution (the same parameters in hive and the same effect as those in hadoop)

Either of them will do.

2.Reduce optimization (number of reduce settings)

Set mapred.reduce.tasks=10; direct setting

Maximum value

Hive.exec.reducers.max default: 999

File size calculated by each reducer

Hive.exec.reducers.bytes.per.reducer default: 1G

Calculation formula: although there are so many, but not necessarily so many

NumRTasks = min [maxReducers,input.size/perReducer]

MaxReducers=hive.exec.reducers.max

PerReducer=hive.exec.reducers.bytes.per.reducer

VIII. Queue

Set mapred.queue.name=queue3; setup queue queue3

Set mapred.job.queue.name=queue3; settings use queue3

Set mapred.job.priority=HIGH

Queue reference article:

Http://yaoyinjie.blog.51cto.com/3189782/872294

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.