In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
First, view the implementation plan
Explain extended hql; can see the hdfs path of scanned data
II. Optimization of hive table
Partition (different folders):
Dynamic partitioning is enabled:
Set hive.exec.dynamic.partition=true
Set hive.exec.dynamic.partition.mode=nonstrict
Default value: strict
Description: strict avoids that full-partition fields are dynamic. At least one partition field must have a specified value.
Avoid producing a large number of partitions
Separate buckets (different documents):
Set hive.enforce.bucketing=true
Set hive.enforce.sorting=true; enables mandatory sorting, which is performed by inserting data into the table. Default is false.
III. Hive SQL optimization
Groupby data skew optimization
Hive.groupby.skewindata=true; (multiple job) 1.join optimization
(1) data tilt
Hive.optimize.skewjoin=true
If the join process is skewed, it should be set to true
Set hive.skewjoin.key=100000
If the number of records corresponding to the key of join exceeds this value, it will be optimized.
To put it simply, one job becomes two job to execute HQL
(2) mapjoin (execute join on the map side)
Start mode 1: (automatic judgment)
Set.hive.auto.convert.join=true
The default for hive.mapjoin.smalltable.filesize is 25mb
The small table is smaller than 25mb and automatically starts mapjoin
Startup mode 2: (manual)
Select / * + mapjoin (A) * / f.arecoverf.b from A t join B f on (f.a=t.a)
Mapjoin supports non-equivalent conditions
Reducejoin does not support non-equivalence judgment in ON conditions.
(3) bucketjoin (data access can be accurate to bucket level)
Conditions of use: 1. The two tables divide the buckets in the same way
two。 The number of buckets of the two tables is multiplied.
Example:
Create table order (cid int,price float) clustered by (cid) into 32 buckets
Create table customer (id int,first string) clustered by (id) into 32 + + 64 buckets
Select price from order t join customer s on t.cid=s.id
(4) where condition optimization.
Before optimization (relational database will be automatically optimized without consideration):
Select m.cidreu.id from order m join customer u on m.cid = u.id where m.dt='2013-12-12'
After optimization (the where condition is executed on the map side rather than on the reduce side):
Select m.cidreu.id from (select * from order where dt='2013-12-12') m join customer u on m.cid = u.id
(5) group by optimization
Hive.groupby.skewindata=true
If the group by process is skewed, it should be set to true.
Set hive.groupby.mapaggr.checkinterval=100000
If the number of records corresponding to the key of group exceeds this value, it will be optimized.
Also, one job becomes two job.
(6) count distinct optimization
Before optimization (there is only one reduce, the heavy load is removed first and then the count burden is relatively heavy):
Select count (distinct id) from tablename
After optimization (start two job, one job is responsible for the subquery (there can be multiple reduce), and the other job is responsible for count (1):
Select count (1) from (select distinct id from tablename) tmp
Select count (1) from (select id from tablename group by id) tmp
Set mapred.reduce.tasks=3
(7)
Before optimization:
Select a distinct sum (b), count (distinct c), count (distinct d) from test group by a
After optimization:
Select an as sum (b) as bpenh count (c) as c count (d) count (d)
(
Select a, 0 as b,c,null as d from test group by a,c
Union all
Select a,0 as b, null as c,d from test group by a,d
Union all
Select a, bjol null as c, null as d from test) tmp group by a
IV. Hive job optimization
1. Parallel execution
Hive default job is carried out sequentially, and a HQL is split into multiple job,job without dependency and mutual influence. It can be executed in parallel.
Set hive.exec.parallel=true
Set hive.exec.parallel.thread.number=8
Is to control the maximum value of job that can be run simultaneously for the same sql. This parameter defaults to 8. 0. At this point, you can run up to 8 job at the same time
two。 Localized execution (on the node where the data is stored)
Set hive.exec.mode.local.auto=true
Localization execution must meet the conditions:
(1) the input data size of job must be less than the parameter.
Hive.exec.mode.local.auto.inputbytes.max (default 128MB)
(2) the number of map of job must be less than the parameter:
Hive.exec.mode.local.auto.tasks.max (default is 4) too much and not enough slots
(3) the number of reduce of job must be 0 or 1
3.job merge input small files
Set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
Multiple split are combined into one, and the number of merged split is determined by the size of the mapred.max.split.size limit.
4.job merge output small files (preparation for subsequent job optimization)
Set hive.merge.smallfiles.avgsize=256000000; when the average size of the output file is less than this value, start a new job merge file
Each file size after the set hive.merge.size.per.task=64000000; merge
5.JVM reuse
Set mapred.job.reuse.jvm.num.tasks=20
How many task runs per jvm
JVM reuse enables job to retain slot for a long time until the end of the job.
6. Compressed data (multiple job)
(1) Intermediate compression deals with data between multiple job of hive query. For intermediate compression, it is best to choose a compression method that saves cpu time.
Set hive.exec.compress.intermediate=true
Set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
Set hive.intermediate.compression.type=BLOCK; compresses by block instead of recording
(2) final output compression (select those with good compression effect to reduce storage space)
Set hive.exec.compress.output=true
Set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
Set mapred.output.compression.type=BLOCK; compresses by block instead of recording
V. Hive Map optimization
Invalid 1.set mapred.map.tasks=10
(1) default number of map
Default_num=total_size/block_size
(2) expected size (number set manually)
Goal_num = mapred.map.tasks
(3) set the file size to be processed (the number of map calculated according to the file shard size)
Split_size=max (block_size,mapred.min.split.size)
Split_num=total_size/split_size
(4) final number of map (actual number of map)
Compute_map_num=min (split_num,max (default_num,goal_num))
Summary:
(1) if you want to increase the number of map, set mapred.map.tasks to a larger value.
(2) if you want to reduce the number of map, set mapred.min.split.size to a larger value.
2.map end polymerization
Set hive.map.aggr=true; is equivalent to executing combiner on map side.
3. Speculative execution (default is true)
Mapred.map.tasks.speculative.execution
VI. Hive Shuffle optimization
Map end
Io.sort.mb
Io.sort.spill.percent
Min.num.spill.for.combine
Io.sort.factor
Io.sort.record.percent
Reduce end
Mapred.reduce.parallel.copies
Mapred.reduce.copy.backoff
Io.sort.factor
Mapred.job.shuffle.input.buffer.percent
VII. HIve Reduce optimization
1. Speculative execution (default is true)
Mapred.reduce.tasks.speculative.execution (in hadoop)
Hive.mapred.reduce.tasks.speculative.execution (the same parameters in hive and the same effect as those in hadoop)
Either of them will do.
2.Reduce optimization (number of reduce settings)
Set mapred.reduce.tasks=10; direct setting
Maximum value
Hive.exec.reducers.max default: 999
File size calculated by each reducer
Hive.exec.reducers.bytes.per.reducer default: 1G
Calculation formula: although there are so many, but not necessarily so many
NumRTasks = min [maxReducers,input.size/perReducer]
MaxReducers=hive.exec.reducers.max
PerReducer=hive.exec.reducers.bytes.per.reducer
VIII. Queue
Set mapred.queue.name=queue3; setup queue queue3
Set mapred.job.queue.name=queue3; settings use queue3
Set mapred.job.priority=HIGH
Queue reference article:
Http://yaoyinjie.blog.51cto.com/3189782/872294
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.