Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of Hive Optimization

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article will explain in detail the example analysis of Hive optimization for you. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

When can I avoid implementing MapReduce? There are only partition fields in the select * where statement

2. Jion optimization driven table the size of the rightmost query table increases sequentially from left to right. The flag mechanism displays to tell the query optimizer which represents the large table / * + streamtable (table_name) * /.

Third, Map-side aggregation sethive.map.aggr=true; this setting can put the top-level aggregation operations to be performed in the Map phase, thus reducing the cleaning phase data transfer and Reduce phase execution time, and improving the overall performance. Cons: this setting consumes more memory. Execute select count (1) from wlan

4. Localhadoop Local Mode SETmapred.job.tracker=local; Test select 1 from wlan limit 5

The following two parameters are commonly used in local mr control parameters:

1Magnehive.exec.mode.local.auto.inputbytes.max sets the maximum amount of input data of local mr. When the amount of input data is less than this value, it will use the local mr method.

2Medi hive.exec.mode.local.auto.tasks.max sets the maximum number of input files for local mr. When the number of input files is less than this value, the default execution mode is local mr:

Hive (default) > select count (1) T1

Query ID = root_20150611185656_333185b7-e8b3-40b5-bc4c-2f11978f9822

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

Sethive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

Sethive.exec.reducers.max=

In order to set a constant number of reducers:

Set mapreduce.job.reduces=

Starting Job = job_1433931422330_0001,Tracking URL = http://crxy176:8088/proxy/application_1433931422330_0001/

Kill Command = / usr/local/hadoop-2.6.0/bin/hadoop job-kill job_1433931422330_0001

Hadoop job information for Stage-1: number of mappers: 1; number ofreducers: 1

2015-06-11 18 56 14 447 Stage-1 map = 0%, reduce = 0%

2015-06-11 18 Cumulative CPU 56 sec 57029 Stage-1 map = 100%, reduce = 0%

2015-06-11 18 Cumulative CPU 5715 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.2 sec

MapReduce Total cumulative CPU time: 4 seconds 200 msec

Ended Job = job_1433931422330_0001

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce:1 Cumulative CPU: 4.2 sec HDFS Read: 312 HDFS Write: 2 SUCCESS

Total MapReduce CPU Time Spent: 4 seconds 200 msec

OK

T1

one hundred

Time taken: 46.573 seconds, Fetched: 1 row (s)

Compare the startup Localhadoop mode:

Hive (default) > select count (1) T1

Automatically selecting local only mode for query

Query ID = root_20150611185555_97e1a1d0-1958-4f35-8ea7-8face4cda85f

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

Sethive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

Sethive.exec.reducers.max=

In order to set a constant number of reducers:

Setmapreduce.job.reduces=

Job running in-process (local Hadoop)

Hadoop job information for Stage-1: number of mappers: 0; number ofreducers: 0

2015-06-11 18 5512 Stage-1 map = 100%, reduce = 100%

Ended Job = job_local1510342541_0004

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 12HDFS Write: 22 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

T1

one hundred

Time taken: 1.721 seconds, Fetched: 1 row (s)

Fifth, the index architecture in the index Hive opens an interface that allows you to implement your own index according to this interface. Hive currently has a reference index implementation (CompactIndex), and later added bitmap indexes in version 0. 8. Let's talk about CompactIndex here.

/ * create an index on the id field of the index_test_table table * /

Create index idx on table index_test_table (id)

As' org.apache.Hadoop.Hive.ql.index.compact.CompactIndexHandler'with deferred rebuild

Alter index idx on index_test_table rebuild

/ * clipping of the index. Find the index table built above and tailor it according to the query criteria you will eventually use. , /

/ * if you want to use it as soon as you finish indexing like RDBMS, it will not work. It will report an error directly, which is also the trouble point * /

Create table my_index

As select _ bucketname, `_ offsets`

From default__index_test_table_idx__ where id = 10

/ * now you can use the index. Note that the final query condition is consistent with the above clipping condition * /

Set Hive.index.compact.file = / user/Hive/warehouse/my_index

Set Hive.input.format = org.apache.Hadoop.Hive.ql.index.compact.HiveCompactIndexInputFormat

Select count (*) from index_test_table where id = 10

Sixth, data tilt, the so-called data tilt, means that due to the uneven distribution of data, individual values occupy most of the data, coupled with the computing mode of Hadoop, the uneven computing resources lead to performance degradation.

Tilt is divided into tilt caused by group by and tilt caused by join:

One is Hive.Map.aggr, and the default value is true, which means that it can do combiner on the map side. So if you only do count (*) in your group by query, you will not see the tilting effect, but if you are doing count (distinct), you will still see a little tilting effect.

The other parameter is Hive.groupby. Skewindata . This parameter means that when you do the Reduce operation, the key you get is not all the same values given to the same Reduce, but randomly distributed, and then the Reduce aggregates. After that, you do a round of MR, and then calculate the results with the previously aggregated data. So this parameter is actually similar to what Hive.Map.aggr does, but only takes it to the Reduce end to do, and needs to start an extra round of Job, so it is not recommended, and the effect is not obvious.

Rewrite SQL to optimize

/ * before rewriting * /

Select a, count (distinctb) as c from tbl group by a

/ * after rewriting * /

Select a, count (*) as c

From (selectdistinct a, b from tbl) group by a

7. Parallel between Job

First of all, among the multiple Job generated by Hive, the Job can be parallelized in some cases, typically subqueries. Parallelism between Job can be used when multiple subquery unionall or join operations need to be performed. For example, the following code is a scenario that can be done in parallel:

Hive > FROM T4

INSERT OVERWRITE TABLE t3PARTITION (...) SELECT... WHERE...

INSERT OVERWRITE TABLE t3PARTITION (...) SELECT... WHERE...

INSERT OVERWRITE TABLE t3PARTITION (...) SELECT... WHERE...

This is the end of this article on "sample Analysis of Hive Optimization". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 272

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report