Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Data skew of hive and common optimization methods

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "hive data tilt and commonly used optimization methods". In daily operation, I believe many people have doubts about hive data tilt and common optimization methods. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts of "hive data tilt and common optimization methods". Next, please follow the editor to study!

1. Data skew of hive

  introduction: as long as there must be shuffle in the distribution, the data tilt can not be avoided, and the data distribution is uneven in the process of confusing data. For example: in MR programming, the size of the data in the reducetask order is inconsistent, that is, a lot of data is concentrated in a reducetask, and the data tilt of hive is the data tilt of mapreduce. Maptask reducetask is the data tilt of the reducetask phase.

  does not produce a data skew scenario:

  -does not execute the MapReduce program, and hive.fetch.task.conversion-one has three optional values in hive:

   -none: indicates that all statements execute MR. This parameter is not available.

    -minimal: indicates that select *, where field is a partition field, and MR is not executed when limit

    -more: select, filter / where, limit do not execute MR

  -when the aggregate function is used with group by, when the aggregate function is used with group by, the default MR underlying layer executes combiner on the map side, so the data will not be skewed

  produces a data skew scenario:

  -aggregate function is not used with group by

  -count (distinct)

  -join is mainly because reduce join will produce data skew.

Specific scenario analysis: 1) too many null values in join:

  takes the log log as an example, in which one field is userid, but there are too many null values in userid. When using userid for join, all userid=null data will be in one reduce. This reducetask has a large amount of data, resulting in data skew.

   solution 1:

# null values do not participate in the connection select field1,field2,field3. From log a left join user b on a.userid is not null and a.userid=b.useridunion select field1,field2,field3 from log where userid is null

   solution 2:

# Hash null values

When select * from log a left join user b on case when a.userid is null then concat ("null", rand () else a.userid end=b.userid;2) joins, the two tables join different types of columns:

Table 1 of user userid string

Table 2 of log userid int

Select * from log a left join user b on a.userid=b.userid

By default, string is converted to int type, and if the userid of string type cannot be converted to int type, then a large number of null is returned, and then a large number of null is allocated to the same reducetask, resulting in data skew. As long as you are sure that you can convert string to int type, you can avoid data skew.

3) data skew is generated at the join end:

Size table join: when associating a large table with a small table, using join on the map side, there is no data skew in map join. There are two parameters:

  -hive.auto.convert.join # enables map join. Default is enabled.

  -hive.smalltable.filesize # limits the size of small tables when mapjoin. The default is 25000000byte, about 25m.

Large and small tables are connected, but small tables have a large amount of data:

   is a small table that is not very large, but exceeds 25000000byte. Reducejoin is executed by default. If reducejoin is executed, it is easy to generate data skew. If the size of the small table is not very large or less than 100m, you can enforce map join:

# enforce map joinselect/*+mapjoin (table name) * / # force small tables into memory * from T1 join T2 on t1.field1=t2.field

Large tables * large tables: filter one of the tables, convert the table to a relatively small table, and then enforce the map-side join

Here are two tables as an example:

User-30g (all users)

Log-5G (log recorded on the same day)

# userid filter the log log table first: create table temp_log as select distinct userid from log;# associates the above result with the user table: (get valid associated data in the userid table) create table temp_user as select filed1, filed2,field3 / * + mapjoin (a) * / from temp a join user b on a.userid = b.userid # finally, in associating the above table with log: select filed1, filed2,field3 / * + mapjoin (a) * / from temp_user a join Log b on a. Userid = b. Userid;2. Hive optimization (1) commonly used optimization methods:

  -A good design model, pay attention to data skew when designing tables

  -solve data skew problem

  -reduce the number of job

  -set a reasonable number of reduce task

  -understand the distribution of data and manually resolve data skew

  -minimize the use of global aggregate class operations when the amount of data is large

  -merge small files to reduce the number of maptask and improve performance

(1) specific optimization scheme:

How to choose the correct sort for    ①:

   -cluster by: separate buckets and sort the same field, cannot be used with sort by

   -distribute by + sort by: separate buckets to ensure that only one result file exists for the same field value, and that each reduceTask result is orderly with sort by.

   -sort by: sort on a single machine, and the results of a single reduce are ordered.

   -order by: global sorting. The drawback is that only one reduce task can be used.

How does    ② do Cartesian products: Cartesian products are not allowed in HQL statements when Hive is set to strict mode (hive.mapred.mode=strict)

Solving the Cartesian product problem: https://blog.51cto.com/14048416/2338651

In the article: 6) use random prefixes and expanded RDD for join, which are explained in detail.

How to write in/exists for    ③:

# use left semi join. To replace in/exists:select a. ID. Name from a where a.id in (select b.id from b); # change to: selecet a. ID. A. Name from a left semi join b on a.id=b.id

Blog post: https://blog.51cto.com/14048416/2342407 's summary of left semi join.

Reasonable number of maptask handled by    ④:

The number of      Maptask is too large: each Maptask needs to start a jvm process, which takes too long and is inefficient, and the number of Maptask is too small: the load is unbalanced, and when there are a large number of jobs, it is easy to block the cluster. So there are usually two ways to solve the problem:

   -reduce the number of Maptask by merging small files, mainly for data sources

   -reduce the time it takes for MapReduce programs to start and shut down jvm processes by setting reuse of jvm processes: (set mapred.job.reuse.jvm.num.tasks=5) means that map task reuses the same jvm.

   ⑤ reasonably sets the number of reduce task:

The setting of the number of      reducer greatly affects the execution efficiency, which makes it a key issue that how to determine the number of reducer in Hive. By default, only one reducetask is started in hive. There are the following parameters as tuning advantages:

Throughput of    -hive.exec.reducers.bytes.per.reducer # reduceTask

Experience of the maximum reducetask started by    -hive.exec.reducers.max #: 0.95 * (number of datanode in the cluster)

   -mapreduce.job.reduces= # sets the number of reducetask

   ⑥ small File merge:

Too many      small files will exert pressure on hdfs and affect processing efficiency. You can eliminate this effect by merging the result files of Map and Reduce. The following parameters can be used as tuning advantages:

   -set hive.merge.mapfiles = true file merging occurs at the end of the task when there is only maptask

   -set hive.merge.mapredfiles = false # true merges small files at the end of MapReduce's task

   -set hive.merge.size.per.task = 25610001000 # size of merged small files

   -set mapred.max.split.size=256000000; # maximum number of partitions per map

   -set mapred.min.split.size.per.node=1; # minimum split value on a node

Merge small files before    -set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;,# executes map (enabled by default)

Reasonable setting of partition for    ⑦:

The     Partition is the partition. Partitioning is achieved by enabling partitioned by when creating the table. In order to reduce the scope of data scanning during the query and improve the query performance, when the two data are relatively large, when the query is often filtered by a certain field, you need to create a partition table according to the filter field.

   ⑧ makes reasonable use of storage formats:

When     creates a table, try to use orc and parquet column storage format, because the column storage table, the data of each column is physically stored together, Hive query will only traverse the column data, greatly reducing the amount of data processed.

   ⑨ parallelization processing

    A hive sql statement may be translated into multiple mapreduce Job, each job is a stage, and these job are executed sequentially, which can also be seen in the run log of this client. But sometimes these tasks are not interdependent, and if cluster resources allow, multiple stage that are not interdependent can be executed concurrently. The following two parameters can be tuned:

   -set hive.exec.parallel=true; # enables parallelism

   -set hive.exec.parallel.thread.number=8; / / the maximum number of threads allowed for parallel tasks in the same sql

   ⑩ Settings compressed Storage

    Hive is finally due to the conversion to MapReduce programs to execute, and the performance bottleneck of MapReduce lies in the network and IO. To solve the performance bottleneck, the most important thing is to reduce the amount of data. Data compression is a good way.

Job output files are compressed according to block's gzip format: set mapreduce.output.fileoutputformat.compress=true / / default is false set mapreduce.output.fileoutputformat.compress.type=BLOCK / / default is Record set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec / / default is org.apache.hadoop.io.compress.DefaultCodecmap output results are compressed in gzip: set mapred.map.output.compress=true set mapreduce.map.output.compress.codec=org. The default value of apache.hadoop.io.compress.GzipCodec / / is org.apache.hadoop.io.compress.DefaultCodec to compress both the hive output and the middle: set hive.exec.compress.output=true / / default is false Do not compress set hive.exec.compress.intermediate=true / / the default value is false, when it is true, the compression set by MR is enabled here, and the study on "data skew of hive and common optimization methods" is over. I hope to be able to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report