Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the two tables in Hadoop

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "what are the two tables in Hadoop". In daily operation, I believe many people have doubts about what the two tables in Hadoop are. Xiaobian consulted all kinds of data and sorted out simple and easy to use operation methods. I hope to help you answer the question of "what are the two tables in Hadoop"! Next, please follow the small series to learn together!

Common Join

The most common join strategy, regardless of the size of the data, can also be called reduce side join , the most inefficient way to join. It is done by a mapreduce job.

At the map shuffle stage, each map output key becomes table_name_tag_prefix + join_column_value , but it still uses join_column_value to hash when partitioning.

Each reduce accepts all the splits passed by the map, and in the shuffle phase of reduce, it discards the table_name_tag_prefix in front of the map output key for comparison. Because the number of reduce can be determined by the size of the small table, reduce for each node must be able to put the split of the small table into memory and become hashtable. Then compare each record in the big table one by one.

Map Join

Map Join calculation steps are divided into two steps, the small table data into hashtable broadcast to all map ends, the large table data reasonable segmentation, and then in the map phase with the large table data line by line to probe the small table hashtable. If the join key is equal, write HDFS.

A map join is called a map join because all of its work is computed on the map side.

Hive made several optimizations on map join:

Hive 0.6 defaults to large tables written after select and small tables written in front, or use/*+mapjoin(map_table) */prompt to set. This calculation is automated at hive 0.7. It first determines which table is small and which is large. This parameter is controlled by (hive.auto.convert.join=true). Then control the size of the small table by (hive.smalltable.filesize= 2500000 L) parameter control (default is 25M), when the small table exceeds this size, hive will be converted to common join by default. You can check HIVE-1642.

First of all, in the Map phase of the small table, it will convert itself into MapReduce Local Task, then take all the data of the small table from HDFS, convert itself into Hashtable file and compress it into DistributedCache.

Currently hive map join has several limitations, one is that it intends to use BloomFilter to implement hashtable , BloomFilter is about 8-10 times less memory than hashtable, but BloomFilter size is more difficult to control.

Now the default hashtable copy in DistributedCache is 3 copies, which is too small for a large table with 1000 maps, and most map operations are waiting for DistributedCache copies.

Bucket Map Join

Hive supports hash partitioning by specifying clustered by (col_name,xxx ) into number_buckets buckets keyword.

When the join key of the two tables connected is bucket column, you can pass

hive.optimize.bucketmapjoin= true

To control hive to execute bucket map join, it is important to note that the number_buckets of your small table must be a multiple of the large table. This condition must be satisfied no matter how many tables are joined. (In fact, if they are divided into buckets according to exponential multiples of 2, large tables can also be multiples of small tables, but this needs to be calculated one more time, valid for int, long and string are unclear)

Bucket Map Join execution plan is divided into two steps, the first step is to map the small table into hashtable and then broadcast to the map end of all large tables. The map end of the large table accepts the hashtable of number_buckets and does not need to synthesize a large hashtable. The map operation can be directly performed. The map operation will generate number_buckets split. The label of each split is the same as the hashtable label of the small table. When executing the projection operation, only one hashtable of the small table needs to be put into memory. Then the corresponding split of the large table is taken out for judgment, so its memory is limited to the size of the largest hashtable in the small table.

Bucket Map Join is also an implementation of Map Side Join. All calculations are done on the Map side. Without Reduce, it is called Map Side Join. Bucket is just an implementation of hash partition of hive. The other is of course value partition.

create table a (xxx) partition by (col_name)

However, two tables in a hive do not necessarily have the same partition key, and even if they do, they do not necessarily have a join key. So hive doesn't have this value-based map side join, and the list partition in hive is mainly used to filter data rather than partition it. The two main arguments are (hive.optimize.cp = true and hive.optimize.pruner=true)

Hadoop source code provides the implementation of map side join by default, you can find several related classes in hadoop source code src/contrib/data_join/src directory. TaggedMapOutput can be used to implement hash or list , depending on how you decide how to partition. Chapter 8 of the Hadoop Definitive Guide on map side join and side data distribution also has an example of how to implement a map side join for value partitioning.

Sort Merge Bucket Map Join

Bucket Map Join does not solve the restriction that map join in small tables must be fully loaded into memory. If you want to load both large and small tables in a reduce node into memory, you must make both tables ordered on join key. You can specify sorted by join key or use index when building tables.

set hive.optimize.bucketmapjoin = true;

set hive.optimize.bucketmapjoin.sortedmerge = true;

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

Bucket columns == Join columns == sort columns

Such a small table of data can be read only a part at a time, and then still use a large table row by row to match, such a join does not limit the size of memory. And can also perform all-out connection.

Skew Join

In real data, the data tilt is certain, and in hadoop, the default is to use

hive.exec.reducers.bytes.per.reducer = 1000000000

That is, the reduce of each node defaults to processing 1G data. If your join operation also produces data skew, then you can set it in hive.

set hive.optimize.skewjoin = true;

set hive.skewjoin.key = skew_key_threshold (default = 100000)

Hive has no way to tell which key will generate how much skew when it runs, so use this parameter to control the skew threshold. If this value is exceeded, the new value will be sent to those that have not yet reached reduce. Generally, it can be set to you.

2-4 times (total number of records processed/number of reduce) is acceptable.

Skew is often present, generally select more than 2 layers, translated into more than 3 execution plans of the mapreduce job are easy to produce skew, it is recommended that each time you run more complex sql before you can set this parameter. If you don't know how many settings, you can use the official default 1 reduce algorithm to process only 1G, then skew_key_threshold = 1G/average length. Or default to 25000000 (about 4 bytes on average)

Left Semi Join

There is no clause like in/exist in hive, so you need to convert this type of clause to left semi join. left semi join is to pass only the join key of the table to the map phase, if the key is small enough or map join, if not or common join.

Difficulties in the join strategy

Most are suitable only for equal joins,

Range comparison and total outside connection without proper support

It is difficult to evaluate the optimal plan for different execution plans.

No consideration of IO such as temporary tables, network consumption and network latency, CPU time,

The optimal solution does not represent the least resource consumption.

At this point, the study of "what are two tables in Hadoop" is over, hoping to solve everyone's doubts. Theory and practice can better match to help everyone learn, go and try it! If you want to continue learning more relevant knowledge, please continue to pay attention to the website, Xiaobian will continue to strive to bring more practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report