What is Mapsidejoin? The most detailed application introduction is here. 07/09 Update SLTechnology News&Howtos

What is Mapsidejoin? The most detailed application introduction is here.

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

We know that the first step in data analysis is to prepare the data, so in the previous lesson, we introduced metadata. Today's article mainly introduces the application example of large data volume combined data set in Yonghong: Mapsidejoin.

What is Mapsidejoin? Literally, Mapsidejoin means M-node-combination. Before understanding Mapsidejoin, we first need to understand the MapReduce model and the role of the product's four-node CNMR. Through the comparison of Mapsidejoin and Reducesidejoin in the MapReduce model, we can understand the advantages of Mapsidejoin when combining large data sets.

Introduction of Cluster Node in Yonghong

The Client Node-C node is the client access node, and the customer submits the task by visiting the C node.

The Naming Node-N node is equivalent to the brain of the cluster. In addition to monitoring other nodes in the cluster, it also collects tasks submitted by customers through the C node for distribution and so on.

The Map Node-M node is the node that stores the data file

The Reduce Node-R node is used for summary calculation.

Introduction of MapReduce Model

Baidu encyclopedia's definition of MapReduce is relatively comprehensive, which can be summarized simply: MapReduce is a cluster-based computing platform, a computing framework that simplifies distributed programming, and a programming model that abstracts distributed computing into two stages: Map and Reduce. The MapReduce model is used by Yonghong in the calculation of combined data sets.

Applicable scenario: distributed cluster with multi-M nodes. The combination of large amount of data includes large table join small table, large table join large table.

1. Why use Mapsidejoin

In the MapReduce model, there are two types of combinatorial computing: Map-side-join and Reduce-side-join. Here is a brief introduction with an example:

Suppose we have two tables: table 1 personnel table is a large table, Table 2 area table is a small table, as shown in the following figure:

If we want to connect the name of Table 1 with the Address of Table 2 through id, then we need to use id as the join column, do inner join, and the id corresponding to the connection is id=1,id=2,id=3,id=4.

If we now have two Map nodes, Map1 and Map2, in our cluster, after we put tables 1 and 2 into the bazaar, after the data is split and stored, we may have the following situation:

► case 1: Mapsidejoin can be carried out

As shown in the figure above, after the split of Table 1 and Table 2, the data corresponding to the connection column id=1-4 is stored in the same node. In the case of join, the Map node will detect whether the connection column data has completed the correspondence. If the data corresponds to each other, the join,Map node can send the join result to the Reduce node on the Map node, and the Reduce node can summarize the results.

► case 2: Reducesidejoin occurs when Mapsidejoin cannot be performed

As shown in the figure above, the split data shows that the id1,2 of Table 1 is stored on the Map1 node, while the id1,2 of Table 2 is stored on the Map2 node. At this time, when the Map node detects that the corresponding data is not on the same node during join, it will take all the data to the Reduce node for full join.

The above two cases simply illustrate Mapsidejoin and Reducesidejoin.

2. Advantages of Mapsidejoin and Reducesidejoin

The advantage of join on Map is that it can filter out a large amount of data that needs to be excluded from join in advance, which will reduce data transmission, so Mapsidejoin is suitable for join scenarios with large amounts of data.

Join on the Reduce side has the advantage of flexibility, while the disadvantage is that it requires a lot of data transmission and the whole join process is time-consuming, so Reducesidejoin is suitable for scenarios with a small amount of data.

In addition, because join is very resource-consuming when the amount of data is huge, for non-Mapsidejoin forms, whether it is directly connected to the database to do join, or in the form of a data Mart to do Reducesidejoin, it will cause great pressure on the node, easy to cause the product is very stuck, and then it will cause OOM, downtime and so on. So we need to use Mapsidejoin to avoid this scenario. When the amount of data is large, we can deploy multiple M nodes. We can implement Mapsidejoin by importing the data into the bazaar, storing it in multiple M nodes in the cluster, and then calculating on the M nodes. In this way, the pressure of the CMagar R node can be evenly distributed to the M nodes, which can solve the use pressure caused by a large amount of data join and make the use of resources more efficient.

So how do we implement Mapsidejoin? How to ensure that after the data is split, the data corresponding to the join column must be stored on the same Map node? Here are two ways to implement Yonghong Mapsidejoin.

Two forms of Yonghong Mapsidejoin

Fact table-dimension table

Applicable scenarios (large join and small tables)

In distributed system, when star data (one large table, several small tables) need join, the data of small table can be copied to each Map node and Mapsidejoin can be executed without joining operation to Reduce node, so as to improve the efficiency of table join.

In MPP bazaar, we import large tables into the bazaar as normal increments, and check the dimension table for all small tables during incremental import, as shown in the following figure:

At this point, the small table of the selected dimension table will be fully generated on each Map node.

The above table 1 personnel table, Table 2 area table as an example: table 1 incremental import normal split, Table 2 in the form of incremental import dimension table into the bazaar.

As shown in the figure, at this time, on each M node, because Table 2 is fully stored, the id data corresponding to Table 1 and Table 2 must be found on the same M node.

However, the form of fact table-dimension table also has limitations. For example, when more than two large tables do join, one or more of them need to be stored on each M node. Storing large tables with large amounts of data will increase resource consumption, and large tables as dimension tables cannot be pressed into memory for calculation, so Mapsidejoin cannot be used.

Therefore, in view of this situation, we adopt sharding columns to support the use of large tables and join tables.

Slicing column

Applicable scenarios (big table join big table)

Before version 8.5.1, we can only do Mapsidejoin in the form of dimension table join fact table. In some user scenarios, it is impossible to associate data tables in advance to make a wide table model to enter the market, and it does not meet the requirements of Mapsidejoin (or broadcast join) computing, so we need to do distributed join computing support in the bazaar.

The specific scenarios are as follows:

1) Business needs, such as: partial summary before association, distribution of repair batches when product sales are greater than a specific value in a certain period of time, and association of specific values, select the last data in a certain period of time to associate with another table; self-association, data association calculation between this month's data and last month's data, etc. In these scenarios (usually snowflake model or more complex), join in advance will lead to data expansion, resulting in a lot of redundant data, but in actual use, it will not produce too much data because of filtering conditions.

2) the fact table with a large amount of data needs to be updated frequently, and it takes too much time to enter the bazaar with full data join width table.

3) in the self-service scenario, you need to keep the original detail table for self-service query if there is uncertainty about whether to associate the table and what fields to associate.

The Mapsidejoin implementation logic of the sharding column is actually similar to the picture in case 1 above.

We use the hash algorithm to import the associated columns of Table 1 and Table 2 in the form of incremental import of sharded columns to ensure that the data corresponding to the id of the two tables must be stored on the same Map node after splitting, so that the split large table can be pressed into memory for calculation.

Procedure:

1. Enter the large tables that need to be combined into the bazaar in the form of incremental import. At the same time, you need to check the sharding column attribute and select the shard column as a link column. For example, Table 1 needs to check the sharding column id when incrementally importing the bazaar, Table 2 requires the same operation.

2. Combine the generated dataset of the data Mart, and the Map node will automatically connect in the form of Mapsidejoin when detecting the data.

Summary

It is important to remember that the premise of using Mapsidejoin is that the dataset of a distributed cluster with multiple M nodes and a large amount of data Mart is join.

Finally, we use a picture to briefly review the two forms of Mapsidejoin.

Sliced columns for large tables join

Large table join small table using fact table-dimension table

The above is our application introduction to Mapsidejoin.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.