Practical Analysis of how to rewrite left anti join with MaxCompute full outer join 07/09 Update SLTechnology News&Howtos

Practical Analysis of how to rewrite left anti join with MaxCompute full outer join

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to analyze the practice of MaxCompute full outer join rewriting left anti join, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Introduction: the model of incremental merging is often encountered in ods layer data synchronization, that is, Tmuri 1-day increment table + Tmur2 full scale = Tmai 1 full scale. Merging can be done through full outer join scripts, but it is very resource-consuming when there is a large amount of data. The following will show you the best practices of how to rewrite left anti join through full outer join when adding and updating incremental data.

Background

The model of incremental merging is often encountered in ods layer data synchronization, that is, Tmuri 1-day increment scale + Tmur2 full scale = Tmur1 full scale. Merging can be done through full outer join scripts, but it is very resource-consuming when there is a large amount of data.

Insert overwrite table tb_test partition (ds='$ {bizdate}') select case when a.id is not null then a.id esle b.id end as id, if (a.name is not null, a.name, b.name) as name, coalesce (a.age, b.age) as age All the fields of delta table from (select * from tb_test_delta where ds='$ {bizdate}') afull outer join (select * from tb_test where ds='$ {bizdate-1}') bon a.id = b.id are given priority.

This way of writing enables new and updated operations:

The addition refers to the new data in the increment table, but not in the full scale.

Update refers to the data in both the incremental table and the full table, but gives priority to the data of the incremental table to cover the data of the historical table.

As shown in the following figure, R2room1 is the incremental data after the weight of the increment table is removed on the same day, M3 is the data of the previous day of the full scale, and J40002room3 is the execution diagram of full outer join.

If you expand J4 / 2 / 3, you will find that there will be incremental and full merge join, which will incur a lot of shuffle overhead when the amount of data is very large (128.8 billion items). At this time, the optimization solution is to change full outer join to union all, so as to avoid join shuffle.

Optimization model

Conclusion: changing full outer join to hash cluster + left join + union all can effectively reduce the computing cost, and there are two application scenarios. First, the model is abstracted, assuming that there are two tables an and b, an is the incremental table, b is the full scale:

With an as (select * from values (1djiny111'), (2pjcmftwo'), (7pje 777') as (id,name)-increment, b as (select * from values (1pr)), (2pje 222'), (3) '333'), (4) as (id,name))-full scenario 1: only merge the new data into the full scale

Left anti join is equivalent to not in, and the incremental not in is full. Only the completely new id is left after filtering, and the existing id in the total amount is not modified:

-- query the newly added idselect * from a left anti join b on a.id=b.id -- the results are as follows: +-+-+ | id | name | +-+-+ | 7 | 777 | +-- newly added combined full scale select * from a-- increased table left anti join b on a.id=b.id union all select * from b -- full scale-- results are as follows: +-+-+ | id | name | +-+-+ | 1 | 2 | 222 | 3 | 333 | | 4 | 444 | | 7 | 777 | +-- -+ scenario 2: merge the new data into the full scale And update historical data

Full not in increment. After filtering, only historical id is left, and then union all increment is added and modified.

-- query full historical data select * from b left anti join an on a.id=b.id -- the results are as follows: +-+-+ | id | name | +-+-+ | 3 | 333 | | 4 | 444 | +-- merge the new data to the full table. And update the historical data select * from b-- full scale left anti join an on a.id=b.idunion all select * from a -- Enhancement table-- results are as follows: +-+-+ | id | name | +-+-+ | 1 | 111 | 2 | two | 7 | 777 | | 3 | 333 | | 4 | 444 | +-+-+ optimization practice

Step 1: modify table properties

Table, job attribute modification, the original table, job attribute optimization, can improve the optimization effect.

Set odps.sql.reducer.instances=3072;-optional. The default maximum is 1111 reducer,1111 hash buckets. Alter table table_name clustered by (contact_id) sorted by (contact_id) into 3072 buckets;-- required

Step 2: modify the code according to scenario 1 or scenario 2 of the above model.

Here is a comparison of resource consumption after code modification:

The original full outer jionleft anti join initializes the original full outer jionleft anti join the second day later

Time consumption 8h40min38s1h5min48s7h42min30s32min30scpu consumption 29666.02 Core * Min65705.30 Core * Min31126.86 Core * Min30589.29 Core * Minmem consumption 109640.80 GB * Min133922.25 GB * Min114764.80 GB * Min65509.28 GB * Min

You can find that the hash cluster bucket operation has extra overhead in initialization, mainly hashing and sorting by the primary key, but it's worth it, once and for all, and subsequent reads are very fast. It used to take eight hours to run every day, but now it takes an hour to initialize the bucket, and it actually takes only 30 minutes a day.

Initialization execution diagram

Figure 1:

M2 reads the full scale.

M4 is the read increment table, which is read twice in the model of scenario 2, where:

R5room4 is used for the subsequent union all after row_number of the primary key, which contains all the incremental data

R1room4 is used for left anti join after deweighting the primary key (row_number), which contains only the primary key.

J3 is left anti join. When you expand it, you will see that there is still mergJoin here, but this is only an initialization operation, and it will not happen every day. After expansion, it is shown in figure 2.

R6 incremental 5 is the incremental and full union all, which is expanded as shown in figure 3.

R7room6 writes index information to metadata. For example, the MetaCollector1 in figure 3 will sink in R7room6.

So: there is shuffle in figure 1, except for R5room4 and R1room4, which are necessary for deduplication. There are also shuffle in J3, 1, 2 and R6, 3, 5.

Figure 2:

The execution chart after the second day

R3room2 and R1room2 are necessary for incremental deduplication, there is shuffle, which is ignored here.

The initialization execution diagram J3 # 1 # 2 and R6 # 3 # 5 have been merged into M4 # 1 # 3, which is expanded as shown in figure 2. That is, the two steps of left anti join and union all are completed in one phase, and this stage is the Map task (M4 tasks 1 / 3), not the Join task or the Reduce task. And the full scale no longer occupies a single Map task, but also is merged into M4 task 1 # 3, so there is no shuffle operation in the whole process, and the speed increase is very obvious. In other words, you only need an M4 to complete all the operations, sink directly to the table.

R5room4 writes index information to metadata. For example, the MetaCollector1 in figure 2 will sink in R5room4.

Figure 2:

The answers to the practical analysis questions on how to rewrite MaxCompute full outer join left anti join are shared here. I hope the above content can be of some help to everyone. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.