[big data] predicate push-down processing in SparkSQL join query (2) 07/12 Update SLTechnology News&Howtos

[big data] predicate push-down processing in SparkSQL join query (2)

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

This article was first posted on the official account of Wechat, vivo Internet technology.

Author: Li Yong

Table of contents:

1. The table on the left pushes down after join.

two。 The condition in the left table join does not push down.

3. Push down the condition in the table join on the right

4. The condition in the table join on the right does not push down.

5. Summary

In "predicate push-down processing in SparkSql join query (1)", we introduce some basic concepts and analyze some basic push-down rules for inner join query.

The purpose of this article is to introduce the predicate push-down rules in the outer join query, which is more complicated than the rules in the inner join, but it can also be analyzed clearly by using a simple table. Let's start with the table:

Taking the left outer join query as an example, the rules are summarized as follows:

Next, the rules in this table are analyzed in detail.

1. The table on the left pushes down after join.

The query statement is as follows:

As mentioned earlier, the post-join condition can be compared as a correct result if it is executed after the jo**in** operation. Then join the two tables to the left first, and the result is as follows:

Then filter using the join condition of LT.id > 1, and the result is as follows:

To analyze the result of data filtering when LT.id > 1 is pushed down to the left table. After LT.id > 1 filtering, the left table becomes:

At this point, make a left join with the right table. If the row with id 2 in the left table can be found in the right table, the join result is as follows:

It can be seen that the results of the two treatment methods are the same. The condition pushes and filters a full 50% of the data in the left table (quite awesome, although only one item is filtered). The reason is that in SparkSQL, the above query is parsed into the following subquery:

This is an unrelated sub-query, that is, the sub-query can be completed first, and then the parent query can be completed. The sub-query is not related to the external query in the query process.

two。 The condition in the left table join does not push down.

The query statement is as follows:

Let's take a look at the correct result without push. The join process is as follows:

The first step: the row with id 1 in the left table can find the equivalent id in the right table, but the id in the left table is 1, which does not meet the second join condition (LT.id > 1), so the left table is equivalent to no and right table join, so the value value of the left table is retained, while the value of the right table is null (you do not meet the condition in join and keep your value on join, give me a null value? I can't help it, it's just so capricious.

Step 2: the row with id 2 in the left table can be found in the right table, and the id of the row with id 2 in the left table is greater than 1, so the two join conditions are met, so it is on the join in the right table, so the value in both the left table and the right table is retained. The final query result is as follows:

So what will you get if you push the condition "LT.id > 1" to make a table?

First, the left table is filtered with "LT.id > 1" as follows:

At this time, join with the right table, the row of id 2 in the left table can be found in the right table, and meet the join condition of "LT.id = RT.id AND LT.id > 1", so the value of both tables is retained. There is no data in the left table. The query ends and the query result is as follows:

This query result is inconsistent with the correct result and is an incorrect result, so the condition in the left table join cannot be pushed down for data filtering. The reason for the analysis: it is mainly because the conditions in join and the conditions after join deal with the results differently. The former will retain part of the results when the join conditions are not met, while the latter will not retain anything when the conditions are not met.

3. Push down the condition in the table join on the right

The query statement is as follows:

Now push down the join condition of the right table RT.id > 1 to filter the right table. After filtering, it is as follows:

Then the left table and the right table are joined to the left. The process is as follows:

Step 1: the row with id 1 in the left table is not found in the right table, so the value in the left table is retained and the right table is null.

Step 2: the row of id bit 2 in the left table is in the right table, and the RT.id is greater than 1, and both join conditions are satisfied, then the values of both the left and right tables are retained. The query results are as follows:

So if you don't push down (in order to get the correct result), take a look at the result, the process is as follows:

Step 1: the row with id 1 in the left table is in the right table, but does not meet the second join condition, so this row is not on join, so the data in the left table is retained, and the right table is null.

Step 2: the row with id 2 in the left table is in the right table, and the second join condition is also satisfied, so the data in both the left and right tables are retained.

It can be seen that the conditions in the right table join can not push down, and the result is the same, so why not push down? Can filter out half of the data. The equivalent processing statements in SparkSQL are:

As you can see, it is also parsed into an unrelated subquery to process.

4. The condition in the table join on the right does not push down.

This should be the most unconventional query. The query statement is as follows:

First of all, if the conditions are not pushed after join, the process is as follows:

Step 1: the row with id 1 in the left table can be found in the right table, but only the join condition is satisfied at this time. When using the where condition to judge the data after this connection, it is found that the id in the right table does not meet the condition of RT.id > 1, so the join result is not retained (note: it is not reserved here, neither the left table nor the right table is retained, it should be different from the case above that there is no join on the table and the value of the right table is kept as null. This is also the key point)

Step 2: the row with id 2 in the left table and the row with id 2 in the right table are on join, and the where condition of RT.id > 1 is also satisfied.

This is a correct query result in accordance with the semantics.

All right, let's take a look at the condition push after join in the table on the right:

Step 1: use RT.id > 1 to filter the right table. After filtering, there is only one row with id 2 in the right table.

Step 2: the row whose id is 1 in the left table is not found in the filtered right table, so the left table value is retained and the right table value is null

Step 3: the row with id 2 in the left table is in the right table, so the left table value and the right table value are retained.

The results are as follows:

It is obvious that this is actually a wrong result.

Summary

At this point, the four rules of the left join query are analyzed. It can be seen that the filter conditions for external join queries in SparkSQL cannot be used to filter data sources in all cases. If used properly, it will greatly improve the query performance, and if used improperly, it will produce wrong query results, and this kind of error results are not easy to find, so be very careful when using them.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.