Why Not in Subquery is inefficient in Spark SQL and how to avoid it 07/12 Update SLTechnology News&Howtos

Why Not in Subquery is inefficient in Spark SQL and how to avoid it

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about why Not in Subquery in Spark SQL is inefficient and how to avoid it. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

First, take a look at Not in Subquery's SQL:

/ / test_partition1 and test_partition2 are Hive external partition tables select * from test_partition1 T1 where t1.id not in (select id from test_partition2); the corresponding complete logical and physical plans are: = Parsed Logical Plan = = 'Project [*] + -' Filter NOT 't1.id IN (list#3 []): + -' Project ['id]: +-'UnresolvedRelation `test_ partition2` + -' SubqueryAlias `t1` +-'UnresolvedRelation `test_ partition1`

= Analyzed Logical Plan = = id: string, name: string, dt: stringProject [id#4, name#5, dt#6] +-Filter NOT id#4 IN (list#3 []): +-Project [id#7]: +-SubqueryAlias `default`.`test _ partition2`: +-HiveTableRelation `default`.`test _ partition2`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#7, name#8] [dt#9] +-SubqueryAlias `t1` +-SubqueryAlias `default`.`test _ partition1` +-HiveTableRelation `default`.`test _ partition1`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#4, name#5], [dt#6]

= Optimized Logical Plan = = Join LeftAnti, ((id#4 = id#7) | | isnull ((id#4 = id#7):-HiveTableRelation `default`.`test _ partition1`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#4, name#5], [dt#6] +-Project [id#7] +-HiveTableRelation `default`.`test _ partition2`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#7, name#8], [dt#9]

= Physical Plan = = BroadcastNestedLoopJoin BuildRight, LeftAnti, ((id#4 = id#7) | | isnull ((id#4 = id#7):-Scan hive default.test_partition1 [id#4, name#5, dt#6], HiveTableRelation `default`.`test _ partition1`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#4, name#5], [dt#6] +-BroadcastExchange IdentityBroadcastMode +-Scan hive default.test_partition2 [id#7], HiveTableRelation `default`.`test _ partition2` Org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#7, name#8], [dt#9] can be seen from the above logical plan and physical plan When Spark SQL transforms the not in subquery from a logical plan to a physical plan, it will finally choose the BroadcastNestedLoopJoin (corresponding to the BroadcastNestedLoopJoinExec.scala in the Spark source code) strategy.

When it comes to BroadcastNestedLoopJoin, we have to mention NestedLoopJoin, which is used in many RDBMS, such as mysql. It works by cycling reading data from one table (outer table), and then accessing another table (inner table, usually indexed), join each piece of data in the outer table with the data in the inner table, similar to a nested loop and checking whether the data meets certain conditions in the process of the loop.

Nested Loop Join is a better choice when the connected dataset is small. However, when the dataset is very large, from its implementation principle, the efficiency will be very low and may even affect the stability of the entire service.

The BroadcastNestedLoopJoin in Spark SQL is similar to NestedLoopJoin, except that the broadcast table (build table) is added.

BroadcastNestedLoopJoin is an inefficient physical execution plan, the internal implementation broadcasts the subquery (select id from test_partition2), and then each record of test_partition1 traverses the broadcast data through loop to match whether it meets certain conditions.

Private def leftExistenceJoin (/ / broadcast data relation: broadcast [Array [InternalRow], exists: Boolean): RDD [InternalRow] = {assert (buildSide = = BuildRight) / * streamed corresponds to physical plan: Scan hive default.test_partition1 [id#4, name#5, dt#6], HiveTableRelation `default`.`test _ partition1`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#4, name#5] [dt#6] * / streamed.execute (). MapPartitionsInternal {streamedIter = > val buildRows = relation.value val joinedRow = new JoinedRow

/ / whether the condition is defined. Here is Some ((id#4 = id#7) | | isnull ((id#4 = id#7) if (condition.isDefined) {streamedIter.filter (l = > / / exists is mainly used to judge whether the data is returned or not according to the joinType condition. Where joinType is LeftAnti buildRows.exists (r = > boundCondition (joinedRow (l, r) = = exists) / / else} else if (buildRows.nonEmpty = = exists) {streamedIter} else {Iterator.empty}

Due to the inefficient execution of BroadcastNestedLoopJoin, it may take up executor resources for a long time and affect the performance of the cluster. At the same time, because the result set of the subquery needs to be broadcast, if the amount of data is particularly large, it is also a severe test for the driver side, which is very likely to bring the risk of OOM. Therefore, in the actual production, we should try our best to use other relatively efficient SQL to avoid using Not in Subquery.

Although the problems mentioned above can be avoided by rewriting the SQL of Not in Subquery and making the transition from inefficient SQL to efficient SQL. But this is often based on the premise that we find that the task executes slowly or even fails, and then troubleshoots the SQL in the task to find the "problem" SQL. So how to "check" such SQL before the task is carried out, so as to give early warning?

Here the author gives an idea, that is, to analyze the Spark SQL plan, according to the join strategy matching conditions of Spark SQL, to determine whether the task uses inefficient Not in Subquery for early warning, and then inform the business side to modify it. At the same time, when we actually complete the analysis of ETL processing of data, we should avoid similar low-performance SQL beforehand.

This is why Not in Subquery in Spark SQL is inefficient and how to avoid it. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.