In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article introduces the concept and implementation principle of Left join to you. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.
Join is a very common operation in SQL query, specifically, there are many forms such as join,left join, right join,full join and so on. The specific principle is shown in the following figure. But the most common one is to use left join.
The code of this article has passed the test in both mysql and hive, and the difficulty and length of the code itself are not great. I have prepared the mysql and hive code for the test data. If you think it is necessary, you can reply "left" to get it in the background to facilitate your own modification and practice.
Left join popular explanation: take the left table as the main table and return all the rows of the left table. If there is no match in the right table, there will still be records of the left table, and the fields of the right table will be filled with null. It seems very understandable, but there may be some points that are easy to overlook in the actual operation.
First, there are several records after left join
At this point, it is important to understand the conditions under which left join executes. In A join B, we specify the key associated with the two tables in the on statement. As long as the key value is equal, it will appear in the result. There are several situations such as one-to-one, one-to-many, many-to-many and so on. We use examples to illustrate.
1. One on one
This situation is best understood. T_name table with fields such as id,name (user name), sex (gender), dt (registration date), etc. T _ age table. There are id,age (age), province (province), dt (update date) and other fields. The information in the table is as follows:
Now let's do the left join operation of t_name (left table, alias a) and t_age (right table, alias b), with the associated key id. Table a has six records, table b has three records, and the key keys are unique, so the final result is that table a has six records, table b has three records that are not related, and all the fields in table b are empty.
2. One to many
This time we use t_age as the left table, and the association condition is dt. Focus on the record with a dt of 20190905. Because there are three items 20190905 in the right table, and all three items meet the association conditions at the time of association, the final result will have three records of 20190905.
This time the table is the tagetable, but obviously the result is not the original 3 records, but 7: 20190905 3, 20190906 4. If you don't understand, you can read on.
3. Many to many
In the above example, 20190906 of the records end up with four entries, again because the association condition is met, which is a 2-to-2 situation. Here we go back to the case where the t_name table is the master table and associate it with dt. It can be predicted that there will be an additional row of 20190907 in the result this time compared with 2, while the corresponding fields in table b will still be empty.
We see one-to-many and many-to-many situations in 2 and 3, but the former is a special case of the latter. We just briefly list all the fields after associating the two tables, but in practice we may need to do some statistics, aggregation and so on. I would like to remind you that before writing the association conditions, it is best to think about what the final result will be, how many rows there may be, whether there will be more statistics when counting, which rows may have null values, which fields may have null values, and so on. Don't make mistakes just because you take it for granted. Here can be regarded as throwing a brick to attract jade. Interested students can read this blog for further study.
Https://www.cnblogs.com/qdhxhz/p/10897315.html
II. The implementation principle of left join
Next let's take a closer look at the difference between the connection conditions written in on and those written in where. Before we do that, we can look at the specific execution logic of left join. I referred to the blog of the great god on the Internet:
Https://developer.aliyun.com/article/718897, summarized as follows
Mysql handles left join in a nested loop.
SELECT * FROM LT LEFT JOIN RT ON P1 (LT,RT)) WHERE P2 (LT,RT)
P1 is the on filtering condition, the missing is considered as TRUE,P2 is the where filtering condition, and the missing is also considered as TRUE. The execution logic of this statement can be described as:
FOR each row lt in LT {/ / traverses every row of the left table BOOL b = FALSE; FOR each row rt in RT such that P1 (lt, rt) {/ / traverses every row of the right table to find the row that meets the join condition IF P2 (lt, rt) {/ / meets the where filter condition t:=lt | | rt;// merges rows and outputs the line} b=TRUE / / lt has a corresponding line in RT} IF (! B) {/ / traverses the RT and finds that lt does not have a corresponding line in RT, then try to fill a line of IF P2 (lt,NULL) {/ / add null with null to meet the where filter condition t:=lt | | NULL; / / output the lines added by lt and null}
If you don't understand the code, just look at the conclusion:
Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community
If you want to restrict the right table, it must be done in the on condition. If it is done in where, the data may be missing, and the rows of the left table without matching rows in the right table will not appear in the final result, which is contrary to our understanding of left join. Because there is no right table matching row in the left table, b=FALSE after traversing the right table, we will try to complete the right table with NULL, but at this time our P2 limits the right table rows. If NULL does not meet the P2 (NULL generally does not meet the restrictions, unless IS NULL), it will not be added to the final result, resulting in a missing result.
two。 If there is no where condition, no matter how the on condition restricts the left table, each row of the left table will have at least one row of composite results. For the left table row, if there is no corresponding row in the right table, then b=FALSE after traversal of the right table will use a row of NULL to generate data, and this data is redundant. So where must be used to filter the left table.
Let's take a look at the example again, and it may be easier to understand when we come back to study this passage.
1. There is only one on condition
You can look directly at the examples in the first part here. In the end, the output is based on the left table and the right table does not match the complement null, but there may be many-to-many cases.
2. There are two on conditions
The figure above shows the output after adding b.age=24 to the association condition. Due to the restriction on table b, only one condition is met, but because there is no where condition, the left table still prevails, and because it is one-to-one, the output is still the number of records in the left table. Even more extreme, we can "empty" table b.
In both cases, there is no qualified result in table b, so on the basis of the left table, all fields on the right are empty.
3. The situation of where
Write b.age=24 into where and find that this is the only line in the result, breaking the restriction that "left join" is dominated by the left table. Again, let's take a look at what happens when the last two cases are written into where:
Yes, it all turned out to be empty. Because where is executed after on, and there is no record of meeting the conditions in the results generated by on!
Here are two conclusions:
1. The on condition is the condition used when generating the temporary table, which returns the record in the left table regardless of whether the condition in on is true or not.
2. The where condition is the condition that the temporary table is filtered after the temporary table is generated. At this point, there is no meaning of left join (the record of the table on the left must be returned). If the condition is not true, it will be filtered out.
4. If there is is null or is not null
When the condition is written in on:
When the condition is written in where:
Intuitively we understand, WHERE... The IS NULL clause filters out rows of data that do not meet the matching criteria from the data after the matching phase. For the case that the condition is written in on, it can also be said that is null is a negative matching condition and is not null is a positive matching condition. For conditions written in where, it is actually easier to understand, depending on what the existing where conditions produce. Readers can think about it from the example above.
Third, look at two actual cases
After the above discussion, let's look at two examples to further understand and think about the use of left join.
1. Case 1
This case comes from an online blog, which was mentioned earlier. Link:
Https://developer.aliyun.com/article/718897
You can think about how to write before going to the original text to see the answer. In fact, it's easy to write each requirement in two ways, and the difference is whether the condition is written in where or on. The principle of judgment is that we need to ensure that the data in the results are neither missing nor redundant. The conditions for requirement 1 need to be written in on (to ensure that the results are not missing), and the conditions for requirement 2 need to be written in where (to ensure that the results are not redundant).
2. Case 2
Suppose you now have a user activity table t_active that records the daily active uid and the corresponding active date. Now want to see that the date difference from a certain day is 0 days, 1 day, 2 days, 3 days. How many active users are still active during the day (that is, a retention concept). What is expected is shown in the following table:
For the data in the table, we can understand it this way. The number of active people from 2019-09-29 days (that is, 2019-09-29) is 100, and the number of active people from 2019-09-29 days (i.e. 2019-09-28) is 60. and so on.
For this requirement, we can use left join for self-correlation, using the previously active day as the left table, and the final expected day as the right table, calculating the date difference, and counting the left and right tables respectively. The preliminary SQL is as follows: (the data is compiled by ourselves)
Before you move on, make sure you understand the requirements goal and think about it. Is there anything wrong with the above? Can you get the results expected above?
The raw data and the result of this SQL run are as follows:
The running result shows that dt and datediff are null. Can you imagine why? And when dt is not null, the data for the last two columns is the same, which is obviously not what we expected. What is the reason for this? Let's take a look at it step by step.
First of all, there should be no problem with the way we use left join. Let's take a look at the result of select * without any calculation.
As you can see, this is equivalent to the one-to-one correlation mentioned earlier without where conditions, the result will be based on the left table, and the unrelated ones will be filled with null. It is worth noting that the date that is not associated is a null value, and when the null value participates in the calculation of datediff, the result will be null. Don't you understand a little bit here. Because the null value participates in the calculation, the final datediff has a null value, and when counting, because the null value exists, when the date difference is finally used as the dimension, the number of the left table and the right table is the same. As shown in the following code:
From the above results, we can deduce the initial SQL run results. For example, in datediff=5, there are two records, and the count (distinct uid) of the left table and the right table is 2. When datediff is null, the result of the left table is 7, the right table is 0, and so on, and so on, the result is the same as the previous one. In this way, we know that the root cause of not meeting expectations lies in the existence of empty dates. So how to solve this problem is obviously to fill in the blank date. You can use case when to make up with the corresponding date when the date in the right table is not relevant. The code is as follows:
You can see that you finally got the desired result, taking the last line as an example, it said that there were five active people from the 2019-09-29 day (that is, 2019-09-24), and two of those people were still active in 2019-09-29. You can check the detailed data. The rest, and so on. We used case when to write down the date, which is based on what day we know. In practice, it may be a variable, but it must also be a fixed value, which needs to be analyzed in a specific situation.
On the concept of Left join and what the implementation principle is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.