In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article introduces the knowledge of "what is the difference between in, not in, exists and not exists". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Usage explanation
For convenience, we create two tables T1 and T2. And add some data respectively. (id is the primary key, name is the normal index)
-- T1 DROP TABLE IF EXISTS `t1`; CREATE TABLE `t1` (`id` int (11) NOT NULL AUTO_INCREMENT, `name` varchar (255) DEFAULT NULL, `address` varchar (255) DEFAULT NULL, PRIMARY KEY (`id`), KEY `idx_t1_ name` (`name`) USING BTREE) ENGINE=InnoDB AUTO_INCREMENT=1009 DEFAULT CHARSET=utf8mb4 INSERT INTO `t1` VALUES ('1001', 'Zhang San', 'Beijing'), ('1002','Li Si', 'Tianjin'), ('1003', 'Wang Wu', 'Beijing'), ('1004', 'Zhao Liu', 'Hebei'), ('1005', 'Jack', 'Henan'), ('1006', 'Tom', 'Henan'), ('1007', 'Bell') 'Shanghai'), ('1008', 'Sun Qi', 'Beijing') -- T2 DROP TABLE IF EXISTS `t2`; CREATE TABLE `t2` (`id` int (11) NOT NULL AUTO_INCREMENT, `name` varchar (255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `address` varchar (255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, PRIMARY KEY (`id`) USING BTREE, INDEX `idx_t2_ name` (`name`) USING BTREE) ENGINE = InnoDB AUTO_INCREMENT = 1014 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic INSERT INTO `t2` VALUES (1001, 'Zhang San', 'Beijing'); INSERT INTO `t2` VALUES (1004, 'Zhao Liu', 'Hebei'); INSERT INTO `t2` VALUES (1005, 'Jack', 'Henan'); INSERT INTO `t2` VALUES (1007, 'Bell', 'Shanghai'); INSERT INTO `t2`VALUES (1008, 'Sun Qi', 'Beijing'); INSERT INTO `t2` VALUES (1009, 'Cao Cao', 'Wei Guo') INSERT INTO `t2` VALUES (1010, 'Liu Bei', 'Shu'); INSERT INTO `t2` VALUES (1011, 'Sun Quan','Wu'); INSERT INTO `t2` VALUES (1012, 'Zhuge Liang', 'Shu'); INSERT INTO `t2` VALUES (1013, 'Dian Wei', 'Wei')
So, for the current problem, it's easy to use not in or not exists to pick out the part of the data in the T1 table that is more than the T2 table. (of course, the part where T2 is more than T1 does not count)
It is assumed that name is used to match the data.
Select * from T1 where name not in (select name from T2); or use select * from T1 where not exists (select name from T2 where t1.name=t2.name)
The results are all the same.
However, it is important to note that there are differences between not in and not exists.
When using not in, you need to ensure that the matching fields of the subquery are not empty. For example, the name in this table T2 needs to have a non-null restriction. Otherwise, it will cause the entire result set returned by not in to be empty.
For example, I add a piece of data with empty name in the T2 table.
INSERT INTO `t2` VALUES (1014, NULL, 'State of Wei')
At this point, the not in result returns null.
It is also important to understand that the result returned by exists is a boolean value of true or false, not a result set. Because it doesn't care about the specific data returned, it's just that the outer query needs to judge by this Boolean value.
The difference is that when using exists, if the subquery finds the data, it returns true. When using not exists, if no data is found in the subquery, it returns true.
Because the exists subquery does not care about the specific data returned. Therefore, the above statement can be changed to the following
-- name can be changed to any other field in the subquery, such as 1 here. Select * from T1 where not exists (select 1 from T2 where t1.name=t2.name)
In terms of execution efficiency, 1 > column > *. Therefore, select 1 is recommended. (it should be a constant value, to be exact)
In, exists execution flow
1. For in queries, subqueries, such as the T2 table above, are executed first, and then the results obtained from the query and the appearance T1 are Cartesian product, and then filtered by conditions (here, that is, whether name is equal), adding each qualified data to the result set.
Sql is as follows
Select * from T1 where name in (select name from T2)
The pseudo code is as follows:
For (x in A) {for (y in B) {if (condition is true) {result.add ();}
The condition here is actually a comparison of whether the name in the two tables is the same.
2. For exists, it first queries the traversal table T1, and then traverses it each time, and then checks whether the inner table meets the matching condition, that is, to check whether there is data with equal name.
Sql is as follows
Select * from T1 where name exists (select 1 from T2)
The pseudo code is as follows:
For (x in A) {if (exists condition is true) {result.add ();}}
Corresponding to this example, the T1 table is traversed starting with an id of 1001, and then traversing to see if there is an equivalent name in T2.
For example, if Zhang San is stored in the T2 table in id=1001, return true, add this record of Zhang San in T1 to the result set, and continue the next loop. When id=1002, Li Si is not in the T2 table, then return false, do nothing, and continue the next loop. Until the entire T1 table is traversed.
Do you want to go to the index?
In view of the online saying that in and exists do not go to the index, is this the case?
Let's verify it in MySQL 5.7.18. (note the version number.)
Single table query
First, validate the simplest case of a single table. Let's take the T1 table as an example, with id as the primary key and name as the normal index.
Execute the following statements respectively
Explain select * from T1 where id in; explain select * from T1 where id in; explain select * from T1 where name in (Zhang San, Li Si); explain select * from T1 where name in (Zhang San, Li Si, Wang Wu)
Why should I check the number of different id separately? Look at the screenshot.
You will be surprised to find that when id is four values, it also takes the primary key index. When id is five values, the index is not taken. This is very intriguing.
Let's take a look at name's situation.
After the same duty is too much, the index will not be used.
So, I guess this has something to do with the length of the matching field. According to the Chinese character is three bytes to calculate, and programming like to use 2 to the n power of urine, here is about 16 bytes as the demarcation point.
However, with the same data, I went to my server to query (version 5.7.22), and when I found four id values, I didn't go to the index. Therefore, the threshold here is estimated to be 12 bytes.
In any case, this shows that there should be a limit on the byte length of in queries in MySQL. (there is no official exact statement, so it is for reference only)
Multiple tables involve subqueries
We are mainly looking at whether in and exists are indexed when querying the two tables in the current example.
First, execute the following statements, primary key index (id) and normal index (name), whether or not to walk the index under in and not in.
Explain select * from T1 where id in (select id from T2);-1 explain select * from T1 where name in (select name from T2);-2 explain select * from T1 where id not in (select id from T2);-3 explain select * from T1 where name not in (select name from T2);-4
The screenshot is as follows
1. T1 takes the index, T2 goes the index.
one
2. T1 does not walk the index, T2 does not walk the index. (in this case, if the name is changed to a unique index, T1 will also go to the index)
two
3. T1 does not go to the index, T2 goes to the index.
three
4. T1 does not walk the index, T2 does not walk the index.
four
Oh, my God, the result looks like a mess, as if whether to go to the index or not depends on the mood.
However, we found that only in the first case, where the index fields match with the primary key and in is used, both tables are indexed.
Is this a rule or not? It needs to be examined, and let's move on.
Second, the next test, primary key index and ordinary index under exists and not exists situation. Sql is as follows
Explain select * from T1 where exists (select 1 from T2 where t1.id=t2.id); explain select * from T1 where exists (select 1 from T2 where t1.name=t2.name); explain select * from T1 where not exists (select 1 from T2 where t1.id=t2.id); explain select * from T1 where not exists (select 1 from T2 where t1.name=t2.name)
The result is very regular. Let's see.
Have you found that the T1 table will not move the index in any case, while the T2 table will do so if it has an index? Why did this happen?
In fact, the previous section talked about the implementation process of exists, which has already explained the problem.
It is driven by the outer table and traverses in a loop anyway, so it scans the whole table. On the other hand, the inner table can quickly judge whether the current record matches by walking the index.
How efficient is it?
In view of the fact that exists must be more efficient than in on the Internet, let's do a test.
100W and 200W pieces of data were inserted into T1 and T2 respectively.
Here, I use a custom function to loop insert, the statement reference is as follows, (did not extract the table name into a variable, because I did not find a way, embarrassed)
-- input the starting value of id and the amount of data to be inserted. The function returns the number of entries to be inserted. This value should be equal to the size of the data. -- id increments itself and cycle to add data to T1 table. For convenience here, id and name take the same variable, and address is Beijing. Delimiter / / drop function if exists insert_datas1// create function insert_datas1 (in_start int (11), in_len int (11)) returns int (11) begin declare cur_len int (11) default 0; declare cur_id int (11); set cur_id = in_start; while cur_len < in_len do insert into T1 values (cur_id,cur_id,' Beijing'); set cur_len = cur_len + 1; set cur_id = cur_id + 1 End while; return cur_len; end / / delimiter;-- similarly, insert data delimiter / / drop function if exists insert_datas2// create function insert_datas2 (in_start int (11), in_len int (11)) returns int (11) begin declare cur_len int (11) default 0; declare cur_id int (11); set cur_id = in_start into the T2 table While cur_len < in_len do insert into T2 values (cur_id,cur_id,' Beijing'); set cur_len = cur_len + 1; set cur_id = cur_id + 1; end while; return cur_len; end / / delimiter
Before that, clear the data in the table and then execute the function
Select insert_datas1 (110000000)
Do the same for T2, but in order for the two tables to intersect, start with 70W and then insert 200W of data.
Select insert_datas2 (700000 and 2000000)
The actual execution time of the computer at home is 36s and 74s respectively.
For some reason, the computer at home is not as fast as the script that runs in the Docker virtual machine. Harm, just make do with it.
When I get the new money, I'll replace it, hem.
Similarly, execute all the above execution plans and compare them. I'm not going to stick it here.
Which is faster or slower, in or exists?
For convenience, the following two sql are mainly used for comparative analysis.
Select * from T1 where id in (select id from T2); select * from T1 where exists (select 1 from T2 where t1.id=t2.id)
The execution results show that the two sql execute 1.3s and 3.4s respectively.
Note that at this time, the data amount of T1 table is 100W and that of T2 table is 200W.
According to the popular online view of the difference between in and exists
If the two tables of the query are of the same size, there is little difference between in and exists; if one of the two tables is smaller and the other is larger, the larger subquery table uses exists and the smaller subquery table uses in
Corresponding to this is:
When T1 is a small table and T2 is a large table, you should use exists, which is efficient.
When T1 is a large table and T2 is a small table, you should use in, which is more efficient.
And I tested it with actual data, and I overturned the first statement. Because obviously, T1 is a small table, but in is faster than exists.
In order to continue to test this point of view, I swapped the relationship between the inner and outer tables of the two tables and used the T2 large table as the appearance to compare the query.
Select * from T2 where id in (select id from T1); select * from T2 where exists (select 1 from T1 where t1.id=t2.id)
The execution results show that the two sql execute 1.8s and 10.0s respectively.
Isn't it interesting. Can be found
For in, large tables and small tables change the relationship between the inner and outer layers, and the execution time is not much different. One is 1.3s and the other is 1.8s.
For exists, the size table changes the relationship between the inner and outer layers, and the execution time is very different, one is 3.4s and the other is 10.0s, which is twice as slow.
First, compare with the query optimizer dimension.
In order to explore the reasons for this result. I went to see the sql optimized by each of them in the query optimizer.
-- this is written in 5.7. if it is version 5.6, you need to use explain extended. Explain select * from T1 where id in (select id from T2);-- intended to display a warning message. However, when used with explain, the optimized sql is displayed. You need to pay attention to the order of use. Show warnings
-- this is written in 5.7. if it is version 5.6, you need to use explain extended. Explain select * from T1 where id in (select id from T2);-- originally intended to display a warning message. However, when used with explain, the optimized sql is displayed. You need to pay attention to the order of use. Show warnings
The statement we want will be displayed in the result Message.
-- sql select `test`.`t1`.`id`AS `id`, `test`.`t1`.`name`AS `name`, `test`.`t1`.`address`AS `address`test`.`t2`join `test`.`t1` where (`test`.`t2`.`id` = `test`.`t1`.`id`) optimized by message
As you can see, here it converts in to join to execute.
Instead of using on, we use where because when there is only join, the subsequent on can be replaced with where. That is, join on is equivalent to join where.
PS: here we can also find that select * will eventually be converted into specific fields, so you know why we don't recommend using select *.
Similarly, the query with the appearance of the T2 large table also looks at the optimized statement.
Explain select * from T2 where id in (select id from T1); show warnings
We will find that it will also be converted into join.
Select `test`.`t2`.`id`AS `id`, `test`.`t2`.`name`AS `name`, `test`.`t2`.`address`AS `address`test`.`t1`join `test`.`t2`where (`test`.`t2`.`id` = `test`.`t1`.`id`)
Exists's conversion sql is no longer posted here, but it hasn't changed much.
Second, compare with the dimension of execution plan.
Let's compare their differences with the implementation plan dimension.
Explain select * from T1 where id in (select id from T2); explain select * from T2 where id in (select id from T1); explain select * from T1 where exists (select 1 from T2 where t1.id=t2.id); explain select * from T2 where exists (select 1 from T1 where t1.id=t2.id)
The execution results are respectively
one
two
three
four
It can be found that for in, the large table T2 will go the index when it does the outer table or the inner table, and the small table T1 will also go the index when it does the inner table. If you look at their rows column, you can also see that the results of the first two pictures are the same.
For exists, when small table T1 does appearance, T1 full table scan, rows nearly 100W; when large table T2 does appearance, T2 full table scan, rows nearly 200W. This is why T2 is very inefficient when it comes to appearance.
Because for exists, the appearance always performs a full table scan, and of course the less table data, the better.
Final conclusion: the outer layer is large and the inner layer is small, using in. The outer small table and the inner large table are almost as efficient as in and exists (even in is faster than exists, not that exists is more efficient than in as it is said on the Internet).
Which is faster or slower, not in or not exists?
In addition, not in and not exists are measured and compared.
Explain select * from T1 where id not in (select id from T2); explain select * from T1 where not exists (select 1 from T2 where t1.id=t2.id); explain select * from T1 where name not in (select name from T2); explain select * from T1 where not exists (select 1 from T2 where t1.name=t2.name); explain select * from T2 where id not in (select id from T1); explain select * from T2 where not exists (select 1 from T1 where t1.id=t2.id); explain select * from T2 where name not in (select name from T1) Explain select * from T2 where not exists (select 1 from T1 where t1.name=t2.name)
In the case of a small watch making an appearance. For primary keys, not exists is faster than not in. For a normal index, not in and not exists are not much different, and even not in is slightly faster.
In the case of large tables for appearance, not in is faster than not exists for primary keys. For a normal index, not in and not exists are not much different, and even not in is slightly faster.
Students who are interested can try by themselves. Compare the above two dimensions (query optimizer and execution plan).
Nested Loop of join (Nested-Loop Join)
In order to understand why the in here is converted to join, I feel it is necessary to understand the three nested loop connections of join.
1. Simple nested loop join, Simple Nested-Loop Join, SNLJ for short
Join is inner join, inner join, which is a Cartesian product, that is, traversing two tables using double-layer loops.
We know that small tables are usually used as driving tables in sql. Therefore, for the two tables of An and B, if A has fewer result sets, it will be placed in the outer loop as the driver table. Naturally, B loops in the inner layer as a driven table.
Simple nested loops, in the simplest case, are not optimized.
Therefore, the complexity is also the highest, O (mn). The pseudo code is as follows
For (id1 in A) {for (id2 in B) {if (id1==id2) {result.add ();}
2. Index nested loop join, Index Nested-Loop Join, abbreviated as INLJ
You can also see the name, which is matched by the index. The outer table matches the index of the inner table directly, so that there is no need to traverse the entire inner table. By using the index, the matching times of the outer table and the inner table are reduced.
Therefore, this situation requires that the columns of the inner table should have an index.
The pseudo code is as follows
For (id1 in A) {if (id1 matched B.id) {result.add ();}}
3. Block index nested join, Block Nested-Loop Join, abbreviated as BNLJ
Block index nesting joins reduce the number of inner loops by caching the data from the outer table to join buffer, and then matching the data batches in buffer with the data from the inner table.
Take the outer layer loop 100 times as an example, normally you need to read the outer layer data 100 times in the inner layer loop. If every 10 pieces of data are stored in the cache buffer and passed to the inner loop, the inner loop only needs to be read 10 times (100amp 10). This reduces the number of reads of the inner loop.
So, this is converted to join, and index nested loop joins can be used to improve execution efficiency.
This is the end of the introduction of "what's the difference between in, not in, exists, not exists". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.