Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to find and delete duplicate rows by MySQL

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article introduces MySQL how to find and delete duplicate lines, the content is very detailed, interested friends can refer to it, I hope it can help you.

How to find duplicate rows

The first step is to define what kind of rows are duplicates. In most cases it is simple: they have the same value in one column. This article uses this definition. Perhaps your definition of "duplicate" is more complex than this, and you need to make some changes to sql.

Data samples to be used in this article:

create table test(id int not null primary key, day date not null); insert into test(id, day) values(1, '2006-10-08'); insert into test(id, day) values(2, '2006-10-08'); insert into test(id, day) values(3, '2006-10-09'); select * from test; +----+------------+ | id | day | +----+------------+ | 1 | 2006-10-08 | | 2 | 2006-10-08 | | 3 | 2006-10-09 | +----+------------+

The first two lines have the same value in the day field, so how do I treat them as duplicate lines? Here's a query to look up. Query statements use the GROUP BY clause to group rows with the same field values and then calculate the size of the group.

select day, count(*) from test GROUP BY day; +------------+----------+ | day | count(*) | +------------+----------+ | 2006-10-08 | 2 | | 2006-10-09 | 1 | +------------+----------+

Duplicate rows have group size greater than 1. How do you want to display only duplicate rows, you must use the HAVING clause, such as

select day, count(*) from test group by day HAVING count(*) > 1; +------------+----------+ | day | count(*) | +------------+----------+ | 2006-10-08 | 2 | +------------+----------+

This is the basic trick: group fields based on the same value, and then show groups larger than 1.

Why not use the WHERE clause? Because the WHERE clause filters rows before grouping, the HAVING clause filters rows after grouping.

2. How to delete duplicate rows

A related problem is how to remove duplicate rows. A common task is to keep only one duplicate row and delete the others, and then you can create an appropriate index to prevent future duplicate rows from being written to the database.

Again, the first step is to clarify the definition of duplicate rows. What line are you keeping? The first row, or the row with the largest value for a field? In this article, assume that you want to keep the first row--the row with the smallest id field, meaning you want to delete the other rows.

Perhaps the easiest way is through temporary tables. Especially with MYSQL, there are limitations to not updating a table while selecting in a query statement. For simplicity, only temporary table methods are used here.

Our task is to remove all duplicate rows except those with the smallest id field in the grouping. Therefore, you need to find groups with sizes greater than 1 and rows you want to keep. You can use the MIN() function. The statement here is to create a temporary table and find rows that need to be deleted with Delete.

create temporary table to_delete (day date not null, min_id int not null); insert into to_delete(day, min_id) select day, MIN(id) from test group by day having count(*) > 1; select * from to_delete; +------------+--------+ | day | min_id | +------------+--------+ | 2006-10-08 | 1 | +------------+--------+

With this data, you can start deleting the "dirty data" rows. There are several methods available, each with its pros and cons, but I won't compare them here in detail, just to illustrate the standard methods used in relational databases that support query clauses.

delete from test where exists( select * from to_delete where to_delete.day = test.day and to_delete.min_id test.id How to find duplicates in multiple columns

Someone recently asked me this question: one of my tables has two fields b and c that are associated with fields b and c of two other tables. I want to find rows that have duplicate values in either the b or c fields.

It was hard to understand at first glance, but after conversation I understood that he wanted to create unique indexes for b and c. As mentioned above, finding rows with duplicate values in a field is as simple as grouping with groups and then calculating the size of the group. And finding rows with duplicate fields is easy, just put all fields in the group clause. However, if it is to determine whether the b field is duplicated or the c field is duplicated, the problem is much more difficult. The sample data that the questioner used here

create table a_b_c( a int not null primary key auto_increment, b int, c int ); insert into a_b_c(b,c) values (1, 1); insert into a_b_c(b,c) values (1, 2); insert into a_b_c(b,c) values (1, 3); insert into a_b_c(b,c) values (2, 1); insert into a_b_c(b,c) values (2, 2); insert into a_b_c(b,c) values (2, 3); insert into a_b_c(b,c) values (3, 1); insert into a_b_c(b,c) values (3, 2); insert into a_b_c(b,c) values (3, 3);

Now, you can easily see that there are some duplicate rows in the table, but you can't find two rows with the same binary {b, c}. This is why the problem becomes difficult.

4. Wrong query statements

If you group two columns together, you get different results depending on how you group and calculate the size. The questioner is stuck right here. Sometimes query statements find some duplicate rows and miss others. This is where he uses the query.

select b, c, count(*) from a_b_c group by b, c having count(distinct b > 1) or count(distinct c > 1);

The result returns all rows because CONT(*) is always 1. Why? Because>1 is written in COUNT(). This error is easily overlooked and is in fact equivalent to

select b, c, count(*) from a_b_c group by b, c having count(1) or count(1);

Why not? Because (b > 1) is a Boolean, it is not the result you want at all. what you want is

select b, c, count(*) from a_b_c group by b, c having count(distinct b) > 1 or count(distinct c) > 1;

Returns an empty result. Obviously, because there is no repetition of {b,c}. This guy tried a lot of other OR and AND combinations, one field for grouping, another field for sizing, like this.

select b, count(*) from a_b_c group by b having count(distinct c) > 1; +------+----------+ | b | count(*) | +------+----------+ | 1 | 3 | | 2 | 3 | | 3 | 3 | +------+----------+

No one can find all the duplicate rows. And what's most frustrating is that for some cases, this statement is valid if you mistakenly think it's written that way, whereas for others, you're likely to get the wrong result.

In fact, simply using GROUP BY is not feasible. Why not? Because when you use group by for one field, you scatter the values of another field into different groups. You can see these effects by sorting these fields, just as grouping does. First, sort the b field to see how it is grouped

When you sort (group) the b fields, c's of the same value are grouped into different groups, so you can't use COUNT(DISTINCT c) to calculate the size. Internal functions such as COUNT() only work on the same grouping and cannot do anything about rows in different groupings. Similarly, if the sorting is for the c field, b of the same value will also be divided into different groups, which will not achieve our purpose anyway.

5. several correct methods

Perhaps the easiest way to do this is to find duplicate rows for each field separately and then use UNION to put them together, like this:

select b as value, count(*) as cnt, 'b' as what_col from a_b_c group by b having count(*) > 1 union select c as value, count(*) as cnt, 'c' as what_col from a_b_c group by c having count(*) > 1; +-------+-----+----------+ | value | cnt | what_col | +-------+-----+----------+ | 1 | 3 | b | | 2 | 3 | b | | 3 | 3 | b | | 1 | 3 | c | | 2 | 3 | c | | 3 | 3 | c | +-------+-----+----------+

Output the what_col field to indicate which field is duplicated. Another approach is to use nested queries:

select a, b, c from a_b_c where b in (select b from a_b_c group by b having count(*) > 1) or c in (select c from a_b_c group by c having count(*) > 1); +----+------+------+ | a | b | c | +----+------+------+ | 7 | 1 | 1 | | 8 | 1 | 2 | | 9 | 1 | 3 | | 10 | 2 | 1 | | 11 | 2 | 2 | | 12 | 2 | 3 | | 13 | 3 | 1 | | 14 | 3 | 2 | | 15 | 3 | 3 | +----+------+------+

This method is much less efficient than using UNION, and displays each duplicate row rather than duplicate field values. There is also a way to join the results of nested queries between yourself and group. Writing is more complex, but it is necessary for complex data or for situations with high requirements for efficiency.

select a, a_b_c.b, a_b_c.c from a_b_c left outer join ( select b from a_b_c group by b having count(*) > 1 ) as b on a_b_c.b = b.b left outer join ( select c from a_b_c group by c having count(*) > 1 ) as c on a_b_c.c = c.c where b.b is not null or c.c is not null About MySQL how to find and delete duplicate rows to share here, I hope the above content can be of some help to everyone, you can learn more knowledge. If you think the article is good, you can share it so that more people can see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report