How mysql removes duplicates 07/06 Update SLTechnology News&Howtos

How mysql removes duplicates

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article is about how mysql removes duplicates. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

The details are as follows:

When it comes to this weight removal, I can't help but feel a pain in the brain, especially when the amount of production is relatively large. After all, we are not a professional DB, so we can only do something suitable for ourselves.

First of all, according to the regular first paragraph, use the having function to check for duplicates and finish deleting them one by one. Don't ask me how to write the sql of having detection duplicates, you know. This is fine when there are only a few repetitions. What if tens of thousands of different pieces of data are repeated.

When we are done, we consider that when querying with the having function, the original sql is as follows:

Select `name`, count (*) as count from sentence group by `name` having count > 1

You can run under the feeling, at about 5 million of the data, that speed, sesame oil.

However, without considering optimizing its running speed, let's consider using this statement to remove duplicates in the array. First of all, we need to know what kind of data we need to use when deleting duplicate data. Id is sure, and then, the screening conditions, right? So, what is missing from the data queried by sql above, id. Then let's add this field to query and try it:

Select id, `name`, count (*) as count from sentence group by `name` having count > 1

As a result, the three fields id,name,count will be displayed. You can run the specific effect by yourself. Based on these data, we can remove duplicates.

The specific sql design is to delete id, not the id we queried, but the name value is the value we queried. Because our data is not the same, we have to go through the program to concatenate all the id with commas. The name values are processed with quotes and commas, and then we can use them. The example is below:

Delete from sentence where id not in (1, 2, 3) and name in ('a girl, girl, etc.)

If there is too much data, we can write a script and then operate in batches. Hey, at this point, we can remove duplicates, but this speed is always a problem. Next we have to consider how to optimize this sql, let it speed up, even if it is done.

Since it is to improve the running speed of our sql, according to common sense, the first thing we should think of is the index. All right, no more nonsense, let's set up an index first. But create that field. So? That's another problem.

In principle, this is based on the fields that can be distinguished from your name field. For example, my name field stores brand names one by one, and then I have an industry field to store each brand's industry, so I index on my industry field. Of course, there are more suitable ones. It depends on how you think about it. Needless to say, let's take a look at our optimized sql:

Select id, `name`, count (*) as count from sentence where industry = 'drink' group by `name` having count > 1

The running results are as follows:

What does the result show? it shows that our index is in use. That speed, let's not say that every viewing officer should also know it. When we are done, we can use the program to concatenate all the id with commas, and deal with the name values in quotes and commas, and then we can carry out the great task of removing duplicates. There has been a marked increase in efficiency.

However, some officials may use conditions that contain conditions that will invalidate the index, such as like, so we can simply classify the data and detect the data of each category separately. After all the queries are out, we can use the program to check for duplicates and take out the data needed for deletion.

Attached: mysql rand query Optimization & Random query Optimization

Speaking of this random query, everyone knows to use the rand () function, but when the amount of data reaches a certain level, the query efficiency can be imagined. So? We might as well consider optimizing this query scheme.

My optimization method is, through the program to random, and then cooperate with limit to take the value. Let's record the general idea.

First, query the number of data items that meet the conditions, then use the rand function of PHP to take values randomly within this range of values, and then query directly.

Example sql:

Select count (*) from test where $where; (calculate the total number of data required)

$offset = rand (0penny count)

Select * from test where $where limit $offset,1; (query the required data)

You can give it a try. In the case of about 5 million data, the query speed is at least ten times faster than that of mysql's rand function.

Thank you for reading! This is the end of the article on "how to remove duplicates in mysql". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.