A case study of database deduplication according to specified fields 03/21 Update SLTechnology News&Howtos

A case study of database deduplication according to specified fields

2026-03-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the database according to the specified field to heavy case analysis, with a certain reference value, friends can refer to the need. I hope you all have a lot to gain after reading this article. Let's take a look at it together.

Requirements: remove duplicate data from a user table according to the name/email/card_num field;

Idea: Use the group by method to query the data after 'de-duplication', store these data in a temporary table, and then store the data of the temporary table in the specified table;

Mistakes and solutions: group by method can only get some fields (de-duplicate specified fields), can not get complete data at one time, but you can get the id of group by result set through max function, and then query all records according to id set.

Test ideas query deduplicated data

SELECT max(id) as id,name,email,card_num FROM users GROUP BY name,email,card_num;

Get id set from deduplicated data

SELECT ID from (SELECT max(id) as id,name,email,card_num FROM users GROUP BY name,email,card_num) as T;

Obtain id set from deduplicated data, and obtain record list from source data

SELECT * from users where id in (SELECT ID from (SELECT max(id) as id,name,email,card_num FROM users GROUP BY name,email,card_num) as T);

The actual method obtains id set from deduplicated data, obtains record list from source data, and stores these list data in a temporary table

create TEMP TABLE tmp_data as SELECT * from users where id in (SELECT ID from (SELECT max(id) as id,name,email,card_num FROM users GROUP BY name,email,card_num) as T);

Store the data in the temporary table into the specified data table, over

insert into users_copy1 select * from tmp_data;

Check whether the detection result is the same as the total number of data after demultiplexing in the first step.

select count(*) from users_copy1;

Test results: 2300 data duplicates in 1.4w data, and the actual running result is 0.7s, which basically meets the current requirements.

Thank you for reading this article carefully. I hope that Xiaobian will share the case analysis content of the database according to the specified fields. At the same time, I hope that everyone will support you a lot. Pay attention to the industry information channel. When you encounter problems, find detailed solutions waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.