Requirement Analysis and improvement of MySQL data cleaning 07/06 Update SLTechnology News&Howtos

Requirement Analysis and improvement of MySQL data cleaning

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Yesterday, I helped a friend to see the problem of MySQL data cleaning, and I felt more interesting. The specific implementation of this friend is still being done, and it is almost done. I will send it out for reference.

In order to ensure the sensitivity of the information, the description of the problem may not be consistent with the real situation, but the way the problem is handled is true.

First of all, this friend reported yesterday afternoon that he had a table with a size of nearly 600G, and now he needs to clean up the data and keep only the data from recent months. According to this order of magnitude, I find that this problem should not be solved very well, and we have to be very careful. If it is a general idea and method, I suggest using hot and cold data separation. Generally speaking, there are several ways to play:

Exchange partition, which is a feature of the bright spot, can exchange partition data and table data, and the efficiency is good.

Rename table, which is a powerful tool for MySQL archiving data, is difficult to implement in other commercial databases.

But just to be on the safe side, I said we'd better look at the structure of the watch. As a result, when I saw the table structure, I found that the problem was completely different from what I had expected.

The ibd file of this table is about 600G, not a partition table, InnoDB storage engine. The fields don't look like many either. You need to extract the time field according to the time field update_time to delete the data.

I looked at this table structure, there are not many fields, except for some redundancy in the design of the index, I can't see any other problems directly, but according to the storage of the data, I find this problem a little strange. I don't know if you found the problem.

The primary key of this table is based on the field id, and the primary key increases itself, so if you want to store 600 gigabytes of data, the amount of data in the table must be at least 100 million. But if you take a closer look at the value of the self-increment column, you will find that it is only about 1.5 million. The difference is too big.

In order to further verify, I asked my friend to query the amount of data in this table, and in the morning he sent me the latest data, which further verified my conjecture.

Mysql > select max (Id) from test_data

+-+

| | max (Id) |

+-+

| | 1603474 |

+-+

1 row in set (0.00 sec)

The problem is clear. The data in the table is less than 2 million, but it takes up nearly 600 gigabytes of space. This storage ratio is too high, or there are too many fragments.

According to this way of thinking, I still have some sense of achievement. I found that the crux of such a big problem is that if the data is not specially stored, 2 million of the data is not large, and it is very easy to clean it up.

After listening to it, my friend thought it was reasonable. From a security point of view, he just needed to pay attention to some skills, but before long, he gave me feedback, saying that the data in the table had been fragmented, there were probably more than 100 gigabytes, and there might be more. There are still some conflicts between this issue and my previous analysis. At least the difference is not that big. 2 million of the data is basically less than 1G. But here are more than 100 gigabytes, far more than I expected.

Mysql > select round (sum (data_length+index_length) / 1024 plus 1024) as total_mb

-> round (sum (data_length) / 1024ax 1024) as data_mb

-> round (sum (index_length) / 1024ax 1024) as index_mb

-> from information_schema.tables where table_name='hl_base_data'

+-+

+-+

| | 139202 | 139156 | 47 | |

+-+

1 row in set (0.00 sec)

How to explain this question next? I told this friend that as a DBA, we should not only be proficient in physical operations, but also be sensitive to data needs.

How to understand, update_time does not have an index, id is the primary key, we can fully estimate the changes in the data.

How to estimate it? if you observe carefully, you will find that the difference in the information provided by the two times is nearly half a day, and the difference in the value of self-profit is about 4000. The data change in a day is basically 10,000.

Now that it is 24 days away from October 1st, it can be directly estimated that the figure is around 1363474.

Mysql > select current_date-'20171001'

+-+

| | current_date-'20171001' |

+-+

| | 24 |

+-+

1 row in set (0.00 sec)

According to this line of thinking, I provided a sentence to my friend, and as soon as he checked, it was not much different from my preliminary estimate.

Mysql > select id, create_time, update_time from test_data where id=1363474

+-+

+-+

| | 1363474 | 2017-09-29 10:37:29 | 2017-09-29 10:37:29 |

+-+

1 row in set (0.07 sec)

After a simple adjustment, you can filter the data in accordance with id to delete the data. It is still recommended to delete the data in batches and fast forward in small steps.

The premise is to make a good backup, and then slowly automate it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.