Thoughts on deleting a large amount of data-1 07/06 Update SLTechnology News&Howtos

Thoughts on deleting a large amount of data-1

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

In a recent article about deleting columns from a table, I left an open question. What should / might you do after deleting columns? Because deleting columns is just a special case of "mass deletions". In this article, I plan to lay a foundation for how we think about "mass deletions" and their consequences.

Overview

Before you can find a large number of deleted solutions and processes, you must deal with some strategic (long-term) and "war." sexual (short-term) problems.

At the strategic level, you will have the following question: why did you delete it? What do you want from it? If you achieve your initial goal, what is the next strategy (if any)? What kind of evidence do you have to prove that it is worth your effort (people and machines)? Have you ever thought carefully that even fixing an old problem may bring new problems?

At the "war." at the "skill" level, you may ask some details of the workflow you decide to adopt: what resources are there? Do you allow long-term or short-term outages? Interruption of any service is not allowed at all or is not allowed? If the application hierarchy must run during the delete task execution phase, can it reduce some of the functionality or slow down the execution performance? Do you know enough about your system? Have you checked what recent features or enhancements are available in Oracle to help you get your work done safely (and quickly)?

Let's take a look at some ideas from some of the online conversations I've been involved in recently:

Imagine A

A recent post in the OTN forum describes an extreme example of "mass deletion". Users have a normal heap table of 4tb that retains data for 3 years, and now want to reduce the data to daily partitions and retain 15 days of historical data. People may be prompted to delete a large number of data in order to clean up a large amount of historical data, of course, the best strategy is to design a system with such a goal, to turn the deleted data into a simple "delete partition", so that there is almost no overhead.

In this particular case, users (in my opinion) are very lucky because they want to erase most of the data and retain only a small portion of the data. It takes them some time to plan and test all the relevant details (referential integrity, indexes, etc.), but all of this requires creating an appropriate range partitioning table, using it as a swapped table, and then starting partitioning every day. Then wait 16 days before deleting the last partition to clean up the last three years of data.

Others may not be so lucky. I often see data like a table with several years of data, and it needs to be partitioned by week or month, and then keep the data for two or three years. "swap for three years at a time" is not advisable, but the cost of deleting or copying data for several years is also undesirable.

Imagine B

One of the questions I received not long ago was that someone came to ask about the strategy of deleting a large amount of data, because according to their previous experience, delete all the indexes before quickly deleting a large amount of data, and then rebuild the indexes later, and recently they tested a case, although the time difference between this method and "just delete it" is very small. But it doesn't seem to be very beneficial to adopt a slightly more complex approach (deleting the index is rebuilding / therefore risky).

This raises an interesting question: how much data can be deleted as "a lot of data"? This person deleted 2500w rows of data, which sounds pretty large, but it's only 4% of the table, so it's not that huge (relatively speaking). This appearance has been partitioned, which reduces the risk somewhat, on the other hand, it contains at least one globally unique index, which is a bit annoying to him, but the server can add the task to 16 in parallel, so in absolute terms, each parallel task is about 150w rows of data, so maybe it's not really big.

In fact, no matter what method is used, the time to complete the task is about 17 minutes and 30 seconds, but it is worth noting that if we use a simple delete strategy, other users can still use the table during the task, because the table is used concurrently. Delete operations may take longer due to contention and read consistency It may be slower to require user activity (note: what are the benefits of deleting one partition at a time in a particular order?) And there are always disasters caused by lock and deadlock threats. How often does it take to delete these 4% of the data? it may be roughly equivalent to a month of data in two years, so it may be cleaned up regularly once a month, but no one may mind losing access for 15 minutes because of "drop/delete/rebuild". These operations always have some benefits. Most indexes run more efficiently after the data is deleted.

Matters needing attention

When "massive data deletion" comes to mind, I hope these two examples will let you know what to think about. So before we start "how", let's categorize possible situations and ideas related to them.

I think I've encountered three basic deletion modes and two deletion reasons in the past.

The reason for deletion is very simple:

1. Improve performance.

two。 Reclaimed space-you want it to be a database or a specific tablespace; it may end up being disk space outside the database.

Common deletion modes are:

1. Delete the data in the table according to time.

two。 Delete according to the completion time of the data processing in the table.

3. Remove a class of data from the table (this may mean creating two tables, or partitioned tables (list partitions), or non-partitioned tables).

Once we find out why, we will ask some key questions-how can data be deleted to improve performance? How can we improve efficiency in other ways (such as improving indexing)? Is the space released by deleting data available for immediate use, or is there something else that must be done? What is the negative impact of deletion? What is the negative impact of the further measures we may take? Do we have a real platform? We can verify the predicted downtime and perform the corresponding tasks. What are the unpredictable negative effects of testing?

It is important to understand the schema, but it is often ignored when using the database. When you delete data, free up space in table blocks and index blocks, which may be reused when new data appears. However, because the free space released in the table in this way means that the physical distribution of the new data is different from the current distribution pattern of other data, this means that over time, because the query (a) of the different schema may become very inefficient, the optimizer (b) may decide that an index is no longer the best choice, because a change in the data distribution pattern results in a change in the "clustering_factor" of the index.

The three main deletion modes I have proposed are based on their threat to performance. If you assume that this is the first time you have deleted big data, then it is easiest to consider these patterns. Sometimes threats appear only after you have done several deletion cycles. If you delete the data according to its original arrival date, it is likely to leave a lot of free blocks at the beginning of the table segment (the first few sections), which means that the newly inserted data may be inserted into a set of areas at the beginning of the table segment, not at the end of the table segment. Specifically, suppose you have a table with 100000 blocks, and you have just deleted the data from the first 5000 blocks in the table, and the next few hundred thousand rows of data will be inserted into 1-5000 blocks instead of 100001-105000; although the absolute position in the table has changed, but the schema of the data does not change.

If the deletion is based on the "processing complete" date, the initial deletion pattern may be different-maybe the first 1000 blocks are actually empty, and the usage of the next 1000 blocks drops to 20%. In the next 2000 blocks, the usage dropped to 40%, and in the next 70%, the usage of the next 70% dropped to 70%. Over time, new data will be distributed across more blocks than ever before (maybe some of the blocks you delete are not allowed to be reused until you do the next large number of deletions). If you don't refer to the actual application, it's hard to imagine why anyone's data might show this "decay" pattern when a large number of deletions occur-but you might think of an app with a 1 -, 2 -, 3-or 5-year loan agreement.

In the last mode-deleting the entire data category, "lending" may be a good example. For some reason we may decide to create a separate table for 5-year loans because loans have become an important part of the business-so we have to remove them from the current loan table. Of course, this is the pattern in which you just deleted 10% of the 30% data for each block in the table. We may find that none of these blocks appear in the free space, or we may find that over the next nine months, we insert a few rows of data into each block of the table, and people complain about "very poor performance in 2016."

Indexes

Of course, we should also consider the patterns (and side effects) in the index when studying data schemas. Because we delete all rows from a small number of adjacent blocks, even one of the scenarios means that we can efficiently delete data from the table, and we need to consider what happens to each index in the table. Very compact table deletions can result in very scattered index deletions, because random Imax O-reads (through sessions) and writes (database writes) may take a long time and may not give us any subsequent space and performance benefits.

Consider deleting data from the Stock Price table for April 1, 2001: all rows will arrive together, so we can clear hundreds of consecutive blocks in the table-if we have an index (quote _ date, stock _ code), we will empty hundreds of consecutive blocks in the index, and if this is the index we drive to delete, it will not produce too many Icano. If we had an index (stock _ symbol, quote _ date)-we would probably have to access thousands of index leaf blocks to delete each index entry! Deletions can be very slow because a large number of random Iamp Os are to be performed. One of the most common complaints about inserts and deletions in OTN is the "db file sequential read" wait; the execution plan does not tell us about the cost of index maintenance, so it is easy to forget that a large delete operation can result in a very slow random Imax O. Interestingly, SQL Server will tell you which indexes are maintained by the delete operation.

The fact that index maintenance has such a big impact on large deletions-and with lasting consequences-is really worth thinking about. In fact, we can design a strategy to handle indexes on a single table differently according to the definition and actual usage of each index. For a given table, we can delete (or tag cannot) and rebuild some indexes while retaining some indexes, rebuilding or merging indexes after deletion.

Summary

Large deletions are not a normal operation and should not be done without careful consideration. Why do we delete a lot of data? Do we know how we need to verify that we have reached the final goal after the deletion operation is complete? What methods can we use to minimize the impact of the deletion and what else do we need to do to achieve the final goal after the deletion is completed?

What kind of efficiency and effort will be determined by the data schema we want to delete, first in the table and then in the index (and perhaps more importantly, the index).

In the next issue, we will discuss some technical issues of "large-scale" deletion.

Translator: Yang Yuhang original author: Jonathan Lewis original address: https://www.red-gate.com/simple-talk/sql/oracle/massive-deletes-part-1/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.