Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Thoughts on deleting a large amount of data-2

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

In part 1 of this short series, I provided an informal description of two scenarios in which we can make large-scale deletions from the table. Without a specific example, it's hard to imagine that the nature of deleting data and available access paths will have a significant impact on system performance, so I'm going to spend most of my time on the datasets generated by the two tests discussed in this article. This article seems a bit long, but a considerable amount of space will be taken up by tables.

Simple data set

As the power and size of the hardware grows, it becomes more and more difficult to agree on the meaning of "large table" or "massive deletion"-1 million rows seem large to one person and mediocre to another.

I will use a compromise of 10 million lines to represent an investment system that has grown at a rate of 1 million lines per year for 10 years and has reached the segment size of 1.6GB.

Of course, this table is only one of several tables that make up the entire system, and at some point we will worry about the data we need, but for now, we only think about this table, only the table itself and the four indexes on the table.

Here is the code to generate the dataset:

Execute dbms_random.seed (0) create table T1 (idnot null,date_open, date_closed,deal_type,client_ref,small_vc,padding) nologgingaswith generator as (select/*+ materialize cardinality (1e4) * / rownumid fromdualconnect byrownum = add_months (sysdate,-60)-where date_closed

< add_months(sysdate, -60)---- where substr(client_ref,2,1) >

= 'Flemish where substr-where substr (client_ref,2,1)

< 'F'-- group by dbms_rowid.rowid_relative_fno(rowid), dbms_rowid.rowid_block_number(rowid) )group by rows_in_blockorder by rows_in_block;您将注意到,在这个查询中,我有六个注释谓词(在三个互补对中)。这个查询的基本目的是让我总结一下有多少块可以容纳多少行。但是每对谓词都让我对每种场景的效果有了一些想法-每一对中的一个告诉我关于将要删除的数据量和模式的一些信息。下面是sql*plus中执行如上查询的输出: Blocks RowsRows per block Blocks Rows Running total Running total-------------- -------- ------------ ------------- ------------- 27 1 27 1 27 49 203,877 9,989,973 203,878 9,990,000 50 200 10,000 204,078 10,000,000 --------sum 204,078 下面的输出显示了如果删除了5年以上打开的数据行,留下来的数据将会是什么样子?(也就是说,使用谓词date_open >

= add_months (sysdate,-60))

Blocks RowsRows per block Blocks Rows Running total Running total- 27 1 27 1 27 42 1 42 2 69 49 102014 4998686 102016 4998755-sum 102016 that's pretty good-- roughly speaking, we've emptied half of the watch. The other half didn't move. If we try to shrink the space now, we only need to copy the lower half of the table to the upper half of the table. We will generate a large amount of undo data and redo logs. But any clustering factor of any index may not change at all. Another option is that if we decide to leave the blank space as it is, then any new data will start to fill the blank space very effectively (almost like a reallocation zone). Similarly, we will see no change in any clustering factor. Compare this result with the result of deleting all rows that were closed five years ago (that is, what would we see if we used the predicate date_closed > = add_months (sysdate,-60)?). It's gonna be a lot bigger.

Blocks RowsRows per block Blocks Rows Running total Running total- 1 5 5 5 2 22 44 27 49 3 113 339 140 388 4 281 1124 421 1512 5 680 3400 1101 4912 6 1256 7536 2357 12448 7 1856 12992 4213 25440 8 2508 20064 6721 45504 9 2875 25875 9596 71379 10 2961 29610 12557 100989 11 2,621 28,831 15178 129820 12 2222 26664 17400 156484 13 1812 23556 19212 180040 14 1550 21700 20762 201740 15 1543 23145 22305 224885 16 1611 25,776 23,916 250,661 17 1976 33592 25892 284253 18 2168 39024 28060 323277 19 2416 45904 30476 369181 20 2317 46340 32793 415521 21 2310 48510 35103 464,031 22 2,080 45,760 37,183 509,791 23 1833 42159 39016 551950 24 1696 40704 40712 592654 25 1769 44225 42481 636879 26 1799 46774 44280 683653 27 2138 57726 46418 741,379 28 2,251 63,028 48669 804407 29 2448 70992 51117 875399 30 2339 70170 53456 945569 31 2286 70866 55742 1016435 32 1864 59648 57606 1076083 33 1704 56,232 59,310 1,132,315 34 1,566 53,244 60876 1185559 35 1556 54460 62432 1240019 36 1850 66600 64282 1306619 37 2131 78847 66413 1385466 38 2583 98154 68996 1483620 39 2966 115,674 71,962 1,599,294 40 2,891 115,640 74853 1714934 41 2441 100081 77294 1815015 42 1932 81144 79226 1896159 43 1300 55900 80526 1952059 44 683 30052 81209 1982111 45 291 13095 81500 1,995,206 46 107 4,922 81607 2000128 47 32 1504 81639 2001632 48 3 81642 2001776 49 122412 5998188 204054 7999964-sum 204054 in this case About 60% of the blocks still holds 49 rows per block, but the other blocks in the table are almost not deleted, but are completely emptied. (if you compare the total number of blocks in the first output with the total number of blocks in the first report, you will notice that how many blocks (24 blocks) must be completely empty.) how many blocks are available to insert now? Here's a quick calculation, most of our blocks have 49 rows, accounting for 90% (default pctree = 10), so a block will drop to 75% mark (that is, when ASSM marks it as free), when it is less than 41 rows (49 * 75 / 90), out of 204000 blocks, about 75000 meet this criterion (check the "Total number of blocks" column)

Index space

The previous section showed some simple SQL to show you how the space in the table will be displayed (or how the data will be retained)-can we do something similar with the index? The answer must be yes. However, the code that answers the question "what the index will look like after deleting the data that matches the predicate X" is more expensive to run than the code running the table. First, here is a simple piece of code to check the current contents of the index:

Select rows_per_leaf, count (*) leaf_blocksfrom (select / * + index_ffs (T1 (client_ref)) * / sys_op_lbid (94255, 'Little, t1.rowid) leaf_block Count (*) rows_per_leaf from T1 where client_ref is not null group by sys_op_lbid (94255, 'Little, t1.rowid)) group by rows_per_leaforder by rows_per_leaf The call to 'SYS_OP_LBID ()' takes a table rowid as one of its numbers and returns the contents of some rowid similar to the first row of the block, where the address of the block is the address of the index leaf block, and the index your block holds the index entry provided by the table rowid. The other two parameters are the index object_id (or partition or subpartition if the index is partitioned) and a flag indicating the specific use of the function. In this case, it's "L". It is necessary for hint to use a fast index scan on the target index-any other path may return incorrect results -'it is necessary that client_ref' is not empty. To ensure that the query can effectively use the index_ffs path.

For my initialization dataset, the index has 448 index entries in each block, except for one (probably the last one, 192 rows). Even if this is a simple query, it has to be carefully designed for the requirements of each index-because a quick index scan needs to get the correct results, which means we have to do some unusual deletions to see how our large number of deletions affect the index. Here is an example of how we can find out how trying to delete rows that were opened more than five years ago affects the client_ref index.

Select rows_per_leaf, count (*) blocks, rows_per_leaf * count (*) row_count, sum (count (*)) over (order by rows_per_leaf) running_blocks Sum (rows_per_leaf * count (*) over (order by rows_per_leaf) running_rowsfrom (select / * + leading (v1 T1) use_hash (T1) * / leaf_block Count (*) rows_per_leaf from (select / * + no_merge index_ffs (T1 (client_ref)) * / sys_op_lbid (94255, 'Little, t1.rowid) leaf_block T1.rowid rid from T1 where client_ref is not null) v1 T1 where t1.rowid = v1.rid and date_open < add_months (sysdate,-60) group by leaf_block) group by rows_per_leaforder by rows_per_leaf As you can see, we append the index block id to the rowid of each table starting with an inline view (not merged on time), and then join this set of rows id back to the table-through the rowid join and force a hash join. I've hinted at hash joins because it is (probably) the most effective strategy, but although I introduced a leading () hint, I didn't include a hint about swapping (or not) join input-I'll let the optimizer decide which of the two datasets is smaller, thus making it more appropriate to build a hash table.

In this particular case, the optimizer can use an index-only access path to find all rowid whose date_open is earlier than it was five years ago. Even so (partly because my pga_aggregate_target was relatively small and the hash connection overflowed to (solid-state) disk), the query took 3 minutes and 15 seconds, while the previous query ran exactly 1.5 seconds while caching the entire index. Here is an excerpt of the output:

Blocks RowsRows_per_leaf Blocks Rows Running total Running total- 181 2 362 3 458 186 2 372 5 830 187 2 374 7 1204 188 1 188 8 1392... 210 346 72660 2312 474882 211 401 84611 2713 559493... 221 808 178568 8989 1921410 222 851 188922 9840 2110332 223 832 185536 10672 2295868... 242 216 52272 21320 4756575 243 173 42039 21493 4798614 244 156 38064 21649 4836678... 265 1265 22321 5003718 266 1266 22322 5003984 We want to modify 22322 leaf blocks-- this is each leaf block in the index; the number of rows we delete from one leaf block ranges from 1 to 266. I selected a few lines from 83 lines of output at a time, but you may still see that the pattern seems to follow a normal distribution, centered on 222 (50%).

If we delete it in this way, we should know that we will spend a lot of effort updating the index; even so, the simple number of rows deleted per leaf block does not tell us all about the work to be done. We don't know if we will delete all 266 index entries at the same time and show that the deletion is complete from the last block, we will jump back and forth around the index very randomly and find ourselves constantly revisiting the block. to delete one index entry at a time. So in the next installment, we will examine what aspects of the workload need to be considered and how different deletion strategies can have a significant impact on the workload.

Translator: Tang Jianyuan, author: Jonathan Lewis original address: https://www.red-gate.com/simple-talk/sql/oracle/massive-deletes-part-2/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report