Fault diagnosis case of oracle IO performance problem 02/13 Update SLTechnology News&Howtos

Fault diagnosis case of oracle IO performance problem

2026-02-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

A business system has serious IO performance problems during the daytime business hours. The following is the awr report of the afternoon business peak (3-5).

From the point of view of waiting events, it is mainly related to IO.

From the above, we can see that except for the logical reading of a few sentences, physics is not very high, and the redo logs and physical reads generated per second are not high.

Check disk IO

Rx6600-1: [/] # sar-d 1 10HP-UX rx6600-1 B.11.23 U ia64 07Accord 1416 10HP-UX rx6600 45 device% busy avque r+w/s blks/s avwait avserv16:18:46 c39t0d3 100.00 0.50 18 1130 0.00 132.11 c41t0d3 83.50 6 450 0.00 290.7816 Vera 1847 c3t0d0 0.99 0.50 2 63 0.00 7.12 c39t0d3 91.09 0.50 10 982 0.00 115.53 c41t0d3 100.00 0.50 12 586 0.00 291.6716:18:48 c3t0d0 3.03 0.50 2 32 0.00 15.93 c39t0d3 100.00 0.50 9 1034 0.00 139.76 C41t0d3 92.93 0.50 7 388 0.00 310.0716:18:49 c3t0d0 2.00 0.50 4 64 0.00 19.59 c39t0d3 100.00 0.50 12 1088 0.00 127.33 c41t0d3 86.00 0.50 8 416 0.00 251.3216:18:50 c3t0d0 1.01 0.50 1 2 0.00 8.99 c39t0d3 100.00 0.50 16 954 0.00 117.10 c41t0d3 100.00 0.50 9 614 0.00 295.5216:18:51 c3t0d0 0.99 0.50 18 0.00 10.60 c39t0d3 93.07 0.50 17 913 0.00 110.59 C41t0d3 100.00 0.50 9 350 0.00 326.9216:18:52 c39t0d3 100.00 0.50 21 1168 0.00 127.22 c41t0d3 88.00 0.50 11 544 0.00 252.0816:18:53 c3t0d0 2.02 0.50 3 48 0.00 18.51 c39t0d3 88.89 0.50 19 1164 0.00 98.25 c41t0d3 100.00 0.50 11 630 0.00 324.3916:18:54 c3t0d0 3.00 0.50 3 20 0.00 12.39 c39t0d3 95.00 0.50 20 954 0.00 131.90 c41t0d3 81.00 0.50 9 610 0.00 289.0516:18 : 55 c3t0d0 9.00 0.50 11 134 0.00 8.62 c39t0d3 100.00 0.50 19 1090 0.00 137.20 c41t0d3 100.00 0.50 11 512 0.00 327.16Average c39t0d3 99.50 0.50 16 1048 0.00 123.38Average c41t0d3 100.00 0.50 9 510 0.00 296.44Average C3t0d0 2.20 0.50 3 37 0.00 12.28rx6600-1: [/] # sar-d 1 10HP-UX rx6600-1 B.11.23 U ia64 07Accord 15 1416 sar 20purl 04 device% busy avque r+w/s blks/s avwait avserv16:20:05 c3t0d0 1.00 0.50 1 16 0.00 8.33 c39t0d3 98.00 0.50 16 928 0.00 114.86 c41t0d3 98.00 0.50 10 684 0.00 266.4316:20:06 c3t0d0 1.98 0.50 4 81 0.00 8.57 c39t0d3 93.07 0.50 19 1251 0.00 128.81 c41t0d3 91.09 0.50 6 475 0.00 365.8316:20: 07 c3t0d0 2.00 0.50 3 48 0.00 5.87 c39t0d3 98.00 0.50 23 1216 0.00 113.66 c41t0d3 98.00 0.50 8 576 0.00 307.9216:20:08 c3t0d0 1.00 0.50 2 32 0.00 5.36 c39t0d3 100.00 0.50 21 1132 0.00 118.47 c41t0d3 100.00 0.50 7 592 0.00 300.7116:20:09 c3t0d0 6.00 0.58 13 194 2.22 26.05 c39t0d3 89.00 0.50 17 1152 0.00 123.54 c41t0d3 87.00 0.50 8 512 0.00 298.2616:20: 10 c3t0d0 3.00 0.50 6 96 0.00 22.78 c39t0d3 85.00 0.50 17 1136 0.00 114.79 c41t0d3 98.00 0.50 9 592 0.00 252.5216:20:11 c3t0d0 1.00 0.50 1 2 0.00 8.04 c39t0d3 100.00 0.50 17 1216 0.00 138.04 c41t0d3 100.00 0.50 12 672 0.00 291.6916:20:12 c3t0d0 2.00 0.50 3 34 0.00 9.24 c39t0d3 99.00 0.50 16 1024 0.00 122.11 c41t0d3 88.00 0.50 9 476 0.00 299.7916:20 : 13 c39t0d3 91.00 0.50 18 1024 0.00 111.77 c41t0d3 92.00 0.50 3 384 0.00 396.2516:20:14 c39t0d3 99.00 0.50 17 892 0.00 132.15 c41t0d3 100.00 0.50 10 608 0.00 233.54Average c3t0d0 1.80 0.53 3 50 0 . 87 17.64Average c39t0d3 96.00 0.50 18 1097 0.00 121.54Average c41t0d3 100.00 0.50 8 557 0.00 290.35

The dual-computer software restarted after the business staff got off work, but stopped at Completed redo application when starting the database.

SQL > startupORACLE instance started.Total System Global Area 1.0318E+10 bytesFixed Size 2073176 bytesVariable Size 3238006184 bytesDatabase Buffers 7063207936 bytesRedo Buffers 14700544 bytesDatabase mounted.

You can see the following information from the alert.log file:

Tue Jul 15 22:23:29 2014Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0LICENSE_SESSIONS_WARNING = 0Picked latch-free SCN scheme 3Autotune of undo retention is turned on. IMODE=BRILAT = 61LICENSE_MAX_USERS = 0SYS auditing is disabledksdpec: called for event 13740 prior to event group initializationStarting up ORACLE RDBMS Version: 10.2.0.4.0.System parameters with non-default values: processes = 500sessions = 555 _ _ shared_pool_size = 3154116608 _ _ large_pool_size = 16777216 _ _ java_pool_size = 33554432 _ streams_pool_size = 33554432 sga_ Target = 10317987840 control_files = / sx_data/ORCL/control01.ctl / sx_data/ORCL/control02.ctl / sx_data/ORCL/control03.ctl db_block_size = 8192 _ db_cache_size = 7063207936 compatible = 10.2.0.3.0 log_archive_dest_1 = LOCATION=/sx_data/arch_ORCL/ log_archive_format =% t_%s_%r.dbf db_file_multiblock_read_count= 16 db_recovery_file_dest = / oracle/flash_ Recovery_area db_recovery_file_dest_size= 2147483648 undo_management = AUTO undo_tablespace = UNDOTBS1 undo_retention = 39600 fast_start_parallel_rollback= FALSE remote_login_passwordfile= EXCLUSIVE db_domain = dispatchers = (PROTOCOL=TCP) (SERVICE=ORCLXDB) local_listener = ORCL job_queue_processes = 10 background_dump_dest = / oracle/admin/ORCL/bdump user_ Dump_dest = / oracle/admin/ORCL/udump core_dump_dest = / oracle/admin/ORCL/cdump audit_file_dest = / oracle/admin/ORCL/adump db_name = ORCL open_cursors = 2000 optimizer_index_cost_adj = 20 optimizer_index_caching = 90 pga_aggregate_target = 2576351232PMON started with pid=2 OS id=13613PSP0 started with pid=3, OS id=13615MMAN started with pid=4, OS id=13617DBW0 started with pid=5, OS id=13619LGWR started with pid=6, OS id=13621CKPT started with pid=7, OS id=13623SMON started with pid=8, OS id=13625RECO started with pid=9, OS id=13627CJQ0 started with pid=10, OS id=13629MMON started with pid=11, OS id=13631Tue Jul 15 22:23:30 2014starting up 1 dispatcher (s) for network address'(ADDRESS= (PARTIAL=YES) (PROTOCOL=TCP)) '... MMNL started with pid=12 OS id=13635Tue Jul 15 22:23:30 2014starting up 1 shared server (s)... Tue Jul 15 22:23:31 2014ALTER DATABASE MOUNTTue Jul 15 22:23:39 2014Setting recovery target incarnation to 2Tue Jul 15 22:23:42 2014Successful mount of redo thread 1 With mount id 1380841571Tue Jul 15 22:23:42 2014Database mounted in Exclusive ModeCompleted: ALTER DATABASE MOUNTTue Jul 15 22:23:42 2014ALTER DATABASE OPENTue Jul 15 22:23:47 2014Beginning crash recovery of 1 threads parallel recovery started with 7 processesTue Jul 15 22:23:50 2014Started redo scanTue Jul 15 22:23:52 2014Completed redo scan 336597 redo blocks read, 78835 data blocks need recoveryTue Jul 15 22:23:52 2014Started redo application at Thread 1: logseq 2270 Block 29Tue Jul 15 22:23:53 2014Recovery of Online Redo Log: Thread 1 Group 4 Seq 2270 Reading mem 0 Mem# 0: / sx_data/ORCL/redo04.logTue Jul 15 22:23:58 2014Completed redo application

Stop at Completed redo application all the time, and the wait event is because checkpoint complete initially thought it was caused by slow parallel recovery, so it queried the vested transactiontraining starting transactions, but there are no transactions in the view that perform recovery operations. Later, after consulting Lao Xiong, Lao Xiong said to check the IO to see if there is a storage problem, so check the storage IO performance again:

Rx6600-1: [/] # sar 1 10HP-UX rx6600-1 B.11.23 U ia64 07Accord 15x1422 rx6600 41% usr% sys% wio% idle22:36:42 22 12 8422 Vista 36 Vol 43 10 12 8722 Vista 36 Vol 44 00 17 8322 Flex 36 viv 45 00 13 8722 : 36:46 0 1 12 8722:36:47 2 1 13 8422:36:48 1 1 16 8222:36:49 0 0 12 8822:36:50 0 0 12 8822:36:51 0 0 22 78Average 1 0 14 85

From the above, we can see that there is actually no business running, but there is still IO waiting, which is not normal.

Rx6600-2: [/] # bdfFilesystem kbytes used avail% used Mounted on/dev/vg00/lvol3 983040 422504 556176 43% / / dev/vg00/lvol1 1835008 135048 16867776% / stand/dev/vg00/lvol8 8912896 853535352 374824 96% / var/dev/vg00/lvol7 7962624 2762312 5159704 35% / usr/dev/vg00/lvol4 524288 83192 437784 16% / tmp/dev/vg00/tmplv 2064384 93512 184794245 / Oratmp/dev/vg00/orasoft 10256384 3144652 6668390 32% / orasoft/dev/vg00/oracle 20480000 5480497 14062042 28% / oracle/dev/vg00/lvol6 9076736 5206384 3840128 58% / opt/dev/vg00/lvol5 131072 25472 104824 20% / home/dev/cwjcvg/cwjc_datalv 41493952 134188114 263239842 34% / cwjc_data/dev/sxvg/sx_datalv 624689152 298665485 3056549 49% / sx _ datarx6600-2: [/] # time dd if=/dev/zero of=/var/test bs=8k count=100000

The following is an IO test on the disk of the minicomputer itself. It only takes about 12 seconds to write 800m of data.

Msgcnt 2 vxfs: mesg 001: vx_nospace-/ dev/vg00/lvol8 file system full (1 block extent) I take O error 47185pm 0 records in47184+1 records outreal 11.7user 0.0sys 0.8

However, the IO test of EMC storage has not been completed for more than 30 minutes to write 800m data.

Rx6600-2: [/] # time dd if=/dev/zero of=/sx_data/test bs=8k count=100000711856+0 records in711855+0 records outreal 30:58.4user 0.5sys 13.0

This is obviously a problem with the storage. Later, we learned that the manager found that a disk in the storage was damaged at 10:00 in the morning. The raid 5 made by the storage had a hot spare. And there are hundreds of gigabytes of data for storage-level synchronization. As soon as there is a time for performance problems.

When the cause of the problem is found, it will be easy to solve. Fortunately, the problem was solved, and the next day a big leader came to check or else. Ha ha

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.