Oracle Study-DataGuard maximum protected Mode failure (ORA-16198) 04/29 Update SLTechnology News&Howtos

Oracle Study-DataGuard maximum protected Mode failure (ORA-16198)

2025-04-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Oracle Study-DataGuard maximum protected Mode failure (ORA-16198)

System environment:

Operating system: RedHat EL5

Oracle: Oracle 11gR2 (11.2.0.1.0)

Failure phenomenon:

Physical Standby experienced the following failure when converting from Maximum Performance to Maximum Protection:

10:13:06 SYS@ prod1 > startup force mount;ORACLE instance started.Total System Global Area 418484224 bytesFixed Size 1336932 bytesVariable Size 281020828 bytesDatabase Buffers 130023424 bytesRedo Buffers 6103040 bytesDatabase mounted.

10:13:30 SYS@ prod1 > select name,protection_mode from v$database

NAME PROTECTION_MODE

PROD1 MAXIMUM PROTECTION

Open DataBase failed:

10:07:04 SYS@ prod1 > alter database open

Alter database open

ERROR at line 1:

ORA-03113: end-of-file on communication channel

Process ID: 4612

Session ID: 1 Serial number: 5

View the alarm log:

Alter database open

Thu Jun 11 10:07:10 2015

LGWR: STARTING ARCH PROCESSES

Thu Jun 11 10:07:10 2015

ARC0 started with pid=19, OS id=4614

ARC0: Archival started

LGWR: STARTING ARCH PROCESSES COMPLETE

ARC0: STARTING ARCH PROCESSES

Thu Jun 11 10:07:10 2015

ARC1 started with pid=20, OS id=4616

Thu Jun 11 10:07:10 2015

ARC2 started with pid=21, OS id=4618

ARC1: Archival started

ARC2: Archival started

ARC0: STARTING ARCH PROCESSES COMPLETE

ARC0: Becoming the'no FAL' ARCH

ARC0: Becoming the'no SRL' ARCH

ARC1: Becoming the heartbeat ARCH

LGWR: Primary database is in MAXIMUM PROTECTION mode

LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR

Thu Jun 11 10:07:11 2015

NSS2 started with pid=18, OS id=4620

Thu Jun 11 10:07:40 2015

ORA-16198: LGWR received timedout error from KSR

Errors in file / u01/app/oracle/diag/rdbms/bjdb/prod1/trace/prod1_lgwr_4565.trc:

ORA-16198: Timeout incurred on internal channel during remote archival

LGWR: Error 16198 verifying archivelog destination LOG_ARCHIVE_DEST_2

Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED

LGWR: Continuing...

LGWR: Minimum of 1 applicable standby database required

Errors in file / u01/app/oracle/diag/rdbms/bjdb/prod1/trace/prod1_lgwr_4565.trc:

ORA-16072: a minimum of one standby database destination is required

Errors in file / u01/app/oracle/diag/rdbms/bjdb/prod1/trace/prod1_lgwr_4565.trc:

ORA-16072: a minimum of one standby database destination is required

LGWR (ospid: 4565): terminating the instance due to error 16072

Instance terminated by LGWR, pid = 4565

-as can be seen from the log information, all archiving paths failed, local archiving and remote archiving failed!

Solution:

According to the official suggestion of Oracle, modify the net_ timeout value: (master / slave database)

10:10:23 SYS@ prod1 > alter system set log_archive_dest_2='service=shdb lgwr sync affirm VALID_FOR= (online_logfiles,primary_role) net_timeout=30 DB_UNIQUE_NAME=shdb';10:11:25 SYS@ shdb > alter system set log_archive_dest_2='service=bjdb lgwr sync affirm VALID_FOR= (online_logfiles,primary_role) net_timeout=30 DB_UNIQUE_NAME=bjdb'

Add standby redo log:

Main library: (add Standby redo logs under Maximum Performance)

10:20:35 SYS@ prod1 > select group#,status, bytes from vault log; GROUP# STATUS BYTES- 1 INACTIVE 52428800 2 CURRENT 52428800 3 INACTIVE 5242880010 bytes from 54 SYS@ prod1 > select member from v$logfile MEMBER- -/ u01/app/oracle/oradata/prod1/redo03.log/u01/app/oracle/oradata/prod1/redo02.log/u01/app/oracle/oradata/prod1/redo01.log/u01/app/oracle/oradata/prod1/std_redo01.log/u01/app/oracle/oradata/prod1/std_redo02.log6 rows selected.10:21:03 SYS@ prod1 > alter database add standby logfile 10:21:25 2'/ u01 size size 50m Database altered.10:21:46 SYS@ prod1 > alter database add standby logfile10:21:51 2'/ u01 apprenticeship oradata size prod1Universe redo04.log' database altered.10:01:48 SYS@ shdb > select member from v$logfile MEMBER- -/ u01/app/oracle/oradata/shdb/redo03.log/u01/app/oracle/oradata/shdb/redo02.log/u01/app/oracle/oradata/shdb/redo01.log/disk2/arch_prod11_0_881851982.dbf/u01/app/oracle/oradata/shdb/std_redo01.log/u01/app/oracle/oradata/shdb/std_redo02.log6 rows selected.

Prepare the library:

10:18:17 SYS@ shdb > alter database open;Database altered.10:20:21 SYS@ shdb > select group#,status, bytes from v$log GROUP# STATUS BYTES- 1 CLEARING 52428800 2 CLEARING 52428800 3 CLEARING_CURRENT 52428800 10:20:45 SYS@ shdb > alter database add standby logfile 10:22:41 2'/ u01 size Database altered.10:22:57 SYS@ shdb > alter database add standby logfile10:23:02 2'/ u01 Greater col member for SYS@ shdb SYS@ shdb > select group#,member from v$logfile > SYS@ shdb 50m Database altered.10:23:14 SYS@ shdb > SYS@ shdb GROUP# MEMBER- 3 / u01/app/oracle/oradata/shdb/redo03.log 2 / u01/app/oracle/oradata/shdb/redo02.log 1 / u01/app/oracle/oradata/shdb/ Redo01.log 5 / u01/app/oracle/oradata/shdb/std_redo01.log 6 / u01/app/oracle/oradata/shdb/std_redo02.log 7 / u01/app/oracle/oradata/shdb/std_redo03.log 8 / u01/app/oracle/oradata/shdb/std_redo04.log8 rows selected.

-after being dealt with in the above ways, the problem remains the same. In Maximum Protection mode, the main library still cannot be Open, but under Maximum Availablity and Maximum Performance, the main library can be Open. The cause of the error is still being explored.

Reference documentation:

Analysis of the solution to the failure of Database reporting ORA-16198

-http://blog.itpub.net/28546804/viewspace-1260003/

1. First of all, let's take a look at the official documentation on the error report of ORA-16198.

...

The reason for the error may be that the net_timeout setting is low. In previous versions, the default was 10, but it is recommended to change it to 30.

...

The net_timeout attribute in the log_archive_dest_2 on the primary is

Set too low so that

LNS couldn't finish sending redo block in 10 seconds in this example.

... .

If setting 30 is not enough, please check the IO usage or network transfer of the disk

... ..

Note: If NET_TIMEOUT attribute has already been set to 30, and you still get ORA-16198, that means LNS couldn't finish sending redo block in 30 seconds.

The slowness may caused by:

1. Operating System. Please keep track of OS usage (like iostat).

2. Network. Please keep track network flow (like tcpdump).

...

It is also possible to be BUG. The affected version is 11.2.0.1 or 10.2.0.4. It is recommended to upgrade to a version above 11.2.0.2.

... ..

Bug 9259587 Multiple LGWR reconnect attempts in Data Guard MAXIMUM_AVAILABILITY

This note gives a brief overview bug 9259587.

Affects:

Product (Component) Oracle Server (Rdbms)

Range of versions believed to be affected Versions BELOW 12.1

Versions confirmed as being affected 11.2.0.1 10.2.0.4

Platforms affected Generic (all / most platforms affected)

Fixed:

This issue is fixed in 12.1 (Future Release) 11.2.0.2 (Server Patch Set)

Symptoms:

Related To:

Hang (Process Spins)

Active Dataguard (ADG)

Physical Standby Database / Dataguard

Description

...

The error that occurred is probably similar to the following

...

Rediscovery Notes:

Alert log contains messages like:

ORA-16198: LGWR received timedout error from KSR

LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)

LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned

Errors in file

/ app/oracle/diag/rdbms/ora11g_dga/ora11g/trace/ora11g_lgwr_290838.trc:

ORA-16198: Timeout incurred on internal channel during remote archival

LGWR: Network asynch wait error O wait error 16198 log 2 service 'ora11g_DGb'

LGWR: Error 16198 disconnecting from destination LOG_ARCHIVE_DEST_2 standby

Host 'ora11g_DGb'

Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED

LGWR: Failed to archive log 2 thread 1 sequence 1422 (16198)

...

In a Data Guard configuration using LGWR SYNC transport on one or more LOG_ARCHIVE_DEST_n parameters, and using a protection mode of MAXIMUM_AVAILABILITY, then if the primary database becomes disconnected from the standby database, LGWR continues to attempt to reconnect to the standby database. It should instead avoid attempts to reconnect until an ARCH process has re-established communication with the standby database.

So what's certain is:

This error mainly occurs in the architecture of DATAGUARD. The reason is that the log of the host is not completed at the specified time when it is transferred to the slave, or the log cannot be sent to the slave. Then we will explain the following two main causes of failure:

two。 Failure caused by too low parameter setting

It may be because the NET_ timeout value of the set LOG_ARCHIVE_DEST_2 is too low, so that the log cannot be transferred at the specified time. It is recommended to set it to 30.

Query NET_TIMEOUT:

SQL > select DEST_NAME,NET_TIMEOUT FROM V$ARCHIVE_DEST

DEST_NAME NET_TIMEOUT

LOG_ARCHIVE_DEST_1 0

LOG_ARCHIVE_DEST_2 30

... Output ellipsis

View the LOG_ARCHIVE_DEST_2 parameters:

SQL > show parameter log_archive_dest_2

The value is' service=orcl_std reopen=120 lgwr sync valid_for= (online_logfiles,primary_role) db_unique_name=orcl_std'

I didn't set the NET_TIMEOUT parameter, but the default is 30, because my version is 11.2.0.3.

If your parameter is not 30, please modify it as follows:

SQL > ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='service=orcl_std reopen=120 lgwr sync net_timeout=30 valid_for= (online_logfiles,primary_role) db_unique_name=orcl_std'

Then watch to see if such problems are still reported.

3. Failures caused by other reasons, such as blocked network or busy storage IO.

If the error is caused by blocked network and busy storage, please use operating system commands such as tcpdump or IOSTAT,VMSTAT to view the usage of relevant resources, or contact the network storage administrator to assist in the analysis.

If all of the above is no problem, there is another possibility that your host or slave has changed the sys password separately, but the relevant slave or host has not been changed at the same time, so it is very possible for the host to fail when verifying to the slave.

4. BUG of the database

If the above methods have not solved the problem, you can not analyze the specific reasons, and your database version happens to be 11.2.0.1 or 10.2.0.4, then upgrade it.

5. Summary

To consider this kind of problem, we should analyze it from many angles, such as: low parameter value, storage usage, network transmission, sys password changed, database BUG and so on.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.