(transferred) Oracle instance recovery details MTTR 01/01 Update SLTechnology News&Howtos

(transferred) Oracle instance recovery details MTTR

2026-01-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

MTTR-Mean Time To Recover

MTBF-Mean Time Between Failures

First, you need to understand some concepts:

The information in the log file is to ensure that the transaction can be recovered when the failure occurs on the system. When a user transaction finishes issuing a commit, it always waits for the LGWR process to write the redo information required by the transaction to the log file (which may have been in redo buffer before) before receiving the commit complete message.

DBWR processes always write slower than LGWR processes (DBWR processes write randomly, LGWR processes write sequentially, and random writing is slower than sequential writing)

When the DBWR process wants to write the information in the cache to the data file, it first notifies the LGWR process to write transaction-related redo information to the log file.

SCN can be understood as a tag, and ORACLE puts a tag on every operation in the database. This label is added sequentially. Will never return to zero (unless the database is rebuilt)

CHECKPOINT is the ORACLE to record which data has been written to the data file.

The function of CHECKPOINT is to ensure that when checkpoint occurs, all the data before the checkpoint SCN is written to the data file by DBWR, and before DBWR is written, the LGWR process is triggered to write the relevant redo information to the log file. In this way, after the completion of the checkpoint, it is no longer necessary to restore the pre-checkpoint SCN information when the instance failure occurs.

Understand the information about instance recovery:

The information Instance Recovery needs is the redo information from the last checkpoint to the end of the log file.

Because the data before checkpoint has been consistently written to the data file, and some of the later data may have been written into the data file, while some of the data has not been written into the data file.

The time required by Instance Recovery is to restore the data file from the most recent checkpoint to the last SCN value of the data file recorded in the control file. The time to apply redo information between the two is the time it takes for instance recovery.

Adjustment of instance recovery:

From the above information, it can be concluded that the most critical issue of instance recovery is the time of the last CHECKPOINT and the frequency of CHECKPOINT. Only by confirming the point in time when the most recent CHECKPOIN occurred can you determine the redo information required for the recovery and how long it will take to recover.

For instance recovery time-consuming tuning, it is the adjustment of the parameter FAST_START_MTTR_TARGE, in seconds, with a maximum of 3600 seconds.

In other words, the setting of the parameter FAST_START_MTTR_TARGET will directly affect the frequency of checkpoint.

The time set by FAST_START_MTTR_TARGE is the time that the user wants the database to be used in instance recovery. This is the time it takes to apply the redo information between the last two points of the checkpoint and the last two points of the log message.

If the time set by MTTR is too small, the system checkpoint will be too frequent, and when checkpoint occurs, processes such as DBWR,LGWR will have to write data files and generate physical IO. Over time, the performance of the database will become slower and slower.

If the MTTR setting time is too long, when the instance fails, the instance recover will take too long.

Starting from 10g, the database can be adjusted automatically. If you FAST_START_MTTR_TARGET=0, you can see the following information in alert:

MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set

At this point, the database automatically adjusts the frequency of checkpoint according to the load.

If you want to strictly require instance recovery time, set the FAST_START_MTTR_TARGET parameter, if not so strict, it is recommended to use 10g automatic adjustment.

5.4.2.5 principles of instance recovery

As we mentioned earlier, when the database crashes suddenly and there is no time to flush the dirty data blocks in the buffer cache into the data file, and the running transaction is suddenly interrupted when the instance crashes, the transaction is in an intermediate state, that is, it is neither committed nor rolled back. At this point, the contents of the data file do not reflect the state of the instance when it crashes. The database closed in this way is inconsistent.

The next time the instance is started, Oracle will be automatically restored by the SMON process. When the instance starts, the SMON process checks the END SCN number of each online, readable and writable data file recorded in the control file. During the normal operation of the database, the END SCN number is always empty, and when the database is shut down normally, a full checkpoint is performed and the checkpoint SCN number updates the field. If the Oracle does not have time to update the field at the time of the crash, the field is still empty. When the SMON process finds that the field is empty, it knows that the instance did not shut down properly last time, so the SMON process starts the instance recovery.

When the SMON process performs instance recovery, it obtains the checkpoint location from the control file. Therefore, the SMON process goes to the online log file, finds the checkpoint location, and then applies all the redo entries from that checkpoint location, thus restoring the state of the instance crash point in buffer cache. This process is called rollforward. After rolling forward, there are dirty blocks in buffer cache that were committed but not written to the data file at the time of the crash, and transactions are abruptly terminated, resulting in dirty blocks in transactions that are neither committed nor rolled back.

As soon as the roll forward is complete, the SMON process immediately opens the database. However, at this time, the database also contains those intermediate dirty blocks that are neither committed nor rolled back, which cannot exist in the database because they are not committed and must be rolled back. After opening the database, the SMON process rolls back in the background.

Sometimes, after the database is opened, a user process sends a request to read these blocks before the SMON process has time to roll back these intermediate blocks. At this time, the server process is responsible for rolling back these blocks before returning them to the user. After the rollback is complete, the contents of the data blocks are returned to the user.

Oracle provides an initialization parameter, fast_start_mttr_target, which allows us to specify the time it takes to complete the instance recovery (this time only includes the time to roll forward and open the database, not the time to be rolled back), which is measured in seconds. For example, we set this parameter to 30, which means that if an instance crashes, it will take up to 30 seconds for the database to roll forward and open the database the next time it is restarted. During the operation of the database, the approximate amount of redo records corresponding to 30 seconds is estimated based on this time, which actually determines the location of the checkpoint, as shown in figure 5-8.

Figure 5-8 checkpoint queue 3

The red vertical bar in figure 5-8 is the checkpoint location. The time taken by Oracle to redo all records after the checkpoint location is applied is the time specified by fast_start_mttr_target. In other words, the dirty blocks corresponding to the redone records after the checkpoint location are left on the checkpoint queue instead of being written to the data file by DBWn. Therefore, the larger the parameter, the more redo records to be applied, and the more dirty blocks are left on the checkpoint queue, which means that the less frequently DBWn writes dirty blocks and the less iUniver, then the faster the response of the frontend user query statement is. However, the longer it takes for the instance to recover. On the other hand, the smaller the parameter, the less redo records to be applied, and the fewer dirty blocks left on the checkpoint queue, which means that DBWn writes dirty blocks more frequently, so it takes up more iUniteo, then the iUnio of the foreground user query statement can not be responded more quickly. However, the instance recovery time will be shorter.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.