Technology Life Series the Story of me and the data Center (issue 4)-Parameter reminders that lead to Oracle performance jitter 07/03 Update SLTechnology News&Howtos

Technology Life Series the Story of me and the data Center (issue 4)-Parameter reminders that lead to Oracle performance jitter

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Preface

Unwittingly, the fourth issue of the technology life series about me and the data center has come to an end. Little y to see you again!

When you see the following wavy tps curve in the business system stress test, how do you do it?

What Xiao y (Zhongyi Technology) would like to share with you today is the process of analyzing and solving the pressure test performance problems of such a business system. This problem has plagued the customer for quite a long time, but fortunately, Xiao y located the cause of the problem in 10 minutes remotely and helped the customer finally solve the problem. It is important to note that in this CASE, just adjusting the database parameters is not enough, you need to do other analysis and adjustments to solve the problem.

In order to keep the original taste and increase the interest of the article, Xiao y will not only continue to adhere to the writing of the previous analysis report, but also try to introduce some psychological process of dealing with problems, in the hope that friends can understand the real working state of Xiao y.

What can we learn?

An important performance-related parameter that must be adjusted on version 11.2.0.3 and above of the 0. Oracle database!

How to stick to or quickly adjust the direction of problem screening after a failed diagnosis!

Some experience of how to take the initiative in dealing with comprehensive cross-team / department problems!

reminder

If you have a highly concurrent, transactional OLTP core business system, there are often some performance jitters. For example, transaction response time suddenly increases sharply, accompanied by an increase in the number of ap server ports / process activity / jdbc connections, database DB TIME per second and other phenomena, and the Oracle version is 11.2.0.3 or above, then it is likely to have something to do with an important parameter mentioned in this article! If you can't solve the problem after adjusting the parameters, you can contact small y diagnostics (mian fei de).

Your likes and retweets are the motivation for Xiao y to continue to share!

More Oracle database actual combat experience sharing launch, all in the "Zhongyi Antu" official account! Welcome to follow us.

Part 1

Here comes the problem.

At 10:00 in the morning, QQ suddenly flashed up and came to live!

Y, do you have a minute? Watch an awr for me.

I'll call you later and tell you what's going on.

In a new key business system they launched this year, when doing pre-launch stress testing, the concurrency of the application failed to meet the pre-launch concurrency and response time requirements. The curve of TPS during the stress test is as follows: the person who sent the QQ message is DBA of a large domestic airline. He can solve all the general problems by himself, and this time he seems to be in trouble.

It can be seen that the TPS during the pressure test is wavy and extremely unstable.

Customers have done a lot of analysis. At the resource level, CPU, memory utilization and IO are all normal. However, customers have also found that there are a large number of abnormal waits in the back-end Oracle database during stress testing, mainly gc waits. Customers doubt whether there is a problem with the VPC switch. But unfortunately, the network team did not detect any abnormalities.

They also asked the existing Oracle service providers to analyze this problem, but the problem has not been solved for a long time. As a result, the time for the business system to go online is getting closer and closer, and the pressure on customers is also increasing!

Recently, Xiao y has been working with this customer, sincerely hoping to have the opportunity to provide service to them, so when such an opportunity comes, Xiao y naturally cheers up and is ready to fight.

Environment introduction:

Operating system Redhat 64 bit,64C 128G

Database Oracle 11.2.0.3, 2-node RAC

Part 2

Analysis process

2.1 analyze DB TIME per second of Oracle database

When we divide DB Time by Elapsed, we can see that DB TIME reaches 75 per second! This is extremely abnormal.

Indicates that the database is going through a serious wait, and you need to check the database top to wait for the event to continue analysis.

2.2 analyze where trading time is consumed (TOP 5 wait event)

1) Node 1 wait event is as follows:

Event analysis

Gc buffer busy acquire ranked first in the Oracle top 5 wait, accounting for 51.2%, with an average waiting time of an astonishing 277ms! The gc buffer busy acquire here indicates that process B has requested the same data block from node 2 before process A, and it has not been completed yet, so it is waiting.

In second place was log file sync, accounting for 18.58%, averaging 293 milliseconds each time. The log file sync here means that when the process issues the commit, it needs to wait for the lgwr background to persist the change vector in the log buffer to the redo log on disk.

In the third place is DB CPU. In one hour of sampling, the total waiting time is 24648 seconds, that is, about 7 CPU time. The server is configured with 64 CPU, so the average CPU utilization is only 10%. By the way, in general, we expect the higher the percentage of DB CPU, the better, which means that there is almost no wait during SQL execution, and the faster the SQL response time. However, it does not mean that there is no problem. For example, for SQL with high logic reading, if the data to be operated is in memory, the DB CPU will be too high. At this time, the SQL with high logic reading needs to be optimized to reduce DB CPU.

In fourth place is direct path read, with an average waiting time of 153ms. The direct path read here means that the process reads the data blocks directly into PGA memory instead of into buffer cache shared memory. In this case, the throughput of IO is obviously larger, and each process reads its own even the same data. If different processes read the same data at the same time and read into shared memory, then only one process is responsible for reading, and the other processes can directly manipulate the data in memory, and the IO throughput will be much smaller.

In fifth place is buffer busy wait, with an average waiting time of an astonishing 499ms. The buffer busy wait here means that a conflict occurs when two or more processes need to write / write and write / read a block at the same time, that is, hot block conflict.

Seeing here, Xiao Y already knows the answer!

However, from a strict point of view, it is necessary to pass the waiting event of the RAC 2 node a little bit.

2) Node 2 wait events are as follows:

Compared to Node 1, there is no buffer busy wait and more gc current block busy.

Overall, there is little difference in wait events between the two nodes!

2.3 Xiao y's brainstorming in the first 2 minutes

2.3.1 is it the problem of RAC private network?

Looking at this, someone might say:

With gc waiting so high, will the problem be solved by temporarily shutting down another RAC node?

The first answer is NO! Second, this approach can be temporarily adopted in the event of emergency gc performance problems in production, but customers obviously do not accept such a case.

Xiao y will answer this question from a technical level.

First, as shown in the following figure

The private network of the two nodes is only 3m per second, while the two servers of RAC configure gigabit switches for the private network.

Secondly, the CPU and memory of both RAC nodes are low, so there is no case that one node runs slowly due to resource problems, which leads to the inability to respond quickly to the gc request of the other node.

If it is this type of problem, we can usually also see gc*congested* type of waiting (congestion).

2.3.2 is SQL efficiency causing gc/bbw/direct path read problems?

Where bbw is buffer busy wait

Gc means gc such as gc buffer busy acquire waiting.

Some people may say:

Gc wait is so high, and buffer busy wait wait, if SQL efficiency is high enough, then fewer data blocks will be accessed, then the number of gc requests made by the process will be very small, and the hot block conflicts caused by read / write will naturally disappear.

The answer is NO!

As shown in the figure below, you can see that the application is quite well written. Most SQL are limited to less than 100 logical reads, and only 3 SQL logical reads range from thousands to tens of thousands. This SQL efficiency and logical read quantity are not enough to cause such a high gc/bbw wait! In addition, when it comes to the low efficiency of SQL, there is no way to explain that log file sync/direct path read is also in an average single long wait! The wrong direction will not solve the fundamental problem! In other words, even if you make more efforts to optimize these logically read slightly higher SQL, the stress test problem will still exist, because this is not the root cause, optimizing the SQL is an icing on the cake rather than a timely help to the CASE!

2.3.3 is it a problem that direct path read causes IO bandwidth to fill up?

Some people may say, is there such a possibility:

First, direct path read causes the IO bandwidth to be full.

Note: multiple processes read data blocks into PGA private memory instead of buffer cache shared memory, calculated as multi-block read 16, each BLOCK 8K, each process can read about 30m, more than 15 processes can read more than 15 blocks at the same time can fill the bandwidth of the HBA card, setting 10949 event can disable this feature.

Because the IO bandwidth is full, it affects the response time for lgwr to write logs, which results in a long wait for log file sync.

While log file sync is a part of gc and buffer busy wait, which increases the waiting time of gc and buffer busy, so there is the wait of AWR report?

First of all, friends who can make this assumption can send a resume, indicating that you have a very deep understanding of the database, and have very rich experience in TroubleShooting, and are no longer on the level of fragmented analysis of problems!

Welcome to join the DBA team of Zhongyi Technology! When we come, we are brothers. We fight side by side, challenge all kinds of difficult problems and share the benefits together.

Please send your resume to 51994106@qq.com

So what does Log file sync have to do with gc?

Quote a diagram of RAC SG, the principle of which is shown in the following figure

As can be seen from the above picture:

The request of the gc class, which includes the logging process of the lgwr process in the second step

That is, log file sync is a substep of a gc request, which is strictly called gc log flush sync.

But the answer is still NO!

As you can see from the load profile below, the physical read per second is 498 BLOCK, and each BLOCK is 8K, which means that the IO per second is only about 4m. Both IOPS and IO bandwidth are very low, which is obviously not the problem!

2.3.4 2. Small y quickly lock the direction of problem analysis!

In the past two minutes, like the above analysis, Xiao y is rapidly carrying out various hypotheses and exclusions, and problems in series.

Soon Xiao y locked in the analysis direction of the problem-that is, to focus the analysis on log file sync waiting!

The reason is simple. By analyzing top 5 waiting, it is not difficult to see that there is a correlation between them:

Log file sync is a part of gc and buffer busy wait! (see figure in 2.3.3)

If log file sync waits for a solution, then naturally gc* waits and buffer busy wait waits will go down!

The problem will be solved!

2.4 focus on "log file sync" waiting

From the above, we already know that the "log file sync" wait event indicates:

When the process issues commit, it needs to wait for the lgwr background to persist the change vector in log buffer to redo log on disk. Therefore, it is most common for lgwr to write a log slowly, or because it is too frequent!

Next, Xiao y examines these two aspects in turn.

In ORACLE, if lgwr writes slowly, it will reflect the slow response time of log file parallel write.

Node 1

Node 2

As you can see, both log file parallel write and gc log flush sync are less than 5 milliseconds, and log file parallel write is only 1 millisecond and 3 milliseconds. Rule out the problem!

Next, check the number of commit!

As shown in the figure below, there are only 48 transactions (commits/rollbacks) per second!

In the highly concurrent core systems of some large banks served by Xiao y, including those with more than 10000 transactions per second, log file sync is also limited to less than 10 milliseconds. So only 48 transactions per second is a very small indicator, so it won't cause such a serious wait!

2.5 the reason is basically located and begins to adjust for the first time

At this point of analysis, Xiao y has found the root cause of this stress test, and only needs to adjust and verify it.

Friends are advised to stop at this point to see if they have found the cause of the problem.

That is, two minutes after the client's AWR report was sent, Xiao y told him

"I know why. You send me the trace of the lgwr process for final confirmation, and we will begin to adjust it."

In fact, it is not surprising that such case Xiao y encountered N times when he did a large number of system upgrades to 11g a few years ago! The customer was surprised that he didn't even have time to call Xiao y. How could Xiao y do this?

This should be a standard before going online, although the phenomenon is different, but it is essentially a problem.

This is an important database parameter that should be highlighted in the small y title.

If Log file sync waits for events for a long time, but lgwr writes logs quickly, and the number of commit is small, there is a problem in the communication between the initiating process of commit and lgwr.

Key knowledge points:

Starting with 11G, ORACLE introduced the polling mechanism for lgwr logging, whereas previously there was only post/wait mechanism.

At the same time, an implicit parameter, "_ use_adaptive_log_file_sync", is introduced, that is, the adaptive switching between the two mechanisms. Below 11.2.0.3, the default value for this parameter is false, that is, only the post/wait mechanism is enabled.

Starting from 11.2.0.3, the default value for this parameter is true, which means that Oracle adapts to both the post/wait mechanism and the polling mechanism.

In the angular Post/wait system, the lgwr process notifies the process waiting for commit immediately after writing the change vector in log buffer, so the waiting time of log file sync is short, but the burden of lgwr is relatively heavy. After all, there is only one lgwr process below 12C, and when there are more commit processes at the same time, it is also a burden to notify the process to be commit.

In Polling mode, after the process of commit notifies the lgwr process to write, it will enter the sleep link, and after timeout to see whether the contents of the log buffer have been written to disk, the lgwr process will no longer separately notify the process waiting for commit that the write has been completed. Under the Polling mechanism, part of the work of lgwr is freed, but the process waiting for commit will be waiting for log file sync for a long time. For transactional systems, this mechanism is extremely inapplicable!

For the switch between the post/wait and polling mechanisms, the ORACLE is recorded in the trace of the lgwr process, as shown below.

When switching to polling mode, it is easy to cause log file sync to wait and affect the response time of the transaction!

Log file sync switching to polling

……

Log file sync switching to post/wait

Below Oracle 11.2.0.3, it is recommended to turn off adaptive log file sync and be sure to let the lgwr process run under the post/wait mechanism to ensure that there is no big jitter in database performance! The command to close is as follows, which can be modified online! So, little y is here to remind you

Alter system set "_ use_adaptive_log_file_sync" = false sid='*'

Yes, the first adjustment for Xiao y is to adjust the parameter to false.

2.6 the result after the first adjustment is disappointment!

After adjusting the parameters online, the customer restarts both node databases for security reasons.

And re-do the stress test, the re-collected AWR report is as follows!

Seeing the AWR report for Node 1, the gc wait and log file sync wait still exist, and it looks like a single wait is taking longer!

Is there something wrong with Xiao y's analysis? In other words, Xiao y encountered several mixed problems this time? Calm down for a while, the problem with RAC, remember to look at only a single node, so small y let the customer out the AWR report of node 2, and the awr report of node 2 after adjustment is shown in the following figure:

You can see:

Although the wait is still there, the log file sync wait for Node 2 is gone! This shows that this adjustment has worked!

And careful friends may have found that the first waiting gc buffer busy acquire of node 1 is completely gone (indicating that node 2 log file sync is fast), from gc buffer busy acquire to gc buffer busy release. Doesn't this just mean that the adjustment still works?

Here, don't worry, because node 1 still exists log file sync, so node 2's gc buffer busy acquire still exists! Then, Xiao y should concentrate on solving the log file sync of Node 1 again.

2.7 the truth surfaced (doubt everything)

To sum up, after adjusting the log file sync adaptation here, the problem has not been solved, so back to the traditional way of thinking, the most likely problem is that the lgwr process is slow to write logs! Although the log file parallel write indicator in the awr report is only a few milliseconds, the awr report is after all a tool and provides only a reference value, so we still have to be skeptical and check again!

This time, let's take a real-time look at the logging of the lgwr process. Issue the SQL statement and the result is shown in the following figure:

You can see:

Of the two nodes, only one node has a wait for log file parallel write, which coincides with all the previous analysis!

When state is waiting, the seq# that log file parallel waits for is 35693, but the seconds_in_wait reaches 21 seconds. Simply put, it takes 21 seconds for the lgwr process to write an IO!

At this point, we can be sure that there is something wrong with the IO subsystem, and we need to focus on troubleshooting optical fiber lines, SAN switches, storage errors and performance under the IO path.

2.8How to further prove that there is a problem with the IO path (inter-departmental cooperation)

Considering that the team / department in charge of the storage at the customer's side may not recognize the evidence of slow IO in the database, and in order to increase the troubleshooting efforts, Xiao y asked the customer to issue the following command to check the IO of the multipath software. The result is as follows:

The obvious IO ERROR appears on node 1 and continues to increase!

Continue to check node 2 and find that there is no IO ERROR on node 2!

Reviewing the previous analysis, node 2 has adjusted the database adaptive log file sync to false, and there is no IO ERROR, so there is no log file sync.

So far, the analysis is over! All the questions have been perfectly explained!

Found the reason, but also got the very convincing evidence, the customer finally breathed a sigh of relief, not afraid that the storage team did not admit it!

2.9 the problem has been satisfactorily resolved

In the face of ironclad evidence, the customer's storage team no longer struggled, but began to seriously investigate one by one, and finally the problem was satisfactorily resolved after the fiber cable was replaced. The following is the waiting event for the stress test again after replacing the optical fiber line!

The TPS curve of the pressure test changes from the original wavy shape:

Into a good curve like this.

Part 3

Cause of problem and summary and risk hint

3.1 Summary of the causes of the problem

The reason why the pressure test can not reach the concurrency and response time indicators when the aviation customer service is launched is that two mixed problems are encountered at the same time:

1) log file sync on Oracle 11.2.0.3 turns on adaptation by default, which results in longer log file sync waiting time when switching to polling mode, and log file sync is a part of gc and buffer busy wait, which leads to a lot of waiting.

After Xiao y adjusts "_ use_adaptive_log_file_sync" to false, it solves part of the problem of log file sync waiting.

2) due to the quality problem of the optical fiber line of node 1, it will lead to IO error, and then lead to IO retransmission, which affects the performance of lgwr log writing.

After adjusting the default values of database parameters and replacing the optical fiber line, the problem was solved satisfactorily.

The TPS curve of the pressure test changes from the original wavy shape.

Into a good curve like this.

3.2 Review of the key points of problem solving

1) Don't analyze Oracle waiting events separately.

In this case, Xiao y found a breakthrough by combing the common ground of waiting events as log file sync.

2) understand the characteristics and behavior of different versions of the database

Xiao y is constantly learning about the new features of 11g and deeply understands these features through a large number of fault handling, so when log file sync appears, it can quickly locate the new features caused by them.

3) Don't trust the AWR report completely, it's just a tool to doubt everything to verify it.

In this case, the metrics reported by awr do not really reflect the write performance of lgwr, so you should doubt everything.

4) An Oracle server, if he only knows the database, you will suspect this and that, but others will not admit it at all, so they must master more skills, including operating system, storage, networking, and middleware. Of course, it is also a good choice to find a service provider with strong comprehensive service ability.

In this case, Xiao y found direct evidence through multi-path commands, and finally obtained intensive investigation by other teams, which is also the key to the final solution of this problem.

Risk hint

Starting with 11G, ORACLE introduced the polling mechanism for lgwr logging, whereas previously there was only post/wait mechanism.

Starting from 11.2.0.3, the default value for this parameter is true, which means that Oracle adapts to both the post/wait mechanism and the polling mechanism.

Switching back to the trace of the lgwr process is shown below.

When switching to polling mode, it is easy to cause log file sync to wait and affect the response time of the transaction!

Log file sync switching to polling

……

Log file sync switching to post/wait

So, little y is here to remind you.

In Oracle 11.2.0.3 and above, it is recommended to turn off adaptive log file sync and make sure that the lgwr process runs under the post/wait mechanism to ensure that there is no serious performance jitter in the database! The command to close is as follows, which can be modified online!

This article is the document of Yi Antu in reprint.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.