Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

A system optimization! -Technology Life Series-the Story of me and the data Center-issue 17

2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

Share the notice

Are there intermittent IO performance problems in your system, or do you have poor IO performance all the time?

At the end of the article, we will give common risk tips and inspection methods. Why hesitate? check your system ^ _ ^.

This time we share the theme-look at China's most beautiful DBA a priceless system optimization!

Through the optimization of the system, the performance problems of frequent stutters and white screens in a key business system of a customer can be solved completely!

First of all, let's take a look at the effect comparison before and after Oracle database optimization, and then look at it later.

Before optimization:

After optimization:

Yes, it seems that some people do not care about the optimization effect, but care about "the most beautiful woman in China DBA".

All right, let's get to the point. In the 17th issue, we invited Xiaoyi, a little fairy from the Zhongyi science and technology database team, to share the priceless system optimization case above! The reason why it is valuable is because it means a lot to customers. There are many knowledge points in the case, which are wonderful and continuous!

Xiaoyi girl, as the new generation post-90s DBA of Zhongyi science and technology database team, has become a typical representative of junior DBA in the team. This article is her masterpiece of dealing with CASE!

Be careful, don't just look at the photos, the article is the key! I also look forward to more works of Xiao Yi girl ^ _ ^

Retweet if you like, your retweet is the motivation for us to continue to share!

be entrusted with a mission at a critical and difficult moment

"Xiaoyi, you can solve a potential customer's performance problem with the sales this afternoon."

It turns out that an insurance customer, their core system, database is a single instance of oracle, running on the aix minicomputer.

Stutter and white screen occur frequently in the business system every day, and the business department has complained many times. Now the operation and maintenance department of the customer is under great pressure. If the problem continues, everyone will be driven crazy.

This time, their operation and maintenance manager found Zhongyi Technology and hoped that we could solve this problem. As long as the problem was solved, everything would be fine. I have approached some companies before, but all of them have not been solved. The problem has been going on for a long time.

It seems that the client is in trouble. Me, do my best!

It's a serious problem.

Arrived at the customer site, the customer and I briefly introduced the situation of the system, I can not wait to take out the awr report! One of the important reasons why Oracle is enduring and loved by customers is that it is measurable and adjustable.

Through OWI, you can easily know whether there is a bottleneck in the system, and there are countless interfaces waiting for you to control it.

Wait for the event directly, as follows:

When I saw this, I was startled directly. There are so many abnormal waits in the database, no wonder the business system is stuttered.

You can see:

The average Log File Sync 94ms per wait event accounts for 42% of the total DB Time.

On average, each 952ms of the @ log buffer space accounts for 20% of the total DB Time.

The reason for this is that the logfile cannot be written down, causing the commit of the entire data to slow down.

Z. logbuffer space and buffer busy waits are obviously an accessory result of Log File Sync slowness.

Obviously, because lgwr writes slowly, log buffer doesn't have time to brush to disk, causing log buffer to look "full". Other processes have to wait for "log buffer space". The fundamental reason is that log writer writes slowly!

How slow is LGWR?

Next, Xiao also checks the trace of the lgwr process:

You can see:

The maximum write delay of online logs is more than 137 seconds, and the last record shows that it takes 65 seconds to write 18K. What a surprising result. Here we have to suspect that there is a problem with the storage IO subsystem! Is it so simple to be solved by Xiaoyi? Hey, hey...

Disappointing communication

Xiao also told the customer the direction of the above findings and analysis, and found that the customer was not surprised.

Listen to the customer said that the professional company found before also found this problem, and then left the problem to them, saying that IO has a performance problem! But we checked the storage array, SAN switch, and link many times, and there were no errors or useful clues! There is nothing wrong with the operating system! "if you end up with the same conclusion, then you can go back!"

A little aggrieved, China is not just a database service company, we are an one-stop service provider. Xiao Yi must prove to him that we are different!

Wrongly blamed the customer.

Is it because the customer is not good enough to find out the problem with the storage IO subsystem?

I am hesitant to apply for the company's AIX experts and storage experts to check together, why not check it yourself and wait until we confirm that there is a performance problem with the storage?

Click iostat and the result is as follows:

Seeing these data, it seems that Xiao has really misjudged the customer!

From the perspective of the operating system, the performance of the LUN level is very good!

The service queue is not full, there is no timeout and fail, the average service time for reading and writing is avgsrv, and the maximum response time maxserv is very small. If there is no performance problem with iostat or sar-d, then go to the storage array to check, the direction is wrong!

At this point of thinking, the reader can stop and think, if it were you, how would you keep looking, and what is the direction of your suspicion?

Find the intersection to heaven

Since the lgwr process is slow to write, use truss to get the system calls of the process, as shown below:

You can see:

Lgwr makes a large number of calls to aio_nwait_timeout,listio64 's two system call, and after the listio64 call call, there will be a period of pause. Obviously, these two are asynchronous IO calls of the AIX system.

Then check the configuration of the asynchronous IO and let nature take its course. Check as follows:

# ioo-F-a | grep aio

Aio_active = 1

Aio_maxreqs = 4096 # maximum number of requests

Aio_maxservers = 10 # maximum number of aio services per cpu

Aio_minservers = 3 # minimum number of services for aio per cpu

The system is configured with 22 CPU, and each CPU supports up to 10 AIO SERVER, so the theoretical maximum of the whole system is 22 AIO SERVER.

Continue to take advantage of the chase and see how many AIO SERVER the operating system has played.

# pstat-a | grep-c kproc

three hundred and twenty

As you can see, there are a total of 320! Not just the biggest 220. It seems that when the maximum SERVER is not enough, the system is allowed to break through this limit!

After many times of continuous inspection, Xiaoya found that it was 320. normally, after a free time, the number of AIOSERVER will drop, unless it has been working all the time!

That's right! When Xiao Yi saw this, his heart was already in full bloom, regardless of the girl's reserve.

Due to insufficient AIOSERVER, LGWR does not have useless AIOSERVER, and IO cannot be passed to LUN level at all, so IO cannot be completed for a long time.

Reason summary

It can be thought that the application issued too many IO requests, resulting in the operating system AIO server can not meet the requirements, resulting in LGWR writes become extremely slow.

Think again

At this point, readers, you might as well stop and think, if you were to make the decision, how would you adjust it? How big is the Maxserver adjusted from 10? twenty? fifty? Or.

Your decision may affect the effect of optimization. If the effect is not good, it may affect the customer's information. After all, this is the customer's key business system. You might as well stop and take a look at Xiaoyi's choice.

Confirmation before selection

In order to further substantiate the evidence, Xiao also issued the following command to obtain the insufficient asynchronous IO:

You can see:

Within 1 second, the number of requests for the largest asynchronous IO has exceeded 2000, far exceeding the maximum value set by AIO, so it is inevitable that IO writes slowly.

Solution

With the previous analysis, it is easy to solve!

This performance problem, we do not adjust SQL, we do not change the database parameters, we change the operating system parameters!

After consulting the company's AIX experts and the team's third-line experts, Xiaoyi put forward the following optimization plan to the customer:

Modify AIO related parameters: increase maxserver to 800and modify maxreqs to 16384

Exciting optimization effect

After adjusting the operating system parameters, Xiao is also extremely looking forward to it, just like his own child.

I couldn't wait to call the customer the next day. "so far, there are no more system stutters and white screens. Thank you so much! this is a priceless optimization! it is of great significance to the healthy development of the business! you continue to do further optimization, business matters to me!"

Before optimization:

After optimization:

Experience hint

On the AIX operating system, if the file system is used to store database files, incorrect asynchronous IO configuration will lead to serious performance problems in IO. Many customers ignore this and may continue to put up with poor IO performance.

Zhongyi Science and Technology recommends that you check or monitor through the following commands and make timely adjustments to ensure that the system is running at its best performance.

Step 1 Murray-get the number of CPU

# vmstat

Step 2-View the configuration of the asynchronous IO

# ioo-F-a | grep aio

Aio_active = 1

Aio_maxreqs = 4096 # maximum number of requests

Aio_maxservers = 10 # maximum number of aio services per cpu

Aio_minservers = 3 # minimum number of services for aio per cpu

Step 3-View the maxgc of the asynchronous IO:

If maxgc is in a state of more than * aio_maxservers of CPU for a long time, IO may have serious performance problems and need to make adjustments to asynchronous IO configuration!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 270

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report