Dynamically increasing the risk of database restart brought by CPU-- Technology Life Series No. 46-- the story of me and the data center 07/05 Update SLTechnology News&Howtos

Dynamically increasing the risk of database restart brought by CPU-- Technology Life Series No. 46-- the story of me and the data center

2025-07-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

In the online production system, for some reasons, there will be some unexpected performance problems from time to time. When an emergency problem occurs, if we cannot solve it immediately, sometimes adjusting the system hardware resources is a fast and effective way to solve the emergency. After ensuring the normal operation of the system, root cause analysis is done to fundamentally solve the problem.

Recently, Lao K encountered a case in which the customer's emergency strategy led to the restart of the database. Looking through the email history, it turned out that it had already been encountered N years ago. The problem itself is not complicated, but it can cause the database to restart directly and affect production. Here is a special sharing to give you a risk tip.

Today, Lao K shares with you such a case:

When a new business is added to a certain system, under the condition that the database is running normally and the original business is not closed, the relevant operation of the version is completed on the database, as a result, the process of putting into production is too slow, and the application maintenance team is worried that the production window can not complete the production action normally, so the system maintenance team needs the system maintenance team to help analyze and speed up the operation. However, after the emergency in view of this problem, the database has been restarted, but more seriously affected the operation of production.

So who caused the restart? Did we make any wrong judgments during the emergency? How do we prevent similar tragedies from happening again? This sharing starts from the final failure, brings you a hidden emergency pit, and at the end of the paper also gives us some lessons on the introduction of new problems in emergency measures.

There is a problem with the Oracle database, or can't find the cause?

You might as well try Zhongyi Technology, and we will try our best to find the root cause of the failure and performance problems for you.

Here comes the problem.

Saturday is the production date of a customer, which starts at 19:30 in the evening and lasts until 9 o'clock the next morning. Lao K is in charge of the on-site support task.

There are many production systems, just after 07:30, all kinds of consultation followed, Lao K is very busy, the phone came over: "Lao K, come here to help have a look, the application maintenance team said that the action of putting into production is relatively slow." see if there's any way to speed up? " There are a lot of similar calls, and the problems always have to be dealt with one by one, but the system involved in the phone just now is still a relatively important system, and the window for production is limited, so we definitely need to pay more attention to it. So I promised to deal with the problem at hand and then the next one to take a look at this system.

However, a few minutes later, the phone rang again: "Old K, no, this system database has been restarted! come and have a look!" This is embarrassing, is it because I did not respond to their questions in time, the system can not hold up to restart? If that's the case, it's a big sin! Hurry to the scene of the accident to deal with the problem.

Simple analysis, simple conclusion

one

Make clear the situation

At the scene of the problem, all kinds of colleagues are discussing this matter. The system is a set of two-node RAC database, and the database version is 11g. This restart is actually a restart of both instances of the database.

First, let's see if there is any information in the alert log:

Obviously, the database was terminated by the ASMB process at 19:41, and some trace files were generated before the termination; what is the ASMB process? As the name implies, the ASMB process is the process in which the database instance interacts with the ASM instance. From the error report here, ASMB can no longer connect to the ASM instance.

Generally speaking, when this happens, CRS usually restarts. Then we check the key processes of CRS through ps-ef and find that they have not been restarted. This seems to be related to ASM. So let's take a look at ASM's alert log. The key parts are as follows:

You can see that the ASM instance has been restarted for the following reasons:

The parameter parallel_max_servers is set to 4505, which is illegal and should be set in the range of 3600; the CKPT process terminates the ASM instance after finding this error.

The problem seems simple: parallel_max_servers is set to 4505 and the ASM instance terminates, which in turn causes the DB instance to terminate. So, here are a few questions we need to figure out:

Parallel_max_servers setting range is to 3600, is this a mandatory rule? Or is it caused by setting certain values?

Parameter setting errors are usually reported, so why does this cause the ASM instance to restart?

Who does the operation of modifying parameters come from? Does it have anything to do with their emergency operation?

two

Break through separately

1) limitation of parallel_max_servers parameters

For this hard setting, we might as well check the official manual to see if there is an upper limit for this parameter:

As you can see, the upper limit set for this value is 3600

2) Why does parameter modification cause ASM instance to restart?

Under normal circumstances, if we modify the parameters in the database, we can simply throw an error if the modification fails, which obviously will not lead to the restart of the database instance.

As for the restart that occurs here, we can see that the process that caused the restart is the CKPT process:

We can think that the CKPT process is the key process of the database instance, and when it encounters an error, it directly terminates the entire instance.

3) who triggered the change of parameters on the ASM instance?

Generally speaking, the purpose of modifying (scaling up) the parallel_max_servers parameter is to allow the system to start more parallel processes; for ASM instances, changing this parameter itself does not seem to make much sense, after all, ASM instances do not open, basically only serve Database (there may be some monitoring / management statements), so who will modify the parameters on the ASM instance? Is it possible that ASM instances are mistaken for Database in an emergency?

When you see this, why don't you think about it? The reason is actually very simple... ..

three

The truth is actually very simple.

Let's take a look at the ASM log, and the relevant information is worth thinking about:

We can see that before the parameter adjustment, the instance in the log found that CPU count changed to 228, and then the system adjusted the size of the parallel_max_ session value, and then reported the error instance.

First of all, we need to confirm that the adjustment of CPU number will lead to the adjustment of parallel_max_servers parameters. Let's take a look at the relevant properties of this parameter. Its default value is indeed related to CPU_COUNT, and the parameter itself can be adjusted dynamically.

Obviously, in the Oracle code here, when the system adjusts CPU, which leads to the dynamic adjustment of parallel_max_servers parameters, it forgets that there is an upper limit of 3600 for parallel_max_servers, which should be defined as a bug;. We can also quickly find this bug on MOS verification:

The problem occurred in versions 11.2.0.2 and 11.2.0.3, and was fixed in version 11.2.0.4

So the truth

It turned out that when the application maintenance team found that the speed of the relevant tasks was not as fast as expected during the production of the version, they communicated with the person in charge of the relevant system, hoping to speed up the execution of their tasks by increasing resources, so the person in charge of the system dynamically added memory and CPU to the system after consulting the leader (the IBM minicomputer used, which can dynamically adjust the hardware resources), and prepared to recover the resources after the completion of production. However, in the process of adjusting the CPU, the above error occurred, resulting in a database restart, production action is also affected!

three

The truth other than the truth.

If I encounter problems, how to avoid such problems, how to sum up a set of methods to find similar problems, and how to develop such a good habit can help me to think more and deeply.

-- Lao K

Some questions about ①

Seeing that the problem has been satisfactorily analyzed, however, there are still many doubts that linger on my mind and are the concern of the customer, so here is the answer to these questions first.

1. Why didn't the Database instance restart directly because of the parameter adjustment of parallel_max_servers?

Here we can look back at database instance's alert log information:

We also see the change of CPUcount, which is adjusted to 228, but there is no dynamic adjustment of parallel_max_servers parameters. In fact, the main reason here is that in the Database instance, we have already specified the size of the parallel_max_servers parameter in the parameter file. When the parameter is specified, the system no longer dynamically adjusts the size of the parallel_max_servers parameter according to the number of parallel_max_servers.

two。 Why is it that when the ASM instance starts, it can no longer be started because of this parameter?

How do you calculate the value of the parallel_max_servers parameter when ASM restarts again? In fact, we searched the alert log of asm and found that the parallel_max_servers parameter was not set when the instance was started, and an error occurred when we manually set the parallel_max_servers parameter value of asm:

We find that it is impossible to set this parallel_max_servers parameter in an ASM instance at all, even if it is set to spfile; in fact, it is reasonable not to calculate or check this parameter at startup.

3. Will the CPU that has just been adjusted go down if I transfer it back?

In principle, the database here does not care whether the CPU count becomes larger or smaller, ten or one at a time, it will re-calculate the value of the parallel_max_servers parameter, then obviously, if you adjust the CPU count again, if you do not adjust the CPU count to small enough, then again adjust as long as the adjusted CPU count number to calculate the value of the parallel_max_servers parameter is greater than 3600, the problem will occur again!

4. How to prevent such a problem?

We already know that this is bug, so the corresponding patch must be the fundamental solution.

While there are still a lot of systems in data center version 11.2.0.3, and dynamic adjustment of CPU is also a common action, it takes time to install patches uniformly, and it also needs to consider the time window, which must take a certain period of time. So do we have any temporary preventive measures before that?

According to the workaround described in the bug document, the parallel_max_servers parameter is manually specified to a value less than 3600. However, we can clearly see that the ASM instance cannot adjust this parameter. It seems that the official document does not take into account the situation of the ASM instance, and there are also flaws, so what should we do?

What should I do? If it were you, what would you consider to avoid this kind of problem?

② thinks more, discovers more.

To fundamentally answer the above questions, you need to understand exactly how the database calculates the parallel_max_servers parameters; in fact, we can take a look at the following description:

From the above description, we can see several variables that affect the parameters, including:

PARALLEL_THREADS_PER_CPU: this parameter is based on the operating system, usually 2

CPU_COUNT: this parameter is based on the actual number of CPU in the database, but can be adjusted

Concurrent_parallel_users: this parameter is based on the memory management mode configuration of the instance. The default coefficient is different from 1-2-4.

PROCESSES: this parameter value is the PROCESSES parameter when the instance is started.

You can see that the parameter value of parallel_max_servers takes the value of min through two calculations, that is, it is taken from min (parallel_threads_per_cpu*cpu_count*concurrent_parallel_users * 5 processes min 15). From these two aspects, some of them are indeed related to CPU, while on the other hand, they only seem to be related to processes. Let's check the references of 11.2:

The default value of processes parameter is 100. if no one has adjusted the value of process parameter of ASM instance, then the value of parallel_max_servers calculated here should be min (PARALLEL_THREADS_PER_CPU * CPU_COUNT * concurrent_parallel_users * 5100-15). No matter how large the previously calculated value is, the calculated value here should only be 85, which is far from more than 3600.

Is it true that someone has changed the PROCESSES parameter value of the ASM instance? Or is the official reference wrong? If it were you, how would you further check this parameter?

The last part of the truth about ③

In such a situation, we first check and find no modification records in the alert logs of the two nodes; is there really something wrong with reference? In fact, we are looking at reference 11.2. The current version 11.2.0.3 was released long after reference. We might as well check the calculation method of the default value of processes through MOS:

In the MOS document "what you need to know to upgrade to 11.2.0.3 Grid/ASM", it is pointed out that the default value of the processes parameter of ASM is "available CPU cores * 80 + 40". Of course, here the processes parameter does not change dynamically with the number of CPU, but will only be calculated when the instance is started.

We intercept the number of CPU cores obtained in the previous period of time AWR report:

So here, the value of parallel_max_servers is calculated as follows:

Min (PARALLEL_THREADS_PER_CPU*CPU_COUNT*concurrent_parallel_users*5,last_CPU_cores*80+40-15)

Enter the actual value as follows:

Min (228,56,80,40-15)

Follow-up for all 11.2.0.3, apply the corresponding patch to avoid being hit again

Before patching, make sure to manually set parallel_max_servers to a value of less than 3600 in the database instance and less than 3600 in the ASM instance

If it is not necessary, we should not easily adjust system resources when there is no good reason to adjust the system CPU/ memory.

Does ⑤ really need to adjust CPU?

Later, after learning the whole situation, we look back at the so-called slow situation in the process of putting the system into production. In fact, it is only the slow execution caused by the failure of a large number of tables with newly inserted records to collect statistical information in time, resulting in a poor implementation plan. Although the utilization rate of CPU at that time was indeed higher than before, in this case, without collecting statistics and re-executing statements, the two common emergency measures of increasing CPU/ memory actually did not help; in this CASE, the action of increasing CPU caused database downtime instead!

What happened to us?

Let's review the whole story, as well as the actions of the database itself:

The application team believes that the execution of SQL is slow and the utilization rate of system CPU is high, so it is recommended to expand the memory of CPU/.

Without analyzing the root cause of the problem in time, the application team needs to expand the CPU/ memory of the system, in which the number of CPU is expanded to 228.

The database found that the CPU capacity was expanded to 228, and the Database instance no longer adjusted the parallel_max_servers parameter because it was manually set; the ASM instance did not set the parallel_max_servers parameter, so the parallel_max_servers parameter was adjusted in the code.

The target value of the adjustment is min (PARALLEL_THREADS_PER_CPU*CPU_COUNT*concurrent_parallel_users*5,last_CPU_cores*80+40-15), that is, min (228,40,56 and 80,40-15), that is, min (9120,4505) = 4505

Before setting this value, the CKPT process failed to verify whether the 4505 value exceeded the hard upper limit of 3600 of parallel_max_servers because it encountered bug.

The CKPT process encounters an error ora-00068 when setting the parallel_max_servers value to 4505. As a key process of the system, CKPT encounters an error that cannot be completed and terminates the ASM instance.

After the ASM instance is terminated, the Database instance cannot access the ASM and terminates at the same time

CRS found that both the Database instance and the ASM instance were terminated, so it restored the ASM instance and Database instance

When starting the ASM instance again, you do not need to consider the parallel_max_servers parameter. The ASM instance starts normally, the Database instance starts normally, and the whole restart is completed.

From this point of view, a decision that seems to be beneficial and harmless, the more the better, makes a system restart, so there is an urgent need for caution, and it is best to prescribe the right medicine to avoid unexpected results.

Risk hint

When the database version is 11.2.0.2 CPU 11.2.0.3, it is recommended to set the parallel_max_servers parameter manually for database instances and less than 3600 for ASM instances to avoid the restart of Database/ASM instances caused by dynamically adjusting the number of CPU when the number of CPU is too high.

This article is reproduced in Zhongyi Antu.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.