Example Analysis of inconsistent parameters of RAC nodes 12/18 Update SLTechnology News&Howtos

Example Analysis of inconsistent parameters of RAC nodes

2025-12-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

What this article shares with you is an example analysis of inconsistent parameters of RAC nodes. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

In Oracle RAC, some parameters are database-level, all instances use the same parameter value, some parameters are instance-level, and different values can be set between instances. However, for some instance-level parameters, different settings between nodes can cause failures.

On the Bethune Intelligent diagnosis platform (https://bethune.enmotech.com), the detection of database parameters is very meticulous. According to the influence of parameters on the database, it can be divided into performance parameters, stability parameters and standard operation parameters.

In the process of diagnosis, we found that most people are relatively casual in the configuration of parameters. The most common problems include the following:

10g DRM parameter configuration

In the 10g version of Oracle, the DRM feature has been proposed. By default, when the access frequency of an object is more than 50, and the master of the object is another node, then Oracle will trigger the DRM operation to modify the master node, which has the advantage of greatly reducing wait events such as gc grant.

During the process DRM operation, Oracle temporarily frozen the information about the resource, then unfrozen the resource in another node, and then change the master node of the resource. Because the resources of frozen are resources in GRD (Global Resource Directory). The process of accessing the resource will be temporarily suspended throughout the DRM. Because of this, when a DRM operation occurs in the system, it is likely to cause an exception in the system or process.

Oracle DRM also has a lot of Bug, especially in the Oracle 10gR2 version, so in a 10g production environment, it is generally recommended to turn off the DRM feature.

To close DRM, the general operation is as follows:

_ gc_affinity_time=0

_ gc_undo_affinity=FALSE

However, these two parameters are static, which means that the instance must be restarted to take effect. You can actually set two other dynamic implicit parameters to achieve this.

_ gc_affinity_limit=250

_ gc_affinity_minimum=10485760

You can even set the above two parameter values to a larger value. These two parameters are effective immediately, and after setting these two parameters on all nodes, the system will no longer perform DRM.

The following articles are recommended for your reference:

[new Book Serial] failure Analysis of RAC caused by DRM [in-depth Analysis] DRM and read-mostly locking [in detail] A case of performance problems caused by Oracle RAC DRM A RAC global transaction

Cluster-wide global transactions (Clusterwide global transactions) is a new feature of 11g. Cluster-wide global transaction means that each node in RAC has a local transaction, which is a distributed transaction. When _ clusterwide_global_transactions=true (default), Oracle treats these local transactions as a transaction. When _ clusterwide_global_transactions=false, Oracle will coordinate these local transactions as separate transactions through multi-phase commit.

If the default setting is TRUE, you may encounter the following bug.Bug 13605839 ORA-600 [ktbsdp1] ORA-600 [kghfrempty:ds] ORA-600 [kdBlkCheckError]. Corruption in Rollback with Clusterwide Global Transactions in RACORA-00600: [kjuscl:!free]

Therefore, it is recommended that you change this parameter to FALSE, which will not have any impact on performance.

Failure caused by LMS inconsistency between nodes

The LMS process is mainly responsible for the data exchange between nodes, and it is the busiest process in RAC. Its default value is calculated from the number of CPU of the system, and the calculation method varies from version to version. It can also be configured through the gcs_server_process parameter. In general, the number of LMS processes between nodes is required to be the same.

Next, share a glitch related to LMS.

Scenario description: a batch-executed business is sometimes fast and sometimes slow. It is checked that when the execution plan is completely consistent, the execution time varies from 2hour to 10hour.

Sampling AWR report, the overall DBtime is as follows:

These DBtime are mainly consumed in the RAC Global Cache link.

Here is a brief description of the gc current grant 2-way wait event:

Gc cr¤t grant 2-way is a kind of transmission of grant message package. When cr or current block is taken, the permission of x or s is requested from block master instance. When the requested block is not found in the buffer cache from any instance, the lms process will notify the FG process to read the block from disk to local buffer cache.

The waiting between nodes is so long, is it because the node traffic is too large that there is a wait?

However, this is not the case, the traffic between nodes is very small. So why is there so much waiting?

Let's analyze what the Global Cache part of RAC is doing.

Take the access of cr blocks as an example

Avg global cache cr block receive time=

Avg global cache cr block build time+

Avg global cache cr block send time+

Avg global cache cr block flush time+

Avg message sent queue time on ksxp+

Other

In the figure above, we find that the sum time of the following four items is only 0: 0: 3. 1: 0. 2: 3.3, which is a far cry from the total time consumed. So where is all the time?

We continue to analyze the global statistics of RAC through the AWR report

We found that on the last line, there was flow control, up to 16.28. The data here is when the system is running at its slowest time, so when it is running normally, it is found that under normal circumstances, the value of flow control is 0. 8.

So, 16.28 vs 0.8. This is the crux of the problem!

However, according to the previous analysis, the traffic between nodes is not large, why do flow control?

In the case of one hand, there are the following reasons for flow control between nodes:

1. The link of private network is not smooth.

2. The load of RAC peer node is high.

3. The transmission configurations of the two nodes are different.

In this case, there is no problem with the first two. Then the transmission configuration of the two nodes is different. We know that the data transfer between nodes is performed by the LMS process, so it shows that the configuration of LMS is different.

We queried the gcs_server_process parameter and found that it was not configured. Then check the number of CPU, and the result is as follows

Sure enough, CPU is not equal, so this performance problem occurs when more nodes with more lms (node 1 in this case) have a stronger ability to throw cache fusion requests crazily to nodes with small LMS processes (node 2), and node 2 is overloaded and cannot be handled symmetrically.

In order to avoid this kind of attack, Oracle does flow control, which leads to a lot of waiting in the system.

Finally, we manually modified the gcs_server_process parameter to make the number of LMS processes consistent. The problem has been solved.

Bethune, from architecture to details, all-round diagnostic system security and health, knows your database better than you do.

The above is an example analysis of inconsistent parameters of RAC nodes. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.