In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
Preface
How time flies, the story of the technical life series of me and the data center has come to the sixth issue, and Xiao y has met with you again!
What Xiao y wants to share with you today is the process of analyzing and solving a comprehensive problem.
To solve this kind of problems, only understand the database is not enough, but also need to master relatively solid operating system skills.
It also leads to another less common form of optimization, memory optimization.
As the problem to be shared today is universal, it is suggested that you can check whether there are similar problems in your system according to the method in this article. The last part of the sharing will be a summary and hint of the common risks.
If you think the sharing case is good, please raise your hand and forward it, hoping to remind and help more customers.
More Oracle database actual combat experience sharing and risk reminders of the launch, all in the "Zhongyi Antu" official account! Welcome to follow us.
In addition, recently, many friends have asked whether Xiao y's team can do some offline sharing.
Indeed, if we can organize everyone, even in a conference room or a coffee shop, in addition to face-to-face technology sharing, we can also talk about life, have a barbecue beer and make more friends when we are in high spirits. It's also a blessing in life!
Since you are interested, let's start the first phase of offline sharing.
Friends who are interested in participating in offline sharing in Beijing, Shanghai, Guangzhou, Shenzhen and other places can email Xiao y at 51994106@qq.com or add y's Wechat (shadow-huang-bj) to provide city, name, telephone number, organization, position and other information. When the number of applicants exceeds 20, we start free offline sharing activities in Beijing, Shanghai, Guangzhou, Shenzhen, Hangzhou, Nanjing and other places. In addition, those who are interested can join QQ group 227189100, and we will do some online sharing regularly in the future.
Part 1
Here comes the problem.
Beijing, December 28, 2015.
At 8 p.m., just after dinner, the phone rang and it was a customer of an operator.
"Xiao y, something's wrong. From 17:00 to 18:00 this afternoon, xx database monitoring and instance crash, but now the library is up. This business system is very important, leaders attach great importance to it, and must ask to find out the cause. Other manufacturers have arrived, and no problems have been found for the time being. Can you send an engineer to the site to analyze it right away? try to have a conclusion tonight!"
After receiving the call, Xiao y immediately arranged for his brother to rush to the scene. Then I learned that there was a similar problem last week.
Environment introduction:
Operating system AIX 6.1TL8, 64-bit
Database ORACLE 11.2.0.3 2-node RAC
Configuration: 16CPU 50g memory
Part 2
Analysis process
After a while, received the log, suddenly refreshed, let's take a look at the log, confirm the database crash and monitoring crash problems mentioned by the customer.
Confirm the problem of monitoring and database instance downtime
2.1.1
Database alert.log
You can see:
> 17:42:39, the database alert log related to the key error message, and the call to LMS of the process responsible for GCS went unanswered for 89 seconds. It means that from 17:41: 10 seconds, the database began to have problems.
> > next, at 18:02:48, go directly to the log of the startup database. There is no log whose database is stopped or terminated abnormally. There is no output of the alert log in these 20 minutes. During this period, there was no restart of the operating system.
Preliminary analysis of the reasons:
The most common reason for unresponsive calls to LMS is the bottleneck of system resources such as memory / CPU in the Lpar where the database is located. Usually when the operating system resources are tight, it will be accompanied by the following performance:
> > as a process of cluster resource management, CRS may also time out when detecting resources, thus starting exception handling, such as automatically restarting resources.
> > if some heartbeats between RAC nodes cannot be detected, in order to restore the external service capacity of the entire RAC cluster, MemberKill will be started first after 11g, that is, the database instance will be terminated to attempt recovery. If memberKill cannot solve the problem, it will be upgraded to NodeKill, that is, restart OS to restore the external service capacity of the entire cluster.
Next, check the log of the CRSD process
It can be seen that resources, including VIP/ snooping, detect timeouts due to the shortage of OS resources, and then start exception handling.
2.1.2
Node 1 CRSD.LOG
You can see from the following figure:
17Ru 42JL 30D CRSD checks the health of VIP, 10 seconds later, the test times out, so the status changes to UNKNOWN (then CRSD attempts to restart)
You can see from the following figure:
17:51:18 seconds, due to VIP down, and monitoring depends on VIP, so monitoring is requested to stop.
You can see from the following figure:
17:54:34, an abnormal termination of the database resource ora.XXDB.db was detected
2.1.3
Ocssd.log
As can be seen from the above picture:
In fact, the OCSSD process received a request from node 2 Member kill request to kill the database instance in 2015-12-28. In fact, it wasn't until 18:02:48 that the database started.
During this period, as long as 15 minutes, it shows that the database server resources are already very tight, which can lead to such a slow performance, usually only a large number of page feeds occur in memory.
2.2 operating system performance-hidden dangers have long been buried
As can be seen from the above analysis, in fact, from 17:41:10 to 17:42:39, database node 1 system resources began to be tight and began to become slow!
2.2.1
View CPU usage
You can see:
CPU resources are not a problem, and CPU peak is not high.
2.2.2
View memory page fetch (pi/po)
You can see:
NMON samples every 5 minutes.
The next sample at 17:39 was 17:44, but this one was not taken at all!
This shows that the system resources are already very tight! This is the most important evidence!
The problem occurred at 17:42, because of the collection interval, it was not collected, and it was relatively normal.
But it does not affect the overall judgment of this analysis.
In addition, it is not difficult to find that before the problem, the utilization rate of pageSpace reached 12.44! It means that there has been a shortage of memory before! Little y suggests that if you find that pageSpace is being used on your own AIX platform, be sure to analyze the memory usage, otherwise it will be a big mine.
2.2.2
Why is there a memory page break?
Little y was startled when he saw the data!
Before the problem, such as between 16:24 and 17:29, the database server's computing memory (% comp) had been high for a long time! This is very dangerous for AIX! For computing memory, we should try to control below 80%, such a system is out of health of the system!
90% of the computing memory is close to the critical point of memory paging!
When there are some slightly larger memory requirements, it will cause the system to change pages.
So who triggered the page change?
In Xiao y's opinion, if a wall is about to fall down, it doesn't matter who knocked it down! So that's not the point.
The same 17:44 was supposed to show a record, but it wasn't typed!
It shows that the system is really running out of memory around 17:44.
At 17:49, the computing memory reached 97%! (44 points have been abnormal, which must lead to process accumulation and then increase the use of memory.)
2.3 memory Planning-customer's wishful thinking
The memory size of the database server is 50 GB, and the customer's initial memory planning is as follows
> > SGA configuration 25g
> > PGA is configured as 5G
> > the database parameter processes is 2000, which runs in 1000 processes every day.
The memory footprint of the database can be calculated using the following simple formula:
Consumption of SGA + PGA+ processes at OS level
Normally, in the 11G version of the database, the memory consumption of a single ORACLE service process is 3m.
Therefore, the usual use of memory is 25G+5G+1000*3M=33G, accounting for 66% of Lpar memory.
If 2000 processes are called up according to the exception wait, the memory usage is 25G+5G+2000*3M=36G, and the planning is too large.
2.4 analyze where memory usage is going-the reality is cruel
Because the memory footprint of the database can be calculated using the following simple formula:
Consumption of SGA + PGA+ processes at OS level
Here, SGA is a hard limit, and PGA is not a hard limit (neither workspace nor non-workspace is a hard limit)
Small y obtains the peak value of pga through dba_hist_pgastat and finds that it is only 5 GB, which does not break the parameter limit of PGA, so it is most likely that "the process consumes at the OS level" takes up more memory.
Therefore, Xiao y immediately checks the memory consumption of a single process through the procmap command:
Found that ORACLE occupies a single memory of 10m (add up the second column)
At this point, Xiao y already knows the answer!
Readers, you can also stop for a moment, summarize the above phenomena, and think about it for a few minutes. If you pick up this CASE, how will you continue to check?
↓
↓
↓
↓
↓
↓
↓
↓
↓
↓
↓
↓
Don't go away. There's more in the back.
So where is the memory used? Little y is easy to do, and you can see the details of the memory footprint through the svmon-P command.
You can see:
In the exclusive memory of the work type of each ORACLE service process, the USLA heap portion takes up 1642 memory pages, while each page is 4K, that is, 6-7m more.
In fact, this is a known BUG of the operating system and database.
Zhongyi Technology has encountered this problem several times in other data centers.
2.5 7 MB of memory not to be underestimated
The memory size of the database server is 50 GB, and the customer's initial memory planning is as follows
> > SGA configuration 25g
> > PGA is configured as 5G
> > the database parameter processes is 2000, which runs in 1000 processes every day.
Now let's take a look at:
Under normal circumstances:
11G version of the database, the memory footprint of a single ORACLE service process is 10m instead of 3m!
Therefore, the usual use of memory is 25G+5G+1000*10M=40G, and the database alone accounts for 80% of the memory! This is more dangerous! For the AIX platform, it is recommended that the database accounts for between 50% and 60% of memory, because operating system kernel and other operating systems can also use memory, up to 40% at most. The more common is the use of kernel inode cache.
If 2000 processes are called up according to the exception waiting, the memory usage is 25G+5G+2000*10M=50G.
2.6 confirm operating system and database BUG
When we get here, it's easy.
Xiao y does a keyword search with USLA on mos, and the corresponding BUG is found below.
The following note is about USLA heap on the official website of ORACLE, which causes a single process to occupy 7m more memory.
A description of BUG 13443029 that changed from 3m to 10m.
11gR2Aix-Dedicated Server Processes Have Large Usla Heap Segment Compared To Older Versions (Doc ID 1260095.1)
2.7 how can it be solved?
This problem is a flaw in the operating system that requires both the operating system and the database to install patches at the same time:
> > for AIX 6.1 TL07 or AIX 7.1 TL01, you need to install patches for both the operating system and the database.
> > for AIX 6.1 TL07 SP02/AIX 7.1 TL01 SP02 or later, since the operating system has been repaired, you only need to install patch 13443029 on the database side.
Database patch 13443029 is not included in any PSU under 11.2.0.3, and the fix for this problem is included in 11.2.0.4.
Part 3
Summary of reasons and suggestions
3.1 reason summary
The memory size of the database server is 50 GB, and the memory plan is as follows
SGA configuration 25g
PGA is configured as 5G
Database parameter processes is 2000
The average number of database service processes on the 28th is about 1000.
Due to a known defect in the operating system and database-11gR2Aix-Dedicated Server Processes Have Large Usla Heap Segment Compared To Older Versions (Doc ID 1260095.1), an idle database service process takes up about 7m of private memory in the USLA part.
Therefore, the database as a whole accounts for 25 G + 5G + 1000*10M=40G, that is, about 40 G of computing memory, and the database has accounted for more than 80% of the memory (usually controlled at 60%). Coupled with the use of memory such as kernel, the database usually runs in a state of nearly 90% of the computing memory. This makes the database server run under the high level of memory, when the number of processes increases, sorting hashes and other memory requirements, and then the memory page change occurs, which drags the overall system very slowly.
The most direct evidence of the cause of this failure when memory is tight is as follows:
NMON samples every 5 minutes.
The next sample at 17:39 is 17:44, but this time the sample was not taken!
This shows that the system resources are already very tight! This is the most important evidence!
When the database cluster itself detects that the VIP/ database resources are not responding, the exception handling is carried out, which leads to the failure of monitoring, VIP and database instances.
3.2 recommendations
By solving the defects of the operating system and database about USLA and planning the memory parameters of the database, the use of memory can be reduced and the system can run in a healthier memory condition, thus the problem can be solved fundamentally.
1) install database patch 13443029 to recompile the database with the option of shrsymtab, and reduce the 7m memory in USLA to tens of K, thus freeing up about 7G of memory (if calculated by 2000 processes, 14g of memory is freed)
2) adjust the database SGA memory sga_target parameter from 25g to 20g.
Adjustment description:
After two adjustments, without adding memory to the lpar
The memory planning of the database is 20G+5G+1000*3M=28G. If 2000 processes are full, the memory planning of the database is 20G+5G+2000*3M=31G.
As a result, the system memory resources are more abundant, and the page change will not occur because of some private memory requirements.
Part 4
Common risk reminder
Through this case, Xiao y made some common risk tips:
Don't underestimate the impact of the memory consumed by an idle Oracle service process!
When we do memory planning, we often ignore this point!
If your database version is 11GR2 and runs on the AIX6/7 version, there is likely to be excessive memory consumption in your system, that is, an Oracle service process has about 7m more memory than the 10G version, thus changing a single ORACLE process from 3m to 10m. This is fatal for databases with a large number of Oracle service processes.
For example, for a database running 2000 Oracle service processes, the memory footprint is not 2, 000, 3, 6 gigabytes, but 2, 000, 10, 20 gigabytes, 14 gigabytes more. The extra 14GB will exceed your memory plan, leaving the database running in a dangerous state. Whether to hit this question or not, you can refer to the case shared in this article!
The following is on ORACLE's official website that USLA heap causes a single process to take up 7m more memory
A description of BUG 13443029 that changed from 3m to 10m.
11gR2Aix-Dedicated Server Processes Have Large Usla Heap Segment Compared To Older Versions (Doc ID 1260095.1)
This problem is a flaw in the operating system that requires both the operating system and the database to install patches at the same time:
> > for AIX 6.1 TL07 or AIX 7.1 TL01, you need to install patches for both the operating system and the database.
> > for AIX 6.1 TL07 SP02/AIX 7.1 TL01 SP02 or later, since the operating system has been repaired, you only need to install patch 13443029 on the database side.
Database patch 13443029 is not included in any PSU under 11.2.0.3, and the fix for this problem is included in 11.2.0.4.
For AIX 6.1 TL07 SP02/AIX 7.1 TL01 SP02 or later, apply patch 13443029
For AIX 6.1 TL07 or AIX 7.1 TL01, install AIX 6.1 TL-07 APAR
IV09580, AIX 7.1 TL-01 APAR IV09541, and apply patch 13443029
For other AIX level, apply patch 10190759, this will disable Oracle's online patching mechanism
Note: as of 06/21/2012, fix for bug 13443029 or bug 10190759 are not included in any PSU and the interim patch is needed. Interim patch 10190759 exists on top of most PSU, and patch 13443029 on top of 11.2.0.3 does not conflict with 11.2.0.3.1 PSU and can be applied on top of both 11.2.0.3 base and 11.2.0.3.1 PSU.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.