In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
one
Preface
On February 14th, on the eve of Valentine's Day, a data center set of Oracle 11.2.0.4 RAC crashed!
A few days later, another set of RAC went down!
A few days later, another RAC went down.
As an operation and maintenance, when you hear that other customers have such a wave of downtime, will there be an inexplicable panic at the bottom of your heart?
So the question is, will there be a similar wave of downtime in your data center?
What are the causes of these failures?
Will this wave of ups and downs continue to go crazy?
If the truth of the problem can not be found in time, then Xiao y believes that this wave of ups and downs will continue!
Your center's Oracle database may be getting closer and closer to downtime! The scary thing is, you may not be aware of it...
This is definitely not alarmist talk!
This is a real failure in a very large data center, and in less than two weeks, three different Oracle databases have been abnormally terminated!
Coincidentally, other customers of small y service are also showing signs of downtime one after another! Fortunately, it was discovered and dealt with in time.
Seeing the wave of troubles coming, see how Xiao y can simplify the complexity and help customers solve the truth of the problem together.
After the truth is revealed, it may not be difficult for you to find that this is a common problem!
So Xiao y didn't dare to snub, so he quickly took it out and shared it with you, and sounded the red alert!
In the sixteenth issue, Xiao y will lead you to experience an analytical journey through the crash of Oracle databases in the data center.
At the end of the article, why hesitate to provide specific warning and verification methods? Let's check it out!
two
Here's the problem.
Little y, there is an accident, today there is a system, RAC down a node in the morning, and another node in the evening, the operating system did not restart, but the database instance crash dropped! At present, SR has been opened, but the reason has not been determined yet. The leader attaches great importance to this issue. Can you come and check it out tomorrow? The leader hopes to find out the cause of the problem tomorrow.
By the way, this is a set of 11.2.0.4 RAC, playing the latest PSU!
Received the phone call, Xiao y cheered up.
The caller is a super-large state-owned bank in China, which itself has a number of high-level ORACLE DBA.
Usually find the problem of small y, are some strange complex problems, if only understand the database, but the operating system / middleware / storage and other aspects of the lack of enough understanding, often can not solve their complex problems.
It seems that an uphill battle is inevitable.
three
Start the analysis
First, take a look at the database alert log:
The next morning, when I arrived at the customer site, the customer first introduced to me the situation of yesterday's failure: around 9: 00 a.m. on February 12, 11.2.0.4 RAC Node 1 went down, and Node 2 went down at 22:00 in the evening.
After the customer helps log in to the system, Xiao y first checks the alert log of the database, as shown in the following figure:
It is not difficult to see:
In 2017 ASMB, the background process of the database failed to communicate with the ASM instance, and the ASMB process terminated the database instance. Therefore, Xiao y needs to continue to check the alert log of ASM to see if there is a problem with the asm instance before causing the database crash.
Then check the ASM alert log:
It is not difficult to see:
Tens of seconds before the database crash, the rbal background process of the 815 crash instance encountered an error of ORA-07445, the rbal process core dump, so the pmon process terminated the ASM instance.
In other words, an ORA-7445 error occurred in the rbal process of the ASM instance, resulting in the termination of the ASM instance. Because the database instance depends on the ASM instance, the database instance is terminated. The specific ORA-7445 errors of ASM instances are:
ORA-07445: exception encountered: core dump [_ _ lwp_kill () + 48] [SIGIOT]
When little y first saw this mistake, he shook his head helplessly and ran into trouble!
Why does Xiao y have such feelings? The senior DBA may feel the same way when he sees this mistake.
Because this error call lwp_kill is a too common call, there may be ten thousand or even more reasons for this function core Dump, and all possibilities will not be recorded on metalink. ..
However, Xiao y is still very confident that as long as he is adjusted to the modified academic model, unknown problems can be quickly identified.
Yes, as long as we concentrate on analyzing the cause of this ORA-7445 [_ lwp_kill () + 48] [SIGIOT] error, we will solve the truth of the series of problems.
four
A bad start
Seeing here, Xiao y had a simple communication with several engineers who had seen the problem before.
The result of communication is two words, which is not good.
Several of their more senior engineers have seen this problem before and looked for similar problems on metalink. The results show that there are some of the same case through call stack matching, but case does not have a direct conclusion. The customer has opened a SR to gcs, which is currently being analyzed.
The client wants a rough result today, and time is urgent.
Customers they understand Xiao y's habits, no matter how urgent, they will take the time to smoke a cigarette first.
After saying hello to the client, Xiao y went downstairs to smoke.
five
Seriously popularize science.
Taking advantage of the gap of smoking, Xiao y wisps his mind for a while.
Maybe some students are confused about some of the above terms, what is call stack, what is ora-600 and ora-7445 errors. Xiao y found that many DBA are like others. Here, Xiao y gives you a little bit of popular science.
The popularization of knowledge points
Knowledge point 1: what is an ORA-600 error?
Some students will say that ORA-7445 errors, like ORA-600 errors, belong to ORACLE internal errors. According to Xiao y, this understanding is actually not accurate! ORA-600 errors are internal errors, but 7445 errors are not always the case!
ORA-600 is an exception caught in the ORACLE source code, which usually occurs in a specific function, which is relatively specific, usually ORACLE BUG.
Knowledge point 2: what is an ORA-7445 error?
ORA-7445 errors are not the same as ORA-600 errors.
When the ORACLE process receives a serious signal signal from the operating system while running, it will report an ORA-7445 error. The operating system itself will capture some illegal operations of the process, for example, when a process tries to write to an invalid memory location, in order to protect the operating system, the operating system will send a serious signal to the process, such as SIGBUS and SIGSEGV signals, so you will see the process core Dump phenomenon.
ORA-7445 errors can occur anywhere in the code, and the exact location of the error needs to be located through the core file.
From this paragraph, it is not difficult to see that there are many possibilities for ORA-7445 errors, and the essence is that the operating system sends a serious signal to the process, so the reason may be either the BUG of the database or some exception from the operating system.
This is why it is more difficult to analyze ORA-7445 errors than ORA-600 errors.
Knowledge point 3: what is call stack?
When we talk about bug or defective defect, we all have a question: what is the trigger condition for this BUG?
BUG is usually triggered under a special scenario. Call stack is the call track of the function, which represents the specific trigger scenario of BUG.
This is what little y mentioned earlier, and before little y, they had already checked call stack to match BUG.
Unfortunately, case with the same call stack on MOS did not come to a final conclusion, so it is impossible to refer to it.
six
Start with call stack to find the truth.
Little y then opens the trace file of the rbal process with the ORA-7445 error, and finds the call stack section, as shown below
First find the function that appears in the first square bracket of the ORA-7445 error, that is, lwp_kill
This means that the rbal process core dump occurs in the system call _ _ lwp_kill.
Lwp is Light Weight process, which means lightweight process, and kill is termination.
Careful students can see that lwp_kill has two underscores in front of it, indicating that it is not a function called by itself in the oracle code, but a function called within the function, which belongs to a recursive function.
So how does little y know what these calls mean?
In fact, it is very simple, these are standard calls from the operating system, du Niang or google will be fine.
In the Trace file, the call to call stack is viewed from the bottom up, the following function is executed first, and the above function is executed later. This mistake is all too common! It's a broad mistake! If there are many exceptions, lwp_kill may be called to terminate the process. So it doesn't make sense to analyze this function, and we need to move on, as shown in the following figure.
1) the function to call lwp_kill is pthread_kill, which sends a signal to a thread and is also a recursive function. Move on.
2) _ raise, which sends a signal to the executing program, and raise sets the pthread_kill
3) abort () function, from the name is termination, abort () function will cause abnormal termination of the process, unless the process termination signal from the operating system, that is, SIGABRT signal is captured and the signal processing handle does not return _ assert (), its function is to terminate the program execution if its condition returns an error. To put it simply, the program does something and needs to terminate the program execution if it encounters an error.
At this point, it is not difficult to see that the call trajectory of the function is
_ _ lwp_kill
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.