In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
Objective: to share the company's db fault handling process, mainly ideas.
Event description and impact:
At 04: 43 on September 30, 2018, zabbix warned odsdb2 that the database was suspected to be down, and the staff on duty could not log in to the database server through the fortress machine or ssh from other machines. At the same time, the odsdb1 database also lived in HANG and could not log in to the database through the command. (involving the company's business is negligible)
Incident troubleshooting:
At 4:46, the staff on duty in the computer room informed DBA and Yizhuang staff on duty to analyze the situation.
5:23, the attendant responded that the database server had been automatically restarted, but had been stuck in the startup interface.
5Ru 30jing DBA arrived at the scene to assist in troubleshooting.
5DBA found that the ogg process could not start properly because the database connection process reached the upper limit (3000) and the database could not connect.
At 6:03, the staff of the data analysis room participated in analyzing the ODS problem and confirmed that the ods 1-node database HANG lived.
6:56, the engine room attendant tried to restart the odsdb2 server manually, but it was still stuck in the startup interface.
7:40, try to reduce the number of connections to the database by blocking the port where the application connects to the database
8:30, contact the HP manufacturer to report the trouble.
All external connections to 9RV 20 skill kill odsdb1 database (guarantee the main business first)
9:30, do hang analyze to the odsdb1 database and analyze the reason why the database HANG lives
10:11, restart the oddsdb1 database instance
10Rd 28th odsdb1 returned to normal
The process of 10Ru 30menogg returns to normal.
10:40, release the port that has blocked the application
Event analysis:
1. The odsdb2 node is down and cannot be started, so it is stuck in the startup interface all the time. It is suspected that the database downtime and restart is caused by the database hardware problem. Notify the server manufacturer to report the failure.
2. Odsdb1 database HANG resident can not provide services normally, resulting in all applications related to ods database and ogg affected.
3. Odsdb1 reaches the set maximum number of connection processes (3000), which makes the database unable to log in and unable to analyze the situation.
4. Analyze which application server connects to the ods database, block its port to connect to the database, and reduce the external connection to the database.
And after kill loses external connections, the number of connections will soon rise to the maximum. Use hang analyze to do trace for analysis.
Through hang analyze analysis, the database is due to gc domain validation and parallel recory coord wait for reply.
These two wait events are the waiting events when node 1 takes over the services of node 2, rolls back the uncommitted data on node 2, and restores the data of node 2 after the database node 2 goes down.
As you can see from the information in the figure above, the SMON process is doing data recovery for node 2, but is waiting for 289min41sec.
Query Oracle's official website MOS and find some BUG related to gc domain validation
Follow-up optimization scheme:
1. Conduct regular hardware checks on the database to prevent such problems from happening again (communicate with the data center after the holiday and try to check once a month)
2. Add emergency drills for ODS database switching.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.