Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

A tragedy caused by dns cache

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Share

Shulou(Shulou.com)06/01 Report--

At 5: 00 a. M. on a Saturday in 2015, the company's official QQ group had user feedback that the official website could not be opened, but some user feedback could be opened. The customer service got up and tried it with the computer without a problem, and told the customer that it might be his own network problem. Please try again later. At 8: 00 in the morning, more and more users reported that the official website could not be opened, and some user developers reported that app could not be opened, so the customer service called me, who was still asleep.

Analysis and positioning

After being called up by the customer service, a face confused, do not know what the situation, reply to the customer service, know, immediately check, there will be news timely communication. I sobered up after washing my face with cold water, and immediately recalled the situation of production in the past two days according to experience: the XX module was online, which did not affect or repair XXbug, and it should not be affected either. The server has just been configured with https. It seems that there is a relationship, but app has not been put into production of https for the time being, so there are problems and eliminate them. Turn on the computer to check the recent production records should not have such a serious problem, as you doubt whether there is a problem with the network, immediately call the operation and maintenance manager and related people to check together.

While allowing the network and operation and maintenance to troubleshoot the problem, we checked the web server, database server, business log, database log, and other monitoring data again, all of which are normal. Try to ping the domain name on the local machine, and it is more suspected that it is a network problem. If you try to access it directly through the public network, you can open it and basically confirm that there is no problem with the service, but the operation and maintenance department reports that everything is normal. There must be something wrong with your production code, and all parties continue to investigate.

At 9: 00, the group began to have large-scale user feedback official website and app can not be opened, more some users incited, XXX company ran out (15 years a lot of P2P companies run away, causing users to become frightened birds, a little bit of problems will be afraid of the company to run, everyone has become a monitoring master, watch every day, real-time brush, early morning pee also take a look at today's earnings on app), customer service hotline has basically been hit. While continuing to troubleshoot the problem, report the problem to the director and senior executives of the company, advise the customer service, and explain to the users that the network jitter in the IDC computer room is being solved urgently, and there is no impact on funds and data.

10:00, after repeated checks by developers and operators, they began to suspect that there was a problem with dns parsing, but the exact problem was not clear. CTO decided: 1, everyone took a taxi to the company and came to the company to collectively solve 2, in various QQ groups and WeChat groups, send mass messages to users to explain the xxx problem and appease customers. Reorganize the user's entire access process while in the car, as shown in the following figure:

After arriving at the company, according to this idea, everyone verified that accessing all the services of the company through the public network IP and the private network IP is normal, but not through the domain name. In addition, the logs of the monitoring server, firewall and network devices are all normal, so it is concluded that there is a problem with DNS resolution.

The problem of tackling key problems

Since it is indeed a DNS parsing problem, then the problem comes again? Why is there a problem with DNS parsing? How to solve this problem? While submitting a ticket to Wanwang, we also tested the access of Telecom, China Mobile and China Unicom under different network operators, and found that DNS could not be parsed only in the environment of Unicom network. According to the customer service feedback also verified this situation, telecom and mobile user feedback is very little, Unicom user feedback is the most. So we began to call Unicom again, and at first Unicom did not accept our request, so we began to call Unicom as a user to solve the problem of not being able to access the Internet immediately.

So we started a wrangling war between Wanwang and Unicom. Wanwang said that all the DNS parsing from them was normal, and all the indicators were normal. We called Unicom and said that we already knew about it, and professional people would reply to us later. After a while, the network engineer of Unicom replied that it is generally a problem of domain name resolution like this. Within a short period of 6 hours from 10:30 in the morning to the beginning of the company, we took turns to make nearly 50 or 60 calls to Unicom, put forward N work orders to Wanwang, and received N phone calls.

During this period, leaders also began to use various relationships, friends within Unicom and big players in the network operation and maintenance community helped to locate and solve the problem, and we also tried a lot of methods, for example, using the ipconfig/flushdns command to clear the local DNS cache, updating DNS parsing on Wanwang's official website, deleting and re-adding, and so on, are not completely fruitless. We have been trying to find a way to test the networks of various places and operators. Finally, we found 17ce and QYT in the case of recommendation and search, which is very practical. In the future network positioning, it has become a necessary tool for me. It is very convenient to monitor whether the websites of various operators and regions are not accessible and whether the access speed is fast. Screenshots are as follows:

We also found that other domain names of the company are also accessed normally, that is, the domain name of the official website is not connected with the related subdomains. During this period, many people asked a question, that is, did you forget to pay for your domain name? at the beginning, everyone asked the operation and maintenance side that there was no such problem. It was not until 12:30 at noon that after repeated questioning, we said that the domain name was in arrears when he logged on to the tens of thousands of networks at more than 8 o'clock, but he immediately made up the fee. Oh, it almost pissed us off. Is there a hint for the expiration of the domain name? Only to know that because the last operation and maintenance manager left, they did not update Wanwang's phone and mailbox in time, resulting in prompt e-mails and text messages were not received.

By communicating with relevant friends of Wanwang, Unicom and leaders, as well as our test observation, we have a preliminary understanding of the reason for this: the DNS parsing of Wanwang is stopped due to the domain name's forgetting to pay the fee, and the user's local or DNS server has a cache, so some users can access and some users cannot. After the payment, the DNS of Wanwang has been updated and pushed, but there are many levels of DNS resolution that need to be updated level by level, and some levels have not been updated, resulting in some users under the DNS service provider who have not been updated cannot access the official website.

Communicated with Wanwang, asked the most delayed situation, all the DNS updates to the latest time, the answer is that within 48 hours will certainly be good, but we can not afford to wait, with the passage of time more and more users found the problem, QQ group, WeChat group has been boiling, the chairman also began to pay attention to the secondary problem, some customers directly in the group said, your technology is too weak (like this is still euphemistic Some directly call people to swear).

Temporary solution

Through continuous 17ce testing, it is found that the networks in most areas have been restored, except for Beijing Unicom and some regional Unicom network environment, which also shows that the DNS resolution records in these areas have not been updated. So now that we have identified the problem above and understand the cause, we wondered if it would be better to try to change the DNS resolution server, so we changed the local DNS address to 8.8.8.8 (Google's DNS service parsing) and found it! So quickly write a solution manual and send it to anxious customers to use.

Users of the official website can solve the access problem by changing DNS. What about APP? There is no way, we can not wait, directly ask the developer to change the address of the client call from the domain name to the IP address of the public network for temporary use by the user. Android is relatively easy to do, and it's OK to let users download and install it directly, but at that time, the review of IOS took at least a week for cauliflower to cool. In fact, iPhone phones can be set up separately with DNS, and we found that it could also be achieved after setting and testing, so we immediately updated it to the manual and sent it to customer service and sent it to the group for users to use.

Click to download the DNS update manual written at that time

Some people say that it is OK to let users use the public network directly. It is no problem to use the home page of the public network to open it, but the addresses of domain names are all written in the relevant configuration files between systems, which may lead to other problems if forced changes are made. After 10:00 on the first day, I had a meal at 4 o'clock in the middle, and everyone was very tired after making N phone calls, so we did this first thing that day, and we went to the company early the next morning to follow up.

The next day, after 17ce testing, I found that all the nodes had been connected and only the two contacts of Beijing Unicom did not respond, but Beijing is our base camp, and most of the users are from Beijing. Continue to communicate with Wanwang and Unicom to see how to thoroughly solve this problem, on the other hand, prepare for the worst, and what to do if we can't get through all the time. Sort out all the configuration files that use the domain name in the production environment, and make sure that you can directly update to the public network address at any time without affecting the service. App makes a complete new version, ready to be put into production at any time for users to be forced to upgrade to a directly connected version of the public network.

By 10:00 the next evening, the two nodes of Beijing Unicom were still disconnected, and they discussed with the leaders that if the two networks still could not be connected by 8: 00 a.m. on Monday, the modified system would be online and the APP would be forced to upgrade (because there was no target at that time, there was a bid issuance plan within the week). The first thing I did when I got up on the third morning was to pick up my phone and see if I could log on to the official website of my Unicom network. everybody ' s happy.

As the saying goes, the more the truth is debated, the clearer it becomes. After this accident, it thoroughly let me understand the whole process of DNS analysis.

DNS parsing process

DNS (Domain Name System) is the abbreviation of "domain name system". It is a computer and network service naming system organized into a domain hierarchy. It is used in TCP/IP networks. The services it provides are used to convert host names and domain names into IP addresses. As the saying goes, DNS is to convert a URL into an external IP address.

The whole process of dns from user access to response

Step 1: the browser will check whether there is a resolved IP address for this domain name in the cache, and if so, the resolution process will end. The browser cache domain name is also limited, including the time and size of the cache, which can be set through the TTL attribute.

Step 2: if it is not in the cache in the user's browser, the operating system will first check whether the URL mapping exists in the local hosts file, and if so, call the IP address mapping to complete the domain name resolution.

Step 3: if there is no mapping of this domain name in hosts, look for the local DNS parser cache to see if there is this URL mapping relationship, and if so, return directly to complete the domain name resolution.

Step 4: if there is no corresponding URL mapping relationship between hosts and the local DNS parser cache, we will first find the preferred DNS server set in the TCP/ip parameter, here we call it the local DNS server. When this server receives the query, if the domain name to be queried is included in the local configuration zone resources, the resolution result will be returned to the client to complete the domain name resolution.

Step 5: if the domain name to be queried is not resolved by the local DNS server region, but the server has cached the URL mapping, the IP address mapping is called to complete the domain name resolution. This resolution is not authoritative.

Step 6: if the local DNS server local zone file and cache resolution are invalid, the query will be made according to the settings of the local DNS server (whether to set a forwarder). If the forwarding mode is not used, the local DNS will send the request to 13 root DNS. After receiving the request, the root DNS server will determine who is authorized to manage the domain name (.com), and will return an IP responsible for the top-level domain name server. When the local DNS server receives the IP information, it will contact the server responsible for the .com domain. After the server in charge of the .com domain receives the request, if it cannot resolve it, it will find a DNS server address that manages the next level of the .com domain to the local DNS server. When the local DNS server receives this address, it will find the domain name domain server, repeat the above action, and query until the host corresponding to the domain name is found.

Step 7: if the forwarding mode is used, the DNS server will forward the request to the DNS server at a higher level, which will be parsed by the server at the next level. If the server at the next level cannot resolve the request, it can either find the root DNS or transfer the request to the superior in a loop. Whether the local DNS server uses a forwarding or root prompt, the result is finally returned to the local DNS server, which in turn returns the DNS server to the client.

This incident has taught us a great lesson:

First, there are loopholes in process management and turnover is not in place.

Second, the immature crisis management affects the company's reputation.

Third, the monitoring mechanism is not perfect, such as the lack of access to the external network, monitoring measures should be set up in advance.

Sometimes the very serious problem is the little one you often ignore.

Author: a pure smile

Source: http://www.ityouknow.com/

The copyright belongs to the author, please indicate the source of the reprint

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Network Security

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report