Reflection on an Operation and maintenance failure of Exchange Server 02/07 Update SLTechnology News&Howtos

Reflection on an Operation and maintenance failure of Exchange Server

2026-02-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

This paper is a process record and reflection on the fault discovery, fault handling and fault repair of the mail flow fault of the company's Exchange mail system on August 9, 2018. Help yourself to sum up experiences and learn lessons, and at the same time serve as a negative example for other operators or administrators to learn lessons.

Fault discovery

After the training sharing meeting within the team ended at around 18: 50 pm yesterday, I received feedback from colleagues that none of them could receive external mail (e-mail on Internet). The malfunction was that the Exchange server sent and received email normally on the internal network and sent it on the external network, but could not receive the external email.

Because the company's mail system is Exchange Server 2010 built by the company, it needs to be managed by the operation and maintenance staff themselves. After testing a number of external mailboxes, it is found that external mail can not be received, including NetEase, Ali enterprise mailbox and Microsoft Outlook mailbox.

Because the mail service is one of the core services of the enterprise, and some colleagues have reported problems, this failure should be an important emergency and must be eliminated as soon as possible to restore the service.

Note 1: if the problem is serious or there is an emergency handling procedure, you should report to the superior and issue a notice in accordance with the process.

Note 2: the following is a summary of personal views and experience, please point out if there are any mistakes.

Fault handling

The most important thing in the face of failure is to troubleshoot as soon as possible to achieve the fastest recovery of the service. So the first thing to do is troubleshooting. As it is already off time, although the accident is serious, it has not yet had a significant impact.

Because there is a lack of personal experience in the operation and maintenance of Windows, especially Exchange, we cannot find problems at once on the basis of experience, so we can only investigate them one by one according to previous experience, combined with Google and so on.

After preliminary testing, internal mail sending and receiving is normal, internal sending mail to the outside is normal, but receiving is abnormal. So begin the following investigation.

You should be aware of recent changes, such as software configuration, that lead to changes before troubleshooting, especially if two or more administrators work together to manage them. So the server is managed by one person, and no changes have been made recently, which is a sudden problem, so start troubleshooting directly:

Check domain name resolution and troubleshoot mx records for problems. Use the nslookup command to test MX records, as well as related A records and CNAME records, on multiple external network servers.

Note the 1:Windows server can use nslookup-q=mx xxx.com to query directly. The Linux command requires interactive query, that is, first execute nslookup, then set q=mx or set type=mx, and then query.

Note 2: when querying mx records, you only need to query the superior domain name of the mail server fqdn domain name. For example, mail.qq.com, you only need to query the mx record of qq.com.

After investigation, the problem of domain name resolution is eliminated.

Check for external and internal communication problems, firewall blocking and network link problems between the firewall and the server. Use the telnet mail.xxx.com 25 command to check the opening of port 25 and test to troubleshoot the firewall.

Note 1:25 port is the agreed port for receiving external mail.

Note 2: if port 25 is working and the target is an Exchange mail server, you should be prompted with words like "220 mail.xxx.com Microsoft ESMTP MAIL Service ready at Fri, 10 Aug 2018 10:43:58 + 0800".

To confirm that it is not a firewall or network device bug problem, restart the firewall or network device. Firewalls without soft shutdown and restart usually need to be powered off or switched over for more than 10s. It has been checked that it is not a network device problem.

After the above three steps are excluded, you should make sure that the problem lies with the mail server. Start troubleshooting the mail server itself:

Because the internal sending and receiving of the mail server is normal, log in to the mail server directly and check other possible influencing factors of the mail server.

First check the server load, including CPU, memory, disk space, IO, and network load. Generally, CPU and memory are the main factors that affect Exchange, followed by disk space and IO. After checking that the disk space is insufficient (less than 5%, but there is still 3GB free space, due to lack of experience, the possible impact of this problem has not been determined, and the private network mail is normal, so there is no priority to deal with it. Finally, it is found that this is the cause).

Second, you should check the server system log. Whether to check the log first or the load first is just a matter of habit. System logs generally give administrators enough information. Although Windows's event manager is not particularly useful, Exchange is conscientious in logging and generally records large and small events.

In addition to checking system logs, Exchange generally provides other diagnostic tools. For example, the queue Viewer, because the queue Viewer can be used to solve mail flow problems, there are also some tips in the queue Viewer about the problem that messages cannot be delivered.

After looking at the Syslog and queue viewer, it was found that the problem was caused by insufficient resources. The system has two obvious hints:

1. The queue viewer indicates that the previous error is "452 4.3.1 Insufficient system resources". After Google queries, this usually means that there is either insufficient disk space or insufficient memory space.

two。 The event Viewer comes from the "MSExchangeTransport" report that says:

(1) warning: resource pressure has increased from ordinary to moderate.

(2) error: the Microsoft Exchange Transport service refused to submit the message because the free disk space has fallen below the configured threshold.

Fault identification and repair

It has been confirmed that the "backpressure" protection policy of Exchange was triggered by a disk space problem. Solved by freeing up disk space. Notify the superior and related personnel after the settlement.

Knowledge point

About "reverse pressure". The following is an excerpt from the Microsoft document library-learn about backpressure.

Reverse pressure is a system resource monitoring feature of the Microsoft Exchange transport service that exists on the Microsoft Exchange Server 2010 Hub Transport server and the Edge Transport server. The Exchange transport can detect when important resources, such as available hard disk space and memory, are under pressure and take action to try to prevent service unavailability.

Reverse pressure prevents excessive use of system resources, and Exchange attempts to deliver existing messages. When the system resource utilization returns to the normal level, the Exchange server can gradually return to normal operation.

In Exchange Server 2007, when the Hub Transport server or Edge Transport server is under resource pressure, it rejects incoming connections. In Exchange 2010, incoming connections are accepted, but incoming messages are accepted or rejected at a slower rate. When a SMTP host tries to connect to a Hub Transport server or Edge Transport server under reverse pressure, the connection succeeds, but when the host issues a MAIL FROM command to submit a message, depending on stressful resources, Exchange may delay confirming or rejecting the MAIL FROM command.

The following is an excerpt from the event Viewer:

The Microsoft Exchange Transport service rejects mail submission because free disk space has fallen below the configured threshold.

The following resources are under pressure: queue database logging path ("C:\ Program Files\ Microsoft\ Exchange Server\ V14\ TransportRoles\ data\ Queue\") = 95% [medium] [normal = 93% = 95% high = 97%]

Backpressure causes the following components to be disabled: submit inbound messages from the Hub Transport server

Submit inbound mail from Internet

Submit mail from the sorting directory

Submit messages from the replay directory

Submit messages from the mailbox server

Deliver messages to a remote domain

Loading email from queue database (if available)

The following resources are in normal state: queue database path ("C:\ Program Files\ Microsoft\ Exchange Server\ V14\ TransportRoles\ data\ Queue\ mail.que") = 95% [normal] [normal = 95% medium = 97% High = 99%]

Version bucket = 0 [normal] [normal = 80 medium = 120 high = 200]

Private byte = 0% [normal] [normal = 71% medium = 73% high = 75%]

Physical memory load = 11% [the limit for starting mail freezing is 94%.]

Batch point = 0 [normal] [normal = 1000 Intermediate = 2000 Advanced = 4000]

Submit queue = 0 [normal] [generally = 1000 medium = 2000 high = 4000]

Note: in fact, there is a similar protection mechanism in Linux, such as oom, where 5% of disks are retained. If you encounter this kind of knowledge, you should draw an example and follow by analogy.

Fault reflection and summary

When you encounter troubles or problems, you should keep a cool head, don't panic, and don't mess yourself up. Many operators or administrators think of how to solve the problem first when they encounter problems, but it is incorrect to think of rollback in order to save time after trying various ways to solve the problem. As a qualified operation and maintenance staff, they should find out the context of the matter and the root cause of the problem. When troubleshooting problems, the first thought is to troubleshoot the problem through the log. In the investigation should be as comprehensive as possible, do not leave out any of the details that may cause problems.

Deployment must comply with standards and must be standardized. From the point of view of this accident, this Exchange server contains three databases, one of which is stored on disk C and not on other disks. Over time, this database takes up a lot of disk space, resulting in insufficient disk space, which triggers the "backpressure" mechanism. From the point of view of standard and specification practice, this database should be moved from disk C to other disks with large capacity. And calculate the capacity at the beginning of the deployment.

Pay attention to the police. This server is configured with Zabbix monitoring alarm, and Zabbix has detected the fault and sent an alarm, which is caused by the failure due to lack of timely processing.

Even if you pick up the offer, you have to change your ways. Because this mail server was deployed by former operation and maintenance colleagues, there are some problems that have been shelved and have not been solved (there are also technical reasons). In the long run, even if you have to pay a certain price, you need to fix it.

Keep learning. Although sometimes something deviates from its own direction, the core IT systems of companies such as mail servers should be studied in depth. Only by knowing and knowing can we solve problems more quickly when we encounter problems.

Summarize experience and learn lessons after each failure. Record the knowledge and experience and precipitate it. For example, after this summary, when you encounter this failure, you may suddenly think that insufficient disk space will cause Exchange to trigger backpressure, resulting in inability to receive external mail.

-- end--

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.