Hundreds of thousands of connected megabytes of traffic, frightening the "baby" to death. 02/07 Update SLTechnology News&Howtos

Hundreds of thousands of connected megabytes of traffic, frightening the "baby" to death.

2026-02-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

A certain office point upgrade (nginx change ats, while removing the front nginx load layer), after the upgrade service is not normal, hard to see hundreds of thousands of connections, no traffic, all kinds of troubleshooting, can be said to be the heart of the throat thrilling half an hour, although a good business mechanism, service is not normal users can directly return to the source, but for our traffic is definitely a sawtooth, review the investigation process.

The upgrade process is not mentioned. After upgrading, the service configuration, health heartbeat, disk settings, and local origin DNS are simply checked, and no problems are found. The next step is to cut the traffic. The DNS at the front end distributes the request according to the domain name hash. The traffic quickly reaches 100M and is still rising. The number of connections reaches tens of thousands (the domain name quality is not good, and many dynamic ones are considered normal). However, after a few minutes, the traffic drops sharply. It has dropped to a few M. The number of connections has not dropped but risen, and the memory is almost full.

(Number of current connections)

(incoming and outgoing traffic, cpu, memory, tcp retransmission refresh dynamic monitoring chart in 1 second, memory runs more and more full, tcp retransmission becomes higher and higher)

Nerves immediately tense up, first check DNS is normal, because if the local back-to-source DNS is broken, there will be a large number of connections can not be served, but the test found that the local back-to-source DNS service is normal, it seems that it is not a simple problem, crt opened multiple windows, began monitoring:

tailf /var/log/messages |Grep kernel does not report errors, there should be no problem at the system level.

tailf/opt/ats/var/log/trafficserver/diags.log There is no obvious error, but after a while, it will prompt that there are too many connections and the connection is discarded, indicating that the service is definitely abnormal, but it is impossible to locate where the error is.

tstop Open to view the overall situation, found normal refresh, but each refresh some data can not be displayed normally, memory cache and hard disk cache capacity are not displayed, why not display it, is the wrong setting, and then check the disk settings, found records.config memory cache set to half of the memory 12G, storge.config settings are no problem, continue to check.

tsar -l 1 monitoring, disk IO is 0, all disks do not write disk, so think in the end because there is no traffic caused by no disk write, or write disk caused by no traffic, first assume that there is no disk write traffic, do not write disk there are two cases one disk is bad two disk permissions are wrong, immediately check, found that all data disk owners are tserver, and check all naked disk, seems to have no problem.

(No problem found after viewing permissions)

tsar -n 1 continued to check the historical data, found that the ats started the moment there is traffic, followed by a sudden drop in traffic, and the disk is IO at the beginning, more and more suspected of hard disk problems but no evidence. Later, I thought, do a test, simply do not use the hard disk, directly on the memory, unexpectedly there is traffic, and relatively stable, finally located the problem.

(Note out all disks)

Continue to think, do all the hard disks are bad, add a hard disk try it, still not good, continue to think, why tstop can not calculate the cache, so list the size of all disks, found that each disk of this office point actually has nearly 2T or so, the figure is as follows:

(Only one disc is 186.5G, the rest are at 2T)

Continue to think that the disk may be too large, ats can not join it, so change the size of the disk used (300G), restart ats, problem solved, relieved, thrilling half an hour.

(Disk size configuration change, directly specify size)

(Business returns to normal after restart)

Self-built personal original station operation and maintenance Internet Cafe Society (www.net-add.com), new blog posts will be updated in Internet Cafe Society, welcome to browse.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.