Case Analysis of Server failure 07/13 Update SLTechnology News&Howtos

Case Analysis of Server failure

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the relevant knowledge of "server fault case analysis". The editor shows you the operation process through the actual case, and the operation method is simple, fast and practical. I hope this "server fault case analysis" article can help you solve the problem.

1. It's out of order.

There is no way, in the business of it, you have to face failures every day. Everyone is the legendary firefighters, fighting fires everywhere. However, the scope of the fault this time is so large that the host can not be opened.

Fortunately, the surveillance system left some evidence.

The evidence found that the CPU, memory and file handles of the machine continued to rise as the business grew. Rise until the surveillance cannot collect the information.

What is fatal is that there are a lot of Java processes deployed on these hosts. There is no other reason, just to save costs, mixed applications. When the host shows an overall anomaly, it is difficult to find the culprit.

Because remote login is also Over, irascible operators can only restart the machine and restart the application after restarting the machine. After a long wait, all the processes were alive, but only a moment later, the host died immediately.

It's exasperating that the business has been in a dead state. It's also impatient. After several attempts, the operation and maintenance staff crashed and launched an emergency plan: rollback!

Recently, there are a lot of online records, and there are developers privately online deployment behavior, the operation and maintenance circle: which rolls back? Fortunately, someone had an idea and remembered the command find, so find all the recently updated jar packages and roll it back.

Find / apps/deploy-mtime + 3 | grep jar$

If you don't know the find command, it's really a disaster. It's a good thing someone knows.

Roll back more than a dozen jar packages, but luckily there are no schema changes in the database, and the system is finally running normally.

two。 Find the reason

There is no other way, check the log, conduct code review.

Code review can be traced back to code changes within the last 1 or 2 weeks, because some functional codes take a while to get online.

Looking at the screen full of submission records "OK", the technical manager's face turned green.

Xjjdog said, "80% of programmers can't write commit records." I don't think you can write at 100%.

Everyone is quiet, enduring the pain to check the historical changes. After everyone's unremitting efforts, finally in the shit mountain, found some problem code. CxO set up a group himself, and everyone threw the code that might go wrong into the group.

"the system service was interrupted for nearly an hour, and the impact was very bad," CxO said. "be sure to solve the problem completely. Investors are very concerned about this problem."

Okokok, with the help of nails, everyone's gestures have become uniform.

3. Parameters of the thread pool

There is a lot of code, and people have been talking about the problem code for a long time. It includes some code that uses parallel flows, as well as technical code embedded in lamba expressions, and focuses on the usage code of some thread pools.

In the end, we decided to go through the thread pool code again. One of the paragraphs goes like this.

RejectedExecutionHandler handler = new ThreadPoolExecutor.DiscardOldestPolicy (); ThreadPoolExecutor executor = new ThreadPoolExecutor (100Jing 200,60000, TimeUnit.MILLISECONDS, new LinkedBlockingDeque (10), handler)

Not to mention, the parameters are good-looking, even taking into account the rejection strategy.

The thread pool of Java makes programming very easy. It has many parameters, such as the figure above, which we will introduce one by one, otherwise the code cannot be reviewed.

CorePoolSize: the number of core threads that will survive after they are created

MaxPoolSize: maximum number of threads

KeepAliveTime: thread idle time

WorkQueue: blocking queu

ThreadFactory: thread creation factory

Handler: reject policy

Let's introduce their relationship.

When the number of threads is less than the number of core threads, a new task arrives and a new thread will be generated to serve. When the number of frontlines is greater than the number of core threads, and the blocking queue is not full, the task will be placed in the blocking queue. When the number of threads is greater than the number of core threads and the blocking queue is full, new threads will be created to serve until the number of threads reaches the size of the maximumPoolSize. At this point, if there is a new task, the reject policy will be triggered.

Again, the rejection strategy. Jdk implements four policies by default, and the default is AbortPolicy, that is, throwing an exception directly. Here are a few others.

DiscardPolicy is more radical than abort. It just loses its task and has no abnormal information.

The task is handled by the calling thread in CallerRunsPolicy. For example, in a web application, when the thread pool resources are full, the new tasks will run in the tomcat thread. This method can delay the execution pressure of some tasks, but in more cases, it will directly block the running of the main thread.

DiscardOldestPolicy discards the first task in the queue and then retries to execute the task

The code for this thread pool is new, the parameter setting is normal, and there is no big problem. The only possible risk is to use DiscardOldestPolicy's reject strategy. When there are too many tasks, this rejection policy will cause tasks to be queued and requests to time out.

Of course, you can't let go of this risk, and to be honest, it's the most likely risk code that can be found so far.

Change DiscardOldestPolicy to the default AbortPolicy and repackage it and try it online. Technology Daniel said in the herd.

4. What's the problem?

As a result, after the service grayscale was online, the host died after a short time. It's the reason it didn't run away, but why?

The size of the thread pool, from 100 to 200, is not too much to say. The capacity of the blocking queue is only 10, and nothing to say will cause a problem. If you say it is caused by this thread pool, I don't believe it.

But according to the feedback from the business unit, if you add this code, you will die, and if you don't add it, you will be fine. The technology bulls scratched their ears and scratched their cheeks without thinking about their sisters.

Finally, someone couldn't help downloading the business code and planning to debug it.

When he opened the Idea, he was confused and realized in an instant. He finally understood why there was a problem with this code.

The thread pool is actually created in the method!

When each request comes, it creates a thread pool until the system can no longer allocate resources.

That's really bossy.

Everyone is concerned about how the parameters of the thread pool are set, but no one ever doubts the location of the code.

This is the end of the content of "Server failure case Analysis". Thank you for your reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.