What are the Java online troubleshooting techniques? 07/06 Update SLTechnology News&Howtos

What are the Java online troubleshooting techniques?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "what Java online troubleshooting skills". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Online failures mainly include CPU, disk, memory and network problems, and most failures may involve more than one level of problems, so try to troubleshoot the four aspects in turn.

At the same time, such as jstack, jmap and other tools are not limited to one aspect of the problem, basically the problem is df, free, top three companies, and then jstack, jmap service, specific analysis of specific problems.

CPU

Generally speaking, we will first troubleshoot CPU problems. CPU exceptions are often easier to locate. Causes include business logic problems (endless cycles), frequent GC, and excessive context switching.

The most common is often caused by business logic (or framework logic), and jstack can be used to analyze the corresponding stack situation.

① uses jstack to analyze CPU problems

Let's first use the ps command to find the pid of the corresponding process (if you have several target processes, you can first use top to see which is more occupied).

Then use top-H-p pid to find some threads with high CPU usage:

Then convert the most occupied pid to hexadecimal printf'% x\ n 'pid to get nid:

Then find the corresponding stack information jstack pid directly in jstack | grep 'nid'-C5-color:

You can see that we have found the stack information that nid is 0x42, and then we just need to analyze it carefully.

Of course, it is more common for us to analyze the entire jstack file, and we usually pay more attention to the parts of WAITING and TIMED_WAITING, not to mention BLOCKED.

We can use the command cat jstack.log | grep "java.lang.Thread.State" | sort-nr | uniq-c to have an overall grasp of the status of jstack. If there is a lot of WAITING and so on, then there is probably a problem.

② frequent GC

Of course, we still use jstack to analyze problems, but sometimes we can make sure that GC is too frequent.

Use the jstat-gc pid 1000 command to observe the generational changes of GC. 1000 represents the sampling interval (ms), and S0C/S1C, S0U/S1U, EC/EU, OC/OU, and MC/MU represent the capacity and usage of the two Survivor areas, Eden areas, old ages, and metadata areas, respectively.

YGC/YGT, FGC/FGCT and GCT represent the time and frequency of YoungGc and FullGc, as well as the total time spent.

If you see GC more frequently, and then make further analysis on GC, you can refer to the description of the GC section.

③ context switching

For frequent contexts, we can use the vmstat command to view:

The cs (context switch) column represents the number of context switches. If we want to monitor a specific pid, we can use the pidstat-w pid command, where cswch and nvcswch indicate voluntary and involuntary switching.

Magnetic disk

Disk problems are as fundamental as CPU. First of all, in terms of disk space, we directly use df-hl to view the file system status:

More often, the disk problem is a performance problem. We can analyze it through iostatiostat-d-k-x:

The last column,% util, shows the extent to which each disk is written, while rrqpm/s and wrqm/s, respectively, indicate read and write speed, which generally helps to locate which disk has a problem.

In addition, we also need to know which process is reading and writing, generally speaking, developers have a good idea, or use the iotop command to locate the source of file reading and writing.

But what we get here is tid, and we want to convert it to pid. We can find pidreadlink-f / proc/*/task/tid/../.. through readlink.

After you find the pid, you can see the specific reading and writing conditions of the process cat / proc/pid/io:

We can also use the lsof command to determine the specific file read and write lsof-p pid:

Memory

Troubleshooting memory problems is more troublesome than CPU, and there are more scenarios. It mainly includes OOM, GC problems and out-of-heap memory.

Generally speaking, we will first check the memory with the free command:

In-heap memory

Memory problems are mostly in-heap memory problems. On the surface, it is mainly divided into OOM and Stack Overflow.

① OOM

Due to insufficient memory in JMV, OOM can be roughly divided into the following categories:

Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread

This means that there is not enough memory space to allocate the Java stack to the thread, basically there is a problem with the thread pool code, such as forgetting shutdown, so you should first find the problem at the code level and use jstack or jmap.

If all goes well, the JVM aspect can reduce the size of a single thread stack by specifying Xss.

Alternatively, at the system level, you can increase the thread restrictions of os by modifying / etc/security/limits.confnofile and nproc.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

This means that the heap memory footprint has reached the maximum value set by-Xmx, which is probably the most common OOM error.

The solution is still to look for it in the code first, suspect memory leaks, and locate the problem through jstack and jmap. If everything is all right, you need to expand the memory by adjusting the value of Xmx.

Caused by: java.lang.OutOfMemoryError: Meta space

This means that the memory footprint of the metadata area has reached the maximum value set by XX:MaxMetaspaceSize, and the troubleshooting idea is the same as the above, and the parameters can be adjusted through XX:MaxPermSize (not to mention the permanent generation before 1.8).

② Stack Overflow

Stack memory overflow, which you see more often.

Exception in thread "main" java.lang.StackOverflowError

Indicates that the memory required by the thread stack is greater than the Xss value, which is also checked first, and the parameters are adjusted through Xss, but too much adjustment may cause OOM.

③ uses JMAP to locate code memory leaks

With regard to the above code troubleshooting for OOM and Stack Overflow, we generally use JMAPjmap-dump:format=b,file=filename pid to export dump files:

Import the dump file for analysis through mat (Eclipse Memory Analysis Tools). Generally speaking, we can directly choose Leak Suspects for memory leak problems. Mat gives suggestions on memory leakage.

You can also select Top Consumers to view the largest object report. Thread-related problems can be analyzed by selecting thread overview.

In addition, choose the Histogram class overview to analyze yourself slowly, you can search the relevant tutorials of mat.

In daily development, code memory leaks are more common, and more hidden, developers need to pay more attention to details.

For example, new objects for each request, resulting in a large number of repeated creation of objects; file stream operations but not closed correctly; improper manual trigger GC;ByteBuffer cache allocation unreasonable will result in code OOM.

On the other hand, we can specify-XX:+HeapDumpOnOutOfMemoryError in the startup parameters to save the dump file for OOM.

④ GC issues and threads

GC problems in addition to affecting CPU will also affect memory, troubleshooting ideas are also consistent. Generally, jstat is used to check generation changes, such as whether the number of youngGC or FullGC is too much, and whether the growth of indicators such as EU and OU is abnormal.

Threads that talk too much and are not GC in time will also cause OOM, most of which is unable to create new native thread.

In addition to jstack's detailed analysis of the dump file, we usually first look at the overall thread, through pstreee-p pid | wc-l.

Or the number of threads directly by looking at / proc/pid/task is the number of threads.

Out-of-heap memory

It would be unfortunate if you encounter an out-of-heap memory overflow. First of all, the performance of out-of-heap memory overflow is the rapid growth of physical resident memory, and the way to report an error is uncertain.

If it is caused by the use of Netty, the OutOfDirectMemoryError error may appear in the error log, and if it is directly DirectByteBuffer, it will report OutOfMemoryError: Direct buffer memory.

Out-of-heap memory overflows are often related to the use of NIO. Generally, we use pmap to check the memory occupied by the process pmap-x pid | sort-rn-K3 | head-30, which means to view the memory segments corresponding to the first 30 memory segments in reverse order of pid.

Here, you can run the command again after a while to see the memory growth, or where the suspicious memory segment is compared with the normal machine.

If we determine that there is a suspicious memory side, we need to use gdb to analyze gdb-- batch-- pid {pid}-ex "dump memory filename.dump {memory start address} {memory start address + memory Block size}".

After obtaining the dump file, you can use heaxdump to view hexdump-C filename | less, but most of what you see is binary garbled.

NMT is a new HotSpot feature introduced by Java7U40. With the jcmd command, we can see the specific memory composition.

Need to add-XX:NativeMemoryTracking=summary or-XX:NativeMemoryTracking=detail to the startup parameters, there will be a slight performance loss.

In general, for cases where out-of-heap memory grows slowly until it explodes, you can first set a baseline jcmd pid VM.native_memory baseline.

Then wait for a while and then look at the memory growth, and use jcmd pid VM.native_memory detail.diff (summary.diff) to do summary or detail-level diff.

You can see that the memory analyzed by jcmd is very detailed, including in-heap, thread and GC (so other memory exceptions mentioned above can actually be analyzed by nmt). For out-of-heap memory, we focus on the memory growth of Internal. If the growth is obvious, then there is a problem.

At detail level, there will also be the growth of specific memory segments, as shown below:

In addition, at the system level, we can also use the strace command to monitor memory allocation strace-f-e "brk,mmap,munmap"-p pid.

The memory allocation information here mainly includes pid and memory address:

However, in fact, the above operations are also difficult to locate the specific problem points, the key is to look at the error log stack, find suspicious objects, figure out its recycling mechanism, and then analyze the corresponding objects.

For example, if DirectByteBuffer allocates memory, it requires Full GC or manual system.gc to reclaim (so it's best not to use-XX:+DisableExplicitGC).

So we can actually track the memory of the DirectByteBuffer object and manually trigger the Full GC through jmap-histo:live pid to see if the out-of-heap memory has been reclaimed.

If it is recycled, then there is a good chance that the out-of-heap memory itself is allocated too small and adjusted through-XX:MaxDirectMemorySize.

If there is no change, use jmap to analyze objects that cannot be GC, as well as reference relationships with DirectByteBuffer.

GC problem

In-heap memory leaks are always accompanied by GC exceptions. However, GC problems are not only related to memory problems, but may also cause a series of complications such as CPU load and network problems, but are relatively closely related to memory, so we will summarize the GC-related problems separately here.

In the CPU chapter, we introduced the use of jstat to obtain current GC generational change information.

More often, we use the GC log to troubleshoot the problem, adding-verbose:gc,-XX:+PrintGCDetails,-XX:+PrintGCDateStamps,-XX:+PrintGCTimeStamps to the startup parameters to open the GC log.

The meaning of common Young GC and Full GC logs will not be repeated here. For GC logs, we can roughly infer whether youngGC and Full GC are too frequent or time-consuming, so as to prescribe the right medicine.

We will analyze the G1 garbage collector below, and we also recommend that you use G1-XX:+UseG1GC.

① youngGC is too frequent

There are generally more short-period small objects in youngGC. First, consider whether the setting of Eden / Cenozoic is too small, and see if you can adjust the parameter settings such as-Xmn and-XX:SurvivorRatio to solve the problem.

If the parameters are normal, but the youngGC frequency is still too high, you need to use Jmap and MAT to further troubleshoot the dump file.

② youngGC takes too long

The problem of taking too long depends on which part of the time is spent in the GC log. Take G1 log as an example, you can focus on Root Scanning, Object Copy, Ref Proc and other stages.

Ref Proc takes a long time, so pay attention to referencing related objects. Root Scanning takes a long time, so pay attention to the number of threads and cross-generational references.

Object Copy needs to focus on the life cycle of the object. And time-consuming analysis requires horizontal comparison, that is, with the time-consuming of other projects or normal time periods.

For example, the Root Scanning in the figure increases more than the normal time period, that is, there are too many threads.

③ triggers Full GC

There is still more mixedGC in G1, but mixedGC can be checked in the same way as youngGC.

Trigger Full GC will generally have a problem, G1 will degenerate the use of Serial collector to complete the garbage cleaning work, the pause time reaches the level of seconds, can be said to be half kneeling.

The reasons for FullGC may include the following, as well as some ideas on parameter adjustment:

Concurrency phase failure: in the concurrency marking phase, the old years before MixGC are filled, and then G1 will abandon the marking cycle.

In this case, you may need to increase the heap size or adjust the number of concurrent markup threads-XX:ConcGCThreads.

Promotion failure: there is not enough memory for the survival / promotion object during GC, so Full GC is triggered.

At this point, you can use-XX:G1ReservePercent to increase the percentage of reserved memory, decrease-XX:InitiatingHeapOccupancyPercent to start the tag in advance, and-XX:ConcGCThreads to increase the number of tagged threads.

Large object allocation failed: large objects can not find the appropriate Region space to allocate, will be Full GC, in this case can increase memory or increase-XX:G1HeapRegionSize.

The program actively executes System.gc (): don't write it casually.

In addition, we can configure-XX:HeapDumpPath=/xxx/dump.hprof in the startup parameters to dump fullGC related files, and use jinfo to dump before and after GC:

Jinfo-flag + HeapDumpBeforeFullGC pid jinfo-flag + HeapDumpAfterFullGC pid

In this way, we get two dump files, and after comparison, we mainly focus on the problem objects dropped by GC to locate the problem.

The network

Problems related to the network level are generally more complex, many scenarios, difficult positioning, has become a nightmare for most developers, should be the most complex.

Here are some examples to illustrate in terms of TCP layer, application layer, and the use of tools.

① timeout

Most of the timeout errors are at the application level, so this piece focuses on understanding concepts. Timeouts can be divided into connection timeouts and read and write timeouts, and some client frameworks that use connection pooling also have connection acquisition timeouts and idle connection cleanup timeouts.

Read and write timeouts: readTimeout/writeTimeout, some frameworks called so_timeout or socketTimeout, refer to data read and write timeouts.

Note that most of the timeouts here refer to logical timeouts. Soa timeout also refers to read timeout. The read and write timeout is generally set only for the client.

Connection timeout: connectionTimeout, which usually refers to the maximum time it takes to establish a connection with the server.

There are a variety of connectionTimeout on the server side. The Jetty indicates the cleaning time of the idle connection, and the Tomcat indicates the maximum duration of the connection.

Others: including connection acquisition timeout connectionAcquireTimeout and idle connection cleanup timeout idleConnectionTimeout. Mostly used for client-side or server-side frameworks that use connection pooling or queues.

In setting various timeouts, we need to make sure that the timeout of the client is less than that of the server as far as possible to ensure the normal end of the connection.

In the actual development, we should be most concerned about the read and write timeout of the interface. How to set a reasonable interface timeout is a problem.

If the interface timeout is set too long, it may take up too much of the server's TCP connection. If the interface is set too short, then the interface timeout will be very frequent.

The server interface obviously reduces the RT, but the client still times out is another problem. The problem is actually very simple. The link from client to server includes network transmission, queuing and service processing, each of which may be time-consuming.

② TCP queue overflow

TCP queue overflow is a relatively low-level error that can lead to more superficial errors such as timeouts, RST, and so on. Therefore, the mistake is also more hidden, so let's talk about it alone.

As shown in the figure above, there are two queues:

Syns queue (semi-connected queue)

Accept queue (fully connected queue)

Three-way handshake, after server receives client's syn, put the message to syns queue, reply syn+ack to client,server to receive client's ack.

If the accept queue is not full at this time, take the temporary information from the syns queue and put it in the accept queue, otherwise follow the instructions of the tcp_abort_on_overflow.

Tcp_abort_on_overflow 0 means that if the accept queue is full at the third step of the three-way handshake, then server throws away the ack sent by client.

Tcp_abort_on_overflow 1 indicates that if the full connection queue is full in step 3, server sends a RST packet to client, which means that the handshake process and the connection are abolished, which means that there may be a lot of connection reset/connection reset by peer in the log.

So how can we quickly locate TCP queue overflows in actual development?

Netstat command, execute netstat-s | egrep "listen | LISTEN":

As shown in the figure above, overflowed represents the number of fully connected queue overflows, and sockets dropped represents the number of semi-connected queue overflows.

Ss command, execute ss-lnt:

As seen above, Send-Q indicates that the full connection queue on the Listen port of the third column is up to 5, and the first column Recv-Q is how much the full connection queue is currently used.

Let's take a look at how to set the fully connected and semi-connected queue size: the size of the fully connected queue depends on min (backlog,somaxconn).

Backlog is passed in when Socket is created, and somaxconn is an OS-level system parameter. The size of the semi-connection queue depends on max (64, / proc/sys/net/ipv4/tcp_max_syn_backlog).

In daily development, we often use the Servlet container as the server, so we sometimes need to pay attention to the connection queue size of the container.

Backlog is called acceptCount in Tomcat and acceptQueueSize in Jetty.

③ RST exception

The RST package indicates a connection reset, which is used to close some useless connections, usually indicating an abnormal close, as opposed to four waves.

In actual development, we often see connection reset/connection reset by peer errors, which is caused by the RST package.

Port does not exist: if a SYN request to establish a connection is issued like a port that does not exist, the server finds that it does not have this port and will directly return a RST message to break the connection.

Terminate the connection actively instead of FIN: generally speaking, normal connection closure needs to be achieved through FIN message, but we can also use RST message instead of FIN to terminate the connection directly.

In actual development, the SO_LINGER value can be set to control, which is often intentional, to skip TIMED_WAIT, to provide interactive efficiency, not idle to use with caution.

An exception occurs on one side of the client or server, which sends a RST to the peer to tell the client to close the connection: the TCP queue overflow sending RST packet we mentioned above also belongs to this category.

This is often due to some reason that one party can no longer handle the request connection properly (for example, the program crashes and the queue is full), thus telling the other party to close the connection.

The TCP message received is not in the known TCP connection: for example, one side of the machine lost the TCP message due to poor network, the other party closed the connection, and then received the missing TCP message a long time later, but since the corresponding TCP connection no longer exists, it will directly send a RST packet to open a new connection.

One party does not receive the confirmation message from the other party for a long time and sends out the RST message after a certain period of time or retransmission times.

Most of this is also related to the network environment, which may lead to more RST messages.

As mentioned earlier, many RST messages will cause errors in the program, read operations on a closed connection will report connection reset, and write operations on a closed connection will report connection reset by peer.

Usually, we may also see broken pipe errors, which are pipeline-level errors, indicating that reading and writing closed pipes is often the error of continuing to read and write datagrams after receiving RST and reporting connection reset errors, which is also described in the comments on glibc source code.

How can we be sure that there is a RST package when troubleshooting? Of course, use the tcpdump command to grab the package, and use wireshark for simple analysis.

Tcpdump-I en0 tcp-w xxx.cap,en0 indicates the network card being monitored:

Next, when we open the captured package through wireshark, we may see the following figure, and the red one indicates the RST package.

④ TIME_WAIT and CLOSE_WAIT

I'm sure we all know what TIME_WAIT and CLOSE_WAIT mean.

When online, we can directly use the command netstat-n | awk'/ ^ tcp/ {+ + S [$NF]} END {for (an in S) print a, S [a]}'to check the number of time-wait and close_wait.

Faster ss-ant with the ss command | awk'{+ + S [$1]} END {for (an in S) print a, S [a]}':

The existence of TIME_WAIT:time_wait is to reuse the lost data packets and to close the connection normally within the time range of 2MSL.

Its existence actually greatly reduces the number of RST packages. Too much time_wait is more likely to occur in scenarios with frequent short connections.

In this case, some kernel parameters can be tuned on the server side:

# means to enable reuse. Allow TIME-WAIT sockets to be reused for new TCP connections. Default is 0, which means turning off net.ipv4.tcp_tw_reuse = 1 # means enabling fast recycling of TIME-WAIT sockets in TCP connections. Default is 0, which means turning off net.ipv4.tcp_tw_recycle = 1.

Of course, we must not forget that the packet is rejected because of incorrect timestamps in the NAT environment. Another way is to reduce the tcp_max_tw_buckets. Any time_wait exceeding this number will be killed, but this will also lead to an error in the time wait bucket table overflow.

CLOSE_WAIT:close_wait is often due to a problem with the writing of the application, and the FIN message is not launched again after ACK.

The probability of close_wait is even higher than that of time_wait, and the consequences are more serious. Often it is because somewhere is blocked and the connection is not closed properly, thus gradually consuming all the threads.

To locate this type of problem, it is best to analyze the thread stack through jstack to troubleshoot the problem, as detailed in the above section. Here is only one example.

The developer said that CLOSE_WAIT has been increasing since the application was launched, until it was hung up. The suspicious stack found after jstack is that most of the threads are stuck in the countdownlatch.await method.

After learning with the developer, I learned that multithreading was used but there was no catch exception. After modification, I found that the exception is only the most simple class not found that often occurs after upgrading SDK.

This is the end of the content of "what are the Java online troubleshooting techniques"? thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.