Summary of Java problem location methods 04/21 Update SLTechnology News&Howtos

Summary of Java problem location methods

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Background

"offline no problem", "there can be no problem with the code is the reason for the system", "can you debug remotely online"

Online problems are different from bug during development, and are related to runtime environment, stress, concurrency, and specific business. For online problems, using the tools available in the online environment, collecting the necessary information is very important to locate the problem.

For the bug and resource bottlenecks that cause the problem, it is difficult to obtain data intuitively, so it is necessary to speculate the root cause of the problem according to the resource usage data, logs and other information. And the location of difficult problems usually need to use different methods to trace back to the source.

In this wiki article, I sorted out the tools I used and shared some examples.

1. common problem

1.1 availability

Here are several common situations that lead to service availability:

A) 502 Bad Gateway

Nothing is more serious for applications, especially those based on http, than "502 Bad Gateway", which means that back-end services are completely unavailable.

Insufficient resources 1: garbage collection can cause serious application pauses in the case of CMS application memory leaks or insufficient memory.

Insufficient resources 2: insufficient number of server threads. Common web server, such as tomcat jetty, are configured with maximum worker threads.

Insufficient resources 3: insufficient database resources, database usually configured with connection pool, low maxConnection configuration and too many slow queries will block worker threads that live in web server

Insufficient resources 4:IO resource bottleneck, online environment IO is shared, especially for mixed environments (CRM fortunately does not have this case, but there are many agent), our commonly used log4j logging tool is also an exclusive resource for each recorded log file, and the thread needs to acquire the lock before recording data to the log.

......

All kinds of OOM

B) Socket exception

Common Connection reset by peer,Broken Pipe,EOFException

Network problems: may be encountered in the case of cross-operator and computer room access

Program bug:socket shuts down abnormally

1.2 average response time

The most intuitive indication of a problem with the system. This parameter can give early warning before the deterioration of the situation infects other services and causes the entire system to become unavailable. Possible reasons:

Resource competition 1:CPU

Resource competition 2:IO

Resource competition 3:network IO

Resource competition 4: database

Resource competition 5:solr, medis

Downstream interface: exception causes response delay

1.3 Machine alarm

Compared with the unavailability of the application service, such errors do not directly cause the service to be unavailable, and if there is confusion, multiple services deployed by the machine may interfere with each other:

CPU

Magnetic disk

IO (Network disk)

1.4 Summary

Written for a long time, many cases are repeatedly mentioned, usually the cause of online problems is nothing more than system resources, applications, master the tools to monitor and view these resources, data, it is easier to locate online problems.

2 commonly used tools

2.1 Linux tools

A) sysstat:

Iostat: view read and write pressure

[sankuai@cos-mop01 logs] $iostat Linux 2.6.32-20131120.mt (cos-mop01.lf.sankuai.com) October 21, 2015 _ x86. 644 CPU avg-cpu:% user% nice% system% iowait% steal% idle 1.88 0.00 0.87 0.12 0.05 97.07 Device: tps Blk_read / s Blk_wrtn/s Blk_read Blk_wrtn vda 1.88 57.90 12.11 2451731906 512911328 vdb 0.01 0.40 1.41 17023940 59522616 vdc 1.14 28.88 36.63 1223046988 1551394969

Sar: view CPU network IO IO. Enable parameters to view historical data.

/ etc/sysconfig/sysstat HISTORY=7 / etc/cron.d/sysstat * / 10 * root / usr/lib/sa/sa1 1 1 sar-u/-r/-B/-b/-q/-P/-n-f / var/log/sa/sa09

B) top

Follow load, cpu, mem, swap

Resource information can be viewed by thread (version greater than 3.2.7)

Top-19:33:00 up 490 days, 4:33, 2 users, load average: 0.13,0.39, 0.42Tasks: 157total, 1 running, 156sleeping, 0 stopped, 0 zombieCpu (s): 4.9%us, 2.7%sy, 0.0%ni, 92.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%stMem: 5991140k total, 57884k used, 202256k free 4040k buffersSwap: 2096440k total, 447332k used, 1649108k free 232884k cached PID USER PR NI VIRT RES SHR S% CPU% MEM TIME+ SWAP CODE DATA COMMAND18720 sankuai 20 08955m 4.3g 6744 S 22.6 74.5 174V 30.73 8.6g java27794 sankuai 20 5715m 489m 2116 S 11.6 3922m 43 121m 4 3.9g java13233 root 20 0420m 205m 2528 S 0.0 3.51885L 15 91m 4304m puppetd21526 sankuai 20 02513m 69m 4484 S 0.0 1.245m 42.4g java

C) vmstat

[sankuai@cos-mop01 logs] $vmstatprocs-memory- swap---io---- system---cpu- r b swpd free buff cache si so bi bo in cs us sy id wa st 00 447332 200456 4160 234512 00 11 60 0 2 1 97 00

D) tcpdump

Locate the network problem artifact, you can see the details of the TCPIP message, need to be familiar with TCPIP protocol at the same time, can be used in conjunction with wireshark.

Common scenarios analyze network delay, network packet loss, and network problems in complex environments.

#! / bin/bashtcpdump-I eth0-s 0-l-w-dst port 3306 | strings | perl-e'while () {chomp; next if / ^ [^] + [] * $/; if (/ ^ (SELECT | UPDATE | INSERT | COMMIT | ROLLBACK | CREATE | DROP | ALTER | CALL) / I) {if (defined $Q) {print "$Q\ n";} $qblocks;} else {$_ = ~ s / ^ [\ t] + /; $q.= "$_";}'}'

3.2 java tool

A) jstat

[sankuai@cos-mop01 logs] $jstat-gc 18704 S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT3584.0 3584.0 0.0 24064.0 13779.7 62976.0 4480.0 677.9 384.0 66.60 0.000 0 0.000 0.000

B) jmap

Jmap-dump:format=b,file=heap.bin $pid

C) jstack or kill-3

View deadlocks, thread waits.

Thread status:

Running

TIMED_WAITING (on object monitor)

TIMED_WAITING (sleeping)

TIMED_WAITING (parking)

WAINTING (on object monitor)

D) jhat jconsole

It is difficult for jhat to use jconsole to fetch information through jmx has an impact on performance.

E) gc log

-XX:+UseParallelOld

-XX:+ConcurrentMultiSweep

3.3 third-party tools

A) mat

Object detail

Inboud/outbound

Thread overview

Configuration item

. / MemoryAnalyzer-keep_unreachable_objects heap_file

4. Case analysis

4.1 cpu High

Phenomenon: CPU alarm

Positioning problem:

View threads with high CPU usage

Sankuai@sin2:~$ ps H-eo user,pid,ppid,tid,time % cpu | sort-rnk6 | head-10sankuai 13808 13807 13808 00:00:00 8.4sankuai 29153 1 29211 00:21:13 0.9sankuai 29153 1 29213 00:20:01 0.8sankuai 29153 129205 00:17:35 0.7sankuai 29153 129210 00:11:50 0.5sankuai 29153 11323 00:08:37 0.5sankuai 29153 129207 00:10:02 0.4sankuai 29153 1 29206 00 rnk6 07 0.3sankuai 29153 1 29208,06 0.4sankuai 0.2

Thread dump

Jstack $pid > a.txt printf x $tid $xTID

Find the code executed by the thread

"main-SendThread (cos-zk13.lf.sankuai.com:9331)" # 25 daemon prio=5 os_prio=0 tid=0x00007f78fc350000 nid=$TIDx runnable [0x00007f79c4d09000] java.lang.Thread.State: RUNNABLE at org.apache.zookeeper.ClientCnxn$SendThread.run (ClientCnxn.java:1035) at java.util.concurrent.FutureTask.run (FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617) at java .lang.Thread.run (Thread.java:745)

4.2 io High

Symptom: disk IO alarm

Environment: sysstat tools need to be installed

Positioning problem:

A) View threads with high CPU usage

Pidstat-d-t-p $pid

B) others same as 4.1

4.3 Resources

A) Database

"DB-Processor-13" daemon prio=5 tid=0x003edf98 nid=0xca waiting for monitor entry [0x000000000825f000] java.lang.Thread.State: BLOCKED (on object monitor) at ConnectionPool.getConnection (ConnectionPool.java:102)-waiting to lock (a beans.ConnectionPool) at Service.getCount (ServiceCnt.java:111) at Service.insert (ServiceCnt.java:43) "DB-Processor-14" daemon prio=5 tid=0x003edf98 nid=0xca waiting For monitor entry [0x000000000825f020] java.lang.Thread.State: BLOCKED (on object monitor) at ConnectionPool.getConnection (ConnectionPool.java:102)-waiting to lock (a beans.ConnectionPool) at Service.getCount (ServiceCnt.java:111) at Service.insertCount (ServiceCnt.java:43)

B) log

RMI TCP Connection (267865)-172.16.5.25 "daemon prio=10tid=0x00007fd508371000 nid=0x55ae waiting for monitor entry [0x00007fd4f8684000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.log4j.Category.callAppenders (Category.java:201)-waiting to lock (an org.apache.log4j.Logger) at org.apache.log4j.Category.forcedLog (Category.java:388) at org.apache.log4j.Category.log (Category.java:853) at org.apache.commons.logging.impl. Log4JLogger.warn (Log4JLogger.java:234) at com.xxx.core.common.lang.cache.remote.MemcachedClient.get (MemcachedClient.java:110)

C) web server

There are two very important system parameters:

MaxThread: number of worker threads

Number of backlog:TCP connection caches, Jetty (ServerConnector.acceptQueueSize) Tomcat (Connector.acceptCount). If the setting is too small for high concurrency, there will be 502.

4.4 gc

A) CMS fail

Promotion failed

172966 2015-09-18T03:47:33.108+0800: 627188.183: [GC 627188.183: [ParNew (promotion failed) 172967 Desired survivor size 17432576 bytes, new threshold 1 (max 6) 172968-age 1: 34865032 bytes, 34865032 total 172969: 306688K-> 306688K (306688K) 161.1284530 secs] 627349.311: [CMS CMS: abort preclean due to time 2015-09-18T03:50:14.743+0800: 627349.818: [CMS-concurrent-abortable-preclean: 1.597TX 162.729 secs] [Times: user=174.58 sys=84.57, real=162.71 secs] 172970 (concurrent mode failure): 1550703K-> 592286K (1756416K), 2.9879760 secs] 1755158K-> 592286K (2063104K), [CMS Perm: 67701K-> 67695K (112900K)], 164.1167250 secs] [Times: user=175.61 sys=84.57] Real=164.09 secs]

Concurrent fail

[CMS2015-09-18T07:07:27.132+0800: 639182.207: [CMS-concurrent-sweep: 1.704 CMS-concurrent-sweep 13.116 secs] [Times: user=17.16 sys=5.20,real=13.12 secs] 443222 (concurrent mode failure): 1546078K-> 682301K (1756416K), 4.0745320 secs] 1630977K-> 682301K (2063104K), [CMS Perm: 67700K-> 67693K (112900K)], 15.4860730 secs] [Times: user=19.40 sys=5.20,real= 15.48K]

B) continuous Full GC

There is a memory leak in the application, and the garbage collection takes up a lot of cpu time. In extreme cases, more than 90% of the time is spent doing GC.

In applications where the system uses http to access check alive or uses Zookeeper to ensure viability by heartbeat, it will be abnormal in availability or removed by zk's master.

5. Be careful

Keep the site: threaddump top heapdump

Note logging: file database

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.