In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
Background
"offline no problem", "there can be no problem with the code is the reason for the system", "can you debug remotely online"
Online problems are different from bug during development, and are related to runtime environment, stress, concurrency, and specific business. For online problems, using the tools available in the online environment, collecting the necessary information is very important to locate the problem.
For the bug and resource bottlenecks that cause the problem, it is difficult to obtain data intuitively, so it is necessary to speculate the root cause of the problem according to the resource usage data, logs and other information. And the location of difficult problems usually need to use different methods to trace back to the source.
In this wiki article, I sorted out the tools I used and shared some examples.
1. common problem
1.1 availability
Here are several common situations that lead to service availability:
A) 502 Bad Gateway
Nothing is more serious for applications, especially those based on http, than "502 Bad Gateway", which means that back-end services are completely unavailable.
Insufficient resources 1: garbage collection can cause serious application pauses in the case of CMS application memory leaks or insufficient memory.
Insufficient resources 2: insufficient number of server threads. Common web server, such as tomcat jetty, are configured with maximum worker threads.
Insufficient resources 3: insufficient database resources, database usually configured with connection pool, low maxConnection configuration and too many slow queries will block worker threads that live in web server
Insufficient resources 4:IO resource bottleneck, online environment IO is shared, especially for mixed environments (CRM fortunately does not have this case, but there are many agent), our commonly used log4j logging tool is also an exclusive resource for each recorded log file, and the thread needs to acquire the lock before recording data to the log.
......
All kinds of OOM
B) Socket exception
Common Connection reset by peer,Broken Pipe,EOFException
Network problems: may be encountered in the case of cross-operator and computer room access
Program bug:socket shuts down abnormally
1.2 average response time
The most intuitive indication of a problem with the system. This parameter can give early warning before the deterioration of the situation infects other services and causes the entire system to become unavailable. Possible reasons:
Resource competition 1:CPU
Resource competition 2:IO
Resource competition 3:network IO
Resource competition 4: database
Resource competition 5:solr, medis
Downstream interface: exception causes response delay
1.3 Machine alarm
Compared with the unavailability of the application service, such errors do not directly cause the service to be unavailable, and if there is confusion, multiple services deployed by the machine may interfere with each other:
CPU
Magnetic disk
Fd
IO (Network disk)
1.4 Summary
Written for a long time, many cases are repeatedly mentioned, usually the cause of online problems is nothing more than system resources, applications, master the tools to monitor and view these resources, data, it is easier to locate online problems.
2 commonly used tools
2.1 Linux tools
A) sysstat:
Iostat: view read and write pressure
[sankuai@cos-mop01 logs] $iostat Linux 2.6.32-20131120.mt (cos-mop01.lf.sankuai.com) October 21, 2015 _ x86. 644 CPU avg-cpu:% user% nice% system% iowait% steal% idle 1.88 0.00 0.87 0.12 0.05 97.07 Device: tps Blk_read / s Blk_wrtn/s Blk_read Blk_wrtn vda 1.88 57.90 12.11 2451731906 512911328 vdb 0.01 0.40 1.41 17023940 59522616 vdc 1.14 28.88 36.63 1223046988 1551394969
Sar: view CPU network IO IO. Enable parameters to view historical data.
/ etc/sysconfig/sysstat HISTORY=7 / etc/cron.d/sysstat * / 10 * root / usr/lib/sa/sa1 1 1 sar-u/-r/-B/-b/-q/-P/-n-f / var/log/sa/sa09
B) top
Follow load, cpu, mem, swap
Resource information can be viewed by thread (version greater than 3.2.7)
Top-19:33:00 up 490 days, 4:33, 2 users, load average: 0.13,0.39, 0.42Tasks: 157total, 1 running, 156sleeping, 0 stopped, 0 zombieCpu (s): 4.9%us, 2.7%sy, 0.0%ni, 92.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%stMem: 5991140k total, 57884k used, 202256k free 4040k buffersSwap: 2096440k total, 447332k used, 1649108k free 232884k cached PID USER PR NI VIRT RES SHR S% CPU% MEM TIME+ SWAP CODE DATA COMMAND18720 sankuai 20 08955m 4.3g 6744 S 22.6 74.5 174V 30.73 8.6g java27794 sankuai 20 5715m 489m 2116 S 11.6 3922m 43 121m 4 3.9g java13233 root 20 0420m 205m 2528 S 0.0 3.51885L 15 91m 4304m puppetd21526 sankuai 20 02513m 69m 4484 S 0.0 1.245m 42.4g java
C) vmstat
[sankuai@cos-mop01 logs] $vmstatprocs-memory- swap---io---- system---cpu- r b swpd free buff cache si so bi bo in cs us sy id wa st 00 447332 200456 4160 234512 00 11 60 0 2 1 97 00
D) tcpdump
Locate the network problem artifact, you can see the details of the TCPIP message, need to be familiar with TCPIP protocol at the same time, can be used in conjunction with wireshark.
Common scenarios analyze network delay, network packet loss, and network problems in complex environments.
#! / bin/bashtcpdump-I eth0-s 0-l-w-dst port 3306 | strings | perl-e'while () {chomp; next if / ^ [^] + [] * $/; if (/ ^ (SELECT | UPDATE | INSERT | COMMIT | ROLLBACK | CREATE | DROP | ALTER | CALL) / I) {if (defined $Q) {print "$Q\ n";} $qblocks;} else {$_ = ~ s / ^ [\ t] + /; $q.= "$_";}'}'
3.2 java tool
A) jstat
[sankuai@cos-mop01 logs] $jstat-gc 18704 S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT3584.0 3584.0 0.0 24064.0 13779.7 62976.0 4480.0 677.9 384.0 66.60 0.000 0 0.000 0.000
B) jmap
Jmap-dump:format=b,file=heap.bin $pid
C) jstack or kill-3
View deadlocks, thread waits.
Thread status:
Running
TIMED_WAITING (on object monitor)
TIMED_WAITING (sleeping)
TIMED_WAITING (parking)
WAINTING (on object monitor)
D) jhat jconsole
It is difficult for jhat to use jconsole to fetch information through jmx has an impact on performance.
E) gc log
-XX:+UseParallelOld
-XX:+ConcurrentMultiSweep
3.3 third-party tools
A) mat
Object detail
Inboud/outbound
Thread overview
Configuration item
. / MemoryAnalyzer-keep_unreachable_objects heap_file
4. Case analysis
4.1 cpu High
Phenomenon: CPU alarm
Positioning problem:
View threads with high CPU usage
Sankuai@sin2:~$ ps H-eo user,pid,ppid,tid,time % cpu | sort-rnk6 | head-10sankuai 13808 13807 13808 00:00:00 8.4sankuai 29153 1 29211 00:21:13 0.9sankuai 29153 1 29213 00:20:01 0.8sankuai 29153 129205 00:17:35 0.7sankuai 29153 129210 00:11:50 0.5sankuai 29153 11323 00:08:37 0.5sankuai 29153 129207 00:10:02 0.4sankuai 29153 1 29206 00 rnk6 07 0.3sankuai 29153 1 29208,06 0.4sankuai 0.2
Thread dump
Jstack $pid > a.txt printf x $tid $xTID
Find the code executed by the thread
"main-SendThread (cos-zk13.lf.sankuai.com:9331)" # 25 daemon prio=5 os_prio=0 tid=0x00007f78fc350000 nid=$TIDx runnable [0x00007f79c4d09000] java.lang.Thread.State: RUNNABLE at org.apache.zookeeper.ClientCnxn$SendThread.run (ClientCnxn.java:1035) at java.util.concurrent.FutureTask.run (FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617) at java .lang.Thread.run (Thread.java:745)
4.2 io High
Symptom: disk IO alarm
Environment: sysstat tools need to be installed
Positioning problem:
A) View threads with high CPU usage
Pidstat-d-t-p $pid
B) others same as 4.1
4.3 Resources
A) Database
"DB-Processor-13" daemon prio=5 tid=0x003edf98 nid=0xca waiting for monitor entry [0x000000000825f000] java.lang.Thread.State: BLOCKED (on object monitor) at ConnectionPool.getConnection (ConnectionPool.java:102)-waiting to lock (a beans.ConnectionPool) at Service.getCount (ServiceCnt.java:111) at Service.insert (ServiceCnt.java:43) "DB-Processor-14" daemon prio=5 tid=0x003edf98 nid=0xca waiting For monitor entry [0x000000000825f020] java.lang.Thread.State: BLOCKED (on object monitor) at ConnectionPool.getConnection (ConnectionPool.java:102)-waiting to lock (a beans.ConnectionPool) at Service.getCount (ServiceCnt.java:111) at Service.insertCount (ServiceCnt.java:43)
B) log
RMI TCP Connection (267865)-172.16.5.25 "daemon prio=10tid=0x00007fd508371000 nid=0x55ae waiting for monitor entry [0x00007fd4f8684000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.log4j.Category.callAppenders (Category.java:201)-waiting to lock (an org.apache.log4j.Logger) at org.apache.log4j.Category.forcedLog (Category.java:388) at org.apache.log4j.Category.log (Category.java:853) at org.apache.commons.logging.impl. Log4JLogger.warn (Log4JLogger.java:234) at com.xxx.core.common.lang.cache.remote.MemcachedClient.get (MemcachedClient.java:110)
C) web server
There are two very important system parameters:
MaxThread: number of worker threads
Number of backlog:TCP connection caches, Jetty (ServerConnector.acceptQueueSize) Tomcat (Connector.acceptCount). If the setting is too small for high concurrency, there will be 502.
4.4 gc
A) CMS fail
Promotion failed
172966 2015-09-18T03:47:33.108+0800: 627188.183: [GC 627188.183: [ParNew (promotion failed) 172967 Desired survivor size 17432576 bytes, new threshold 1 (max 6) 172968-age 1: 34865032 bytes, 34865032 total 172969: 306688K-> 306688K (306688K) 161.1284530 secs] 627349.311: [CMS CMS: abort preclean due to time 2015-09-18T03:50:14.743+0800: 627349.818: [CMS-concurrent-abortable-preclean: 1.597TX 162.729 secs] [Times: user=174.58 sys=84.57, real=162.71 secs] 172970 (concurrent mode failure): 1550703K-> 592286K (1756416K), 2.9879760 secs] 1755158K-> 592286K (2063104K), [CMS Perm: 67701K-> 67695K (112900K)], 164.1167250 secs] [Times: user=175.61 sys=84.57] Real=164.09 secs]
Concurrent fail
[CMS2015-09-18T07:07:27.132+0800: 639182.207: [CMS-concurrent-sweep: 1.704 CMS-concurrent-sweep 13.116 secs] [Times: user=17.16 sys=5.20,real=13.12 secs] 443222 (concurrent mode failure): 1546078K-> 682301K (1756416K), 4.0745320 secs] 1630977K-> 682301K (2063104K), [CMS Perm: 67700K-> 67693K (112900K)], 15.4860730 secs] [Times: user=19.40 sys=5.20,real= 15.48K]
B) continuous Full GC
There is a memory leak in the application, and the garbage collection takes up a lot of cpu time. In extreme cases, more than 90% of the time is spent doing GC.
In applications where the system uses http to access check alive or uses Zookeeper to ensure viability by heartbeat, it will be abnormal in availability or removed by zk's master.
5. Be careful
Keep the site: threaddump top heapdump
Note logging: file database
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.