How to use Fault Analysis under Linux 07/11 Update SLTechnology News&Howtos

How to use Fault Analysis under Linux

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to use fault analysis under Linux. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

1. Background

Sometimes you will encounter some difficult and complicated problems, and the monitoring plug-in can not immediately identify the root cause of the problem. At this point, you need to log on to the server to further analyze the root causes of the problem. Then the analysis of the problem requires the accumulation of certain technical experience, and some problems involve a wide range of areas in order to locate the problem. Therefore, analyzing problems and stepping on pits is a great exercise in one's growth and self-improvement. If we have a set of good analysis tools, it will get twice the result with half the effort, which can help you quickly identify the problem and save you a lot of time to do more in-depth things.

2. Description

This article mainly introduces a variety of problem positioning tools and will be combined with case studies.

3. The methodology of analyzing problems.

Applying 5W2H method, several problems of performance analysis can be put forward.

What is the What- phenomenon like?

When does When- happen?

Why did Why- happen?

Where is the problem in Where-?

How many resources did How much- consume?

How does How to do- solve the problem?

4 、 cpu

4.1 description

For applications, we usually focus on kernel CPU scheduler functionality and performance.

The main purpose of thread state analysis is to analyze where thread time is spent, and the classification of thread state is generally divided into:

A. on-CPU: in execution, the time in execution is usually divided into user state time user and system state time sys.

B. off-CPU: wait for the next round of CPU, or wait for Imax O, lock, page change, and so on. The status can be subdivided into executable, anonymous page change, sleep, lock, idle, etc.

If you spend a lot of time on CPU, the analysis of CPU can quickly explain why; if the system time is in the off-cpu state, it will take a lot of time to locate the problem. But there are still some concepts that need to be clear:

Processor

Nuclear

Hardware thread

CPU memory cache

Clock frequency

CPI per instruction cycle and IPC per instruction cycle

CPU instruction

Utilization rate

User time / kernel time

Dispatcher

Run queue

Preemption

Multiple processes

Multithreading

Word length

4.2 Analytical tools

Description:

Uptime,vmstat,mpstat,top,pidstat can only query the usage of cpu and load.

Perf can follow the time-consuming situation of specific functions within the process, and can specify kernel functions to count and tell where to hit.

4.3 mode of use

/ / View system cpu usage top / / View all cpu core information mpstat-P ALL 1 / / View cpu usage and average load vmstat 1 / / process cpu statistics pidstat-u 1-p pid / / track process internal function level cpu usage perf top-p pid-e cpu-clock

5. Memory

5.1 description

Memory is created to improve efficiency, and when analyzing problems, memory problems may not only affect performance, but also affect services or cause other problems. There are also some concepts to be clear about memory:

Main memory

Virtual memory

Resident memory

Address space

OOM

Page cache

Missing page

Change the page

Swap space

exchange

User allocators libc, glibc, libmalloc and mtmalloc

LINUX kernel-level SLUB allocator

5.2 Analytical tools

Description:

Free,vmstat,top,pidstat,pmap can only count memory information and memory usage of processes.

Valgrind can analyze memory leaks.

Dtrace dynamic tracking. You need to have a deep understanding of kernel functions and write scripts in D language to complete the tracking.

5.3 mode of use

/ / View system memory usage free-m / / Virtual memory statistics vmstat 1 / / View system memory status top / / 1s collection cycle, get memory statistics pidstat-p pid-r 1 / / View process memory image information pmap-d pid / / detect program memory problems valgrind-- tool=memcheck-- leak-check=full-- log-file=./log.txt. / Program name

6. Disk IO

6.1 description

The disk is usually the slowest subsystem of the computer, and it is also the place where performance bottlenecks are most likely to occur, because the disk is farthest away from the CPU and CPU access to the disk involves mechanical operations, such as axis rotation, orbit finding, and so on. The speed difference between accessing the hard disk and accessing memory is calculated by an order of magnitude, just like the difference between a day and a minute. To monitor IO performance, it is necessary to understand the fundamentals and how Linux handles IO between hard disk and memory.

Before we understand disk IO, we also need to understand some concepts, such as:

File system

VFS

File system cach

Page cache page cache

Buffer cache buffer cache

Directory caching

Inode

Inode caching

Noop call policy

6.2 Analytical tools

6.3 Mode of use

/ / View system io information iotop / / Statistics io details iostat-d-x-k 1 10 / / View process-level io information pidstat-d 1-p pid / / View system IO requests. For example, when a system IO exception is found, you can use this command to investigate and specify what causes the IO exception perf record-e block:block_rq_issue-ag ^ C perf report.

7. Network

7.1 description

Network monitoring is the most complex of all Linux subsystems, and there are too many factors in it, such as delay, blocking, collision, packet loss and so on. Worse, routers, switches and wireless signals connected to Linux hosts will affect the whole network and it is difficult to judge whether it is because of the problems of Linux network subsystem or other devices, which increases the complexity of monitoring and judgment. All the network cards we use now are called adaptive network cards, which means that they can be automatically adjusted according to the different network speeds and working modes caused by different network devices on the network.

7.2 Analytical tools

7.3 mode of use

/ / display network statistics netstat-s / / display the current UDP connection status netstat-nu / / display the usage of UDP port numbers netstat-apu / / count the number of network connections in each state in the machine netstat-a | awk'/ ^ tcp/ {+ + S [$NF]} END {for (an in S) print a S [a]}'/ / display TCP connection ss-t-a / / display sockets summary information ss-s / / display all udp sockets ss-u-a / / tcp,etcp status sar-n TCP ETCP 1 / / View network IO sar-n DEV 1 / / output packets in packets tcpdump-I eth2 host 192.168.1.1 and port 80 / / grab packets display data content in streams tcpflow-cp host 192.168.1.1

8. System load

8.1 description

Load is a measure of how much work a computer has done (WikiPedia:the system Load is a measure of the amount of work that a compute system is doing), simply the length of a process queue. Load Average is the average Load over a period of time (1 minute, 5 minutes, 15 minutes).

8.2 Analytical tools

8.3 how to use

/ / View load uptime top vmstat / / Statistics system call time strace-c-p pid / / track specified system operations such as epoll_wait strace-T-e epoll_wait-p pid / / View kernel log information dmesg

9. Flame diagram

9.1 description

Flame Graph is a performance analysis chart created by Bredan Gregg because it looks similar. And get the name.

The flame diagram is mainly used to show the call stack of CPU.

The y-axis represents the call stack, and each layer is a function. The deeper the call stack, the higher the flame, with the executing function at the top and its parent function at the bottom.

The x-axis represents the number of samples, and the wider the width a function occupies on the x-axis, the more times it is drawn, that is, the longer it takes to execute. Note that the x-axis does not represent time, but is arranged alphabetically after all the call stacks are merged.

The flame diagram is to see which function at the top level occupies the largest width. As long as there is a "plateaus", it means that the function may have a performance problem. Color has no special meaning, because the flame diagram shows the degree of busyness of the CPU, so it generally chooses a warm tone.

The common types of flame diagrams are On-CPU, Off-CPU, Memory, Hot/Cold, Differential and so on.

9.2 install dependent libraries

/ / install systemtap. Yum install systemtap systemtap-runtime / / kernel debug libraries must correspond to the kernel version installed by default For example: uname-r 2.6.18-308.el5 kernel-debuginfo-2.6.18-308.el5.x86_64.rpm kernel-devel-2.6.18-308.el5.x86_64.rpm kernel-debuginfo-common-2.6.18-308.el5.x86_64.rpm / / install the kernel debug library debuginfo-install-- enablerepo=debuginfo search kernel debuginfo-install-- enablerepo=debuginfo search glibc

9.3 installation

Git clone https://github.com/lidaohang/quick_location.git cd quick_location

9.4 CPU level flame chart

Cpu usage is too high, or the utilization rate can not be raised, can you quickly locate which part of the code has a problem?

The general practice may be to determine the problem through logs and other methods. Now that we have the flame diagram, we can clearly see which function occupies the cpu too high or too low to cause the problem.

9.4.1 on-CPU

Cpu occupies too much, and the time in execution is usually divided into user state time user and system state time sys.

Mode of use:

/ / on-CPU user sh ngx_on_cpu_u.sh pid / / enter the result directory cd ngx_on_cpu_u / / on-CPU kernel sh ngx_on_cpu_k.sh pid / / enter the result directory cd ngx_on_cpu_k / / Open a temporary port 8088 python-m SimpleHTTPServer 8088 / / Open the browser and enter the address 127.0.0.1:8088/pid.svg

DEMO:

# include # include void foo3 () {} void foo2 () {int i; for (iTun0; I)

< 10; i++) foo3(); } void foo1() { int i; for(i = 0; i< 1000; i++) foo3(); } int main(void) { int i; for( i =0; i< 1000000000; i++) { foo1(); foo2(); } } DEMO火焰图： 9.4.2 off-CPU cpu过低，利用率不高。等待下一轮CPU，或者等待I/O、锁、换页等等，其状态可以细分为可执行、匿名换页、睡眠、锁、空闲等状态。使用方式： // off-CPU user sh ngx_off_cpu_u.sh pid //进入结果目录 cd ngx_off_cpu_u //off-CPU kernel sh ngx_off_cpu_k.sh pid //进入结果目录 cd ngx_off_cpu_k //开一个临时端口8088 python -m SimpleHTTPServer 8088 //打开浏览器输入地址 127.0.0.1:8088/pid.svg 官网DEMO： 9.5 内存级别火焰图如果线上程序出现了内存泄漏，并且只在特定的场景才会出现。这个时候我们怎么办呢？有什么好的方式和工具能快速的发现代码的问题呢？同样内存级别火焰图帮你快速分析问题的根源。使用方式： sh ngx_on_memory.sh pid //进入结果目录 cd ngx_on_memory //开一个临时端口8088 python -m SimpleHTTPServer 8088 //打开浏览器输入地址 127.0.0.1:8088/pid.svg 官网DEMO： 9.6 性能回退-红蓝差分火焰图你能快速定位CPU性能回退的问题么？如果你的工作环境非常复杂且变化快速，那么使用现有的工具是来定位这类问题是很具有挑战性的。当你花掉数周时间把根因找到时，代码已经又变更了好几轮，新的性能问题又冒了出来。主要可以用到每次构建中，每次上线做对比看，如果损失严重可以立马解决修复。通过抓取了两张普通的火焰图，然后进行对比，并对差异部分进行标色：红色表示上升，蓝色表示下降。差分火焰图是以当前（"修改后"）的profile文件作为基准，形状和大小都保持不变。因此你通过色彩的差异就能够很直观的找到差异部分，且可以看出为什么会有这样的差异。使用方式： cd quick_location //抓取代码修改前的profile 1文件 perf record -F 99 -p pid -g -- sleep 30 perf script >

Out.stacks1 / / grab the modified profile 2 file perf record-F 99-p pid-g-- sleep 30 perf script > out.stacks2 / / generate differential flame diagram:. / FlameGraph/stackcollapse-perf.pl.. / out.stacks1 > out.folded1. / FlameGraph/stackcollapse-perf.pl.. / out.stacks2 > out.folded2. / FlameGraph/difffolded.pl out.folded1 out.folded2 |. / FlameGraph/flamegraph.pl > diff2.svg

DEMO:

/ / test.c # include # include void foo3 () {} void foo2 () {int i; for (iTun0; I < 10; iTunes +) foo3 ();} void foo1 () {int i; for (iTun0; I < 1000; iTunes +) foo3 ();} int main (void) {int i; for (iTun0; I < 100000000; iTunes +) {foo1 (); foo2 () } / / test1.c # include # include void foo3 () {} void foo2 () {int i; for (iTun0; I < 10; iTunes +) foo3 ();} void foo1 () {int i; for (iTun0; I < 1000; iTunes +) foo3 ();} void add () {int i; for (iTun0; I < 10000; iTunes +) foo3 () } int main (void) {int i; for (I = 0; I < 10000000000; iTunes +) {foo1 (); foo2 (); add ();}}

DEMO red and blue differential flame picture:

10. Case study

10.1 abnormal phenomenon of nginx cluster in access layer

Through the monitoring plug-in, it was found that a large number of 499pr 5xx status codes appeared in the nginx cluster request traffic at 19:00 on September 25, 2017. It is also found that the utilization rate of machine cpu has been increasing, which has been going on.

10.2 Analysis of nginx-related indicators

A) * * analyze nginx request traffic:

Conclusion:

Through the figure above, it is found that the traffic does not increase suddenly, but decreases, which has nothing to do with the sudden increase in request traffic.

B) * * analyze nginx response time

Conclusion:

Through the figure above, it is found that the increase in the response time of nginx may be related to the nginx itself or to the response time of the backend upstream.

C) * * Analysis of nginx upstream response time

Conclusion:

The above figure shows that the nginx upstream response time has increased. It is speculated that the backend upstream response time may delay the nginx, resulting in abnormal request traffic in nginx.

10.3 Analysis system cpu

A) * * observe system metrics through top

Top

Conclusion:

It is found that nginx worker cpu is relatively high.

B) * * Analysis of cpu within the nginx process

Perf top-p pid

Conclusion:

It is found that the main overhead is free,malloc,json parsing.

10.4 Flame Diagram Analysis cpu

A) * * generate user-mode cpu flame diagram

/ / test.c # include # include / / on-CPU user sh ngx_on_cpu_u.sh pid / / enter the result directory cd ngx_on_cpu_u / / Open a temporary port 8088 python-m SimpleHTTPServer 8088 / / Open a browser and enter the address 127.0.0.1:8088/pid.svg

Conclusion:

It is found that there are frequent parsing json operations in the code, and it is found that the performance of this json library is not high, and it takes up a lot of cpu.

10.5 case summary

A) analyze the request traffic anomaly and find that the response time of the nginx upstream backend machine is longer.

B) by analyzing the high cpu of the nginx process, it is concluded that the internal module code of nginx has time-consuming json parsing and memory allocation recovery operations.

10.5.1 in-depth analysis

According to the conclusions of the above two problems, we make a further in-depth analysis.

The backend upstream response is lengthened, which may affect the processing capacity of the nginx at most. However, it is not possible to affect the excessive cpu operations occupied by the internal modules of nginx. And the module that occupies high cpu at that time is the logic that will only be taken when requested. It is unlikely that the upstram backend will hold up the nginx, triggering the time-consuming operation of the cpu.

10.5.2 solution

When faced with this kind of problem, we give priority to solving known and very clear problems. That's the problem of cpu height. The solution is to downgrade and close the module that takes up too much cpu, and then observe. After downgrading and shutting down the module, cpu is down, and nginx request traffic is normal. The reason why the upstream time is lengthened is that the interface of the service invocation on the upstream backend may be a loop back to nginx again.

This is the end of the article on "how to use Fault Analysis under Linux". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.