Linux fault location is a must for operation and maintenance. 01/05 Update SLTechnology News&Howtos

Linux fault location is a must for operation and maintenance.

2026-01-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Linux fault location is a must for operation and maintenance.

Background

Sometimes you will encounter some difficult and complicated problems, and the monitoring plug-in can not immediately identify the root cause of the problem. At this point, you need to log on to the server to further analyze the root causes of the problem. Then the analysis of the problem requires the accumulation of certain technical experience, and some problems involve a wide range of areas in order to locate the problem. Therefore, analyzing problems and stepping on pits is a great exercise in one's growth and self-improvement. If we have a set of good analysis tools, it will get twice the result with half the effort, which can help you quickly identify the problem and save you a lot of time to do more in-depth things.

Description

This article mainly introduces a variety of problem positioning tools and will be combined with case studies.

The methodology of analyzing problems

Applying 5W2H method, several problems of performance analysis can be put forward.

What is the What- phenomenon like?

When does When- happen?

Why did Why- happen?

Where is the problem in Where-?

How many resources did How much- consume?

How does How to do- solve the problem?

Cpu

4.1 description

For applications, we usually focus on kernel CPU scheduler functionality and performance.

The main purpose of thread state analysis is to analyze where thread time is spent, and the classification of thread state is generally divided into:

A. on-CPU: in execution, the time in execution is usually divided into user state time user and system state time sys.

B. off-CPU: wait for the next round of CPU, or wait for Imax O, lock, page change, and so on. The status can be subdivided into executable, anonymous page change, sleep, lock, idle, etc.

If you spend a lot of time on CPU, the analysis of CPU can quickly explain why; if the system time is in the off-cpu state, it will take a lot of time to locate the problem. But there are still some concepts that need to be clear:

Processor core hardware thread CPU memory cache clock frequency CPI and instructions per cycle IPCCPU instruction utilization user time / kernel time scheduler run queue preempts multiprocess multithreaded word length

4.2 Analytical tools

Linux problem fault location, read this article is enough, nine tricks to solve all the problems

Description: uptime,vmstat,mpstat,top,pidstat can only query the usage of cpu and load. Perf can follow the time-consuming situation of specific functions within the process, and can specify kernel functions to count and tell where to hit.

4.3 mode of use

/ / View system cpu usage top// view all cpu core information mpstat-P ALL 1 ALL / view cpu usage and average load vmstat 1 / process cpu statistics pidstat-u 1-p pid// tracking process internal function-level cpu usage perf top-p pid-e cpu-clock memory 6.5.1 description

Memory is created to improve efficiency, and when analyzing problems, memory problems may not only affect performance, but also affect services or cause other problems. There are also some concepts to be clear about memory:

Main memory

Virtual memory

Resident memory

Address space

OOM

Page cache

Missing page

Change the page

Swap space

exchange

User allocators libc, glibc, libmalloc and mtmalloc

LINUX kernel-level SLUB allocator

5.2 Analytical tools

Linux problem fault location, read this article is enough, nine tricks to solve all the problems

Description:

Free,vmstat,top,pidstat,pmap can only count memory information and memory usage of processes.

Valgrind can analyze memory leaks.

Dtrace dynamic tracking. You need to have a deep understanding of kernel functions and write scripts in D language to complete the tracking.

5.3 mode of use

/ / View system memory usage

Free-m

/ / Virtual memory statistics

Vmstat 1

/ / check the system memory condition

Top

/ / 1s collection cycle to obtain statistical information of memory

Pidstat-p pid-r 1

/ / View the memory image information of the process

Pmap-d pid

/ / detect program memory problems

Valgrind-- tool=memcheck-- leak-check=full-- log-file=./log.txt. / Program name

6. Disk IO

6.1 description

The disk is usually the slowest subsystem of the computer, and it is also the place where performance bottlenecks are most likely to occur, because the disk is farthest away from the CPU and CPU access to the disk involves mechanical operations, such as axis rotation, orbit finding, and so on. The speed difference between accessing the hard disk and accessing memory is calculated by an order of magnitude, just like the difference between a day and a minute. To monitor IO performance, it is necessary to understand the fundamentals and how Linux handles IO between hard disk and memory.

Before we understand disk IO, we also need to understand some concepts, such as:

File system

VFS

File system cach

Page cache page cache

Buffer cache buffer cache

Directory caching

Inode

Inode caching

Noop call policy

6.2Analytical tool Linux problem fault location, read this article is enough, nine tips to solve all the problems 6.3 ways to use

/ / View system io information

Iotop

/ / Statistics io details

Iostat-d-x-k 1 10

/ / View process-level io information

Pidstat-d 1-p pid

/ / check the request of the system IO. For example, when a system IO exception is found, you can use this command to investigate and specify what causes the IO exception.

Perf record-e block:block_rq_issue-ag

^ C

Perf report

7. The network

7.1 description

Network monitoring is the most complex of all Linux subsystems, and there are too many factors in it, such as delay, blocking, collision, packet loss and so on. Worse, routers, switches and wireless signals connected to Linux hosts will affect the whole network and it is difficult to judge whether it is because of the problems of Linux network subsystem or other devices, which increases the complexity of monitoring and judgment. All the network cards we use now are called adaptive network cards, which means that they can be automatically adjusted according to the different network speeds and working modes caused by different network devices on the network.

7.2 Analytical tools

Linux problem fault location, read this article is enough, nine tricks to solve all the problems

7.3 mode of use

/ / display network statistics

Netstat-s

/ / displays the current UDP connection status

Netstat-nu

/ / display the usage of UDP port number

Netstat-apu

/ / count the number of network connections in each state in the machine

Netstat-a | awk'/ ^ tcp/ {+ + S [$NF]} END {for (an in S) print a, S [a]}'

/ / display TCP connections

Ss-t-a

/ / display sockets summary information

Ss-s

/ / Show all udp sockets

Ss-u-a

/ / tcp,etcp status

Sar-n TCP,ETCP 1

/ / View the network IO

Sar-n DEV 1

/ / output packets in units of packets

Tcpdump-I eth2 host 192.168.1.1 and port 80

/ / grab packets to display data content in stream units

Tcpflow-cp host 192.168.1.1

8. System load 8.1 indicates that Load is a measure of how much work the computer has done (WikiPedia:the system Load is a measure of the amount of work that a compute system is doing) simply the length of the process queue. Load Average is the average Load over a period of time (1 minute, 5 minutes, 15 minutes). 8.2 Analysis tool Linux problem fault location, read this article is enough, nine tricks to solve all the problems

8.3 how to use

/ / check the load

Uptime

Top

Vmstat

/ / Statistics of system call time

Strace-c-p pid

/ / track specified system operations such as epoll_wait

Strace-T-e epoll_wait-p pid

/ / View kernel log information

Dmesg

Flame diagram

9.1 description

Flame Graph is a performance analysis chart created by Bredan Gregg because it looks similar. And get the name.

The flame diagram is mainly used to show the call stack of CPU.

The y-axis represents the call stack, and each layer is a function. The deeper the call stack, the higher the flame, with the executing function at the top and its parent function at the bottom.

The x-axis represents the number of samples, and the wider the width a function occupies on the x-axis, the more times it is drawn, that is, the longer it takes to execute. Note that the x-axis does not represent time, but is arranged alphabetically after all the call stacks are merged.

The flame diagram is to see which function at the top level occupies the largest width. As long as there is a "plateaus", it means that the function may have a performance problem. Color has no special meaning, because the flame diagram shows the degree of busyness of the CPU, so it generally chooses a warm tone.

The common types of flame diagrams are On-CPU, Off-CPU, Memory, Hot/Cold, Differential and so on.

9.2 install dependent libraries

/ / install systemtap. The default system is installed.

Yum install systemtap systemtap-runtime

/ / the kernel debug library must correspond to the kernel version, for example: uname-r 2.6.18-308.el5

Kernel-debuginfo-2.6.18-308.el5.x86_64.rpm

Kernel-devel-2.6.18-308.el5.x86_64.rpm

Kernel-debuginfo-common-2.6.18-308.el5.x86_64.rpm

/ / install the kernel debug library

Debuginfo-install-enablerepo=debuginfo search kernel

Debuginfo-install-enablerepo=debuginfo search glibc

9.3 installation

Git clone https://github.com/lidaohang/quick_location.git

Cd quick_location

9.4 CPU level flame chart

Cpu usage is too high, or the utilization rate can not be raised, can you quickly locate which part of the code has a problem?

The general practice may be to determine the problem through logs and other methods. Now that we have the flame diagram, we can clearly see which function occupies the cpu too high or too low to cause the problem.

9.4.1 on-CPU

Cpu occupies too much, and the time in execution is usually divided into user state time user and system state time sys.

Mode of use:

/ / on-CPU user

Sh ngx_on_cpu_u.sh pid

/ / enter the result directory

Cd ngx_on_cpu_u

/ / on-CPU kernel

Sh ngx_on_cpu_k.sh pid

/ / enter the result directory

Cd ngx_on_cpu_k

/ / Open a temporary port 8088

Python-m SimpleHTTPServer 8088

/ / Open the browser and enter the address

127.0.0.1:8088/pid.svg

DEMO:

# include

Void foo3 ()

{

}

Void foo2 ()

{

Int i

For (iTun0; I)

< 10; i++) foo3(); } void foo1() { int i; for(i = 0; i< 1000; i++) foo3(); } int main(void) { int i; for( i =0; i< 1000000000; i++) { foo1(); foo2(); } } DEMO火焰图： Linux 问题故障定位，看这一篇就够了，九招搞定所有问题 9.4.2 off-CPU cpu过低，利用率不高。等待下一轮CPU，或者等待I/O、锁、换页等等，其状态可以细分为可执行、匿名换页、睡眠、锁、空闲等状态。使用方式： // off-CPU user sh ngx_off_cpu_u.sh pid //进入结果目录 cd ngx_off_cpu_u //off-CPU kernel sh ngx_off_cpu_k.sh pid //进入结果目录 cd ngx_off_cpu_k //开一个临时端口8088 python -m SimpleHTTPServer 8088 //打开浏览器输入地址 127.0.0.1:8088/pid.svg 官网DEMO： Linux 问题故障定位，看这一篇就够了，九招搞定所有问题 9.5 内存级别火焰图如果线上程序出现了内存泄漏，并且只在特定的场景才会出现。这个时候我们怎么办呢？有什么好的方式和工具能快速的发现代码的问题呢？同样内存级别火焰图帮你快速分析问题的根源。使用方式：sh ngx_on_memory.sh pid//进入结果目录cd ngx_on_memory//开一个临时端口8088python -m SimpleHTTPServer 8088//打开浏览器输入地址127.0.0.1:8088/pid.svg9.6 性能回退-红蓝差分火焰图你能快速定位CPU性能回退的问题么？如果你的工作环境非常复杂且变化快速，那么使用现有的工具是来定位这类问题是很具有挑战性的。当你花掉数周时间把根因找到时，代码已经又变更了好几轮，新的性能问题又冒了出来。主要可以用到每次构建中，每次上线做对比看，如果损失严重可以立马解决修复。通过抓取了两张普通的火焰图，然后进行对比，并对差异部分进行标色：红色表示上升，蓝色表示下降。差分火焰图是以当前（"修改后"）的profile文件作为基准，形状和大小都保持不变。因此你通过色彩的差异就能够很直观的找到差异部分，且可以看出为什么会有这样的差异。使用方式：cd quick_location//抓取代码修改前的profile 1文件perf record -F 99 -p pid -g -- sleep 30perf script >

Out.stacks1// grabs the modified profile 2 file perf record-F 99-p pid-g-- sleep 30perf script > out.stacks2// to generate differential flame image:. / FlameGraph/stackcollapse-perf.pl.. / out.stacks1 > out.folded1./FlameGraph/stackcollapse-perf.pl.. / out.stacks2 > out.folded2./FlameGraph/difffolded.pl out.folded1 out.folded2 |. / FlameGraph/flamegraph.pl > diff2.svg

DEMO:

/ / test.c#include # include void foo3 () {} void foo2 () {int i; for (iTun0; I < 10; iTunes +) foo3 ();} void foo1 () {int i; for (iTun0; I < 1000; iTunes +) foo3 ();} int main (void) {int i; for (iTun0; I < 1000000000; iTunes +) {foo1 (); foo2 ();}} / / test1.c#include # include void foo3 () {} void foo2 () {int i; for (iTun0; I < 10; iTunes +) foo3 () } void foo1 () {int i; for (I = 0; I < 1000; iTunes +) foo3 ();} void add () {int i; for (I = 0; I < 10,000; iTunes +) foo3 ();} int main (void) {int i; for (I = 0; I < 100000000; iTunes +) {foo1 (); foo2 (); add ();}}

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.