Scenario Analysis idea and Toolbox of linux Server problems on Cloud platform 07/01 Update SLTechnology News&Howtos

Scenario Analysis idea and Toolbox of linux Server problems on Cloud platform

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Preface

Based on the author's practice in the support work of SUNING cloud platform, this paper sorts out the common problem scenarios, analysis toolbox and discrimination ideas in the operation and maintenance of cloud platform linux server (physical machine and virtual machine are distinguished in the following article). It mainly includes the following three parts:

1. Common tools, criteria and analysis ideas for abnormal performance of CPU, IO and memory in linux servers.

2. The possible causes, location methods and conventional analysis ideas of the abnormal downtime of linux server.

3. The possible causes, location methods and conventional analysis ideas of packet loss in linux server.

Audience: middle and senior linux server operation and maintenance staff

Note: combined with the problem picture, this paper enumerates the parameters and usage of each tool which are closely related to the analysis, in order to give an example to illustrate that the detailed usage of each tool needs to read the study man manual.

Linux server CPU, IO, memory performance exception cpu exception

Fig. 1 decomposition of cpu exception

Toptop-H-d 1-c highlight column and running process z x y Select shift+L/Rarrowpidstat

-d disk read and write report io statistics

-r memory usage and missing pages

-u cpu

-l display the command line and arguments

-w switch

-t displays the statistics of threads

-T

Show cpu usage of active processes per second pidstat-u 1 show cpu elapsed time by thread-group aggregation to help find busy thread pidstat-t 1-T ALLsar

-b block statistics

-B page

-r page usage statistics

-R page recycling statistics

-d disk usage statistics

-Q scheduling statistics

-S swap

-m operating frequency

-v file inode dentry activity statistics

-w scheduling switch

-W swap in and out statistics

-n Network DEV, EDEV, NFS, NFSD, SOCK, IP, EIP, ICMP, EICMP, TCP, ETCP, UDP, SOCK6, IP6, EIP6, ICMP6, EICMP6 and UDP6

-s 00:00:00-e 00:21:00 indicates the start and end time to view

Iotopiostatnmon Analysis of nmonvisulizar

Nmonvisulizar is a nmon visual analysis tool from ibm.

Sysrq turns on the switch echo 1 > / proc/sys/kernel/sysrq printing process stack echo t > / proc/sysrq-triggereg. If you have already softlockup and the business impact is obvious, use the following command to generate a vmcoreecho c > / proc/sysrq-trigger after stopping business

Strace

-c Statistics the number and time of system calls

-f is also called by the trace child process

-e indicates that you are interested in calling eg. -e open,write

Eg.

Suspend when the command is executed, know which syscallstrace cmd arg process the process is hung in, and get the syscall statistics strace-p PID-c

Gdb

Bt View execution Stack

Frame switch work Fram

The cpu consumption of the user process affects the overall use of the system. With debuginfo and code, the occupancy logic can be roughly sorted out. After attach, the process will STOPgdb-p PID

Perf

Online sampling and display of perf top

-e indicates the event. The default is cycle. It can be queried by perf list all the time.

-G call graph

-F sampling frequency

-d refresh interval

-p specific process

-C specific kernel

Perf top-D1-G-F 99-zshift + e expandable stack view shift + c collapsible stack view

Perf record/report

Record output sample file perf.data file

Report parsing

Perf record-F 99-a-g-p PID-C 6 sleep 5perf report

Memory exception

Figure 2 abnormal decomposition of memory

Generally check freecat / proc/self/statuscat / proc/self/smapsnumastat-mnumactl-- hardwarecat / proc/meminfo

Three waterlines

The combined values of sysctl-a | grep extra_free_kbytes min_free_kbytes extra_free_kbytes form three waterlines

Direct recovery line MIN min_free_kbytes

Background Recycle Line LOW 5/4*min_free_kbytes + extra_free_kbytes

Background Recycling stops HIGH 3/2*min_free_kbytes + extra_free_kbytes

Physical page conditions cat / proc/buddyinfoNode 0, zone DMA 2 21 1 10 10 1 3Node 0, zone DMA32 730 596 414 339 277 214 159 127 85 68 557Node 0, zone Normal 447 558 348 166 72 45 1021 888,607 252 2661

Kernel structure buffering slabtop to understand the current kernel data structure memory consumption

Io exception

Fig. 3 decomposition of io exception

Io scheduler

Cfq deadline noop

Blktrace & blkparser

When unexpected io delays occur, you need to have an in-depth understanding of io delay distribution and use blktrace & blkparser tools for detailed analysis.

Learn to properly use the oflag logo sync synchronously flush out data direct bypass pagecache

Fio

A convenient tool for calibrating the io capability of a system

Fio-filename=/dev/mapper/vg_os-testlv-direct=1-iodepth 1-thread-rw=randwrite-ioengine=psync-bs=8k-size=100G-numjobs=96-runtime=60-group_reporting-name=mytest

Du & df

Query Analysis for Block occupation and File system occupation

Strace can see the difference between the principles of the two commands: df reads file system information, du stat each file and then accumulates

The big difference between the two needs to be further investigated: is there a hole? Is it true that a file user can no longer see it but the file system has not really been deleted? (that is, when the open file is deleted, lsof + L1) is the previous directory file hidden by some mount point? If df displays an error, is it suspected that the fs is damaged? Network abnormal scenario

Fig. 4 Network anomaly analysis

Ethtoolethtool-S pays attention to drop error

Tc Statistics check tc-s-d qd concern package drop situation ss netstat iftop frequently used connections View netstat-ntpnetstat-ntplss-ietcpdump

-I the name of the net port to be crawled

-w grab the package file, which can be a time format string

-G rollback duration (in second)

-c grab how many packages and then exit

-s grab part of the message, in bytes

-r offline analysis of reading grab package files

-z call gzip and other tools to do compression

-Z switches user operation. Default is tcpdump.

-B set the buf size, otherwise you will not be able to catch the whole unit KB 10240

Eg.tcpdump tcp port 80 and host

Tcpdump-s 0-w% m_%d_%H_%M_%S.pcap-G 5-z gzip-Z root-c 100000-I any

Analysis of downtime scenario

Fig. 5 Analysis of outage scenario

Dropwatchcrash tool

Log View downtime Association Log

Bt to view the location of downtime

Sys to view basic information

Crash vmcore vmlinuxvmlinux comes from the kernel debuginfo package and is a binary kernel image with debugging information. if the system does not generate vmcore correctly, you need to check the / etc/kdump.conf configuration and the setting vmcore path in it. Kernel state issues have been discussed here, and the common exception analysis field is no longer summarized.

This paper summarizes several common linux server anomaly analysis ideas and toolset on cloud platform, but as mentioned at the beginning, the real fast and effective problem identification and location can not be separated from the familiarity and meticulous judgment of the system field, and make flexible use of the toolbox in the way of scene, so as to understand the system from surface to inside, from shallow to deep, and solve online problems quickly and efficiently. Havefun:)

About the author

Xie Yinghao SUNING Technology Group Cloud platform Center Senior engineer, long-term hard work in the support field of linux kernel and operating system, to ensure the stable and efficient operation of SUNING cloud environment line Shanghai quantity kvm server farm.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.