What are the common monitoring indicators of Linux? 04/09 Update SLTechnology News&Howtos

What are the common monitoring indicators of Linux?

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what are the common monitoring indicators of Linux". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Basic collection items of Linux operation and maintenance

As an operation and maintenance staff, they are not afraid of problems, but they are afraid that they will not be able to catch the scene and their eyes will be dark. Therefore, it is of great significance to collect as many indicators as possible by relying on a powerful monitoring system. But which indicators are meaningful? in line with the idea from practice, the experience summed up by engineers over a long period of time is the most valuable.

In the long-term practice of operation and maintenance engineers, we have summarized some indicators that are often referred to in the process of system operation and maintenance, including the following categories:

CPU

Load

Memory

Magnetic disk

Network related

Kernel parameters

Ss statistical output

Port acquisition

Process Survival Information Collection of Core Services

Critical business process resource consumption

NTP offset acquisition

DNS analytical acquisition

For each category, the specific detailed indicators are as follows, which are directly supported by the agent component of open-falcon. Falcon-agent collects the relevant metrics at regular intervals (currently 60 seconds) and reports them to the server.

2. CPU related collection items

Calculation method: through the collection / proc/stat to get, you can refer to the statistical output of the sar command to understand.

Cpu.idle:Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.

Cpu.busy: as opposed to cpu.idle, his value equals 100minus cpu.idle.

Cpu.guest:Percentage of time spent by the CPU or CPUs to run a virtual processor.

Cpu.iowait:Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.

Cpu.irq:Percentage of time spent by the CPU or CPUs to service hardware interrupts.

Cpu.softirq:Percentage of time spent by the CPU or CPUs to service software interrupts.

Cpu.nice:Percentage of CPU utilization that occurred while executing at the user level with nice priority.

Cpu.steal:Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.

Cpu.system:Percentage of CPU utilization that occurred while executing at the system level (kernel).

Cpu.user:Percentage of CPU utilization that occurred while executing at the user level (application).

Cpu.cnt:cpu core number.

Number of cpu.switches:cpu context switches, counter type.

3. Disk-related acquisition items

Calculation method: first read / proc/mounts to get all the mount points, and then get the usage of blocks and inode through syscall.Statfs_t. Each metric is appended with a set of tag descriptions, similar to mount=$mount,fstype=$fstype, where $mount is the mount point, such as / home,$fstype is the file system, such as ext4.

Df.bytes.free: disk available, int64

Df.bytes.free.percent: disk availability as a percentage of the total, float64, e.g. 32.1

Df.bytes.total: total disk size, int64

Df.bytes.used: disk used size, int64

Df.bytes.used.percent: disk used size as a percentage of total, float64

Total df.inodes.total:inode, int64

Df.inodes.free: number of available inode, int64

Df.inodes.free.percent: percentage of available inode, float64

Df.inodes.used: used inode data, int64

Df.inodes.used.percent: inode has been used in proportion, float64

4. Megacli tool output

Use the megacli tool to read RAID-related information, and each metric will attach a set of tag descriptions to indicate the PD or VD,PD format as PD=Enclosure_ID:SLOT_ID. For example, PD=32:0 indicates the first disk and VD=0 indicates the first logical disk.

Sys.disk.lsiraid.pd.Media_Error_Count: this and the following three metrics are currently used only for data collection, which does not necessarily mean disk damage (only indicates an increase in the probability of damage)

Sys.disk.lsiraid.pd.Other_Error_Count

Sys.disk.lsiraid.pd.Predictive_Failure_Count

Sys.disk.lsiraid.pd.Drive_Temperature

Sys.disk.lsiraid.pd.Firmware_state: if the value is not 0, there is a problem with this physical disk

Sys.disk.lsiraid.vd.cache_policy: if the value is not 0, this logical disk cache policy does not match the setting

Sys.disk.lsiraid.vd.state: if the value is not 0, there is a problem with this logical disk

5. SMART tool output

Use the smartctl tool to read disk SMART information. Currently, all metrics are only used for data collection, which does not necessarily mean that the disk is damaged (it just means that the probability becomes larger). Each metric will have a set of tag descriptions indicating the disk letter, such as device=/dev/sda.

Sys.disk.smart.Reallocated_Sector_Ct

Sys.disk.smart.Spin_Retry_Count

Sys.disk.smart.Reallocated_Event_Count

Sys.disk.smart.Current_Pending_Sector

Sys.disk.smart.Offline_Uncorrectable

Sys.disk.smart.Temperature_Celsius

6. Zone read and write monitoring

Test whether all mounted partitions are readable and writable. Each metric will have a set of tag descriptions indicating the mount point, such as mount=/home.

Sys.disk.rw: if the value is not 0, there is a problem with reading and writing in this partition.

7. IO related collection items

Calculation method: collect / proc/diskstats once per second and calculate the difference, all of which are counter type. Each metric has a set of tag descriptions, shaped like device=$device, to represent specific devices, such as sda1, sdb. Users can refer to iostat's help documentation to understand the specific meaning of metric.

Disk.io.ios_in_progress:Number of actual I/O requests currently in flight.

Disk.io.msec_read:Total number of ms spent by all reads.

Disk.io.msec_total:Amount of time during which ios_in_progress > = 1.

Disk.io.msec_weighted_total:Measure of recent I/O completion time and backlog.

Disk.io.msec_write:Total number of ms spent by all writes.

Disk.io.read_merged:Adjacent read requests merged in a single req.

Disk.io.read_requests:Total number of reads completed successfully.

Disk.io.read_sectors:Total number of sectors read successfully.

Disk.io.write_merged:Adjacent write requests merged in a single req.

Disk.io.write_requests:total number of writes completed successfully.

Disk.io.write_sectors:total number of sectors written successfully.

Disk.io.read_bytes: the unit is the number of byte

Disk.io.write_bytes: the unit is the number of byte

Disk.io.avgrq_sz: the following are the values seen by iostat-x 1

Disk.io.avgqu-sz

Disk.io.await

Disk.io.svctm

Disk.io.util: it's a percentage, such as 56.43, which means 56.43%.

8. Machine load related acquisition items

Calculation method: read / proc/loadavg, all of which are of original value type:

Load.1min

Load.5min

Load.15min

9. Memory-related acquisition item

Calculation method: read the contents of / proc/meminfo, where the mem.memfree is free+buffers+cached,mem.memused=mem.memtotal-mem.memfree. Users can refer to the output of the free command and help documentation to understand the meaning of each metric.

Mem.memtotal: total memory size

Mem.memused: how much memory is used

Mem.memused.percent: percentage of memory used

Mem.memfree

Mem.memfree.percent

Total mem.swaptotal:swap size

Mem.swapused: how much swap is used

Mem.swapused.percent: percentage of swap used

Mem.swapfree

Mem.swapfree.percent

10. Network-related collection items

Calculation method: read the contents of / proc/net/dev, each metric is appended with a set of tag, such as iface=$iface, indicating the specific interface, such as eth0. Metric with in indicates inflow, out indicates outflow, and total is the total in+out. The supported metric is as follows:

Net.if.in.bytes

Net.if.in.compressed

Net.if.in.dropped

Net.if.in.errors

Net.if.in.fifo.errs

Net.if.in.frame.errs

Net.if.in.multicast

Net.if.in.packets

Net.if.out.bytes

Net.if.out.carrier.errs

Net.if.out.collisions

Net.if.out.compressed

Net.if.out.dropped

Net.if.out.errors

Net.if.out.fifo.errs

Net.if.out.packets

Net.if.total.bytes

Net.if.total.dropped

Net.if.total.errors

Net.if.total.packets

11. Port acquisition item

Calculation method, through the ss-ln, to determine whether the specified port is in the listen state. The original value type, the value is either 1: it means that it is listening or 0 means that it is not listening. Each metric is attached to a set of tag, such as port=port,port is a specific port.

Net.port.listen

twelve。 Machine kernel configuration

Kernel.maxfiles: read / proc/sys/fs/file-max

Kernel.files.allocated: the first Field of / proc/sys/fs/file-nr read

Kernel.files.left: value = kernel.maxfiles-kernel.files.allocated

Kernel.maxproc: read / proc/sys/kernel/pid_max

13. Ntp acquisition items

Use ntpq-pn to get the offset of native time relative to the ntp server.

Sys.ntp.offset: local offset time (in ms). If the value is too high or 0, it indicates that there is an anomaly and needs to give an alarm.

14. Process monitoring

Proc.num: to judge the number of processes, there are two scenarios, one is based on the name of the process, such as name=sshd;, the other is based on cmdline, for example, the name of the application process of Java may all be java, and it is impossible to distinguish according to the first situation. You can configure cmdline, such as cmdline=./falcon_agent-c./cfg.ini.

15. Process resource monitoring

Process.cpu.all: the cpu of the sys+user used by the process and its child processes, in jiffies

Process.cpu.sys: the sys cpu used by the process and its child processes, in jiffies

Process.cpu.user: the user cpu used by the process and its child processes, in jiffies

Process.swap: the swap used by the process and its child processes, in page

Process.fd: the number of file descriptors used by the process

Process.mem: memory consumed by processes (in byte)

16. Ss command output

Ss.orphaned

Ss.closed

Ss.timewait

Ss.slabinfo.timewait

Ss.synrecv

Ss.estab

This is the end of the content of "what are the common monitoring indicators of Linux". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.