What are the common monitoring indicators of Linux? 04/21 Update SLTechnology News&Howtos

What are the common monitoring indicators of Linux?

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

Editor to share with you what Linux commonly used monitoring indicators are, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

1. Basic collection items of Linux operation and maintenance

As an operation and maintenance staff, they are not afraid of problems, but they are afraid that they will not be able to catch the scene and their eyes will be dark. Therefore, it is of great significance to collect as many indicators as possible by relying on a powerful monitoring system. But which indicators are meaningful? in line with the idea from practice, the experience summed up by engineers over a long period of time is the most valuable.

In the long-term practice of operation and maintenance engineers, we have summarized some indicators that are often referred to in the process of system operation and maintenance, including the following categories:

CPULoad memory disk IO network related kernel parameters ss statistical output port collection core service process survival information collection key business process resource consumption NTP offset collection DNS parsing collection 2, CPU related collection items

Calculation method: through the collection / proc/stat to get, you can refer to the statistical output of the sar command to understand.

Cpu.idle:Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I request.cpu.busy O request.cpu.busy: as opposed to cpu.idle, its value equals 100 minus cpu.idle. Cpu.guest:Percentage of time spent by the CPU or CPUs to run a virtual processor.cpu.iowait:Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.cpu.irq:Percentage of time spent by the CPU or CPUs to service hardware interrupts.cpu.softirq:Percentage of time spent by the CPU or CPUs to service software interrupts.cpu.nice:Percentage of CPU utilization that occurred while executing at the user level with nice priority.cpu.steal:Percentage of time spent in involuntary wait by the virtual CPU or CPUs while The hypervisor was servicing another virtual processor.cpu.system:Percentage of CPU utilization that occurred while executing at the system level (kernel) .cpu.user:Percentage of CPU utilization that occurred while executing at the user level (application) .cpu.cnt: number of cpu cores. Number of cpu.switches:cpu context switches, counter type. 3. Disk-related acquisition items

Calculation method: first read / proc/mounts to get all the mount points, and then get the usage of blocks and inode through syscall.Statfs_t. Each metric is appended with a set of tag descriptions, similar to mount=$mount,fstype=$fstype, where $mount is the mount point, such as / home,$fstype is the file system, such as ext4.

Df.bytes.free: disk available, int64df.bytes.free.percent: percentage of total disk available, float64, such as 32.1df.bytes.total: total disk size, int64df.bytes.used: disk used size, int64df.bytes.used.percent: disk used size as a percentage of total, float64df.inodes.total:inode total, int64df.inodes.free: number of available inode, int64df.inodes.free.percent: available inode percentage Float64df.inodes.used: used inode data, int64df.inodes.used.percent: used inode proportion, float644, megacli tools output

Use the megacli tool to read RAID-related information, and each metric will attach a set of tag descriptions to indicate the PD or VD,PD format as PD=Enclosure_ID:SLOT_ID. For example, PD=32:0 indicates the first disk and VD=0 indicates the first logical disk.

Sys.disk.lsiraid.pd.Media_Error_Count: this and the following three metrics are currently used only for data collection, which does not necessarily mean disk damage (only indicates a higher probability of damage) sys.disk.lsiraid.pd.Other_Error_Countsys.disk.lsiraid.pd.Predictive_Failure_Countsys.disk.lsiraid.pd.Drive_Temperaturesys.disk.lsiraid.pd.Firmware_state: if the value is not 0 Then there is a problem with this physical disk sys.disk.lsiraid.vd.cache_policy: if the value is not 0, the logical disk cache policy and settings do not match sys.disk.lsiraid.vd.state: if the value is not 0, there is a problem with this logical disk. 5. SMART tool output

Use the smartctl tool to read disk SMART information. Currently, all metrics are only used for data collection, which does not necessarily mean that the disk is damaged (it just means that the probability becomes larger). Each metric will have a set of tag descriptions indicating the disk letter, such as device=/dev/sda.

Sys.disk.smart.Reallocated_Sector_Ctsys.disk.smart.Spin_Retry_Countsys.disk.smart.Reallocated_Event_Countsys.disk.smart.Current_Pending_Sectorsys.disk.smart.Offline_Uncorrectablesys.disk.smart.Temperature_Celsius6, partition read and write monitoring

Test whether all mounted partitions are readable and writable. Each metric will have a set of tag descriptions indicating the mount point, such as mount=/home.

Sys.disk.rw: if the value is not 0, there is a problem with reading and writing in this partition. 7. IO related collection items

Calculation method: collect / proc/diskstats once per second and calculate the difference, all of which are counter type. Each metric has a set of tag descriptions, shaped like device=$device, to represent specific devices, such as sda1, sdb. Users can refer to iostat's help documentation to understand the specific meaning of metric.

Disk.io.ios_in_progress:Number of actual I take O requests currently in flight.disk.io.msec_read:Total number of ms spent by all reads.disk.io.msec_total:Amount of time during which ios_in_progress > = 1.disk.io.msec_weighted_total:Measure of recent I take O completion time and backlog.disk.io.msec_write:Total number of ms spent by all writes.disk.io.read_merged:Adjacent read requests merged in a single req.disk.io.read_ Requests:Total number of reads completed successfully.disk.io.read_sectors:Total number of sectors read successfully.disk.io.write_merged:Adjacent write requests merged in a single req.disk.io.write_requests:total number of writes completed successfully.disk.io.write_sectors:total number of sectors written successfully.disk.io.read_bytes: the number in byte disk.io.write_bytes: the number in byte disk.io.avgrq_sz: the following values are iostat-x 1 The value seen is disk.io.avgqu-szdisk.io.awaitdisk.io.svctmdisk.io.util: it is a percentage For example, 56.43%, indicating 56.43% 8, collection items related to machine load

Calculation method: read / proc/loadavg, all of which are of original value type:

Load.1minload.5minload.15min9, memory related acquisition items

Calculation method: read the contents of / proc/meminfo, where the mem.memfree is free+buffers+cached,mem.memused=mem.memtotal-mem.memfree. Users can refer to the output of the free command and help documentation to understand the meaning of each metric.

Mem.memtotal: total memory size mem.memused: how much memory used mem.memused.percent: percentage of memory used mem.memfreemem.memfree.percentmem.swaptotal:swap Total size mem.swapused: number of swapmem.swapused.percent: percentage of swap used mem.swapfreemem.swapfree.percent10, network-related collection items

Calculation method: read the contents of / proc/net/dev, each metric is appended with a set of tag, such as iface=$iface, indicating the specific interface, such as eth0. Metric with in indicates inflow, out indicates outflow, and total is the total in+out. The supported metric is as follows:

The amount of data transmitted by the net.if.in.bytesnet.if.in.compressednet.if.in.droppednet.if.in.errorsnet.if.in.fifo.errsnet.if.in.frame.errsnet.if.in.multicastnet.if.in.packetsnet.if.out.bytes / / Nic outward per second net.if.out.carrier.errsnet.if.out.collisionsnet.if.out.compressednet.if.out.droppednet.if.out.errorsnet.if.out.fifo.errsnet.if.out.packetsnet .if.total.bytes / / the amount of data sent and received by the network card per second net.if.total.droppednet.if.total.errorsnet.if.total.packets11, Port acquisition item

Calculation method, through the ss-ln, to determine whether the specified port is in the listen state. The original value type, the value is either 1: it means that it is listening or 0 means that it is not listening. Each metric is attached to a set of tag, such as port=$port,$port is a specific port.

Net.port.listen12, machine kernel configuration kernel.maxfiles: read / proc/sys/fs/file-maxkernel.files.allocated: read / proc/sys/fs/file-nr first Fieldkernel.files.left: value = kernel.maxfiles-kernel.files.allocatedkernel.maxproc: read / proc/sys/kernel/pid_max13, ntp collection entry

Use ntpq-pn to get the offset of native time relative to the ntp server.

Sys.ntp.offset: local offset time (in ms). If the value is too high or 0, it indicates an exception. You need to alarm 14. Process monitoring proc.num: determine the number of processes. Here, you need to determine the number of processes in two scenarios. One is based on the name of the process, such as name=sshd. The other is based on cmdline. For example, the application process name of Java may all be java. It is impossible to make a distinction according to the first case. You can configure cmdline at this time, such as cmdline=./falcon_agent-c./cfg.ini15, process resource monitoring process.cpu.all: the cpu of sys+user used by the process and its child processes, in jiffiesprocess.cpu.sys: the sys cpu used by the process and its child processes. The unit is jiffiesprocess.cpu.user: the user cpu used by the process and its child processes, in jiffiesprocess.swap: the swap used by the process and its child processes, in pageprocess.fd: the number of file descriptors used by the process process.mem: the process occupies memory, in byte16. Ss command output ss.orphanedss.closedss.timewaitss.slabinfo.timewaitss.synrecvss.estab above is "what are the common monitoring indicators of Linux?" all the contents of this article, thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.