In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "what are the common monitoring indicators of Linux". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
1. Basic collection items of Linux operation and maintenance
As an operation and maintenance staff, they are not afraid of problems, but they are afraid that they will not be able to catch the scene and their eyes will be dark. Therefore, it is of great significance to collect as many indicators as possible by relying on a powerful monitoring system. But which indicators are meaningful? in line with the idea from practice, the experience summed up by engineers over a long period of time is the most valuable.
In the long-term practice of operation and maintenance engineers, we have summarized some indicators that are often referred to in the process of system operation and maintenance, including the following categories:
CPU
Load
Memory
Magnetic disk
IO
Network related
Kernel parameters
Ss statistical output
Port acquisition
Process Survival Information Collection of Core Services
Critical business process resource consumption
NTP offset acquisition
DNS analytical acquisition
For each category, the specific detailed indicators are as follows, which are directly supported by the agent component of open-falcon. Falcon-agent collects the relevant metrics at regular intervals (currently 60 seconds) and reports them to the server.
2. CPU related collection items
Calculation method: through the collection / proc/stat to get, you can refer to the statistical output of the sar command to understand.
Cpu.idle:Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
Cpu.busy: as opposed to cpu.idle, his value equals 100minus cpu.idle.
Cpu.guest:Percentage of time spent by the CPU or CPUs to run a virtual processor.
Cpu.iowait:Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
Cpu.irq:Percentage of time spent by the CPU or CPUs to service hardware interrupts.
Cpu.softirq:Percentage of time spent by the CPU or CPUs to service software interrupts.
Cpu.nice:Percentage of CPU utilization that occurred while executing at the user level with nice priority.
Cpu.steal:Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
Cpu.system:Percentage of CPU utilization that occurred while executing at the system level (kernel).
Cpu.user:Percentage of CPU utilization that occurred while executing at the user level (application).
Cpu.cnt:cpu core number.
Number of cpu.switches:cpu context switches, counter type.
3. Disk-related acquisition items
Calculation method: first read / proc/mounts to get all the mount points, and then get the usage of blocks and inode through syscall.Statfs_t. Each metric is appended with a set of tag descriptions, similar to mount=$mount,fstype=$fstype, where $mount is the mount point, such as / home,$fstype is the file system, such as ext4.
Df.bytes.free: disk available, int64
Df.bytes.free.percent: disk availability as a percentage of the total, float64, e.g. 32.1
Df.bytes.total: total disk size, int64
Df.bytes.used: disk used size, int64
Df.bytes.used.percent: disk used size as a percentage of total, float64
Total df.inodes.total:inode, int64
Df.inodes.free: number of available inode, int64
Df.inodes.free.percent: percentage of available inode, float64
Df.inodes.used: used inode data, int64
Df.inodes.used.percent: inode has been used in proportion, float64
4. Megacli tool output
Use the megacli tool to read RAID-related information, and each metric will attach a set of tag descriptions to indicate the PD or VD,PD format as PD=Enclosure_ID:SLOT_ID. For example, PD=32:0 indicates the first disk and VD=0 indicates the first logical disk.
Sys.disk.lsiraid.pd.Media_Error_Count: this and the following three metrics are currently used only for data collection, which does not necessarily mean disk damage (only indicates an increase in the probability of damage)
Sys.disk.lsiraid.pd.Other_Error_Count
Sys.disk.lsiraid.pd.Predictive_Failure_Count
Sys.disk.lsiraid.pd.Drive_Temperature
Sys.disk.lsiraid.pd.Firmware_state: if the value is not 0, there is a problem with this physical disk
Sys.disk.lsiraid.vd.cache_policy: if the value is not 0, this logical disk cache policy does not match the setting
Sys.disk.lsiraid.vd.state: if the value is not 0, there is a problem with this logical disk
5. SMART tool output
Use the smartctl tool to read disk SMART information. Currently, all metrics are only used for data collection, which does not necessarily mean that the disk is damaged (it just means that the probability becomes larger). Each metric will have a set of tag descriptions indicating the disk letter, such as device=/dev/sda.
Sys.disk.smart.Reallocated_Sector_Ct
Sys.disk.smart.Spin_Retry_Count
Sys.disk.smart.Reallocated_Event_Count
Sys.disk.smart.Current_Pending_Sector
Sys.disk.smart.Offline_Uncorrectable
Sys.disk.smart.Temperature_Celsius
6. Zone read and write monitoring
Test whether all mounted partitions are readable and writable. Each metric will have a set of tag descriptions indicating the mount point, such as mount=/home.
Sys.disk.rw: if the value is not 0, there is a problem with reading and writing in this partition.
7. IO related collection items
Calculation method: collect / proc/diskstats once per second and calculate the difference, all of which are counter type. Each metric has a set of tag descriptions, shaped like device=$device, to represent specific devices, such as sda1, sdb. Users can refer to iostat's help documentation to understand the specific meaning of metric.
Disk.io.ios_in_progress:Number of actual I/O requests currently in flight.
Disk.io.msec_read:Total number of ms spent by all reads.
Disk.io.msec_total:Amount of time during which ios_in_progress > = 1.
Disk.io.msec_weighted_total:Measure of recent I/O completion time and backlog.
Disk.io.msec_write:Total number of ms spent by all writes.
Disk.io.read_merged:Adjacent read requests merged in a single req.
Disk.io.read_requests:Total number of reads completed successfully.
Disk.io.read_sectors:Total number of sectors read successfully.
Disk.io.write_merged:Adjacent write requests merged in a single req.
Disk.io.write_requests:total number of writes completed successfully.
Disk.io.write_sectors:total number of sectors written successfully.
Disk.io.read_bytes: the unit is the number of byte
Disk.io.write_bytes: the unit is the number of byte
Disk.io.avgrq_sz: the following are the values seen by iostat-x 1
Disk.io.avgqu-sz
Disk.io.await
Disk.io.svctm
Disk.io.util: it's a percentage, such as 56.43, which means 56.43%.
8. Machine load related acquisition items
Calculation method: read / proc/loadavg, all of which are of original value type:
Load.1min
Load.5min
Load.15min
9. Memory-related acquisition item
Calculation method: read the contents of / proc/meminfo, where the mem.memfree is free+buffers+cached,mem.memused=mem.memtotal-mem.memfree. Users can refer to the output of the free command and help documentation to understand the meaning of each metric.
Mem.memtotal: total memory size
Mem.memused: how much memory is used
Mem.memused.percent: percentage of memory used
Mem.memfree
Mem.memfree.percent
Total mem.swaptotal:swap size
Mem.swapused: how much swap is used
Mem.swapused.percent: percentage of swap used
Mem.swapfree
Mem.swapfree.percent
10. Network-related collection items
Calculation method: read the contents of / proc/net/dev, each metric is appended with a set of tag, such as iface=$iface, indicating the specific interface, such as eth0. Metric with in indicates inflow, out indicates outflow, and total is the total in+out. The supported metric is as follows:
Net.if.in.bytes
Net.if.in.compressed
Net.if.in.dropped
Net.if.in.errors
Net.if.in.fifo.errs
Net.if.in.frame.errs
Net.if.in.multicast
Net.if.in.packets
Net.if.out.bytes
Net.if.out.carrier.errs
Net.if.out.collisions
Net.if.out.compressed
Net.if.out.dropped
Net.if.out.errors
Net.if.out.fifo.errs
Net.if.out.packets
Net.if.total.bytes
Net.if.total.dropped
Net.if.total.errors
Net.if.total.packets
11. Port acquisition item
Calculation method, through the ss-ln, to determine whether the specified port is in the listen state. The original value type, the value is either 1: it means that it is listening or 0 means that it is not listening. Each metric is attached to a set of tag, such as port=port,port is a specific port.
Net.port.listen
twelve。 Machine kernel configuration
Kernel.maxfiles: read / proc/sys/fs/file-max
Kernel.files.allocated: the first Field of / proc/sys/fs/file-nr read
Kernel.files.left: value = kernel.maxfiles-kernel.files.allocated
Kernel.maxproc: read / proc/sys/kernel/pid_max
13. Ntp acquisition items
Use ntpq-pn to get the offset of native time relative to the ntp server.
Sys.ntp.offset: local offset time (in ms). If the value is too high or 0, it indicates that there is an anomaly and needs to give an alarm.
14. Process monitoring
Proc.num: to judge the number of processes, there are two scenarios, one is based on the name of the process, such as name=sshd;, the other is based on cmdline, for example, the name of the application process of Java may all be java, and it is impossible to distinguish according to the first situation. You can configure cmdline, such as cmdline=./falcon_agent-c./cfg.ini.
15. Process resource monitoring
Process.cpu.all: the cpu of the sys+user used by the process and its child processes, in jiffies
Process.cpu.sys: the sys cpu used by the process and its child processes, in jiffies
Process.cpu.user: the user cpu used by the process and its child processes, in jiffies
Process.swap: the swap used by the process and its child processes, in page
Process.fd: the number of file descriptors used by the process
Process.mem: memory consumed by processes (in byte)
16. Ss command output
Ss.orphaned
Ss.closed
Ss.timewait
Ss.slabinfo.timewait
Ss.synrecv
Ss.estab
This is the end of the content of "what are the common monitoring indicators of Linux". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.