How to obtain Node performance monitoring metrics 07/15 Update SLTechnology News&Howtos

How to obtain Node performance monitoring metrics

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to obtain Node performance monitoring indicators". In daily operation, I believe many people have doubts about how to obtain Node performance monitoring indicators. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "how to obtain Node performance monitoring indicators". Next, please follow the editor to study!

The performance bottlenecks of servers are usually the following:

CPU utilization rate

CPU load (load)

Memory

Magnetic disk

I/O

Throughput (Throughput)

Query rate per second QPS (Query Per Second)

Log monitoring / real QPS

Response time

Process monitoring

Get CPU metrics

CPU usage and CPU load, both of which can reflect the busy degree of a machine to some extent.

CPU utilization rate

CPU utilization is the CPU resources consumed by the running program, indicating how the machine is running the program at a certain point in time. The higher the utilization rate, the more programs the machine runs at this time, and vice versa. The utilization rate is directly related to the strength of CPU. Let's first take a look at the relevant API and some noun explanations to help us understand the code to obtain CPU usage.

Os.cpus ()

Returns an array of objects containing information about each logical CPU kernel.

Model: a string that specifies the model of the CPU kernel.

Speed: a number that specifies the speed of the CPU kernel (in MHz).

Times: an object that contains the following attributes:

The number of milliseconds that user CPU spent in user mode.

The number of milliseconds that nice CPU spent in good mode.

The number of milliseconds that sys CPU spent in system mode.

The number of milliseconds that idle CPU spent in idle mode.

The number of milliseconds that irq CPU spent in interrupt request mode.

Note: the nice value of is only used for POSIX. On Windows operating systems, the value of all nice processors is always 0.

When you see the user,nice field, some students are confused about their advantages, and so am I. so I checked its meaning carefully, please follow.

User

User indicates the percentage of time that CPU is running in user mode.

Application process execution is divided into user mode and kernel mode: CPU executes the code logic of the application process itself in user mode, usually some logic or numerical calculation; CPU kernel mode executes the system call initiated by the process, usually in response to the process's request for resources.

A user-space program is any process that does not belong to the kernel. Shell, compilers, databases, Web servers, and desktop-related programs are user-space processes. If the processor is not idle, it is normal that most of the CPU time should be spent running user-space processes.

Nice

Nice indicates the percentage of time that CPU is running in low priority user mode, which means that the nice value of the process is less than 0.

System

User indicates the percentage of time that CPU is running in the kernel state.

In general, kernel CPU usage should not be too high unless the application process initiates a large number of system calls. If it is too high, it means that the system calls for a long time, for example, frequent IO operations.

Idle

Idle represents the percentage of time that CPU is idle, where CPU has no tasks to perform.

Irq

Irq indicates the percentage of time that CPU takes to handle hardware interrupts.

The network card interrupt is a typical example: after the network card receives the data packet, it notifies the CPU through the hardware interrupt to deal with it. If the system network traffic is very high, a significant increase in irq usage can be observed.

Conclusion:

If the user state is less than 70%, the kernel state is less than 35%, and the overall state is less than 70%, it can be counted as a healthy state.

The following example illustrates the use of the os.cpus () method in Node.js:

Example 1:

/ / Node.js program to demonstrate the / / os.cpus () method / / Allocating os module const os = require ('os'); / / Printing os.cpus () values console.log (os.cpus ())

Output:

[{model:'Intel (R) Core (TM) i5-7200U CPU @ 2.50GHz, speed:2712, times: {user:900000, nice:0, sys:940265, idle:11928546, irq:147046}}, {model:'Intel (R) Core (TM) i5-7200U CPU @ 2.50GHz, speed:2712, times: {user:860875, nice:0, sys:507093, idle:12400500, irq:27062}} {model:'Intel (R) Core (TM) i5-7200U CPU @ 2.50GHz, speed:2712, times: {user:1273421, nice:0, sys:618765, idle:11876281, irq:13125}}, {model:'Intel (R) Core (TM) i5-7200U CPU @ 2.50GHz, speed:2712, times: {user:943921, nice:0, sys:460109, idle:12364453, irq:12437}}]

Here is the code for how to get the cpu utilization

Const os = require ('os'); const sleep = ms = > new Promise (resolve = > setTimeout (resolve, ms)); class OSUtils {constructor () {this.cpuUsageMSDefault = 1000 / / CPU utilization default time period} / * obtain CPU utilization for a certain time period * @ param {Number} Options.ms [time period, default is 1000ms, that is, 1 second] * @ param {Boolean} Options.percentage [true (returned as a percentage result) | false] * @ returns {Promise} * / async getCPUUsage (options= {}) {const that = this; let {cpuUsageMS, percentage} = options CpuUsageMS = cpuUsageMS | | that.cpuUsageMSDefault; const T1 = that._getCPUInfo (); / / T1 time point CPU information await sleep (cpuUsageMS); const T2 = that._getCPUInfo (); / / T2 time point CPU information const idle = t2.idle-t 1.idle; const total = t2.total-t 1.total; let usage = 1-idle / total; if (percentage) usage = (usage * 100.0) .tofixed (2) + "%"; return usage } / * * get CPU instantaneous time information * @ returns {Object} CPU information * the number of milliseconds that user CPU spent in user mode. * the number of milliseconds nice CPU spent in good mode. * the number of milliseconds that sys CPU spent in system mode. * the number of milliseconds idle CPU spent in idle mode. * the number of milliseconds spent by irq CPU in interrupt request mode. * / _ getCPUInfo () {const cpus = os.cpus (); let user = 0, nice = 0, sys = 0, idle = 0, irq = 0, total = 0; for (let cpu in cpus) {const times = CPU [CPU] .times; user + = times.user; nice + = times.nice; sys + = times.sys; idle + = times.idle; irq + = times.irq } total + = user + nice + sys + idle + irq; return {user, sys, idle, total,} const cpuUsage = new OSUtils (). GetCPUUsage ({percentage: true}); console.log ('cpuUsage:', cpuUsage.then (data= > console.log (data); / / my computer is 6.15%CPU load

The loadavg of CPU is well understood, which refers to the average load (load average) of processes that occupy CPU time and the number of processes waiting for CPU time in a certain period of time. Here, processes waiting for CPU time refer to processes waiting to be awakened, excluding processes in wait state.

Before that, we need to learn an API of node.

Os.loadavg ()

Returns an array of 1, 5, and 15 minute average loads.

The average load is a measure of system activity calculated by the operating system and expressed as a decimal.

Load averaging is a concept unique to Unix. On Windows, the return value is always [0,0,0]

It is used to describe the current busy level of the operating system and can be simply understood as the average number of tasks that CPU is using and waiting to use CPU per unit time. CPU load is too high, indicating that there are too many processes, which may be reflected in starting new processes repeatedly with the Forbidden City module in Node.

Const os = require ('os'); / / number of CPU threads const length = os.cpus (). The average load of length;// single-core CPU, which returns an array of 1, 5, and 15-minute average load os.loadavg (). Map (load = > load / length); memory metric

Let's explain an API first, or you can't understand the code we use to get memory metrics.

Process.memoryUsage ():

This function returns four parameters with the following meanings and differences:

Rss: (Resident Set Size) the total amount of memory allocated to the process by the operating system. Includes all C++ and JavaScript objects and code. (for example, stacks and snippets)

HeapTotal: the total size of the heap, including three parts

Allocated memory for object creation and storage, corresponding to heapUsed

Unallocated memory available for allocation

Unallocated but unallocable memory, such as memory fragments between objects prior to garbage collection (GC)

HeapUsed: the allocated memory, that is, the total size of all objects in the heap, is a subset of heapTotal.

External: the memory used by the system link library used by the process, such as buffer, which belongs to the data in external. Unlike other objects, buffer data does not go through the memory allocation mechanism of V8, so there is no heap memory size limit.

Using the following code to print the memory usage of a child process, you can see that rss is roughly equal to the RES of the top command. In addition, the memory of the main process is only 33m less than that of the child process, so it can be seen that their memory footprint is counted independently.

Var showMem = function () {var mem = process.memoryUsage (); var format = function (bytes) {return (bytes / 1024 / 1024). Tofixed (2) + 'MB';}; console.log (' Process: heapTotal'+ format (mem.heapTotal) + 'heapUsed' + format (mem.heapUsed) + 'rss' + format (mem.rss) + 'external:' + format (mem.external)) Console.log ('-);}

For Node, once a memory leak occurs, it is not so easy to troubleshoot. If it is monitored that the memory is only rising but not falling, then there must be a memory leak problem. Healthy memory usage should go up and down. When the visit is big, it goes up, but the visit goes down.

The code for getting memory metrics const os = require ('os'); / / View the current Node process memory usage const {rss, heapUsed, heapTotal} = process.memoryUsage (); / / get system free memory const systemFree = os.freemem (); / / get system total memory const systemTotal = os.totalmem () Module.exports = {memory: () = > {return {system: 1-systemFree / systemTotal, / / system memory occupancy heap: heapUsed / headTotal, / / current Node process memory occupancy node: rss / systemTotal, / / current Node process memory occupancy ratio} disk space indicator

Disk monitoring is mainly to monitor the consumption of disks. Due to frequent log writes, disk space is gradually used up. Once the disk is not enough, it will cause all kinds of problems in the system. Set an upper limit on disk usage, and once the disk usage exceeds the warning value, the server manager should log or clean the disk.

The following code refers to easy monitor3.0

First use df-P to get all disk conditions, this-P is to prevent line feeds

StartsWith ('/') guarantees that it is a real disk, not virtual

Line.match (/ (\ d +)%\ s + (/. * $) /) = > matches the disk condition and the mounted disk, such as'1% / System/Volumes/Preboot'

Match [1] is a string indicating utilization, and match [2] represents the name of the mounted disk.

Const {execSync} = require ('child_process'); const result = execSync (' df-playing, {encoding: 'utf8'}) const lines = result.split ('\ n'); const metric = {}; lines.forEach (line = > {if (line.startsWith ('/')) {const match = line.match (/ (\ d +)%\ s + (\ / .* $) /); if (match) {const rate = parseInt (match [1] | | 0) Const mounted = match [2]; if (! mounted.startsWith ('/ Volumes/') & &! mounted.startsWith ('/ private/')) {metric [mounted] = rate;}); console.log (metric) Imoo index

The Icano load mainly refers to the disk Ipicuro. It reflects the situation of reading and writing on disk. For the application written by Node, it is mainly for network services, and it is unlikely that the load of Ibino is too high. The pressure of reading more is from the database.

To get the linux O indicator, we need to learn about a command called iostat. If it is not installed, we need to install it. Let's see why this command reflects the Imap O indicator.

Iostat-dx

Attribute description

Rrqm/s: the number of merge reads per second. That is, rmerge/s (the number of read requests to the device per second are merged, and the file system merges requests to read the same block (block)) wrqm/s: the number of merge writes per second. That is, wmerge/s (the number of write requests to the device per second were merged) r _ hand s: the number of times the read I _ hand O device was completed per second. That is, rio/sw/s: the number of writes completed by the Icano device per second. Wio/srsec/s: the number of sectors read per second. Rsect/swsec/s: the number of sectors written per second. Wsect/srkB/s: the number of K bytes read per second. Is half the rsect/s because the size of each sector is 512 bytes. WkB/s: write K bytes per second. It's half of wsect/s. Avgrq-sz: the average data size (sector) per device Istroke O operation. Avgqu-sz: average Imax O queue length. Await: the average wait time (in milliseconds) for each device Istroke O operation. Svctm: the average processing time (in milliseconds) for each device Istroke O operation. % util: how many percent of the time in a second is spent on the cpu O operation, that is, the percentage of cpu consumed by io

We only need to monitor% util.

"if% util is close to 100%, there are too many Icano requests generated, the Icano system is fully loaded, and there may be a bottleneck on the disk."

If the await is much larger than the svctm, it means that the response time of the application becomes slower because of the long IhampGo queue. If the response time exceeds the range allowed by the user, you can consider replacing a faster disk, adjusting the kernel elevator algorithm, optimizing the application, or upgrading the CPU.

Response time RT monitoring

Monitor the page response time of Nodejs, the solution is selected from teacher Liao Xuefeng's blog article.

Recently, I want to monitor the performance of Nodejs. Recording and analyzing Logs is too troublesome. The easiest way is to record the processing time of each HTTP request and return it directly in HTTP Response Header.

The time to record a HTTP request is very simple, that is, a timestamp is recorded when the request is received, and another timestamp is recorded when the request is responded to. The difference between the two timestamps is the processing time.

However, the res.send () code is spread all over the js file, so you can't change every URL handler.

The correct idea is to implement it in middleware. But Nodejs does not have any way to intercept res.send (), how to break it?

In fact, as long as we change our thinking a little bit, abandon the traditional OOP way, and look at res.send () as a function object, we can first save the original handler res.send, and then replace res.send with our own handler:

App.use (function (req, res, next) {/ / record start time: var exec_start_at = Date.now (); / / Save the original handler: var _ send = res.send / / bind our own handler: res.send = function () {/ / send Header: res.set ('XmurExecutionMutual timekeeper, String (Date.now ()-exec_start_at)); / / call the original handler: return _ handler (res, arguments);}; next ();})

It took only a few lines of code to get the timestamp done.

The res.render () method does not need to be handled because res.send () is called internally by res.render ().

When calling the apply () function, it is important to pass in the res object, otherwise the this of the original handler pointing to undefined directly leads to an error.

The measured home page response time is 9 milliseconds.

Monitoring Throughput / query rate per second QPS

The noun explains:

1. QPS, query per second

QPS:Queries Per Second, which means "query rate per second", is the number of queries per second that a server can respond to, and is a measure of how much traffic is processed by a particular query server within a specified period of time.

In the Internet, the performance of machines that serve as domain name system servers is often measured by the query rate per second.

2. TPS, transactions per second

TPS: is an acronym for TransactionsPerSecond, that is, transactions per second. It is the unit of measurement of software test results. A transaction is the process in which a client sends a request to the server and the server responds. The client starts the timing when sending the request and ends the timing after receiving the response from the server to calculate the time used and the number of transactions completed.

QPS vs TPS:QPS is basically similar to TPS, but the difference is that one visit to a page forms a TPS;, but one page request may generate multiple requests to the server, and the server can count these requests as "QPS". For example, visiting a page will request the server twice, one visit, resulting in a "T" and two "Qs".

3. RT, response time

Response time: the total time it takes to execute a request from the beginning to the last receipt of the response data, that is, the time from the client initiates the request to the receipt of the server response result.

Response time RT (Response-time) is one of the most important indicators of a system, and its value directly reflects the speed of the system.

IV. Number of concurrency

Concurrency refers to the number of requests that the system can handle at the same time, which also reflects the load capacity of the system.

V. Throughput

The throughput (pressure capacity) of the system is closely related to request's CPU consumption, external interface, IO, and so on. The higher the CPU consumption of a single request, the slower the external system interface and IO speed, the lower the system throughput, and vice versa.

Several important parameters of system throughput: QPS (TPS), number of concurrency, response time.

QPS (TPS): (Query Per Second) number of request/ transactions per second

Concurrency: the number of request/ transactions processed simultaneously by the system

Response time: generally take the average response time

After understanding the meaning of the above three elements, we can infer the relationship between them:

QPS (TPS) = number of concurrency / average response time

Number of concurrency = average response time of QPS*

VI. Practical examples

We use an example to string together the above concepts to understand. According to the law of twenty-eight, if 80% of daily visits are concentrated in 20% of the time, that 20% of the time is called peak time.

Formula: (total PV * 80%) / (seconds per day * 20%) = requests per second at peak time (QPS)

Machine: peak time per second QPS / QPS of a single machine = required machine

1. 300w PV per day on a single machine, how much QPS does this machine need?

(3000000 * 0.8) / (86400 * 0.2) = 139 (QPS)

2. If the QPS of a machine is 58, how many machines do you need to support it?

139 / 58 = 3

Here, if you do the front-end architecture of general small and medium-sized projects and deploy your own node service, you will know how many machines need to form a cluster to report ppt. Haha, with pv, you can calculate an initial value.

We need to take a look at the stress test (we need to get the qps from the stress test). Take the ab command as an example:

Command format:

Ab [options] [http://]hostname[:port]/path

The common parameters are as follows:

-n total number of requests requests-c concurrency concurrency-maximum number of seconds for t timelimit tests, which can be used as the timeout of the request-p postfile contains files that require POST data-Content-type header information replication code used by T content-type POST data

Please see the official documentation for more parameters.

Http://httpd.apache.org/docs/2.2/programs/ab.html

For example, test a GET request interface:

Ab-n 10000-c 10000-t 10 "http://127.0.0.1:8080/api/v1/posts?size=10"

Get some data:

We get several key indicators from this:

Throughput (Requests per second) is shown on the graph.

A quantitative description of the server's concurrent processing capacity, in reqs/s, which refers to the number of requests processed per unit time under a certain number of concurrent users. The maximum number of requests that can be processed per unit time under a certain number of concurrent users is called maximum throughput.

Remember: throughput is based on the number of concurrent users. This sentence represents two meanings:

A, throughput is related to the number of concurrent users

B. Under different concurrent users, the throughput is generally different.

Calculation formula:

Total number of requests / time spent processing the number of requests completed

It must be noted that this value represents the overall performance of the current machine, and the higher the value, the better.

2. QPS query rate per second (Query Per Second)

Query rate per second (QPS) is a measure of how much traffic is processed by a specific query server within a specified period of time. On the Internet, the performance of the machine as a domain name system server is often measured by the query rate per second, that is, the number of response requests per second, that is, the maximum throughput capacity.

Calculation formula

QPS (TPS) = concurrency / average response time (Time per request)

There is the value of Time per request in the above figure, and then we also have concurrent data, so we can calculate the QPS.

This QPS is the pressure test data, the real qps, which can be obtained using log monitoring.

Log monitoring

Usually, as the system runs, our backend service will generate all kinds of logs, and the application will generate application access logs, error logs, running logs, and web logs. We need a display platform to display these logs.

The back end is generally displayed with ELk, our front end is a ui veteran, we can draw a customized UI interface, needless to say, the main reason is that the log itself should be printed in accordance with certain specifications, so that the formatted data is more conducive to analysis and display.

And the business logic monitoring is mainly reflected in the log. By monitoring the changes in the exception log file, the new exceptions are reflected by the type and number of exceptions. Some exceptions are related to a specific subsystem, and a monitoring exception can also reflect the state of the subsystem.

The QPS value of the actual business can also be reflected in the system monitoring. Observing the performance of QPS can check the time segment of the business.

In addition, monitoring of PV and UV can also be achieved from the access log. And we can analyze the user's habits and predict the peak of access.

Response time

This can also be obtained by accessing the log, and the real response time needs to be typed log on controller.

Process monitoring

Monitoring processes generally check the number of application processes running in the operating system. For example, for node applications with multi-process architecture, you need to check the number of working processes. If it is lower than the expected value, an alarm will be issued.

It is easy to check the number of processes in linux.

If we provide child_process module through Node to realize the utilization of multicore CPU. The child_process.fork () function is used to copy the process.

The worker.js code is as follows:

Var http = require ('http')\ http.createServer (function (req, res) {\ res.writeHead (200,{' Content-Type': 'text/plain'})\ res.end (' Hello World\ n')\}) .Content-Type': (Math.round ((1 + Math.random ()) * 1000), '127.0.0.1')\)

Starting it through node worker.js will listen on a random port between 1000 and 2000.

The master.js code is as follows:

Var fork = require ('child_process'). Forkvar cpus = require (' os'). Cpus () for (var I = 0; I < cpus.length; iTunes +) {fork ('. / worker.js')}

The command to view the number of processes is as follows:

Ps aux | grep worker.js$ ps aux | grep worker.jslizhen 1475 0.0 2432768 600 s003 S+ 3:27AM 0ps aux 00.00 grep worker.js\ lizhen 1440 0.2 3022452 12680 s003 S 3:25AM 0bureau 00.14 / usr/local/bin/node. / worker.js\ lizhen 1439 0.00.2 3023476 12716 s003 S 3:25AM 0pur00.14 / usr/local/bin/node. / worker.js\ lizhen 1438 0.0 0.2 3022452 12704 s003 S 3: 25AM 0worker.js 00.14 / usr/local/bin/node. / worker.js\ lizhen 1437 0.2 3031668 12696 s003 S 3:25AM 00.15 / usr/local/bin/node. / worker.js\ so far The study on "how to obtain Node performance monitoring indicators" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.