What are the performance analysis tools commonly used by FISCO BCOS engineers 07/13 Update SLTechnology News&Howtos

What are the performance analysis tools commonly used by FISCO BCOS engineers

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about the performance analysis tools commonly used by FISCO BCOS engineers. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.

FISCO BCOS is a completely open source technology platform at the bottom of the alliance block chain, which is created by the open source working group set up by the Financial Block chain Cooperation Alliance (Shenzhen) (Golden chain Alliance for short). Members of the open source working group include Bo Yan Technology, Huawei, Shenzhen Zhengtong, Shenzhou Digital, Sifang Fangchuang, Tencent, WeBank, Yipi Technology and Yuexiu Jinke and other gold chain member organizations.

Foreword

"premature optimization is the root of all evil."

Donald Knuth, the computer science pioneer who said this, did not oppose optimization, but stressed the need to optimize key locations in the system. Suppose a for loop takes 0.01s, even if you use various tricks such as loop unfolding to improve its performance by 100x and reduce the time to 0.00001 seconds, it is almost imperceptible to users. Before quantitative testing of performance issues, various flashy optimizations at the code level may not improve performance, but may increase the difficulty of code maintenance or introduce more errors.

"Optimization without any evidence is the root of all evil."

Before implementing the optimization measures to the system, we must carry out a detailed performance test of the system, so as to find out the real performance bottleneck. Fighting on the front line of FISCO BCOS performance optimization, we have gained some experience on how to use performance testing tools to accurately locate performance hotspots. This paper collates and summarizes the tools we use in the process of optimization for readers.

1.Poor Man's Profiler

The analyzer of the poor, or PMP for short. Although the name is a little confusing, it is really a serious means of performance analysis, and even has its own official website https://poormansprofiler.org/. The principle of PMP is Stack Sampling. By calling a third-party debugger (such as gdb) and repeatedly obtaining the stack information of each thread in the process, PMP can get the hot spot distribution of the target process.

The first step is to take a certain number of thread stack snapshots:

Pid=$ (pidof fisco-bcos) num=10for x in $(seq 1 $(num)) do gdb-ex "set pagination 0"-ex "thread apply all bt"-batch-p $pid sleep 0.5done

The second step is to take the function call stack information from the snapshot and sort it according to the call frequency:

Awk 'BEGIN {s = ";} / ^ Thread/ {print s; s =";} / ^\ # / {if (s! = ") {s = s", "$4} else {s = $4}} END {print s}' |\ sort | uniq-c | sort-r-n-k

Finally, you get the output, as shown in the following figure:

From the output, we can observe which threads and which functions are sampled frequently, and then we can follow the clues to find out the possible bottlenecks. The above few lines of shell script are the essence of PMP. Extreme ease of use is the biggest selling point of PMP. Apart from relying on a ubiquitous debugger, PMP does not need to install any components, as the author of PMP said in his introduction: "although there are more advanced analytical techniques, without exception, they are too troublesome to install." Poor man doesn't have time. Poor man needs food.

The disadvantage of PMP is also obvious: the startup of gdb is very time-consuming, which limits the sampling frequency of PMP can not be too high, so some important function call events may be omitted, resulting in the final profile results are not accurate.

However, in some special occasions, PMP can still play a role. For example, in some Chinese technology blogs, some developers mentioned that the use of PMP has successfully located the deadlock problem in online production environment. The author of PMP also said that this technology has been applied in Facebook, Intel and other big companies. In any case, this kind of technology with a little bit of programmer's wisdom and a little bit of humor is worth a glance.

2.perf

Perf, whose full name is Performance Event, is integrated in the Linux kernel after version 2.6.31. It is a powerful performance analysis tool that comes with Linux. It uses the special hardware PMU (Performance Monitor Unit, performance monitoring unit) and kernel performance counters in modern processors to count performance data.

Perf works by sampling interrupts from running processes at a certain frequency to get the name of the currently executed function and the call stack. If most of the sampling points fall on the same function, it indicates that the function takes a long time to execute or is called frequently, which may have a performance problem.

To use perf, you need to sample the target process first:

$sudo perf record-F 1000-p `pidof fisco- bcos`-g-- sleep 60

In the above command, we use perf record to specify the statistics of recording performance; use-F to specify the sampling frequency as 1000Hz, that is, 1000 samples per second; use-p to specify the process ID to be sampled (that is, the process ID of fisco-bcos), which we can get directly through the pidof command; use-g to record call stack information; and use sleep to specify a sampling duration of 60 seconds.

After the sampling is completed, perf will write the collected performance data to the perf.data file in the current directory.

$perf report-n

The above command reads the perf.data and counts the percentage of each call stack, in order from highest to lowest, as shown in the following figure:

The information is rich enough, but the readability is still not very friendly. Although the use of perf in the example is relatively simple, in fact perf can do much more than that. With other tools, the data sampled by perf can be presented to us in a more intuitive and clear way, which is the performance analysis artifact we will introduce next-the flame diagram.

3. Flame diagram

Flame diagram, namely Flame Graph, is carried forward by the dynamic tracking technology proposed by system performance Brendan Gregg. It is mainly used to visually process the data generated by performance analysis tools, so that developers can locate the performance problems at a glance. The use of the flame map is relatively simple, we just need to download a series of tools from github and place them in any local directory:

Wget https://github.com/brendangregg/FlameGraph/archive/master.zipunzip master.zip

3.1Flame Diagram of CPU

When we find that FISCO BCOS performance is low, we intuitively want to figure out which part of the code is slowing down the overall speed, and CPU is our primary focus.

First, use perf to sample the performance of the FISCO BCOS process:

Sudo perf record-F 10000-p `pidof fisco- bcos`-g-- sleep 6 parses the sampled data file to generate stack information sudo perf script > cpu.unfold

After the sample data file is generated, then call the flame map tool to generate the flame map:

# fold perf.unfold symbols sudo. / stackcollapse-perf.pl cpu.unfold > cpu.folded# to generate flame images in SVG format sudo. / flamegraph.pl cpu.folded > cpu.svg

Finally, output an image in SVG format to show the call stack of CPU, as shown in the following figure:

The vertical axis represents the call stack. Each layer is a function and the parent function of the layer above it, and at the top is the function being executed at the time of sampling. the deeper the call stack, the higher the flame.

The horizontal axis represents the number of samples. Note that it does not indicate the execution time. The wider the width of a function, the more times it is fetched, and all the call stacks are arranged alphabetically on the horizontal axis after summary.

The flame diagram uses the SVG format, and the interactivity is greatly improved. When opened in the browser, each layer of the flame will be marked with a function name, and when the mouse hovers over it, the complete function name, the number of times to be sampled, and the percentage of the total number of sampled words will be displayed, as follows:

When you click on a layer, the flame image will be enlarged horizontally, the layer will occupy all the widths, and the details will be displayed. Click "Reset Zoom" in the upper left corner to restore it. The following figure shows the percentage of samples of each function when the PBFT module executes the block:

As can be seen from the figure, when executing the block, the main overhead is in the decoding of the transaction. This is because the length of each object in the traditional RLP coding is uncertain when decoding, and the RLP coding only records the number of objects, not the byte length of the object. If you want to obtain one of the encoded objects, you must recursively decode all the objects in its preorder.

Therefore, the decoding process of RLP coding is a serial process, and when the number of transactions in the block is large, the overhead of this part will become very huge. In view of this, we propose an optimization scheme for parallel decoding of RLP coding. The details of implementation can be found in the previous article, "parallelization practice in FISCO BCOS".

With the flame diagram, you can easily see where most of the time spent on CPU is spent, which in turn can be targeted and optimized.

3.2.Flame diagram of Off-CPU

When implementing the parallel execution function of FISCO BCOS, we find that there is a puzzling phenomenon: sometimes even though the transaction volume is very large and the load of the block is full, it is observed that the utilization of CPU is still relatively low through the top command, usually the utilization of 4-core CPU is less than 200%. After ruling out the possibility of dependencies between transactions, it is speculated that CPU may be trapped in Imax O or lock waiting, so it is necessary to determine where the CPU is waiting.

Using perf, we can easily understand the sleep process of any process in the system. The principle is to use perf static tracer to grab the scheduling events of the process, and merge these events through perf inject, and finally get the call flow and sleep time that induce the sleep of the process.

We need to record three kinds of events: sched:sched_stat_sleep, sched:sched_switch and sched:sched_process_exit through perf. These three events represent the waiting event that the process actively abandons CPU and goes to sleep, the waiting event that the process is switched to sleep by the scheduler because of Ipico and lock waiting, and the exit event of the process.

Perf record-e sched:sched_stat_sleep-e sched:sched_switch\-e sched:sched_process_exit-p `pidof fisco- bcos`-g\-o perf.data.raw sleep 60perf inject-v-s-I perf.data.raw-o perf.data# generate Off-CPU flame image perf script-f comm,pid,tid,cpu,time,period,event,ip,sym,dso,trace | awk'NF > 4 {exec = $1 Period_ms = int ($5 / 1000000)} NF > 1 & & NF 0 {print $2} NF

< 2 && period_ms >

0 {printf "% s\ n% d\ n\ n", exec, period_ms}'|\. / stackcollapse.pl |\. / flamegraph.pl-- countname=ms-- title= "Off-CPU Time Flame Graph"-- colors=io > offcpu.svg

The above commands may fail on newer Ubuntu or CentOS systems, which do not support logging scheduling events for performance reasons. Fortunately, we can choose another profile tool, OpenResty's SystemTap, instead of perf to help us collect performance data for the process scheduler. When we use SystemTap under CentOS, we only need to install some dependent kenerl debuginfo to use it.

Wget https://raw.githubusercontent.com/openresty/openresty-systemtap-toolkit/master/sample-bt-off-cpuchmod + x sample-bt-off-cpu./sample-bt-off-cpu-t 60-p `pidof fisco- bcos`-u > out.stap./stackcollapse-stap.pl out.stap > out.folded./flamegraph.pl-- colors=io out.folded > offcpu.svg

The resulting Off-CPU flame image is shown in the following figure:

After expanding the core function of executing the transaction, a pile of lock_wait on the right side of the flame diagram quickly caught our attention. After analyzing their call stack, we found that the root of these lock_wait comes from the behavior of printing a large number of debug logs in the program.

In the early development phase, we added a lot of logging code to facilitate debugging, which was not deleted later. Although we set the log level to a high level during the test, the log-related code will still incur runtime overhead, such as accessing the log level status to determine whether to print the log. Because these states require mutually exclusive access between threads, threads go hungry because of competition for resources.

After we delete these log codes, the utilization rate of 4-core CPU instantly increases to 300% after executing the transaction. Considering the overhead of scheduling and synchronization between threads, this utilization is within the normal range. This debugging experience also reminds us that we must be careful to output logs in pursuit of high-performance parallel code to avoid unnecessary performance loss due to unnecessary logging.

3.3. Memory flame diagram

In the early testing phase of FISCO BCOS, we adopted the test method of repeatedly executing the same block, and then calculating the average time taken to execute a block. we found that the time spent on the first execution of the block is much higher than that of the subsequent execution of the block. On the face of it, it seems that the program allocates the cache somewhere the first time the block is executed, but we don't know exactly where the cache is allocated, so we started to study the memory flame graph.

Memory flame diagram is a non-invasive bypass analysis method. Compared with Valgrid that simulates memory analysis and TC Malloc that counts heap usage, memory flame diagram can obtain the memory allocation of the target process without interfering with the running of the program.

To create a memory flame diagram, you first need to dynamically add a probe to perf to monitor the malloc behavior of the standard library, and sample the call stack of the function that is requesting / releasing memory:

Perf record-e probe_libc:malloc-F 1000-p `pidof fisco- bcos`-g-- sleep 60

Then draw the memory flame diagram:

Perf script > memory.perf./stackcollapse-perf.pl memory.perf > memory.folded./flamegraph.pl-- colors=mem memory.folded > memory.svg

The resulting flame picture is shown below:

At first, we guessed that this unknown cache might be in the database connection module or JSON decoding module of LevelDB, but by comparing the memory flame diagram of the first execution block and the subsequent execution block, we found that the proportion of the number of malloc samples in each module was about the same, so we quickly denied these conjectures. We didn't notice the unusually high number of calls to sysmalloc the first time a block was executed until we looked at it in conjunction with the Off-CPU flame diagram. Considering the fact that malloc pre-allocates memory when it is called for the first time, we suspect that it takes a lot of time to execute a chunk for the first time.

To test the conjecture, we lower the upper limit of pre-allocated space for malloc:

Export MALLOC_ARENA_MAX=1

Then test again and draw the Off-CPU flame diagram, and find that although the performance has degraded, the time-consuming and sysmalloc calls of the first execution of the block are basically the same as the subsequent execution of the block. Based on this, we can basically conclude that this interesting phenomenon is caused by the memory pre-allocation behavior of malloc.

Of course, this behavior is introduced by the operating system to improve the overall performance of the program, and we do not need to interfere with it, and the execution speed of the first block is slow and will hardly have a negative impact on the user experience. but even the smallest performance problem is a problem, and as developers we should get to the bottom of it and know why.

Although the Memory flame map does not help us directly locate the essential cause of the problem, but through intuitive data comparison, we can easily eliminate the wrong cause conjecture and reduce a lot of trial and error costs. In the face of complex memory problems, you need not only a keen sense of smell, but also good helpers such as Memory flame images.

4.DIY tool

Although there are already so many excellent analysis tools to help us overcome difficulties in the progress of performance optimization, powerful functions sometimes fail to catch up with the variability of performance problems, which requires us to combine our own needs. Self-sufficient development of analysis tools.

When testing the stability of FISCO BCOS, we find that the performance of FISCO BCOS nodes tends to decline with the increase of test time, and we need to get the performance trend changes of all modules to find out the culprit of performance degradation, but the existing performance analysis tools can not achieve this requirement quickly and easily, so we choose to find another way.

First, we insert a large number of posts into the code, which are used to measure the execution time of the code segments we are interested in and record them in the log with special identifiers:

Auto startTime = utcTime (); / *... code to be measured...*/auto endTime = utcTime (); auto elapsedTime = endTime-startTime;LOG (DEBUG)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.