How to optimize the performance of CumberCraft + 07/19 Update SLTechnology News&Howtos

How to optimize the performance of CumberCraft +

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

The main content of this article is to explain how to optimize the performance of CCMG +. Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how to optimize the performance of Cmax Craft +".

1. Examples

Here's a toast to the code:

# include # define CACHE_LINE _ attribute__ ((aligned (64) struct S1 {int R1; int R2; int R3; S1 (): R1 (1), R2 (2), R3 (3) {} CACHE_LINE; void add (const S1 smember [], int members,long & total) {int idx = members; do {total + = member [IDX] .R1; total + = member [IDX] .r2 Total + = member [IDX] .r3;} while (--idx);} int main (int argc, char * argv []) {const int SIZE = 204800; S1 * smember = (S1 *) malloc (sizeof (S1) * SIZE); long total = 0L; int loop = 10000; while (--loop) {/ / facilitate comparison test add (smember,SIZE,total);} return 0;}

Note: the logic of the code is relatively simple to do an accumulation operation, just for demonstration.

Compile + run:

Task_set + cache_line.cpp-o cache_line; task_set-c 1. / cache_line

The following figure shows a sample cache_line running on the CPU 1 core, and the CPU utilization reaches 99.7%. When the CPU is basically fully loaded, how do we know whether this cpu is doing any useful work when running the cache_line service, and whether there is room for optimization?

Some students may say that you can use perf to analyze and find hot functions. It is true that perf can be used, but perf can only know that a function is a hot spot (or some assembly instructions), but it is impossible to identify which operations in CPU have bottlenecks, such as fetching instructions, decoding,.

If you are still at a loss to determine which CPU operations are causing service performance bottlenecks, then this article will give you a lesson. This paper mainly introduces the top-down analysis method (TMAM) methodology to quickly and accurately locate CPU performance bottlenecks and related optimization suggestions to help you improve service performance. In order for you to better understand the methods introduced in this article, you need to prepare some knowledge.

II. Introduction of CPU pipeline

Modern computers are generally von Neumann computer models with five core components: operation, storage, control, input and output. The method introduced in this paper is related to CPU. The execution of CPU involves the most basic stages of fetching instructions, decoding, execution and writing back. In the earliest CPU execution process, one instruction is executed sequentially according to the above steps before the second instruction is executed sequentially. It is obvious that the utilization rate of each hardware unit of CPU is very low. In order to improve the performance of CPU, Intel introduces multi-level pipelining, out-of-sequence execution and other technologies to improve performance. Generally speaking, intel cpu is a level 5 pipeline, that is, the same cycle can handle five different operations. In some new CPU, there are as many as 15 levels of pipeline. The following figure shows the state of a level 5 pipeline. In the seven CPU instruction cycles, instruction 1 ~ 2 ~ 3 has been executed, while instruction 4 ~ 5 is also being executed. This is why CPU decodes instructions: to decompose instructions that operate on different resources into different microinstructions (uops), such as ADD eax, [mem1] can be decoded into two microinstructions, one is to load data from memory [mem1] into temporary registers, and the other is to perform operations, so that the operation unit can perform the operation of another instruction uops when loading data. Multiple different resource units can work in parallel.

There are many kinds of resources in CPU, such as TLB, ALU, L1Cache, register, port, BTB and so on, and the execution speed of each resource is different, some are fast, some are slow, and there is a dependence on each other, so there are a variety of constraints on different CPU resources in the process of program running. this paper uses TMAM to analyze which internal CPU resources have bottlenecks in the process of program running.

3. Top-down analysis (TMAM)

TMAM is the top-down micro-architecture analysis method of Top-down Micro-architecture Analysis Methodology. This is the methodology that Intel CPU engineers have summed up to optimize CPU performance. The theoretical basis of TMAM is to classify all kinds of CPU microinstructions from a large aspect to first identify the possible bottleneck, and then further drill down to analyze and find the bottleneck, this method is also in line with our human thinking, from macro to details, premature attention to details, often need to spend more time. The advantages of this methodology are as follows:

Even if you have no hardware related knowledge, you can optimize programs based on the characteristics of CPU.

Systematically eliminate our guesses about the bottleneck of program performance: the success rate of branch prediction is low? Low CPU cache hit rate? Memory bottleneck?

Quickly identify bottleneck points in polynucleate disordered CPU

TMAM uses two measures in the process of evaluating each index, one is cpu clock cycle (cycle [6]), the other is CPU pipeline slot [4]. In this method, it is assumed that each CPU kernel each cycle pipeline is 4 slot, that is, the CPU pipeline width is 4. The following figure shows the different states of the four slot in each clock cycle. Note that only with Clockticks 4, the cycle utilization is 100%, and the rest are cycle stall (pause, bubbles).

3.1. Basic classification

TMAM classifies all kinds of CPU resources and identifies bottlenecks in the process of using these resources through different classifications. first, identify the general bottleneck from the general direction, and then conduct in-depth analysis to find the corresponding bottleneck points to break one by one. The top level in TMAM divides the resource operations of CPU into four categories, and then introduces the meaning of these categories.

3.1.1 、 Retiring

Retiring refers to the pipeline slot running a valid uOps, that is, these uOps [3] will eventually exit (note that the final result of a microinstruction is either discarded or exited and writes back to register), which can be used to evaluate the relatively real efficiency of the program against CPU. Ideally, all pipeline slot should be "Retiring". 100% Retiring means that the number of uOps Retiring per cycle will be maximized, and the extreme Retiring can increase the number of instruction throughput (IPC) per cycle. It should be noted that the high percentage of the Retiring category does not mean that there is no room for optimization. For example, the category of Microcode assists in retiring is actually a loss of performance, and we need to avoid this kind of operation.

3.1.2 、 Bad Speculation

Bad Speculation indicated that mispredictions led to a waste of pipeline resources, including blocking due to the submission of uOps that would not eventually retired and some slots due to recovery from previous mispredictions. Work wasted due to the prediction of wrong branches is classified as "misprediction". For example, if, switch, while, for, etc., may produce bad speculation.

3.1.3 、 Front-End-Boun

Front-End responsibilities:

Fetch instruction

Decode an instruction into a microinstruction

Distribute instructions to Back-End with a maximum of 4 microinstructions per cycle

Front-End Bound indicates that the slots, which processes part of its Front-End, cannot deliver enough instructions to Back-End. As the first part of the processor, the core responsibility of Front-End is to get the instructions needed by Back-End. In Front-End, the predictor predicts the next address to be obtained, then obtains the corresponding cache line from the memory subsystem, converts it to the corresponding instruction, and finally decodes it into uOps (microinstruction). Front-End Bound means that part of the slot will be idle even if the Back-End is not blocked. For example, blocking caused by the instruction cache misses can be classified as Front-End Bound.

3.1.4 、 Back-End-Bound

Responsibilities of Back-End:

Receive microinstructions submitted by Front-End

Rearrange microinstructions submitted by Front-End if necessary

Get the corresponding instruction Operand from memory

Execute microinstructions and submit results to memory

Back-End Bound indicates that part of the pipeline slots is not delivered to Back-End because Back-End lacks some of the necessary resources.

The core part of the Back-End processor is to disorderly distribute the prepared uOps to the corresponding execution unit through the scheduler. Once the execution is completed, the uOps will return the corresponding results according to the order of the program. For example, a block (pause) caused by cache-misses or a pause caused by the overload of the division arithmetic unit can be classified as such. This category can be subdivided into two categories: Memory-Bound and Core Bound.

To sum up, it is:

Front End Bound = Bound in Instruction Fetch-> Decode (Instruction Cache, ITLB)

Back End Bound = Bound in Execute-> Commit (Example = Execute, load latency)

Bad Speculation = When pipeline incorrectly predicts execution (Example branch mispredict memory ordering nuke)

Retiring = Pipeline is retiring uops

A microinstruction state can be classified according to the decision tree in the following figure:

In the leaf node in the above figure, each category will have a proportion of pipeline slot after running the program for a certain period of time, and only Retiring is the desired result, so what should be the proportion of each category is reasonable or the performance is relatively good, there is no need to continue to optimize? Intel provides a reference standard for different types of programs in the lab:

Only the Retiring category is as high as possible, and the other three categories are as low as possible. If the proportion of a certain category is prominent, then it is the object that we focus on when we optimize.

At present, there are two mainstream performance analysis tools based on this methodology: Intel vtune (charged and expensive), and the other is pm-tools of the open source community.

With some of the above knowledge, let's take a look at the categories of the initial examples:

Although all the indicators are within the scope of the previous reference table, there is room for optimization as long as the retiring does not reach 100%. The obvious bottleneck in the figure above is Back-End.

3.3. How to optimize for different categories?

When using Vtune or pm-tools tools, we should pay attention to the high proportion of the other three categories except retired, and analyze and optimize these more prominent ones. In addition, when using the tool to analyze the project, we need to pay attention to the index of MUX Reliability (reliability of multivariate analysis). The closer it is to 1, the higher the reliability of the current result. If it is less than 0.7, it is recommended to extend the running time of the program in order to collect enough data for analysis. Let's analyze and optimize the three categories.

3.3.1 、 Front-End Bound

The figure above shows that Front-End is responsible for fetching instructions (which may be fetched in advance according to prediction), decoding and distributing to the back-end pipeline. Its performance is limited to two aspects: latency and bandwidth. For latency, it is generally fetching instructions (such as L1 ICache, iTLB missed or interpreted programming language python\ java, etc.) and decoding (some special instructions or queuing problems) that cause delays. When Front-End is restricted, pipeline utilization decreases. The non-green part of the figure below indicates that slot is not being used, and ClockTicks 1 has only 50% slot utilization. For BandWidth, it is divided into three subcategories: MITE,DSB and LSD. Interested students can learn about these three subcategories through other ways.

3.3.1.1 Optimization recommendation for Front-End: code to minimize the footprint7 of the code:

C _ fomit-frame-pointer + can use the compiler's optimization options to help optimize, for example, GCC-O * optimizes footprint or can also be achieved by specifying-fomit-frame-pointer

Make full use of CPU hardware features: macro Fusion (macro-fusion)

The macro fusion feature can combine two instructions into a single microinstruction, which can improve the throughput of Front-End. Example: like the loop we usually use:

Therefore, it is recommended that the types in the loop condition use unsigned data types to use the macro fusion feature to improve Front-End throughput.

Adjust the code layout (co-locating-hot-code):

① makes full use of the compiler's PGO feature:-fprofile-generate-fprofile-use

② can use _ _ attribute__ ((hot)) _ _ attribute__ ((code)) to adjust the layout of the code in memory, the code of hot

It is advantageous for CPU to prefetch in the decoding stage.

For other optimization options, please refer to: GCC Optimization options GCC General Properties options

Branch prediction optimization

① eliminating branches can reduce the possible performance of prediction: for example, small loops can be expanded, such as less than 64 loops (you can use the GCC option-funroll-loops)

② try to use if instead of:?, it is not recommended to use astatb > 0? Because of this, it is impossible to make a branch prediction.

③ reduces the combination condition as much as possible, using a single condition such as if (a | | b) {} else {} code CPU cannot do branch prediction.

④ for multi-case switch, put the most likely case first as far as possible

⑤ we can adjust the code layout according to its static prediction algorithm to meet the following conditions:

Pre-condition so that the first block of code after the condition branch is most likely to be executed

Bool is_expect = true; if (is_expect) {/ / the code with high probability of execution is put here as far as possible} else {/ / the code with low probability of execution is put here as much as possible} postcondition, so that the conditional branch with an unlikely target do {/ / the code here runs} while (conditions) as much as possible; 3.3.2, Back-End Bound

This category of optimization involves the use of CPU Cache optimization, CPU cache [14] its existence is to bridge the speed gap between ultra-high-speed CPU and DRAM. There is a multilevel cache (register\ L1\ L2\ L3) in CPU, and TLB is introduced to speed up the conversion between virtual memory address and physical address.

Without cache, the delay would be unacceptable if instructions were loaded into DRAM every time.

3.3.2.1 Optimization recommendations:

Adjust the algorithm to reduce data storage, reduce the dependence of instruction data before and after, and improve the concurrency of instructions.

Resize the data structure according to cache line

Avoid L2 and L3 cache pseudo-sharing

(1) rational use of cache line alignment

The cache of CPU is very precious, and its utilization rate should be improved as much as possible. There may be some misunderstandings in the usual use process, which may lead to the low effective utilization of CPU cache. Let's look at an example that is not suitable for cache line alignment:

# include # define CACHE_LINE struct S1 {int R1; int R2; int R3; S1 (): R1 (1), R2 (2), R3 (3) {}} CACHE_LINE; int main (int argc, char * argv []) {/ / consistent with the previous}

The following is the test result:

Cache line alignment is done:

# include # include # define CACHE_LINE _ attribute__ ((aligned (64) struct S1 {int R1; int R2; int R3; S1 (): R1 (1), R2 (2), R3 (3) {}} CACHE_LINE; int main (int argc,char* argv []) {/ / consistent with the previous}

Test results:

By comparing the two retiring, we can see that the cache utilization is high without cache alignment in this scenario, because the cache line is used in a single thread, which leads to low cpu cache utilization. In the above example, the cache line utilization is only 3 percent 4 cache 64 = 18%. Principles for the alignment of cache lines:

There is a scene where multiple threads write an object and structure at the same time (that is, there is a pseudo-shared scene).

When the object or structure is too large

Put the high-frequency access object attributes on the head of the object and structure as much as possible.

(2) pseudo sharing

The previous scenario is mainly about the misuse of cache lines. Here we introduce how to use cache lines to solve pseudo-sharing (false shared) in SMP system. Multiple CPU modifies the data of the same cache row at the same time, resulting in data inconsistency in CPU cache, that is, cache invalidation. Why does pseudo-sharing only occur in multi-threaded scenarios, but not in multi-process scenarios? This is because of the characteristic of linux virtual memory, the virtual address space of each process is isolated from each other, that is to say, if the data is not aligned with cache rows, the data of a cache line loaded by CPU execution process 1 will only belong to process 1, and there will be no part of process 1 and another part of process 2.

(different models of L2 cache in the figure above may be organized differently, and some may be exclusive to each core, such as skylake)

Pseudo-sharing has a great impact on performance because it causes operations that could have been performed in parallel to be executed concurrently. This is not acceptable for high-performance services, so we need to optimize the alignment. The method is CPU cache line alignment (cache line align) to solve pseudo-sharing, which is originally a space-for-time solution. For example, the code snippet above:

# define CACHE_LINE _ attribute__ ((aligned (64) struct S1 {int R1; int R2; int R3; S1 (): R1 (1), R2 (2), R3 (3) {} CACHE_LINE

Therefore, the use of cache lines needs to be treated differently according to their own actual code, rather than following others.

3.3.3. Bad Speculation branch prediction

When Back-End deletes microinstructions, Bad Speculation appears, which means that the fetching and decoding of these instructions by Front-End are useless, so why branches should be avoided as much as possible in the development process or branch prediction accuracy should be improved to improve the performance of the service. Although CPU has BTB records of historical predictions, this part of cache is very scarce and the amount of data it can cache is very limited.

Branch prediction is used in Font-End to speed up the process of CPU getting a specified instruction, rather than waiting until it is necessary to read the instruction from main memory. Front-End can use branch prediction to load needed prediction instructions into L2 Cache in advance, so the delay of CPU fetching instructions is greatly reduced, so there is misjudgment when loading instructions in advance, so we should avoid this situation. The common method of C++ is:

Use gcc's built-in branch prediction feature whenever possible where if is used (for other cases, please refer to the Front-End section)

# define likely (x) _ builtin_expect (!! (x), 1) / / gcc built-in function to help compiler branch optimize # define unlikely (x) _ builtin_expect (!! (x), 0) if (likely (condition)) {/ / the probability of code execution here is relatively high} if (unlikely (condition)) {/ / the probability of code execution here is relatively high} / / avoid remote calls as far as possible

Avoid indirect redirecting or calling

In C++, such as switch, function pointer or virtual function, there may be multiple jump targets when generating assembly language, which will also affect the branch prediction. Although BTB can improve these, after all, the resources of BTB are very limited. (BTB 512 entry of intel P3, some newer CPU cannot find the relevant data.)

Fourth, write at the end

Here, let's take a look at the initial example. After using the optimization method mentioned above, the evaluation results are as follows:

Gmail + cache_line.cpp-o cache_line-fomit-frame-pointer; task_set-c 1. / cache_line

The time-consuming has been reduced from 15s to 9.8s, the performance improvement 34%:retiring has increased from 66.9% to 78.2%, and Back-End bound has decreased from 31.4% to 21.1%.

5. CPU knowledge charging station

[1] CPI (cycle per instruction) average number of clock cycles per instruction

[2] IPC (instruction per cycle) instruction throughput per CPU cycle

[3] uOps modern processors can decode at least four instructions per clock cycle. The decoding process produces many small operations, which are called micro-ops (uOps).

[4] pipeline slot pipeline slot represents the hardware resources needed to process uOps, and TMAM assumes that each CPU core has multiple pipelined slots available in each clock cycle. The number of pipelines is called pipeline width.

[5] MIPS (MillionInstructions Per Second) is MIPS= 1 / (CPI × clock cycle) that executes millions of instructions per second = main frequency / CPI

[6] cycle clock cycle: cycle=1/ main frequency

[7] the amount of memory needed to run the memory footprint program. Including code segments, data segments, heaps, call stacks, including some hidden data such as symbol tables, debug data structures, open files, shared libraries mapped to the process space, and so on.

At this point, I believe that you have a deeper understanding of "how to optimize the performance of Cpicket +". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.