Case Analysis of Dubbo Source Code to CPU Branch Prediction 07/06 Update SLTechnology News&Howtos

Case Analysis of Dubbo Source Code to CPU Branch Prediction

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "Dubbo source code to CPU branch prediction example analysis," interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let Xiaobian take you to learn "Dubbo source code to CPU branch prediction example analysis"!

It is also coincidental that recently I was looking at the Dubbo source code, and then I found a very strange code, so I have this article, let's take a look at this code, it belongs to ChannelEventRunnable, this runnable is created by the Dubbo IO thread, and this task is thrown into the business thread pool for processing.

See, state == ChannelState.RECEIVED is taken out as an independent if, and other states are still judged in switch.

At that time, my mind scanned back and forth, thinking about what kind of flower head there was, but my knowledge was shallow and my face was confused.

So began a wave of adventure journey!

CPU branch prediction

Encounter the problem of course is to ask the search engine, generally speaking, I will search the major engines at the same time, we do not say who is better than who, anyway, sometimes the degree of mother is still good, such as this search degree of mother to the results are relatively high, Google is relatively low.

Generally, I like to search for things on the official website first, and then release the search if I can't find them, so I first search site: xxx.com key.

There you go, perfect!

Let's first take a look at what the official blog said, and then analyze the wave in detail.

Dubbo's official blog

Modern CPUs support both branch prediction and instruction pipeline, a combination of which can greatly improve CPU efficiency. For simple if jumps, CPUs are better at branch prediction. But for switch jumps, the CPU doesn't have much to do. Switch essentially takes an address from an address array and jumps to it according to the index.

That is to say, if is a jump instruction, if it is a simple jump instruction, then the CPU can use branch prediction to pre-execute the instruction, and switch is to first find the corresponding address according to the value of an array-like structure, and then jump, so CPU prediction will not help.

Then, because after a channel is established, more than 99.9% of its state is ChannelState.RECEIVED, so this state is picked out, so that CPU branch prediction mechanism can be used to improve code execution efficiency.

And also gave the Benchmark code, that is, by randomly generating 100W states, and 99.99% is ChannelState.RECEIVED, and then compare it in the following two ways (the example name of this benchSwitch official website is wrong, I didn't find it at first, but later proofread the article).

Although the blog also gives its comparison results, but I still run locally to see how the results, in fact, JMH does not recommend running in ide, but I am lazy, directly run in idea.

From the results, it is indeed more efficient to execute code independently through if (note that throughput is measured here). The blog also suggests that this technique can be placed in places with strict performance requirements, that is, in general, there is no need to do so specially.

So far we know that this conclusion is correct, but we still need to analyze a wave of in-depth, first we have to look at the if and switch execution mode in the end where the difference, and then look at the CPU branch prediction and instruction pipeline in the end is what, why there are these two things?

if vs switch

Let's start with a simple demo to see the execution efficiency of if and switch. In fact, it is to add a code that is all controlled by if else, switch and if + switch do not move, and see how efficient they are compared (at this time, RECEIVED exceeds 99.9%).

Let's see how the results of the implementation:

Good guy, I ran several times, this all if than if + switch a lot stronger ah, so the source code should be changed to all if else way, you see this throughput is high, it will not be like now if and switch a bit nondescript appearance.

I changed the value generated by state to random, and ran it again to see what happened:

I ran many times or if throughput is the highest, how to adjust this all if is the best drop.

Decompile if and switch

In my impression, this switch should be better than if, regardless of CPU branch prediction, when this is the case from a bytecode point of view, let's look at the bytecode generated by each.

First look at the decompilation of switch, which intercepts the key parts.

That is to say, switch generates a tableswitch. After getstatic above gets the value, it can directly look up the table according to the index, and then jump to the corresponding line to execute, that is, the time complexity is O(1).

For example, if the value is 1, jump directly to line 64, if it is 4, jump directly to line 100.

There are some small details about switch. When the values in swtich are discontinuous and the gap is large, the lookup switch is generated. According to the online saying, it is a dichotomy query (I have not verified it). The time complexity is O(logn). It is not directly found according to the index. I think the generated lookup should be dichotomy, because it is sorted by value size.

Also, when the values in the switch are discontinuous but the gap is relatively small, tableswitch will still be generated but filled with some values. For example, in this example, the values in my switch are 1, 3, 5, 7, and 9, and it automatically fills in 2, 4, 6, and 8.

Let's look at the decompilation result of if:

You can see that if is to take variables and conditions for comparison every time, while switch is to look up the table and jump directly to the correct row after taking a variable. From this point of view, the efficiency of switch should be better than if. Of course, if you pass the first judgment, you will go directly to goto, and you will not execute the following judgments.

Therefore, from the perspective of generated bytecode, switch efficiency should be greater than if, but from the perspective of test results, if efficiency is higher than switch, whether it is randomly generated state or 99.99% is the same state.

First of all, CPU branch prediction optimization is positive, then about the random case if is better than switch then this I am a little not sure why, may be JIT to do some optimization operation, or is the random case branch prediction success brings more benefits than prediction failure?

Is my enumeration too small to reflect the effect of switch? However, in random cases switch should not be weaker than if ah, I added 7 enumeration values, a total of 12 values and tested again, the results are as follows:

It seems that the distance has been shortened, I see a play, so I recite the wave of 26 letters, to be honest or sing the letters.

After expanding the number of branches, another wave of tests was carried out. This time swtich was better than if.

digression: I see there are also comparisons between if and switch on the Internet. The result of their comparison is that switch is better than if. First, jmh is not written correctly. Define a constant to test if and switch, and the result of the test method is written without consumption. I don't know what this code will be optimized by JIT. I wrote dozens of lines and may directly optimize it to return a certain value.

Summarize the test results

So much for comparison. Let's summarize.

First of all, for hot branches, we extract them from switch and judge them independently with if. The convenience brought by CPU branch prediction is indeed better than pure swtich. From our code test results, the throughput is roughly twice as high.

In the case of hot branches, the throughput improvement is even greater when the pure if decision is changed instead of if + swtich. It is 3.3 times that of pure switch and 1.6 times that of if + switch.

In the case of random branching, the three are not very different, but the pure if case is the best.

However, from a bytecode perspective, the switch mechanism should be more efficient, whether it is O(1) or O(logn), but from a test result perspective, it is not.

If is better than switch with fewer choices, I don't know why, maybe it costs more to look up the table with fewer values than it does to bring more benefits? Some friends who know can leave a message at the end of the article.

In the case of a lot of choice conditions switch is better than if, no matter how many choice values I did not test, everyone interested can test their own, but the trend is like this.

CPU branch prediction

Next, let's take a look at how this branch prediction is done and why there is such a thing as branch prediction, but before we talk about branch prediction, we need to introduce the instruction pipeline, which is the pipeline of modern microprocessors.

CPU essence is to fetch execution, and fetch execution Let's look at the next five steps, namely fetch instruction (IF), instruction decode (ID), execute instruction (EX), memory access (MEM), write back result (WB), and then look at a diagram on Wikipedia.

Of course, there may actually be more steps. Anyway, this means that it needs to go through so many steps, so one execution can be divided into many steps, so many steps can be parallel to improve the efficiency of processing.

So instruction pipelining is an attempt to keep every part of the processor busy with instructions by dividing incoming instructions into a series of sequential steps that are executed by different processor units, with different instruction parts being processed in parallel.

Just like the assembly line in our factory, my Ultraman's feet are put together and the next Ultraman's feet are put together immediately. I won't wait for the last Ultraman to be assembled before assembling the next Ultraman.

Of course, it is not so rigid, not necessarily sequential execution, some instructions are waiting and the instructions behind do not depend on the previous results, so they can be executed in advance, which is called out-of-order execution.

Let's go back to our branch prediction.

This code, like our lives, is always faced with choices. Only after making a choice can we know how to go in the future. However, in fact, we found that this code often takes the same choice, so we came up with a branch predictor to predict the trend and execute the instructions in advance.

What if the prediction is wrong? This is not the same as our life, it can throw away all the results of previous execution and then do it again, but it also has an impact, that is, the deeper the pipeline, the more wrong the waste, the error prediction delay is between 10 and 20 clock cycles, so there are still side effects.

Simply put, it is to predict the instructions to be executed in the future through branch predictors, and then pre-execute them, so that when it is really needed, the results can be directly obtained, which improves efficiency.

Branch prediction is divided into many kinds of prediction methods, including static prediction, dynamic prediction, random prediction, etc. From Wikipedia, there are 16 kinds.

Let me briefly mention the three kinds I mentioned. Static prediction is a hothead, just like Mongolian English multiple-choice questions. I don't care what questions you choose. I choose A, which means it will predict a trend, indomitable, simple and rude.

Dynamic prediction will determine the direction of prediction according to historical records. For example, if the previous choices are true, then I will go true to execute these instructions. If the recent choices are false, then I will become false to execute these instructions. In fact, it also uses the principle of locality.

Random prediction to see the name will know, this is another way of Mongolian English multiple-choice questions, guess, randomly choose a direction directly executed.

There are many more that are not listed one by one, and you are interested in studying them yourself. Incidentally, in 2018, Google's Zero Project and other researchers published a catastrophic security vulnerability called Spectre, which can exploit CPU branch prediction execution to leak sensitive information. I won't expand it here, but I will attach a link at the end of this article.

Then there was BranchScope, which also used predictive execution, so every time a new thing came out there were always pros and cons.

At this point we already know what instruction pipeline and branch prediction are, and understand why Dubbo is so optimized, but the article is not over, I also want to mention this stackoverflow very famous problem, look at this number.

Why are ordered arrays faster than unordered arrays?

This question was raised at the beginning of that blog. Obviously, this is also related to branch prediction. Since we have seen it, we can analyze it again. Everyone can answer this question first in their minds. After all, we all know the answer. See if the thinking is clear.

Here's the code that loops faster when arrays are sorted.

Then all the great gods jumped out. Let's see what the first praised big guy said.

The first thing he said, he hit the nail on the head.

You are a victim of branch prediction fail.

Immediately after, the picture was taken. At a glance, it was an old driver.

He said,"Let's go back to the 19th century, a time when communication over long distances was not possible and radio was not widespread. If you were a switchman at a railway crossing, how would you know which side to switch to when the train was coming?"

Train stop and restart consumption is very large, every time to fork stop, and then you ask him, brother where to go ah, and then pull the road, and then restart very time-consuming, how to do? Guess!

If you guessed correctly, the train would not have to stop. It would keep going. If you're wrong, stop and reverse and change lanes and drive again.

So it depends on the accuracy of the guess! A bike becomes a motorcycle.

Then, the boss pointed out the assembly code corresponding to the key code, which was the jump instruction. This corresponded to the fork of the train. It was time to choose a route.

I will not analyze the latter, everyone should know, the sorted array execution to the value greater than 128 after all must be greater than 128, so each branch prediction results are right! So the execution was very efficient.

The unsorted array is out of order, so many times it will be predicted incorrectly, and the prediction error will have to drain the instruction pipeline, and then it will be repeated again, which is of course slower.

So the boss said that you are the victim of branch prediction error.

In the end, the modification plan given by the big boss is that we don't need if anymore. If we can't afford to provoke him, we can't hide. Direct use of bit operations to achieve this function, I will not analyze the specific, to show you the next big brother's suggested modification scheme.

At this point, I believe we have a deeper understanding of "Dubbo source code to CPU branch prediction example analysis," may wish to actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.