In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "C language algorithm example Analysis". In daily operation, I believe many people have doubts about C language algorithm example analysis. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts of "C language algorithm example Analysis"! Next, please follow the editor to study!
Recently, I have been developing Dynvm--, a general-purpose dynamic language runtime. Like any other good language runtime project, development is driven by benchmark procedures. As a result, I have been using benchmark programs to test a variety of algorithms written in different languages to get a sense of their typical speed. A classic test is to iterate over the Fibonacci series. For simplicity, I use 2 ^ 64 as a module to write and implement the algorithm in two languages.
The implementation in Python is as follows:
SZ = 2 years 64 I = 0 a, b = 1, 0 while I
< n: t = b b = (b+a) % SZ a = t i = i + 1 return b 用C语言实现如下: #include #include typedef unsigned long ulong; int main(int argc, char *argv[]) { ulong n = atoi(argv[1]); ulong a = 1; ulong b = 0; ulong t; for(ulong i = 0; i < n; i++) { t = b; b = a+b; a = t; } printf("%lu\n", b); return 0; } 用其他语言实现的代码示例,我已放在github上。 Dynvm包含一个基准测试程序框架,该框架可以允许在不同语言之间对比运行速度。在一台Intel i7-3840QM(调频到1.2 GHz)机器上,当 n=1,000,000 时,对比结果如下: ======================================================= 语言 时间 (秒) ======================================================= Java 0.133 C Language 0.006 CPython 0.534 Javascript V8 0.284 很明显,C语言是这里的老大,但是java的结果有点误导性,因为大部分的时间是由JIT编译器启动(~120ms)占用的。当n=100,000,000时,结果变得更明朗: ======================================================= 语言 时间(秒) ======================================================= Java 0.300 C Language 0.172 CPython 47.909 Javascript V8 24.179 在这里,我们探究下为什么C语言在2013年仍然很重要,以及为什么编程世界不会完全"跳槽"到Python或者 V8/Node。有时你需要原始性能,但是动态语言仍在这方面艰难挣扎着,即使对以上很简单的例子而言。我个人相信这种情况会克服掉,通过几个项目我们能 在这方面看到很大的希望:JVM、V8、PyPy、LuaJIT等等,但在2013年我们还没有到达"目的地"。 然而,我们无法回避这样的问题:为什么差距如此之大?在C语言和Python之间有278.5倍的性能差距!最不可思议的地方是,从语法角度讲,以上例子中的C语言和Python内部循环基本上一模一样。 为了找到问题的答案,我搬出了反汇编器,发现了以下现象: 0000000000400480 : 247 400480: 48 83 ec 08 sub $0x8,%rsp 248 400484: 48 8b 7e 08 mov 0x8(%rsi),%rdi 249 400488: ba 0a 00 00 00 mov $0xa,%edx 250 40048d: 31 f6 xor %esi,%esi 251 40048f: e8 cc ff ff ff callq 400460 252 400494: 48 98 cltq 253 400496: 31 d2 xor %edx,%edx 254 400498: 48 85 c0 test %rax,%rax 255 40049b: 74 26 je 4004c3 256 40049d: 31 c9 xor %ecx,%ecx 257 40049f: 31 f6 xor %esi,%esi 258 4004a1: bf 01 00 00 00 mov $0x1,%edi 259 4004a6: eb 0e jmp 4004b6 260 4004a8: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 261 4004af: 00 262 4004b0: 48 89 f7 mov %rsi,%rdi 263 4004b3: 48 89 d6 mov %rdx,%rsi 264 4004b6: 48 83 c1 01 add $0x1,%rcx 265 4004ba: 48 8d 14 3e lea (%rsi,%rdi,1),%rdx 266 4004be: 48 39 c8 cmp %rcx,%rax 267 4004c1: 77 ed ja 4004b0 268 4004c3: be ac 06 40 00 mov $0x4006ac,%esi 269 4004c8: bf 01 00 00 00 mov $0x1,%edi 270 4004cd: 31 c0 xor %eax,%eax 271 4004cf: e8 9c ff ff ff callq 400470 272 4004d4: 31 c0 xor %eax,%eax 273 4004d6: 48 83 c4 08 add $0x8,%rsp 274 4004da: c3 retq 275 4004db: 90 nop (译注: 1、不同的编译器版本及不同的优化选项(-Ox)会产生不同的汇编代码。 2、反汇编方法:gcc -g -O2 test.c -o test;objdumb -d test>Test.txt readers can try to change the optimization options to compare the disassembly results.
The most important part is the internal loop that calculates the next Fibonacci value:
262 4004b0: 48 89 f7 mov% rsi,%rdi 263 4004b3: 48 89 d6 mov% rdx,%rsi 264 4004b6: 48 83 c1 01 add $0x1 4004ba: 48 8d 14 3e lea (% rsi,%rdi,1),% rdx 266 4004be: 48 39 c8 cmp% rcx % rax 2674004c1: 77 ed ja 4004b0
The allocation of variables in the register is as follows:
A:% rsi b:% rdx t:% rdi I:% rcx n:% rax
Lines 262 and 263 implement variable exchange, and line 264 increases the loop count value, although it may seem strange, line 265 implements b=a+t. Then make a simple comparison, * A jump instruction jumps to the beginning of the loop to continue execution.
Manually decompile the above code, and the code looks like this
Loop: t = an a = b ii = iTun1b = Atret if (n > I) goto loop
The whole internal loop is implemented with only six X86-64 assembly instructions (most likely the number of internal microinstructions is about the same. Translator's note: the Intel X86-64 processor further translates instructions into microinstructions, so CPU executes more instructions than assembly instructions. The CPython interpretation module needs at least six or more instructions to explain each high-level instruction bytecode, compared with C language. Beyond that, there are other more subtle things.
One of the main reasons for slowing down the execution speed of modern processors is the access to main memory. The impact of this aspect is so terrible that in microprocessor design, countless engineering hours (engineering hours) are spent finding effective technologies to "hide" memory access latency. General strategies include: caching, speculative prefetching, load-store dependent prediction, out-of-order execution, and so on. These methods do play a big role in making the machine faster, but it is impossible not to generate memory access operations at all.
In the assembly code above, memory is never accessed, and the variables are actually stored entirely in the CPU register. Now consider CPython: everything is an object on the heap, and all methods are dynamic. Dynamic features are so common that there is no way to know whether aroomb executes integer_add (a, b), string_concat (a, b), or user-defined functions. This means that a lot of time is spent figuring out which function is called at run time. The dynamic JIT runtime tries to get this information at run time and generate x86 code dynamically, but this is not always very straightforward (I expect V8 runtime to perform better, but strangely, it is only half as fast as Python). Because CPython is a pure translator, a lot of time is spent on solving the dynamic characteristics in each iteration of the loop, which requires a lot of unnecessary memory access operations.
In addition to the above, there are other factors. Modern superscalar scrambling processor cores can fetch several instructions into the processor at one time and execute them "at the most convenient time", that is, since the results are still correct, the instruction execution order can be arbitrary. These processors can also execute multiple instructions in parallel in the same clock cycle, as long as the instructions are unrelated. Intel Sandy Bridge CPU can reorder 168 instructions at the same time, and can launch (that is, start execution) up to 6 instructions in a cycle, and end (that is, instructions complete execution) up to 4 instructions at the same time! Roughly using the Fibonacci example above, the Intel processor can reorder about 28 internal loops and almost complete one cycle per clock cycle! It sounds domineering, but like everything else, the details are always very complicated.
We assume that the eight instructions are irrelevant so that the processor can obtain enough instructions to take advantage of the benefits of instruction reordering. Reordering of instruction streams containing branch instructions is very complex, that is, assembly code generated for if-else and loops (if-else needs to jump after judgment, so it must contain branch instructions). A typical method is to predict branch instructions. CPU dynamically uses the results of previous branch execution to guess the execution results of branch instructions to be executed in the future, and obtains those instructions that it "thinks" will be executed. However, this conjecture may not be correct, and if it is, CPU will enter the reset mode, that is, discard the obtained instructions and start fetching again. This reset operation may have a great impact on performance. Therefore, the correct prediction of branch instructions is another area that takes a lot of engineering time.
Now, not all branch instructions are the same, some can be very predictable, but others are almost impossible to predict. The branch instruction in the loop in the previous example-- like line 267 in disassembly code-- is one of the most predictable, jumping backwards 100000000 times in a row.
Here is an example of a branch instruction that is very difficult to predict:
Void main (void) {for (int I = 0; I)
< 1000000; i++) { int n = random(); if(n >= 0) {printf ("positive!\ n");} else {printf ("negative!\ n");}
If random () is really random (in fact, far from being the case in C), then the prediction of if-else is not as accurate as random guesses. Fortunately, most branch instructions are not so naughty, but a few are as abnormal as the loop branch instructions in the example above.
Back to our example: the Fibonacci sequence implemented by the C code produces only one very predictable branch instruction. On the contrary, the CPython code is very bad. First of all, all pure translators have an "allocation" loop, like the following example:
Void exec_instruction (instruction_t * inst) {switch (inst- > opcode) {case ADD: / / do add case LOAD: / / do load case STORE: / / do store case GOTO: / / do goto}}
No matter how the compiler processes the above code, at least one indirect jump instruction is necessary, and this indirect jump instruction is more difficult to predict.
At this point, the study of "C language algorithm example Analysis" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.