Several difficult problems of network shunt-network shunt-multi-core programming and their countermeasures 07/01 Update SLTechnology News&Howtos

Several difficult problems of network shunt-network shunt-multi-core programming and their countermeasures

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

Several difficult problems of network shunt-network shunt-multi-core programming and their countermeasures!

Rong Teng network: with the birth of multi-core CPU, multi-core programming problems will be placed on the programmer's agenda, there are many old programmers think that there have long been multi-CPU machines, the industry has accumulated a lot of experience in multi-CPU machine programming, multi-core CPU programming should be similar, as long as learn from the previous multi-task programming, parallel programming and parallel algorithm experience is enough.

What I want to say is that, for example, multi-core processing boards related to the function of network shunt collectors are collectively referred to as business processing boards in the industry, but multi-core machines are very different from previous multi-CPU machines, which were used in specific areas, such as servers, or areas where large-scale parallel computing can be carried out, which can easily give play to the advantages of multi-CPU. Now multi-core machines are applied to all levels of ordinary users, especially client machines need to use multi-core CPU, and many client software to give play to the parallel advantages of multi-core may not be as simple as servers and specific areas that can carry out large-scale parallel computing!

Network shunt

The problem of serialization

1) acceleration coefficient

When measuring the performance of multiprocessor systems, one of the metrics commonly used is the acceleration factor, which is defined as follows:

S (p) = execution time using a single processor (best sequential algorithm) / execution time required with p processors

2) Amrda's law

In parallel processing, there is an Amda law, which is expressed by an equation as follows:

S (p) = p / (1 + (pmur1) * f)

Where S (p) represents the acceleration coefficient.

P represents the number of processors

F represents the proportion of the serial part to the execution time of the whole program

When f = 5%, p = 20:00, S (p) = about 10.256

When f = 5%, p = 100, S (p) = 16.8 or so

In other words, as long as there is a 5% serial part, when the number of processors increases from 20 to 100, the acceleration factor can only increase from 10.256 to about 16.8, the number of processors has increased fivefold, and the speed has only increased by a little more than 60%. Even if the number of processors increases to infinity, the limit of the acceleration factor is only 20.

If according to Amrda's law, it can be said that there is almost no prospect for multicore development, even if there is only 1% of the non-parallelizable part of the software, then the maximum acceleration system can only reach 100, and no amount of CPU can improve speed performance. According to this law, it can be said that the development of multicore CPU makes Moore's Law reach its limit in less than a few years.

3) Gustafson's law

Gustafson put forward a hypothesis different from Amrda's law to prove that the acceleration coefficient can exceed the limit of Amrda's law. Gustafson believes that the serial part of the software is fixed and will not increase with the increase of scale, and assumes that the execution time of the parallel processing part is fixed (this may be the case with server software). Gustafson's law is described by formula as follows:

S (p) = p + (1murp) * fts

Where fts represents the proportion of serial execution

If the serial ratio is 5% and the number of processors is 20, then the acceleration factor is 20 + (1-20) * 5% 19.05.

If the serial ratio is 5% and the number of processors is 100, the acceleration factor is 100 + (1-100) * 5% "95.05"

The acceleration coefficient in Gustafson's law is almost proportional to the number of processors. If the reality accords with the assumption of Gustafson's law, then the performance of the software will increase with the increase of the number of processors.

4) Serialization analysis in actual situation

There is such a big gap between Amrda's law and Gustafson's law, which law is in line with the reality? Personally, I don't think it will be as pessimistic as Amda's law in reality, but it will not be as optimistic as Gustafson's law. Why do you say that? Let's do a simple analysis.

First of all, it is necessary to determine what content in the software cannot be parallelized in order to estimate the proportion of the serial part. In the 1960s, Bernstein gave three conditions under which parallel computing cannot be carried out:

After conditional 1:C1 writes a memory cell, C2 reads the data of that cell. It's called "read after write" competition.

After the conditional 2:C1 reads the data of a storage unit, C2 writes the cell. It's called "post-read-write" competition.

After conditional 1:C1 writes a memory cell, C2 writes the cell. It's called "post-write" competition.

None of the above three conditions can be executed in parallel. Unfortunately, there are a large number of phenomena that meet the above situation in the actual software, that is, we often talk about the problem of lock protection of shared data.

The serialization problem caused by lock protection if the proportion of serialization decreases with the increase of software size under the premise of a fixed number of tasks, but unfortunately it will increase with the increase of the number of tasks, that is to say, the more processors, the more serious the serialization caused by lock competition, so that the proportion of serialization increases sharply with the increase of the number of processors. I will explain the increase in serialization caused by lock competition in another article. Therefore, serialization is a difficult problem in multi-core programming.

5) possible solutions

For the problem of serialization, the first solution is to use less locks, or even lock-free programming, but this is almost a difficult task for ordinary programmers, because the algorithms of lock-free programming are too complex. and improper use is easy to make mistakes, many lock-free algorithms that have been published in professional journals are later proved to be wrong, you can imagine how difficult it is.

The second solution is to use atomic operations instead of locks, which essentially does not solve the problem of serialization, but greatly increases the speed of serialization, thus greatly reducing the proportion of serialization in execution time. However, at present, the atomic operations provided by chip manufacturers are very limited and can only work in a few places, and chip manufacturers may need to continue their efforts in this regard. Provide more atomic operations with slightly more powerful functions to avoid the use of locks in more places.

The third solution is to reduce the proportion of serialization from the design and algorithm level. It may be necessary to find practical parallel design patterns to reduce the use of locks. At present, the industry has accumulated some experience in this area, such as task decomposition pattern, data decomposition pattern, data sharing pattern. It is believed that with the large-scale use of multi-core CPU, more and more new and effective parallel design patterns and algorithms will emerge in the future.

The fourth solution is considered in terms of chip design, and since I know nothing about chip design, this solution may just be my wishful guess. The main idea is to design some new instructions at the chip level, these instructions are not completed by a single CPU like the previous single-core CPU instructions, but some parallel instructions completed by multiple CPU parallel processing, so that programmers call these parallel processing instructions to program like writing serialization programs, but make full use of the advantages of multiple CPU.

Load balancing problem! As we all know, the common function of network shunt is load balancing!

Lock contention in multi-core programming this article talks about a serialization problem in multi-core programming, and another problem in multi-core programming is load balancing.

In multi-core CPU, if you want to give full play to the performance of multiple CPU, you must ensure that the tasks assigned to each CPU have a good load balance. Otherwise, some CPU are running, while others are idle, unable to take advantage of multicore CPU.

There are usually two schemes to achieve a good load balancing, one is static load balancing, the other is dynamic load balancing.

1. Static load balancing

In static load balancing, it is necessary to manually divide the program into multiple parallel executable parts, and to ensure that the divided parts can be evenly distributed to each CPU, that is to say, the workload should be evenly distributed among multiple tasks to achieve a high acceleration coefficient.

The static load balancing problem is a NP completeness problem mathematically. Richard M. Karp, Jeffrey D. Ullman, Christos H. Papadimitriou, M. Garey, D. Johnson and others have proved the NP completeness of static load problems under several different constraints from 1972 to 1983.

Although the NP completeness problem is a difficult problem in mathematics, it is not the difficult problem mentioned in the title, because the NP completeness problem can generally find a very effective approximate algorithm to solve.

2. Dynamic load balancing

Dynamic load balancing is to distribute tasks in the running process of the program to achieve the purpose of load balancing. In practice, there are many problems that can not be solved by static load balancing. for example, in a large cycle, the number of cycles is input from the outside, and the number of cycles is not known in advance. at this time, it is difficult to achieve load balancing by using static load balancing partition strategy.

The scheduling of tasks in dynamic load balancing is generally realized by the system, programmers can only choose the scheduling strategy of dynamic balance, but can not modify the scheduling strategy, because there are many uncertain factors in the actual task, the scheduling algorithm can not do very well, so dynamic load balancing may sometimes fail to meet the established load balancing requirements.

Network shunt

3. Where is the problem of load balancing?

The difficulty of load balancing does not lie in the degree of load balancing, because even if there are some differences in the task execution time allocated on each CPU, the total execution time decreases with the increase of the number of CPU cores, thus the acceleration factor increases with the increase of the number of CPU cores.

The difficulty of load balancing is that many parallel executable blocks in the program have to be divided by programmers, but when the number of CPU cores is small, such as dual-core or 4-core, this division is not very difficult. However, with the increase of the number of cores, the granularity of the partition will become finer and finer, and when it is more than 16 cores, it is estimated that programmers will be crazy about how to divide tasks. For example, a sequentially executed code, running on a 128core CPU, is manually divided into 128tasks, and the difficulty of the division can be imagined.

The error of load division will be magnified with the increase of the number of CPU cores. For example, a program that requires 16 time units is divided into 4 tasks, and the average load execution time on each task is 4 time units. If the partition error is 1 time unit, then the acceleration coefficient becomes 16 / (4 / 1) = 3.2, which is 80% of the ideal acceleration coefficient 4. However, if you run on a 16-core CPU, if the partition error of a task is 0.5 time units, then the acceleration coefficient becomes 16 / (1 / 0.5) = 10.67, only 66.7% of the ideal acceleration coefficient 16. If the number of cores increases again, the ratio of the acceleration coefficient to the ideal acceleration coefficient will decrease due to the amplification of the error.

The problem of load sharing is also reflected in the upgrade of CPU and software, for example, the load sharing on 4-core CPU is balanced, but on 8-core and 16-core, the load may become unbalanced again. The same is true of software upgrade, when the software adds functions, the load balance will be destroyed, and the load needs to be redivided to achieve balance, so the difficulty and trouble of software design is greatly increased.

If locks are used, some loads that seem to be balanced may also become unbalanced due to lock competition. Please check the relevant information for details, it is not a big problem! Network shunt

4. Coping strategies for load balancing.

For the software with small amount of computation, the running speed is very fast even if it is put on the single-core CPU, and the poor load balancing does not have much impact. In practice, the software with large computation and large scale should be considered in load balancing. These software need to balance the load on the multi-core to improve the performance.

For large-scale software, the response strategy for load balancing is to develop a macroscopic partition method of parallel blocks, which is divided from the whole software system level. instead of parallel decomposition for some local programs and algorithms, as the traditional one, because local programs are usually difficult to decompose into more than dozens of tasks to run.

Another coping strategy is at the tool level, that is, compiling tools can help manually decompose parallel blocks and find a good decomposition solution. Intel has made some efforts in this respect, but more efforts are needed to make the tools more powerful in order to cope with the situation with more audits.

Lock competition in multi-core programming

In the previous article explaining several difficult problems and countermeasures of multi-core programming (problem 1), it was mentioned that lock competition will aggravate serialization with the increase of the number of cores in CPU. This article is to make an in-depth analysis of lock competition in multi-core programming.

For the sake of simplicity, let's look at a simple situation where four equivalent tasks are started at the same time, and each task starts with an operation that requires lock protection, which takes 1 and the rest of each task takes 25. The operation of these tasks after starting and running is shown in the following figure: the picture here is reserved and has not been published!

Network shunt ATCA6 slot and 14 slot

In the figure above, you can see that the first task was executed directly to the end without waiting, the second task waited for 1 unit of time, the third task waited for 2 units of time, and the third task waited for 3 units of time.

In this way, there are three CPU waiting for a total of six time units. If all the tasks in the OpenMP wait at the same point until all the tasks are executed, then the total running time will be the same as the fourth task is 29 time units, with an acceleration factor of (1 × 4 × 25) / 29 = 3.48, even if calculated as the average time of 4 tasks. Acceleration coefficient = 101 CPU 27.5 = 3.67. if the acceleration coefficient is calculated according to Amrda's law, in the above applications, the serial time is 1, and the total time of parallel processing is 100 time units after serial processing. If running on a 4-core CPU, the acceleration coefficient = p / (1 + (pMul 1) f) = 4 / (1 + (4-1) 1Comp101) = 404104 = 3.88.

This gives rise to a strange problem. After using the lock, the acceleration coefficient is not even as good as that calculated by Amda's law, let alone the acceleration coefficient calculated by Gustafson's law.

In fact, the lock contention of the above four tasks can be extended to a more general case, assuming that the serialization time with lock protection is 1, and the running time of parallelizable part on a single core CPU is t CPU core p, then when p pairing tasks are running at the same time, the total waiting time caused by lock contention is 1 CPU 2 +. + p = p* (pmur1) / 2

The time spent by the most time-consuming task is: P + tUnip

If you use the time spent by the most time-consuming task as the parallel run time, the acceleration factor is as follows

S (p) = (twee1) / (pairt / p) = p (twee1) / (pp+t) (acceleration coefficient formula under lock competition)

This formula shows that in the case of lock competition, if the kernel number is fixed, the larger the parallelizable part is, the greater the acceleration coefficient will be. In the case of fixed parallelization time, if there are more CPU cores, the acceleration coefficient will be smaller.

Or calculate a few practical examples to illustrate the effect of the above formula:

Make tweak 100, pendant 4, acceleration coefficient = 4 × (100 + 1) / (4 × 4 × 100) = 3.48

Make tweak 100, pendant 16, acceleration coefficient = 16 × (100mm 1) / (160160100) = 4.54

Make tweak 100, pendant 64, acceleration coefficient = 64 × (100mm 1) / (64064nm 100) = 1.54

The acceleration coefficient is 128 × (100 / 100) / (128 / 128 / 100), and the acceleration coefficient is 128 × (100 / 100).

It can be seen from the above calculation that when the number of cores reaches a certain level, the acceleration coefficient decreases instead of increasing. When the number of cores increases to 128, the acceleration coefficient is only 0.78. it is not as fast as running on a single-core CPU.

In the above example, the serial code caused by lock protection is called when the task starts, but the same is true for lock-protected serial code that is called elsewhere in the peer task.

Lock contention for peer-to-peer tasks is very common in practice, such as server software, where clients usually deal with tasks equally. If locks are used, it is easy to cause the above-mentioned acceleration factor to decrease with the increase in the number of CPU cores.

In the past, server software usually runs on dual-CPU or four-CPU machines, so the decrease of acceleration coefficient caused by lock competition is not obvious. after entering the multi-core era, with the increase of the number of CPU cores, this problem will become very serious, so the multi-core era poses a new challenge to programming. The previous idea of multitasking programming may not work on multi-core programming.

Therefore, it is impractical to simply think that multi-core programming is equivalent to previous multitasking programming or parallel computing. Some countermeasures are put forward in the article on the problem of serialization, but those countermeasures can only be achieved if the industry continues to work hard.

Of course, as the multicore CPU currently on the market is dual-core and quad-core, it may be several years before more than 16-core CPU enters the market on a large scale. It is believed that the industry will be able to find a better solution to the lock competition problem on the above peer-to-peer tasks in the next few years.

Multi-core service processing board for network shunt

Rong Teng network multi-core NPU processing board (Ezchip NPS400), using the main frequency 800MHz NPS400,4K processing threads, memory 48g, support regular expression matching, 5-level QoS scheduling strategy, large capacity message output buffer, 2m multiple group rules with mask, 100 million accurate five-tuple rules, processing power 400Gbps, regular expression matching ability 200Gbps. Can use EC-X16 10 Gigabit daughter card, EC-Z2X4 100G daughter card, scheduled for network traffic analysis business, deep message detection, load balancing, online flow control and so on. Rong Teng network shunt

Network shunt

Network splitter cartridge 1U

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.