Propositional composition: Dimension Tree interval search and IP packet Classification 12/05 Update SLTechnology News&Howtos

Propositional composition: Dimension Tree interval search and IP packet Classification

2025-12-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

This topic is a bit big, and I have to strictly control the number of words, which is not as open as "propositional composition: thoroughly understanding the various lookup processes of the IP routing table in an IPv4 address tree". In fact, this composition is an extension of the section on interval search in the previous composition.

1.IP packet classification

According to several fields of the IP packet protocol header, also known as matching domain, packets are divided into a certain category, which is the core of IP packet classification.

In fact, the process of IP route lookup is a special case of IP packet classification, an extremely simple special case, where the matching domain is the destination IP address, and the category is the routing item or, more simply, the next hop. At this point, consider the source address Policy routing, and then add a matching domain, that is, the source IP address. What should the whole process do? This paper focuses on a multi-dimensional interval matching scheme, not talking about others such as HASH matching, nor about hardware algorithms.

two。 Interval search

I hope my "search process of HiPAC high-performance rule matching algorithm" and "play to high-performance super firewall nf-HiPAC" can help. In fact, the process of IP packet classification can be seen as a combination of interval search results of multiple matching domains.

In the general HiPAC algorithm, although the execution efficiency is very good, it may waste a lot of memory. It may be necessary to use the optimization of Grid of Tries data structure to optimize its storage space to remove redundant data.

Although the core algorithm I'm going to talk about IP packet classification should be over at this point, the theory behind packet classification is just beginning. Later, I will use an example to illustrate as clearly as possible, rather than using a unified mathematical derivation to illustrate the problem, in order to provide a perceptual picture.

3. The raising of a question

It is often not enough to know how to get an IP packet in a multidimensional continuous interval set, but it doesn't matter if you don't understand it. If you do know this, then you must know what it is and why.

Now, I assume that through the two articles on nf-HiPAC mentioned above, you already know the process of multi-dimensional interval matching, so we can break away from the specific scenario and abstract it into a general problem. First, take a look at the abstract diagram. In this diagram, I ignored the size of the interval and ignored the problem of Rule arrangement (which I will return to eventually):

In this picture, we see a lot of "?" Number, which means that we don't know what the Ruleset associated with this interval is at this time. Why is that?

Let's take a look at the second layer, that is, the layer of Dimension 2. There are four nodes in this layer, corresponding to the four intervals split by Dimension 1. If you want the matching to continue, that is, to continue to match the nodes on the layer of Dimension 3, there must be intersection in the corresponding Ruleset, so we get a clearer picture below, although it is a bit messy. While writing this article, I somehow thought of Harbin and wanted to cry.

According to this diagram, we have found the following three important key points:

1)。 Key operation: Ruleset needs to take intersection

When the Ruleset of the previous interval intersects the Ruleset of each interval of the next dimension, the branch will be expanded into a child node when the result set is not empty. I have added some constraints related to the scene, such as there must be at least one Default Rule in each interval, so all the results of taking the intersection will not be empty, so the above sentence should be that if the number of elements in the result set is 1 and the element is Default Rule, the next matching dimension corresponding to the current interval will be extended to a leaf node. This means that the subtree will not be extended in the next dimension, that is, there is no need for matching to continue, just take Default Rule.

2)。 Key trade-off: rapid success and rapid failure

As the key operation shows, as long as the matching continues, that is, the number of Rule in the remaining Ruleset in the relevant current interval is not 1 or 1 but not Default Rule, then we cannot determine the final result, which means that rapid success is impossible. But we can quickly determine when we will fail, that is, there is only one Default Rule left in the Ruleset of the current range. This is quick failure.

3)。 Key question: how to lay out the tree

As can be seen from the diagram above, because Ruleset is an intersection, who is in front of and who is behind in Dimension on 1-2-3 does not affect the result in the end, but the difference in this sort can affect the occupation of space. As can be seen from the above figure, as long as there is intersection in Ruleset, it is necessary to create a child node at the lower level, which is actually creating an interval search tree, and we know in rapid failure. All subtrees may have different heights due to different Ruleset layouts. The question is, how do we arrange the search order of the dimensions?

4. The difficulty lies

Don't expect to balance the tree above, because the factors that affect the height and width of its subtree are interrelated: the number of Rule, the number of intervals in each dimension, the internal layout of the Ruleset, and so on. Any "rotation" of the tree will cause it to be called a mess, it is not a tripod.

Since there is no good way to calculate how to optimize the matching order of these dimensions, it is better to look for a more balanced way to "make the tree" in a statistical sense.

5. From M-tree to binary tree

So far, we have always thought that the so-called multi-dimensional interval matching is an M-cross tree, where M is the number of intervals divided by that layer, that is, the dimension. Through the above analysis, we find that it is difficult to find the best construction method in such a fat tree with uncertain height of each branch, because there are too many variables, the number of Rule in Ruleset is a variable parameter, the number of intervals is also a variable parameter, and the Ruleset of each interval of each dimension has to be union, which involves the problem of Cartesian product. There are too many mathematical calculations involved, and there is no simple formula.

Does the multi-dimensional interval matching tree have to be constructed into an M-ary tree in the first place?

Before I go on, I'll show the above multi-dimensional matching problem in a more intuitive way, and then start from 0 until we encounter the KD tree, and the rest of the content, data structures, and algorithm-related books will be available.

5.1. Multidimensional matching cube

Fortunately, I take 3D domain matching as an example, otherwise I really don't know how to draw a four-dimensional cube.

I describe the above 3D domain matching problem as a cube:

Note that each black point is connected along the axis into line segments and then the surface formed by these segments divides the cube into 4-3-2 sub-cubes. Our problem is that after the final matching is completed, we will fall into a sub-cube, and the current Ruleset in this sub-cube is the result set, and we take the highest priority, that is, the final result. The Ruleset in these subcubes is put in when constructing this cube, and like the propositional composition on routing lookup, I do not consider the time complexity of relatively rare data structures to construct events. Therefore, I put the Ruleset schematic into the subcube, as shown in the following figure:

So here's the problem. How do I cut this cube and finally get the desired sub-cube? How to cut the first knife? Along the D1 axis? D2 axis? D3 axis? If the D1 axis is determined, what about the second knife? Continue the D1 axis? Or do you want to cut in another direction? ...

5.2. Cutting mode

It makes people think of cutting watermelons, but in fact it's completely different, similar to the small piece you take from the center with a large piece of wood. You have two ways: first cut it into pieces, then cut it into sticks, and finally chop the sticks into small pieces; there is another way that is similar to that of machine tools, constantly changing direction and cutting. For the multi-dimensional matching problem, we are faced with the same choice.

5.2.1. Dimension by dimension depth first

This approach is similar to the slice-stick-block approach, as shown in the following figure:

5.2.2. Dimension by dimension breadth priority

This is similar to the way it is machined, as shown in the following figure:

So, it's time to give the data structure. In the interval search section of my composition on routing lookup, I give the structure and construction of the interval binary search tree. This paper makes a basic assumption on this premise.

5.3. Depth first search tree

The structure of the depth-first search tree is as follows:

It can be seen that this is a direct way, especially suitable for accurate matching. When I mention Dimension Tree or Dimtree, I'm referring to this tree.

5.4. Breadth first search tree

The breadth-first search tree structure is as follows:

As you can see, this is a way of taking a chance, gradually (with the finest possible granularity, gingerly) approaching the subcube space that will eventually fall. Of course, this way is actually the result of layer-by-layer rotation of the Dimension Tree, and you can finally get the answer. After touching the leaf node of a certain dimension, press the Ruleset into a stack. Finally, at the end of the tree tour, take out the Ruleset on the inner side of the stack for intersection, and then take the best one. Of course, you can also prune the branches with quick failures, such as reaching a leaf node that contains only Default Rule. But how long will it take to find the leaf node for the first time? If the search tree of the first dimension has three layers, the second dimension has two layers, and the third dimension has two layers, then the breadth-first tree has the following layers: first dimension-second dimension-third dimension-first dimension-second dimension (first encounter leaves)-third dimension (leaf node)-first dimension. In any case, even if you start cutting from the second dimension, it is impossible to meet the leaves on the second floor. This means that even if the test fails quickly, it should be delayed over depth.

5.5. Compare

For packet exact matching and classification, depth-first is better than breadth-first, which not only means that no close match can be counted without exact interval matching (for example, 192.168.1.18, then 192.168.1.19 obviously does not meet the requirements, although it is very close), but also that this way can maximize the use of system Cache.

Note, do not think that the packet mask matching is not an exact match, this paper is strictly based on the interval matching, the mask divides a domain into several exact intervals, and the ultimate goal is to locate the exact interval in which a certain domain of a packet falls, note that this is a kind of exact matching.

The KD tree, that is, the k-dimension tree, adopts the above matching method of dimension-by-dimension breadth priority. However, its dimension ranking idea can be used for depth-first matching, and the ultimate goal is to build a tree of "quick failure if mismatch"!

Note: k-dimension tree

It looks a lot like a B-tree, but it's not!

There is too much information about KD trees on the Internet. What I want to say is how variance is used in its recursive construction. Now let's answer the question of where the first knife will be cut.

I take the interval length as the benchmark, then each dimension can calculate a variance value of the interval length, that is:

Take the dimension with the largest value as the first dimension, and then use the simple binary construction, take the point closest to the midpoint in the complete set of the dimension interval as the root, and begin to recursively build the binary tree. The variance of the subdomain is calculated in each layer, and the dichotomous root of the subtree is selected according to the principle of maximum variance. From the breadth-first matching above, we can see that the graph should be a simple KD tree.

The purpose of adopting the idea of minimum variance is to make the tree more balanced. A large variance means that the interval distribution in this dimension is more uniform, and the marking points are not concentrated near the average, so that a large piece will not be cut off at once, thus missing those areas that may be closer to the final answer (the approximation principle of the final result: the cutting granularity is as small as possible, you must be careful. Like pear sharpening or pencil sharpening. Note that the goal of the KD tree is not an exact match, but a fuzzy best match, such as the classic search for K nearest points.

Dimension ordering of 6.Dimension Tree

Since depth-first matching is not constructed recursively, it is not necessary to calculate the variance in every step to determine the cutting dimension. for depth-first matching, as long as a certain dimension is determined, it will be cut directly on this dimension. for three-dimensional cases, it is directly cut into a piece, for two-dimensional cases, it is cut into a stick. There is no need to consider this dimension for the next cut, because in this dimension, it has precisely reached its interval. Therefore, we only need to find a sequence of dimensions.

Is it still based on variance like a KD tree? Not necessarily. What we want at this time is to let Rule distribute the most unevenly in each interval as the first dimension, and so on. This is because when the Rule distribution is uneven and intersects with the union set of the next dimensional Ruleset, the probability of rapid failure caused by a unique Default Rule in the corresponding interval after the intersection calculation is greater, which will reduce the number of child nodes. For an M-tree, the more you cut off the children of the nodes close to the root, the more you can restrain the excessive growth of the lower nodes. We decide whether to extend the child's formula as follows:

Current dimension interval Ruleset & (union of each Ruleset of next dimension)

It is difficult for us to control the union of each interval Ruleset of the next dimension, we don't even know which next dimension is, but we can find the Ruleset distribution of the current dimension. To take the simplest example, a dimension is divided into four intervals. There are 10 Ruleset in the first interval and 2 in the last three intervals. Then only 2 Rule intervals are likely to be empty when they intersect with others, which gives us a basis for calculation. Obviously, we do not use the variance of the KD tree this time, but express the weights as the number of intervals. The function of the distribution of the interval Rule (this variable can be used with or without variance), so as to calculate the final ranking. The simplest calculation is to use variance to calculate the variance of the number of Rule in each dimension interval, taking the minimum value as the highest weight, which shows that the distribution of Rule in one interval is more concentrated than other intervals, so it is more likely to cut off the child's umbilical cord and lead to "rapid failure"!

7. Interval covering problem

Is that a question? It's not a problem.

In this article, we take the interval as the benchmark, so we can split the Ruleset. To take the simplest example, the interval 192.168.1.0 with Rule1,2 is covered by the interval with Rule1,3 from 192.168.0 to 16, but this is not a problem, as shown in the figure below:

8.Dimension Tree parallel matching

Don't hang yourself from a tree!

Do we have to cram the entire matching data structure into a single tree? If we have to do this, we lose the possibility of parallel operation, because there is only one path from the root to the leaf of a tree, and given the large number of branches in the lower layer, it is almost impossible to guess the branches correctly. So what should I do? Since the final result is the intersection of the matching interval Ruleset of N dimensions (in our example, N is 3), then it is easy to build N trees for N dimensions, N processing processes operate at the same time, and finally take the intersection. Let's go back to the original picture and click on the title:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.