How to understand hash and bucket in Linux kernel 04/17 Update SLTechnology News&Howtos

How to understand hash and bucket in Linux kernel

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

In this article, the editor introduces in detail "how to understand hash and bucket in the Linux kernel". The content is detailed, the steps are clear, and the details are handled properly. I hope this article "how to understand hash and bucket in the Linux kernel" can help you solve your doubts.

A hash table (Hashtable), also known as a "hash", is a collection of index key (Key) and Value (Value) pairs that are organized according to the hash program code of the key. The Hashtable object consists of a Bucket that contains the elements in the collection. Bucket is a virtual subgroup of elements in Hashtable, which makes the search and acquisition work in most collections easier and faster.

The hash function (Hash Function) is an algorithm that returns a numeric hash program code based on an index key. The index key (Key) is some attribute value (Value) of the stored object. When an object is added to the Hashtable, it is stored in the Bucket associated with the hash program code that matches the object hash program code. When searching for a value within Hashtable, the hash code is generated for the value and searches for the Bucket associated with the hash code. For example, student and teacher are placed in different Bucket, while dog and god are placed in the same Bucket. So it works better when the index key is the only one that gets the performance of the element from the Hashtable. The four advantages of Hash are as follows.

1. No sorting is required in advance.

2. The search speed has nothing to do with the amount of data.

3. The cryptographic technology of digital signature has high Security.

4. Data compression (Data Compression) can be done to save space.

Hash tables are widely used in the Linux kernel, and most of the language features in the PHP kernel are based on hash tables. Why is the hash table so powerful? Hash table can achieve efficient data storage and search, and storage and lookup are the two most widely used operations in programming.

Hash table in Linux kernel

Anyone who has read the Linux kernel source code may find that there are not many complex data structures, and the bidirectional linked list (list) as the basic data structure and the hash table implemented based on list occupy most of the data structure. Why does the kernel use a lot of these two data structures? First of all, both data structures are very simple, including two aspects: easy to understand and easy to use. This also means that the code is more readable and maintainable than other complex data structures, and the risk of bug is lower. Philosophically speaking, this is also in line with K.I.S.S. Terms.

Secondly, the kernel is a software that pays more attention to performance, and the loss of performance for the sake of simplicity in programming and maintenance is not worth the loss. Should we make the balance more performance-oriented? I can't remember where I heard that many commercial routing software stores routing items based on the data structure of binary tree, in order to find the time complexity of routing lookup is log (n), and he criticizes that Linux routing entries are organized as hash tables, resulting in poor performance and unsuitable for business. There is indeed a certain truth, but careful analysis, the performance of the hash table is really worse than the binary tree? The time complexity of inserting and deleting an item in a binary tree is log (n); the best time complexity of inserting and deleting a hash table is O (1), and the worst is O (n). If enough items (m) are selected and the hash function is good enough, the time complexity is O (n log) (when mn / log (n), the average performance of the hash table is better than that of the binary tree. And when m > = n, the time complexity approaches O (1). The value of m can be made adjustable, which shows the customizability of the kernel. However, don't be blindly optimistic, it all starts with a good enough hash function.

The advantages and disadvantages of hash function

How to determine whether a hash function is good or bad? hash means "hash" in Chinese, which can be interpreted as: scattered arrangement. A good hash function should distribute all elements evenly, avoiding or minimizing conflicts between them (Collision). It is necessary to remind you again that the choice of the hash function must be careful, if unfortunately all the elements have conflicts, then the hash table will be degenerated into a linked list, its performance will be greatly reduced, the time complexity will be quickly reduced to O (n), there must not be any fluke, because that is quite dangerous. There has been a loophole in the history of using the hash function of the Linux kernel to successfully construct a large number of elements that cause collisions in the hash table, resulting in the system being DoS. So at present, most of the hash functions in the kernel are mixed with a random number as a parameter, so that the final value can not be predicted or not easy to be predicted. This puts forward the second security requirement for the hash function: the hash function is preferably one-way and should be doped with random numbers. When it comes to one-way, you may think of the one-way hash functions md4 and md5, which unfortunately tell you that they are not suitable because the hash function needs to have fairly good performance. The jhash used in the Linux kernel is a tried and tested hash function, which can be CPMS (Copy Paste Modify Save). Bob Jenkins, the author of Jhash, also posted a series of other hash functions on his website, such as the hash function of hash for predictable data-the perfect hash function. English explanation of bucket: Hash table lookup operations are often O (Nago) (where n is the number of objects in the table and m is the number of buckets), which is close to O (1), especially when the hash function has spread the hashed objects evenly through the hash table, and there are more hash buckets than objects to be stored.

It can be understood as follows:

The address corresponding to the result of a HASH can store two BUCKET. Resolves HASH conflicts.

When you want to store data, the HASH comes here for the first time and stores a data in the first BUCKET.

When you want to store data, when the second HASH is here for some reason, store another data in the second BUCKET.

A hash table of five buckets with seven elements:

Linux's hash function hash_long, etc., is calculated by golden ratio. Because the number of bits needs to be determined by the hash function and expectations of conflicts, how do we determine the number of buckets for a hash function like hash_long? In general, the hash algorithm is considered to be used according to the characteristics of the data, and it is not uniform to kill one. For example, for the hash table that stores the IP address, it is good to use a 65536 bucket, and the post-16bit of IP is used as the key. The collision rate of this method is absolutely lower than that of hash_long, jhash and other functions. In fact, it is a compromise between this boundary and performance. I can take the maximum value of my problem space. This is sure to ensure that the bond values are dispersed. But it will waste a lot of space. However, the achievement is too small, which affects the search efficiency. It feels like it still has to be tested in the experiment. And where hash is more flexible than other search data structures is its customizability. It can be adjusted according to the specific situation to achieve the best effect.

After reading this, the article "how to understand hash and bucket in the Linux kernel" has been introduced. If you want to master the knowledge of this article, you still need to practice and use it to understand it. If you want to know more about related articles, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.