How to implement Java hash table 07/19 Update SLTechnology News&Howtos

How to implement Java hash table

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to implement Java hash table". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to implement Java hash table.

Introduction

Arrays are characterized by easy addressing and difficult insertion and deletion, while linked lists are characterized by difficult addressing and easy insertion and deletion. As for the tree structure, their search is first carried out from the root node, and the data or index from the node is compared with the search value. Although the comprehensive efficiency of search, addition and deletion is better, it still needs to be searched many times. For this reason, a hash table is introduced to try to further improve the search efficiency and the comprehensive efficiency of adding and deleting.

1 hash table overview 1.1 hash table overview

Among the previous search algorithms, the simplest sequential table structure search includes simple sequential search, binary search, interpolation search, Fibonacci search, and later tree structure search, including binary sorting tree, balanced binary tree, multi-path search tree, red-black tree and so on. They have a functional feature is that the element to be found must always be compared with the existing element many times in order to finally find out whether the element exists or does not exist.

We know that these comparisons are used to gradually locate an exact location, and most of the above search algorithms require that the data must be stored in order. The algorithm is to narrow the scope of the search by comparing the size of the two data. Finally, find a data of the same size, or show that the element exists, or finally did not find a data of the same size, indicating that it does not exist.

Why do we have to "compare"? Can you directly get the memory storage location of the records you want to find through the keyword key? Of course there is. This is the hash table.

In advance, a definite corresponding function relation f is established between the storage location of the data (which can be in the form of key-value key-value pair or a single key) and its keyword, so that each keyword k corresponds to a storage location f (key). That is, the storage location = f (key), the mapping is called hash function, and the data structure that uses hash function to store data is called hash table. The process of calculating the storage location by f (key) is called a hash, and the resulting storage location is called the hash address.

Hash tables are usually implemented based on arrays. When storing data, the hash function f (key) calculates the location where the data should be stored according to key-array subscript, thus dispersing different data in different storage locations, which is also the origin of the "hash"; when searching, the key corresponding to the hash function f (key) can directly determine the location of the search value-array subscript, without the need for comparison one by one. In this way, we can "know" the location of key in advance, find the data directly, and improve efficiency. The array location where elements are stored in a hash table is also known as a "slot".

The difference between hash table and linear table, tree, graph and other structures is that there is a certain logical relationship between the latter structural data elements, while there is no logical relationship between the data elements of hash tables using hash technology, and the position of elements is only related to the keyword key and hash function f (key).

For search, hash technology simplifies the comparison process, the efficiency will be greatly improved, but everything has advantages and disadvantages, because there is no exact relationship between data elements, hash technology does not have the ability of many conventional data structures. Compared with other search structures, it only supports partial operations (search, add and delete. ), another part of the operation can not be implemented (sort, index operation, range search, sequential output, find maximum / minimum value …... ). Therefore, hash tables are mainly lookup-oriented storage structures.

The English name of the hash table is hash table, so the hash table is also called the hash table, the hash function is also called the hash function, and the implementation step of the function is called the hash algorithm / hash algorithm.

1.2 Hash conflicts (hash collision)

We can also see that the hashing algorithm actually transforms an arbitrary length of key into a fixed range length value. This transformation is a compression mapping, which is simply a function that compresses a message of any length into a message digest of a fixed length. That is, the range of hash values is usually much smaller than the range of input key.

For example, if you enter a key of 5, 28, 19, 15, 20, 33, 12, 17, 10, a total of 9, you must not initialize the length of the hash table (internal array) to 33, which would be a waste of space. The ideal result is to put these nine key into a hash table (array) of length 10 through a hash function f (key). However, because the hash algorithm is a compressed mapping algorithm, the length unit of the hash table is finite, the keyword key is infinite, for a hash algorithm, if the key sample is large, then two different key maps to the same unit. That is, it is almost inevitable that the values of f (key) are equal, and this phenomenon is also called hash conflict / hash conflict.

For example, the hash function used for the number of key above is f (key) = key mod 9 # f (5) = 5, so data 5 should be in the fifth slot of the hash table; f (28) = 1, so data 28 should be in the first slot of the hash table; f (19) = 1, that is, data 19 should also be placed in the first slot of the hash table-- thus causing a collision.

Although we can minimize conflicts through well-designed hash functions, they cannot be completely avoided. Therefore, the hash table must have the ability to handle hash conflicts.

2 selection of hash function 2.1 requirements of hash function

As can be seen from the overview of the hash table above, the most important thing to realize the hash table is the choice of the hash function f (key) and the ability to deal with hash conflicts. Let's first look at the choice of the hash function.

Good hash functions should meet the following two principles:

The calculation is simple: if the hash algorithm requires very complex computation, it will take a lot of time, which will greatly reduce the efficiency of search for frequent search. Therefore, the calculation time of the hash function should not exceed the time that other lookup techniques compare with keywords.

Uniform distribution of hash addresses: we have just mentioned the problems caused by conflicts. Although conflicts can not be completely avoided, it is possible to design a hash function to distribute hash addresses evenly in the storage space as far as possible. this can ensure the effective use of storage space and reduce the occurrence of conflicts and the time spent dealing with conflicts. Here are several commonly used hash function construction methods!

2.2 Construction method of hash function

Direct addressing method

Take a keyword or a linear function value of a keyword as the hash address (this hash function is also called its own function). F (key) = a×key + b (a, b are constant).

For example, for the population statistics of 0-100 years old, age is directly used as the keyword.

For example, if we count the number of people born each year in 1980, we can subtract 1980 from the year of birth keyword as the address.

The advantage of such a hash function is that it is simple, uniform and does not cause conflicts, but the problem is that it requires knowing the distribution of keywords in advance, which is suitable for situations where the lookup table is small and continuous. Because of this limitation, although this method is simple, it is not commonly used in practical applications.

Digital analysis method

Assuming that the keyword is an r-based number, and the keywords that may appear in the hash table are known in advance, several digits of the keyword can be taken to form a hash address.

For example, for the key of mobile phone number, the last four digits of the mobile phone number are used to participate in the calculation.

Digital analysis is usually suitable for dealing with situations where the number of key bits is relatively large. If the distribution of keywords is known in advance and several bits of keywords are evenly distributed, this method can be considered.

Folding method

Divide the keyword from left to right into equal parts (note that the last part can be shorter when the number of digits is not enough), then stack these parts and take the last few bits as the hash address according to the length of the hash table. For example, our keyword is 9876543210, and the hash table is three digits long. We divide it into four groups, 987 / 654 / 321 / 0, and then add them to 987 / 654 / 321 / 01962, and then we get a hash address of 962 with the last three digits.

Sometimes this may not guarantee a uniform distribution, so you might as well fold it back and forth from one end to the other and add it up. For example, we reverse 987 and 321 and add them to 654 and 0 to 789 '654' 123'0' 1566, where the hash address is 566.

The folding method does not need to know the distribution of keywords in advance, so it is suitable for the case of more keywords.

Square centering method

The middle digits after the keyword is squared are the hash address. The difference is enlarged by the square, and the other middle bits are related to each bit of the multiplier, resulting in a more uniform hash address.

Suppose the keyword is 1234, then its square is 1522756, and the middle three digits are 227, which is used as the hash address. For example, if the keyword is 4321, then its square is 18671041, and the three digits extracted can be 671 or 710, which can be used as a hash address.

The square middle method is more suitable for situations where the distribution of keywords is not known and the number of digits is not very large.

Method of division and residue

This method is the most commonly used method of constructing hash functions. For the hash function whose hash table length is m, the formula is: F (key) = key mod p (p ≤ m)

Mod means to take a module (to find the remainder). In fact, this method can not only directly take the mode of the keyword, but also take the mode after folding and square. Obviously, the key to this method is to choose the right pforce p. If it is not selected well, it may easily lead to conflict. HashMap adopts this method (using bit operation instead of modular operation to improve the computational efficiency of the program). It is important to note that bit operations can be converted to modular operations only in certain cases (a% b = a & (b-1) when b = 2 ^ n).

Therefore, according to the experience of predecessors, if the hash table length is m, p is usually less than or equal to the minimum prime number of the table length (preferably close to m) or does not contain a composite number of less than 20 prime factors.

Random number method

Select a random number and take the random function value of the keyword as its hash address. That is: F (key) = random (key)

Where random is a random function. When the length of keywords is different, it is appropriate to use this method to construct hash function.

In short, in reality, different hash functions should be used according to different situations. We can only give some considerations for reference:

1. The time required to calculate the hash address.

two。 The length of the keyword

3. The size of the hash table.

4. The distribution of keywords.

5. Record the frequency of searches.

Only by combining these factors can we decide which hash function is more appropriate.

3. Resolution of hash conflicts

No matter how well designed the hash function is, it is impossible to avoid conflicts completely. For hash tables built with any hash function, you must also consider how to handle hash conflicts. The common methods are as follows:

Use auxiliary data structures: separate link method / chain address method / zipper method

No use of auxiliary data structures: open addressing (linear detection, square detection, double hash)

Rehash

3.1 separate link method

In the process of storing data, if there is a conflict, you can use a single linked list to insert new data after the existing data, and the accessed array subscript element is used as the header of the linked list. This method of conflict resolution is called "separation link method", also known as separation link method and zipper method. As you can imagine, in addition to the linked list, other auxiliary structures can solve the conflict phenomenon: binary tree or another hash table, if the linked list is used to help solve the hash conflict, and the hash function is well designed, then the linked list should be relatively short, and other complex auxiliary structures are not worth trying.

As shown in the following figure, a hash table using chain address method is used:

Before JDK1.8, HashMap uses linked lists to deal with hash conflicts. In order to reduce the loss of traversal performance caused by long linked lists, the method of linked list + red-black tree is used in JDK1.8 to deal with hash conflicts. When the length of the linked list is greater than 8, it is converted to a red-black tree, and the search efficiency of the red-black tree is significantly higher than that of a single linked list. When the number is less than or equal to 8, the linked list is completely acceptable to avoid the complex structure of the red-black tree.

3.2 Open addressing method

The basic idea of the open addressing method is to find other free locations through another algorithm when there is a conflict, so there is no need for additional auxiliary data structure, as long as the hash table is large enough, the empty hash address can always be found and the record can be stored.

The open addressing method can be divided into linear detection method, square detection method and double hash method.

3.2.1 Linear detection (Linear Probing)

The linear detection formula is (H (key) + di)% m, where H (key) is a hash function and m is a table length-1 (di=0,1,2,3...,m-1).

The main idea of linear detection is that when a collision occurs (a key is hashed to an array position that already has a key-value pair), we check the next position of the array, a process called linear detection. Linear detection may produce three results:

Hit: the key at this location is the same as the key you are looking for

Missed: the location is empty

The key at this location is different from the key being found.

When we look for a key, we first get an array index through the hash function, and then we start to check whether the key in the corresponding position is the same as the given key. If different, continue to search (if you can't find it at the end of the array, fold back to the beginning of the array) until you find the key or encounter an empty position. From the process of linear detection, we can know that if we insert new keys into the array when it is full, we will fall into an infinite loop.

3.2.2 Square detection method

If the hash function is not selected so well, it may lead to conflicts. If you first deposit in key1, you can find a spare place to deposit. When saving in key2, it conflicts with key1, and at this time it is saved to the next location of key1, and then key3 has a hash conflict with them. In this way, there is a situation of keyword aggregation in a certain area, that is, the phenomenon of data aggregation.

One solution is to increase the increment of di: (h (key) + di ²)% m (di=0,1,2,3...,m/2)

The purpose of increasing the square operation is to prevent keywords from being clustered in a certain area. We call this method the square detection method.

If m-table length-1 is not a prime, then the number of alternative units will be reduced, and the possibility of hash conflicts will be greatly increased.

3.2.3 double hash method

Prepare two hash functions. The general formula of double hash is: F (I) = I * hash3 (x), which means that the hash value of x is calculated with the second hash function, and then detected at the distance hash3 (x), 2hash3 (x).

3.3 rehashing method

Both the chain addressing method and the open addressing method are designed to make the distribution of elements in the hash table more reasonable, but the hash table space always runs out, even when their hash tables are filled with too many elements. will increase the probability of hash conflicts. The rehashing method here is to calculate when to expand the hash table and how to make the original elements reasonably distributed in the new space when the hash table is expanded.

The general method is: when the elements of the hash table are long enough, create another table about twice the size, use a new hash function, scan the entire original table and insert it into the new table according to the new mapping, which is re-hashing. It is very expensive because it involves the movement of all elements.

The trigger conditions for rehashing are usually done as long as the table is half full, only when the insert fails (which is extreme), and when the table reaches a certain expansion factor. The third is the better method, which is used by HashMap.

The expansion factor of the hash table: the so-called expansion factor α = the number of records in the table / the length of the hash table, and α indicates the degree to which the hash table is full. Generally speaking, when the number of elements reaches the set number of expansion factor, it means that the hash expansion can be carried out, which is also called loading factor. Therefore, a reasonable expansion factor is very important. The greater the alpha, the greater the possibility of conflict; the smaller the alpha, the less likely to conflict, but resulting in a waste of space. The HasmMap loading factor for JDK1.8 is 0.75 by default.

4 simple implementation of hash table

JDK already provides off-the-shelf hash table implementations, such as the famous HashMap. The HashMap of JDK1.8 is realized by array + linked list + red-black tree. The hash function adopts the residual removal method based on hashcode (), and adopts the hash conflict resolution method of separate link method and rehash method.

Here is another simple Java implementation of hash table using linear probe. As can be seen from the following implementation, the hash table efficiency of linear probe is not high and data aggregation is generated. However, there are hash table classes implemented using linear probe in JDK, such as ThreadLocalMap in ThreadLocal, because the implementation of linear probe is relatively simple.

/ * * simple implementation of hash table based on linear probe method * * @ param key type * @ param value type * / public class LinearProbingHashMap {/ * array of node data * / transient Entry [] table / * * the stored node object * * @ param * @ param * / private static class Entry implements Map.Entry {/ * saves the hash value to avoid double calculation of * / final int hash; / * key value * / final K key / * * value value * / V value; / * Constructor * * @ param hash * @ param key * @ param value * / private Entry (int hash, K key, V value) {this.hash = hash; this.key = key This.value = value;} @ Override public final K getKey () {return key;} @ Override public final V getValue () {return value;} / * @ Override public final String toString () {return hash + "=" + key + "=" + value " * / @ Override public final String toString () {return key + "=" + value;} @ Override public final V setValue (V newValue) {V oldValue = value; value = newValue; return oldValue;} @ Override public int hashCode () {return hash Capacity of hash table, initially 16 * / private int capacity = 16; / * * number of hash table nodes * / private int size / * empty constructor, does not initialize the internal array * / public LinearProbingHashMap () {} / * insert * * @ param key k * @ param value v * / public V put (K key, V value) {/ / initialize if (table = = null) {table = new Entry [capacity] } / / capacity expansion, where it is judged that the number of elements is greater than or equal to 0.75*capacity if (size > = capacity * 0.75) {resize (2 * capacity);} int hash = hash (key); / / calculate the array element subscript int position = hash & (capacity-1) that should be inserted / / insert logic, always while (true) {if (table [position] = = null) {table [position] = new Entry (hash, key, value); size++; break / / to determine whether it is a duplicate key, here use hash value and = = to judge} else if (hash = = tableposition]. GetKey () = = key) {return table.setValue (value);} position = nextIndex (position, capacity);} return null } / * find * * @ param key key * @ return find value * / public V get (K key) {if (table = = null) {return null;} / / calculate the array position int position = hash (key) & (capacity-1) corresponding to key / / if the location is not null, start looking for consecutive elements if (table [position]! = null) {do {if (tableposition] .getKey () = = key) {return table.getValue ();} position = nextIndex (position, capacity) } while (table [position]! = null);} return null;} / * * Delete element * * @ param key lookup element * @ return deleted element value;null does not represent value * / public V delete (K key) {if (table = = null) {return null that is not deleted } / / calculate the array position int position = hash (key) & (capacity-1) corresponding to key; V value / / if the location is not null, start looking for consecutive elements if (table [position]! = null) {do {if (tableposition] .getKey () = = key) {/ / delete element value = tableposition .getValue (); table [position] = null Size--; / / reinserts all subsequent consecutive elements position = nextIndex (position, capacity); Entry positionEntry; while ((positionEntry = table [position])! = null) {table [position] = null Size--; put (positionEntry.getKey (), positionEntry.getValue ()); position = nextIndex (position, capacity);} return value;} position = nextIndex (position, capacity) } while (table [position]! = null);} return null;} public int size () {return size } / * to get hash value, instead of taking hash value directly, it uses HashMap's perturbed algorithm for reference, which makes the element distribution more uniform * * @ param key k * @ return hash value * / private int hash (K key) {int h; return (key = = null)? 0: (h = key.hashCode ()) ^ (h > 16) } / * * capacity expansion * * @ param newCapacity New capacity * / private void resize (int newCapacity) {if (newCapacity)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.