How to deeply discuss the working principle of Java HashMap 07/01 Update SLTechnology News&Howtos

How to deeply discuss the working principle of Java HashMap

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article shows you how to explore the working principle of Java HashMap in depth, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Most Java developers are using Map, especially HashMap. HashMap is a simple but powerful way to store and retrieve data. But how many developers know how HashMap works internally? A few days ago, I read a lot of java.util.HashMap source code (including Java 7 and Java 8) to gain an in-depth understanding of this basic data structure. I'll explain the implementation of java.util.HashMap, describe the new features added to the Java 8 implementation, and discuss performance, memory, and some known issues when using HashMap.

Internal storage

The Java HashMap class implements the Map interface. The main methods in this interface include: v put (K key, V value) V get (Object key) V remove (Object key) Boolean containsKey (Object key)

HashMap uses an inner class Entry to store data. This inner class is a simple key-value pair with two additional data:

A reference to another entry so that HashMap can store objects such as linked lists.

A hash value used to represent the key, which is stored to prevent HashMap from regenerating the hash value corresponding to the key every time it is needed.

Here is a portion of Entry's code under Java 7:

Static class Entry implements Map.Entry {final K key; V value; Entry next; int hash; … }

HashMap stores data in multiple unidirectional Entry linked lists (sometimes referred to as bucket bucket or container orbins). All lists are registered with an Entry array (Entry [] array), and the default length of this internal array is 16.

The following figure depicts the internal storage of an HashMap instance, which contains an array of nullable objects. Each object is connected to another object, which forms a linked list.

All keys with the same hash value are placed in the same linked list (bucket). Keys with different hashes may end up in the same bucket.

When the user calls put (K key, V value) or get (Object key), the program calculates the index of the bucket where the object should be. The program then iterates through the corresponding list to find an Entry object with the same key (using the equals () method of the key).

In the case of a call to get (), the program returns the Entry object corresponding to the value (if the Entry object exists).

In the case of calling put (K key, V value), if the Entry object already exists, the program will replace the value with the new value, otherwise, the program will create a new Entry (from the key and value in the parameter) in the header of the one-way linked list.

The index of the bucket (linked list) is generated through three steps of map:

First get the hash code of the key.

The program repeats the hash code to prevent bad hash functions for keys, as it is possible to put all the data on the same index (bucket) of the internal array.

The program takes the repeated hash code and uses a bitmask (bit-mask) of the array length (at least 1) for it. This operation ensures that the index will not be larger than the size of the array. You can think of it as a calculated optimized modular function.

The following is the source code for generating the index:

/ / the "rehash" function in JAVA 7 that takes the hashcode of the key static int hash (int h) {h ^ = (h > 20) ^ (h > 12); return h ^ (h > 7) ^ (h > 4);} / / the "rehash" function in JAVA 8 that directly takes the key static final int hash (Object key) {int h; return (key = = null)? 0: (h = key.hashCode ()) ^ (h > 16) } / / the function that returns the index from the rehashed hash static int indexFor (int h, int length) {return h & (length-1);}

In order to work more efficiently, the size of the internal array must be a power of 2. Let's see why:

Assuming that the length of the array is 17, then the value of the mask is 16 (array length-1). The binary representation of 16 is 0. 010000, so for any value H, the result of "H & 16" is 16 or 0. This means that an array of length 17 can only be applied to two buckets: one is 0 and the other is 16, which is not very efficient. But if you set the length of the array to a power of 2, for example, 16, the work of bitwise indexing becomes "H & 15". The binary representation of 15 is 0. 001111, the output value of the index formula can be from 0 to 15, so that an array of length 16 can be fully utilized. For example:

If H = 952, its binary representation is 0. 01110111000, and the corresponding index is 0. 01000 = 8

If H = 1576, its binary representation is 0. 011000101000, and the corresponding index is 0. 01000 = 8

If H = 12356146, its binary representation is 0. 0101111001000101000110010, and the corresponding index is 0. 00010 = 2

If H = 59843, its binary representation is 0. 01110100111000011, and its corresponding index is 0. 00011 = 3

This mechanism is transparent to the developer: if he chooses a HashMap,Map of length 37, he will automatically select the next power value of 2 greater than 37 (64) as the length of the internal array.

Automatically resize

After getting the index, the get (), put (), or remove () method accesses the corresponding linked list to see if the Entry object for the specified key already exists. Without modification, this mechanism can cause performance problems because this method needs to iterate through the entire list to see if the Entry object exists. Suppose the length of the internal array takes the default value of 16, and you need to store 2000000 records. In the case of * *, there will be 125000 Entry objects per linked list (2000000Uniq16). The get (), remove (), and put () methods require 125000 iterations each time they are executed. To avoid this, HashMap can increase the length of the internal array to ensure that only a small number of Entry objects remain in the linked list.

When you create a HashMap, you can specify an initial length and a loadFactor with the following constructor:

Public HashMap (int initialCapacity, float loadFactor)

If you do not specify parameters, the default value for initialCapacity is 16 and the default value for loadFactor is 0.75. InitialCapacity represents the length of the linked list of the internal array.

Every time you use put (…) Method to add a new key-value pair to the Map, the method checks whether the length of the internal array needs to be increased. To achieve this, Map stores two pieces of data:

Size of the Map: it represents the number of entries recorded in the HashMap. We update it when we insert or delete a value into the HashMap.

Threshold: it is equal to the length of the internal array * loadFactor, which is updated each time the length of the internal array is adjusted.

Before adding a new Entry object, put (...) Method checks whether the current Map size is greater than the threshold. If it is greater than the threshold, it creates a new array that is twice the length of the current internal array. Because the size of the new array has changed, so does the index function (that is, the result of the bit operation that returns the hash value of the key & (array length-1)). Resizing the array creates two new buckets (linked lists) and reassigns all existing Entry objects to the buckets. The goal of resizing the array is to reduce the size of the linked list, thereby reducing the execution time of the put (), remove (), and get () methods. For all Entry objects corresponding to keys with the same hash value, they are assigned to the same bucket after resizing. However, if the hashes of the keys of two Entry objects are not the same, but they were on the same bucket before, there is no guarantee that they will still be on the same bucket after adjustment.

This picture describes the pre-and post-adjusted internal arrays. Before adjusting the length of the array, you need to iterate through a linked list of five elements in order to get the Entry object E _ Magazine Map. After adjusting the length of the integer group, the same get () method only needs to traverse a linked list of two elements, so that the get () method runs twice as fast after adjusting the array length.

Thread safety

If you are already familiar with HashMap, you must know that it is not thread-safe, but why? For example, suppose you have a Writer thread that only inserts existing data into the Map, a Reader thread that reads data from the Map, so why doesn't it work?

Because under the automatic resizing mechanism, if a thread tries to add or get an object, Map may use the old index value so that it will not find the new bucket where the Entry object is located.

In the worst case, when two threads insert data at the same time, two put () calls automatically resize the array at the same time. Since two threads are modifying the linked list at the same time, it is possible for Map to exit in the inner loop of a linked list. If you try to get data from a list with an internal loop, the get () method will never end.

HashTable provides a thread-safe implementation that prevents this from happening. However, since all synchronous CRUD operations are very slow. For example, if thread 1 calls get (key1), then thread 2 calls get (key2), and thread 2 calls get (key3), then only one thread can get its value at a specified time, but all three threads can access the data at the same time.

Since Java 5, we have had a better thread-safe HashMap implementation: ConcurrentHashMap. For ConcurrentMap, only buckets are synchronized so that if multiple threads do not use the same bucket or resize the internal array, they can call the get (), remove (), or put () methods at the same time. In a multithreaded application program, this approach is a better choice.

Invariance of bond

Why is it a good implementation to use strings and integers as keys for HashMap? Mainly because they are immutable! If you choose to create your own class as a key, but there is no guarantee that the class is immutable, then you may lose data within HashMap.

Let's look at the following use cases:

You have a key whose internal value is "1".

If you insert an object into HashMap, its key is "1".

HashMap generates a hash from the hash code of the key (that is, "1").

Map stores this hash value in the newly created record.

You change the internal value of the key to "2".

The hash value of the key has changed, but HashMap does not know this (because the old hash value is stored).

You try to get the corresponding object through the modified key.

Map calculates the hash value of the new key (that is, "2") to find the linked list (bucket) where the Entry object is located.

Case 1: now that you have modified the key, Map will try to find the Entry object in the wrong bucket and cannot find it.

Case 2: you are lucky that the bucket generated by the modified key is the same as the bucket generated by the old key. Map then traverses through the linked list and finds an Entry object with the same key. But to find the key, Map first compares the hash value of the key by calling the equals () method. Because the modified key generates a different hash value (the old hash value is stored in the record), Map has no way to find the corresponding Entry object in the linked list.

Here is an example of Java. We insert two key-value pairs into the Map, then I modify the * keys and try to get the two objects. You will find that only the second object is returned from Map, and * objects have been "lost" in HashMap:

Public class MutableKeyTest {public static void main (String [] args) {class MyKey {Integer i; public void setI (Integer I) {this.i = I;} public MyKey (Integer I) {this.i = I;} @ Override public int hashCode () {return I } @ Override public boolean equals (Object obj) {if (obj instanceof MyKey) {return i.equals (MyKey) obj) .i);} else return false;}} Map myMap = new HashMap (); MyKey key1 = new MyKey (1); MyKey key2 = new MyKey (2); myMap.put (key1, "test" + 1); myMap.put (key2, "test" + 2) / / modifying key1 key1.setI (3); String test1= myMap.get (key1); String test2= myMap.get (key2); System.out.println ("test1=" + test1 + "test2=" + test2);}}

The output of the above code is "test1=null test2=test 2". As we expected, Map does not have the ability to get the string 1 corresponding to the modified key 1.

Improvements in Java 8

In Java 8, many changes have been made to the internal implementation in HashMap. Indeed, Java 7 uses 1000 lines of code, while Java 8 uses 2000 lines of code. Most of what I described earlier is still true in Java 8, except for using linked lists to hold Entry objects. In Java 8, we still use an array, but it is saved in Node, where Node contains the same information as the previous Entry object, and a linked list is also used:

The following is part of the code implemented by Node in Java 8:

Static class Node implements Map.Entry {final int hash; final K key; V value; Node next

So what's the big difference from Java 7? Well, Node can be extended to TreeNode. TreeNode is a red-black tree data structure, it can store more information, so we can add, delete or get an element under the complexity of O (log (n)). The following example describes all the information saved by TreeNode:

Static final class TreeNode extends LinkedHashMap.Entry {final int hash; / / inherited from Node final K key; / / inherited from Node V value; / / inherited from Node Node next; / / inherited from Node Entry before, after;// inherited from LinkedHashMap.Entry TreeNode parent; TreeNode left; TreeNode right; TreeNode prev; boolean red

The red-black tree is a self-balanced binary search tree. Its internal mechanism ensures that its length is always log (n), whether we add or delete nodes. The main benefit of using this type of tree is that when many data in the internal table have the same index (bucket), the complexity of searching the tree is O (log (n)), while for linked lists, the complexity of performing the same operation is O (n).

As you can see, we do store more data in the tree than the linked list. According to the inheritance principle, the internal list can contain Node (linked list) or TreeNode (red-black tree). Oracle decided to use these two data structures according to the following rules:

-for the specified index (bucket) in the internal table, if the number of node is more than 8, the linked list will be converted to a red-black tree.

-for the specified index (bucket) in the internal table, if the number of node is less than 6, then the red-black tree will be converted to a linked list.

This picture describes the internal array in Java 8 HashMap, which contains both a tree (bucket 0) and a linked list (bucket 1, 2 and 3). Bucket 0 is a tree structure because it contains more than 8 nodes.

Memory overhead

JAVA 7

Using HashMap consumes some memory. In Java 7, HashMap encapsulates key-value pairs into Entry objects, and an Entry object contains the following information:

A reference to the next record

A pre-calculated hash (integer)

A reference to the key

A reference to a value

In addition, HashMap in Java 7 uses an internal array of Entry objects. Assuming that a Java 7 HashMap contains N elements, and the capacity of its internal array is CAPACITY, the additional memory consumption is approximately:

SizeOf (integer) * N + sizeOf (reference) * (3*N+C)

Where:

The size of an integer is 4 bytes

The size of the reference depends on the JVM, operating system, and processor, but it is usually 4 bytes.

This means that the total memory overhead is usually 16 * N + 4 * CAPACITY bytes.

Note: after Map automatically resizes, the value of CAPACITY is the next power of the smallest 2 greater than N.

Note: since Java 7, HashMap has adopted a deferred loading mechanism. This means that even if you specify a size for HashMap, the internal array used by the record (which consumes 4*CAPACITY bytes) will not allocate space in memory until we use the put () method * *.

JAVA 8

In the Java 8 implementation, computing memory usage becomes a bit more complicated because Node may store the same data as Entry, or add six references and a Boolean attribute (specify whether it is TreeNode) on top of it.

If all nodes are just Node, then Java 8 HashMap consumes the same memory as Java 7 HashMap.

If all nodes are TreeNode, the memory consumed by Java 8 HashMap becomes:

N * sizeOf (integer) + N * sizeOf (boolean) + sizeOf (reference) * (9*N+CAPACITY)

In most standard JVM, the result of the above formula is 44 * N + 4 * CAPACITY bytes.

Performance problem

Asymmetric HashMap vs equilibrium HashMap

In the case of * *, both the get () and put () methods have only O (1) complexity. However, if you don't care about the hash function of the key, your put () and get () methods may execute very slowly. The efficient execution of the put () and get () methods depends on how the data is allocated to different indexes of the internal array (bucket). If the hash function of the key is not properly designed, you will get an asymmetrical partition (no matter how large the internal data is). All put () and get () methods use a * * linked list, which is slow to execute because it needs to iterate over all the records in the linked list. In a worst-case scenario (if most of the data is in the same bucket), your time complexity will become O (n).

The following is a visual example. The * figure depicts an asymmetric HashMap, and the second diagram depicts a balanced HashMap.

In this asymmetric HashMap, running the get () and put () methods on bucket 0 can be time-consuming. It takes 6 iterations to get record K.

In this balanced HashMap, it takes only three iterations to get record K. The two HashMap store the same amount of data, and the internal array is the same size. The only difference is the hash function of the key, which is used to distribute records over different buckets.

The following is an extreme example written in Java, in which I use a hash function to put all the data into the same linked list (bucket), and then I add 2000000 pieces of data.

Public class Test {public static void main (String [] args) {class MyKey {Integer i; public MyKey (Integer I) {this.i = I;} @ Override public int hashCode () {return 1;} @ Override public boolean equals (Object obj) {… }} Date begin = new Date (); Map myMap= new HashMap (2, 500, 000, 10, 000, 1); for (int iTuno

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.