How to go deep into the internal implementation of Python dictionary 03/26 Update SLTechnology News&Howtos

How to go deep into the internal implementation of Python dictionary

2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article is about how to go deep into the internal implementation of the Python dictionary. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article.

Dictionaries are indexed by key, so dictionaries can also be treated as two arrays associated with each other. Let's try to add three key / value (key/value) pairs to the dictionary:

> d = {'averse: 1,' baked: 2} > d ['c'] = 3 > d {'aquired: 1,' baked: 2, 'cased: 3}

These values can be accessed by:

> d ['a'] 1 > d ['b'] 2 > d ['c'] 3 > d ['d'] Traceback (most recent call last): File ", line 1, in KeyError:'d'

Because the key'd 'does not exist, a KeyError exception is thrown.

Hash table (Hash tables)

In Python, dictionaries are implemented through hash tables. In other words, the dictionary is an array, and the index of the array is obtained after the key is processed by the hash function. The purpose of the hash function is to distribute the keys evenly in the array. Because different keys may have the same hash value, that is, conflicts may occur, advanced hash functions can minimize the number of conflicts. Such advanced hash functions are not included in Python, and several important hash functions (for handling strings and integers) are usually of the regular type:

> > map (hash, (0,1,2,3) [0,1,2,3] > > map (hash, ("namea", "nameb", "namec", "named")) [- 1658398457,-1658398460,-1658398459,-1658398462]

In the following space, we will only consider the case of using a string as a key. In Python, the hash function used to process strings is defined as follows:

Arguments: string object returns: hash function string_hash: if hash cached: return it set len to string's length initialize var p pointing to 1st char of string object set x to value pointed by p left shifted by 7 bits while len > = 0: set var x to (1000003 * x) xor value pointed by p increment pointer p set x to x xor length of string object cache x as the hash so we don't need to calculate it again return x as the hash

If you run hash ('a') in Python, the background executes the string_hash () function and returns 12416037344 (here we assume a 64-bit platform).

If we use an array of length x to store the key / value pair, we need to calculate the index of the slot (slot, the unit that stores the key / value pair) in the array with a mask of XMel 1. This makes the process of calculating the index very fast. The mechanism for adjusting the length of the dictionary structure (described in more detail below) makes the probability of finding empty slots high, which means that in most cases only simple calculations are needed. If the length of the array used in the dictionary is 8, then the index of the key'a'is: hash ('a') & 7 = 0, similarly, the index of'b'is 3, the index of'c'is 2, and the index of'z'is the same as'b', which is also 3, which leads to a conflict.

As you can see, Python's hash function works well when the keys are continuous with each other, mainly considering that this form of data is usually processed. However, once we add the key'z', there will be a conflict because the key value is not adjacent to other keys and is far apart.

Of course, we can also use a linked list of hash values indexed as keys to store key / value pairs, but it will increase the time to find elements, and the time complexity is no longer O (1). The next section describes the methods used by Python's dictionary to resolve conflicts.

Open addressing method (Open addressing)

Open addressing is a method to deal with conflicts by means of detection. In the example of the key'z 'conflict above, index 3 is already occupied in the array, so you need to explore an index that is not currently in use. The time required to increase and search key / value pairs is O (1).

A secondary probe sequence (quadratic probing sequence) is used to search for free slots, which is coded as follows:

J = (5roomj) + 1 + perturb; perturb > > = PERTURB_SHIFT; use j% 2roomi as the next table index

Cyclic 5*j+1 can quickly magnify the small difference in the binary of the hash value that does not affect the initial index. The variable perturb causes other binaries to change as well.

Out of curiosity, let's take a look at the detection sequence when the array length is 32, j = 3-> 11-> 19-> 29-> 5-> 6-> 16-> 31-> 28-> 13-> 2.

For more information on the probe sequence, you can refer to the source code of dictobject.c. The beginning of the file contains a detailed description of the detection mechanism.

Let's take a look at the internal code of Python with an example.

Dictionary structure based on C language

The following C-based data structures are used to store dictionary key / value pairs (also known as entry), which contain hashes, keys, and values. PyObject is a base class of Python objects.

Typedef struct {Py_ssize_t me_hash; PyObject * me_key; PyObject * me_value} PyDictEntry

The following is the data structure corresponding to the dictionary. Where ma_fill is the total number of active slots and dumb slots (dummy slot). When a key / value pair in an active slot is deleted, the slot is marked as a dumb slot. Ma_used is the total number of active slots. The ma_ Mak value is the length of the array minus 1, which is used to calculate the index of the slot. Ma_table is the array itself, and ma_smalltable is the initial array of length 8.

Typedef struct _ dictobject PyDictObject; struct _ dictobject {PyObject_HEAD Py_ssize_t ma_fill; Py_ssize_t ma_used; Py_ssize_t ma_mask; PyDictEntry * ma_table; PyDictEntry * (* ma_lookup) (PyDictObject * mp, PyObject * key, long hash); PyDictEntry ma_ small [PyDict _ MINSIZE];}

Dictionary initialization

The dictionary will call the PyDict_New () function when it is first created. Here, some lines in the source code are deleted, and the C code is converted into pseudo code to highlight several key concepts.

Returns new dictionary object function PyDict_New: allocate new dictionary object clear dictionary's table set dictionary's number of used slots + dummy slots (ma_fill) to 0 set dictionary's number of active slots (ma_used) to 0 set dictionary's mask (ma_value) to dictionary size-1 = 7 set dictionary's lookup function to lookdict_string return allocated dictionary object

Add item

The PyDict_SetItem () function is called to add a new key / value pair. The function will use a pointer to the dictionary object and key / value pairs. In this process, it first checks whether the key is a string, then calculates the hash value, and if the hash value of the key has been previously calculated and cached, the cached value is used directly. Then call the insertdict () function to add a new key / value pair. If the total number of active and empty slots exceeds 2 of the length of the array, the length of the array needs to be adjusted. Why is it 2 + + 3? This is mainly to ensure that the detection sequence can find the free slot fast enough. We will introduce the function of adjusting the length later.

Arguments: dictionary, key, value returns: 0 if OK or-1 function PyDict_SetItem: if key's hash cached: use hash else: calculate hash call insertdict with dictionary object, key, hash and value if key/value pair added successfully and capacity over 2 point 3: call dictresize to resize dictionary's table

Inserdict () uses the search function lookdict_string () to find free slots. This is the same function used to find the key. Lookdict_string () calculates the index of the slot using a hash and a mask. If the key is not found using the "index = hash & mask" method, it is detected by calling the loop method described earlier until a free slot is found. * * Wheel detection. If no matching key is found and a dumb slot is encountered during the detection, a dumb slot is returned. This gives priority to previously deleted slots.

Now we want to add the following key / value pairs: {'a': 1, 'baked: 2','z': 26,'y': 25,'c': 5,'x': 24}, then the following process will occur:

Assign a dictionary structure and the size of the internal table is 8.

Here's what we've got so far:

Six of the eight slots have been used, more than 2 of the total capacity, so the dictresize () function will be called to allocate a longer array while copying entries from the old table to the new table.

In our example, after the dictresize () function is called, the array length is adjusted to no less than four times the number of active slots, that is, minused = 24 = 4*ma_used. When the number of active slots is very large (greater than 50000), the adjusted length should not be less than 2 times the number of active slots, that is, 2*ma_used. Why four times? This is mainly to reduce the number of calls to the length adjustment function, while significantly increasing the sparsity.

The length of the new table should be greater than 24. When calculating the length value, the current length value will be raised continuously until it is greater than 24, and the final length is 32. For example, if the current length is 8, the calculation process is 8-> 16-> 32.

This is the process of length adjustment: assign a new table of length 32, and then insert entries from the old table into the new table with the new mask, 31. The final results are as follows:

Delete item

The PyDict_DelItem () function is called when the entry is deleted. When deleting, the hash value of the key is calculated first, and then the search function is called to return to the entry, and the slot is marked as a dumb slot.

Suppose we want to remove the key 'censor from the dictionary, we will end up with the following result:

Note that after the item is deleted, the action of adjusting the array length is not triggered even if the final number of active slots is much less than the total number. However, if the key / value pair is increased after deletion, the array length may be reduced because the conditional judgment of adjusting the length is based on the total number of active and dumb slots.

The above is how to go deep into the internal implementation of the Python dictionary. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.