Mysql hash index 07/04 Update SLTechnology News&Howtos

Mysql hash index

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

If I look at indexes in mysql today, the first thing I should know is that different storage engines in mysql work differently, and not all storage engines support all types of indexes. Even if multiple storage engines support the same type of index, their implementation principles are different. Different engines have different support for indexes: the default index for Innodb and MyISAM is Btree index, while the default index for Mermory is Hash index. Mysql mainly includes hash index (hash index), B-Tree index, full-text index, and spatial data index (R-Tree). Today, we mainly talk about hash index in mysql, which is based on hash table. Only queries that exactly match all columns of the index are valid. For each row of data, the storage engine calculates a hash code (hash code) for all index columns. The hash code is a small value. In most cases, the hash codes calculated by rows with different key values are different, but there will be exceptions. That is to say, the hash values calculated from different column values are the same (the so-called hash conflict), the hash index stores all the hash codes in the index, while keeping the pointer to each data row in the hash table, hash is very suitable for indexing, to build a hash index for a column or columns, it will use the values of this column or columns to calculate a hash value corresponding to one or more rows of data through a certain algorithm. The implementation principle of hash index is as follows:

For the understanding of the above figure: keys: represents the column value that creates the index; buckets: the hash table consisting of the calculated hash value and the physical location of the corresponding data; entries: represents the specific data row After creating the hash index, a hash code (hash code) is calculated for each key value through a specific algorithm. It should be noted that the hash values calculated by different key values may be the same. The hash values calculated by John Smith and Sandra Dee in the figure above are both 152. then find the physical location of the stored data in the hash table with a hash value of 152. this location corresponds to two pieces of data (that is, John Smith 521-1234 and Sandra Dee 521-9655). Then traverse the two pieces of data again to find the data you need, which explains why the hash conflict is serious and the efficiency of the hash index is reduced. The process of hash index retrieving data (excerpt network) when we build a hash index on a column or columns (which is explicitly supported by only the MEMORY engine at present), a file similar to the following is generated on the hard disk: the hash value storage address 1db54bc745a17745bca452157d476455677cc. The hash value is calculated from the specified column data through a specific algorithm, and the storage address is the address of the data row stored on the hard disk (it may also be other storage addresses, in fact, MEMORY imports the hash table into memory). In this way, when we do WHERE age = 18:00, we will calculate a hash value of 18 through the same algorithm = > find the corresponding storage address in the hash table = > get the data according to the storage address = > the last step is to determine whether this line of data is the data to be queried. Therefore, each query has to traverse the hash table until the corresponding hash value is found. When the amount of data is large, the hash table will become larger, the performance will decline, and the traversal time will increase. Application of MySQLhash index: there is no need to search from root node to leaf node step by step like B+ tree, it only needs a hash algorithm to locate the corresponding location immediately, which is very fast, but hash indexing is only suitable for some specific scenarios, and once it is suitable for hash index, the performance improvement it brings is very obvious. In addition to memory engine, NDB engine also supports unique hash index. The innodb engine has a special function called adaptive hash indexing. When innodb notices that some index values are used very frequently, it will create a hash index based on the btree index in memory, so that the btree index also has some of the advantages of the hash index, such as: fast hash lookup, which is a fully automatic, internal behavior that users cannot control or configure, but if necessary You can choose to turn this feature off (innodb_adaptive_hash_index=OFF, default is ON). It is precisely because hash tables have unparalleled prime advantages in dealing with small amounts of data that hash indexes are suitable for caching (in-memory databases). For example, the in-memory version of mysql database Memsql, the widely used caching tool Mencached,NoSql database redis, and so on, all use the form of hash index. Of course, Mysql's MEMORY engine can also meet this need if you don't want to learn these things. The limitation of mysql hash index: (1) Hash index can only satisfy the equivalent query of "=", "IN" and "", but can not use range query. Because the Hash index compares the Hash value after the Hash operation, it can only be used for equivalent filtering, not for range-based filtering, because the size relationship of the Hash value processed by the corresponding Hash algorithm cannot be guaranteed to be exactly the same as before the Hash operation. (2) because hash indexes are not stored in the order of index values, Hash indexes cannot be used to avoid data sorting operations. Because the Hash index stores the Hash value after Hash calculation, and the size relationship of the hash value is not necessarily the same as the key value before the Hash operation, the database cannot use the index data to avoid any sort operation. (3) Hash indexes cannot use partial index keys to query, that is, for combined indexes, the leftmost matching principle is not supported. For combined indexes, when calculating Hash values, Hash indexes calculate Hash values together after combining index keys, rather than calculating Hash values separately, so when querying through the first one or more index keys of a combined index, the Hash index cannot be used, that is to say, Create a hash index on the data column (where B), which cannot be used if the query has only the data column (data A =). (4) Hash indexes can not avoid table scans at any time. As we already know, the Hash index stores the Hash value of the Hash operation result and the corresponding row pointer information in a Hash table after the index key is calculated through Hash. Because the Hash value calculated by the same column value may be the same, even if the number of records satisfying a certain Hash key value is taken, the query cannot be completed directly from the Hash index, or the corresponding comparison should be made by accessing the actual data in the table. And the corresponding results are obtained. (5) when a large number of hash values are equal, the performance of Hash index is not necessarily higher than that of B-Tree index. For index keys with low selectivity (there are a large number of hash conflicts, that is, a large number of duplicate values, and a high selectivity indicates that the selected value / deduplicated value is larger), if you create a Hash index, there will be a large amount of record pointer information associated with the same Hash value. In this way, it will be very troublesome to locate a record, which will waste many visits to table data, resulting in poor overall performance. (6) Hash indexing contains only hash values and row pointers, but not field values, so you can't use the values in the index to avoid reading rows (that is, you can't use a hash index to overwrite index scanning). However, the speed of accessing rows in memory is fast (because the memory engine data is stored in memory), so in most cases the impact on performance is not obvious. (7) if there are many hash conflicts, the cost of some index maintenance operations will also be high. For example, if a hash index is built on a column with low selectivity (with many hash conflicts), when a row is deleted from the table, the storage engine needs to traverse each row in the linked list corresponding to the hash value and find and delete the corresponding reference. the more conflicts, the greater the cost. Conclusion: if the hash key value has no duplicate value and the amount of data is small, the retrieval of data through the hash index is faster than the btree index, but when the hash value is repeated and the amount of data is very large, the retrieval efficiency is not as high as the Btree index. Hash indexing is only suitable for some specific scenarios, but once it is suitable for hash indexes, the performance improvement it brings is very obvious, because it does not traverse and retrieve data level by level like btree indexes. The particularity of the Hash index structure makes its retrieval efficiency very high, and the index retrieval can be located at once, unlike the BTree index which needs to go from the root node to the branch node before accessing the leaf node for so many times, so the query efficiency of the Hash index is much higher than that of the BTree index.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.