In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article shows you big data cache breakdown and how to solve the cache breakdown, the content is concise and easy to understand, can definitely brighten your eyes, through the detailed introduction of this article, I hope you can get something.
Write at the forefront
After working for a few years, I have not systematically studied, and many things do not have complex work scene experience. I finally decided to sign up for a high-salary training camp for pull hook. I have really learned a lot here. I have also learned a lot after learning, and there are regular internal tweets and more opportunities, which have really helped me to improve. Special thanks to the gentle and lovely little bamboo head teacher and the serious responsible and handsome old Coke mentor for their help!
Cache breakdown
When the cache expires at a certain point in time, there are a large number of concurrent requests for the Key. These requests find that the cache expiration will generally load the data from the back-end DB and set it back to the cache. At this time, large concurrent requests may overwhelm the back-end DB.
Solution to cache breakdown: when there is no key in the database and redis, when the database returns null, insert in redis when key requests again, redis directly returns null instead of requesting the database again. Solution two
You can set some filtering rules, such as Bloom filter
Put all the query conditions in the database into the Bloom filter
When a query request comes, it is checked by the Bloom filter first, and if it is judged that the query value of the request exists, it continues to be checked; if it is determined that the query does not exist, it is discarded directly.
The way of Bloom filter to solve the problem of cache penetration (focus)
Brief introduction
Bloom Filter (hereinafter referred to as BF), proposed by Burton Howard Bloom in 1970, is a probabilistic data structure with high spatial efficiency. * * it is specifically used to detect the existence of specific elements in the collection. * * it sounds like a common requirement, so why use BF as a data structure?
Advantages and disadvantages
Advantages:
There is no need to store the data itself, only expressed in bits, so the space occupation has a great advantage over the traditional method, and the data can be kept secret.
The time efficiency is also high, and the time complexity of insertion and query is O (k).
Hash functions are independent of each other and can be calculated in parallel at the hardware instruction level.
Disadvantages:
There is a false positive probability, which is not suitable for any situation that requires 100% accuracy.
You can only insert and query elements, but you cannot delete elements, which is the same as the reason for false positives. We can simply think of counting (extending a bit to a count) to record the number of elements, but there is still no guarantee that the deleted elements will be in the collection.
Therefore, BF is very suitable for situations where the accuracy is not so strict, but the time and space efficiency is high. In addition, because it does not have the problem of false negative, it has strange effects when used as a "non-existent" logic, for example, it can be used as a buffer for cache systems (such as Redis) to prevent cache penetration.
Design thought
BF is a data structure composed of a bit array (bit array) of m bits and k hash functions (hash function). The bit array is initialized to 0, and all hash functions can hash the input data as evenly as possible.
It itself is a very long binary vector, and since it is a binary vector, it is obvious that either 0 or 1 is stored.
Now let's create a new Bloom filter with a length of 16, and the default value is 0, like this:
Now you need to add a piece of data:
We use some kind of calculation, such as Hash2, to calculate that Hash2 (data) = 5, and we change the grid with subscript 5 to 1, like this:
We also use some kind of calculation, such as Hash3, to calculate that Hash3 (data) = 9, and we change the grid with subscript 9 to 1, like this:
Or through some kind of calculation, such as Hash4, we calculate that Hash4 (data) = 2, and we change the grid with subscript 2 to 1, like this:
In this way, the data just added occupies the three squares of Bloom filter "5", "9" and "2".
It can be seen that just from the Bloom filter itself, there is no complete data at all, just use a series of random mapping functions to calculate the position, and then fill in the binary vector.
What good will it do? For example, now I give you another piece of data, you have to judge whether the data is duplicated, how do you do it?
You just need to use the above three fixed calculations to figure out which cells the data occupies, and then see if all the cells are 1. If there is a grid that is not 1, then it means that the number is not in it. This is easy to understand, for example, now I have given you the data you just added. Through three fixed methods of calculation, the result must be exactly the same as the one above, and it also occupies the three squares of Bloom filter "5", "9" and "2".
However, there is a problem to note. If there are all 1 in these grids, it does not necessarily mean that the given data must be duplicated. Perhaps the results of other data calculated by three fixed calculation methods are the same. This is also easy to understand, for example, we need to determine whether objects are equal, we can not just judge whether their hashes are equal.
In other words, the Bloom filter can only judge whether the data must not exist, but not whether the data must exist. As shown in the figure:
In theory, after introducing the process of adding and querying, it is time to introduce the process of deletion, but it is a pity that it is difficult for Bloom filter to delete data. Why? If you think about it, for example, if you want to delete the data just given to you, you have changed the three squares of "5", "9" and "2" to 0, but maybe other data are also mapped to "5", "9" and "2". Isn't that a mess?
Bloom Filter implementation
There are many implementations and optimizations for Bloom filters, and an implementation of Bloom Filter is provided in Guava.
When using bloom filter, the two points that cannot be bypassed are the estimated amount of data n and the expected misjudgment rate fpp
When implementing bloom filter, the two points that cannot be bypassed are the selection of the hash function and the size of the bit array.
For a certain scenario, we estimate that the amount of data to be stored is n, and the expected error rate is fpp. Then we need to calculate the size m of the Bit array we need, and the number k of the hash function, and select the hash function.
Bit array size selection
According to the estimated amount of data n and the misjudgment rate m of the size of the fpp,bit array:
Hash function selection
From the estimated amount of data n and the length m of the bit array, we can get the number k of a hash function:
The choice of hash function should have a great impact on performance. A good hash function should be able to map strings to each Bit with approximately equal probability.
It is troublesome to select k different hash functions. A simple way is to select a hash function and feed k different parameters.
Take a look at the implementation of m and k calculations in BloomFilter in Guava, in the com.google.common.hash.BloomFilter class:
/ * calculate the number of bit bits of Bloom Filter m * *
See http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives for the * formula. * * @ param n expected data quantity * @ param p misjudgment rate (must be 0 < p < 1) * / @ VisibleForTesting static long optimalNumOfBits (long n, double p) {if (p = = 0) {p = Double.MIN_VALUE;} return (long) (- n * Math.log (p) / (Math.log (2) * Math.log (2) } / * calculate the optimal k value, that is, the number of hashes for each element inserted in the Bloom filter * *
See http://en.wikipedia.org/wiki/File:Bloom_filter_fp_probability.svg for the formula. * * @ param n expected data amount * @ param m bloom filter Total number of bit digits (must be positive) * / @ VisibleForTesting static int optimalNumOfHashFunctions (long n, long m) {/ / (m / n) * log (2), but avoid truncation due to division! Return Math.max (1, (int) Math.round ((double) m / n * Math.log (2));}
Another focus of the BloomFilter implementation is how to use the hash function to map data to bit arrays. The implementation of Guava is to calculate the hash value of the element through MurmurHash4, and the resulting hash value is calculated by 8 bytes higher and 8 bytes lower, so as to get multiple positions of the current element in the bit array. For details of the MurmurHash4 algorithm, see Murmur Hash, which was invented in 2008. This algorithm hbase,redis,kafka is used.
This process is implemented in two places:
Put data into bloom filter
Determine whether the data is already in bloom filter
The implementation of these two places is more or less the same, except that the former is put data and the latter is look-up data.
Let's take a look at the process of put. The hash strategy takes MURMUR128_MITZ_64 as an example:
Public boolean put (T object, Funnel)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.