In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Spatial conversion:
1 Byte = 8 Bits1 KB = 1024 Bytes1 MB = 1024 KB1 GB = 1024 MB
2 ^ 2 = 4
2 ^ 4 = 16
2 ^ 8 = 256
2 ^ 10 = 1024
2 ^ 16 = 65 536
2 ^ 20 = 1 048 576
2 ^ 32 = 4 294 967 296
Basic method
1. Hash method
Hash is generally called hash, which is based on a mapping relationship, that is, given a data element, its keyword is key, calculate hash (key) according to a determined hash function, and take hash (key) as the storage address (or hash address) of the corresponding element of the keyword key, and then insert and retrieve the data element. In short, a hash function is a function that compresses a message of any length into a message digest of a fixed length.
A hash table is an array of fixed size, where the table length should be a prime number. Hash function is a kind of mapping relationship between keyword and storage address, but there is no guarantee that the keyword of each element corresponds to the function value one by one, because it is very likely to correspond to different elements, but calculate the same function value, this is the hash conflict.
1.1 Construction methods of common hash functions:
1.1.1 Direct addressing method
H (key) = key or h (key) = a * key + b
1.1.2 Molding method
H (key) = key mod p
1.1.3 Digital analysis
1.1.4 folding method
1.1.5 square centering method
1.1.6 reserved remainder method
H (key) = key% p
1.1.7 Random number method
H (key) = random (key)
1.2 commonly used conflict resolution
1.2.1 Open address method
When an address conflict occurs, continue to detect other storage addresses in a certain way in the hash table until a free address is found.
1.2.2 chain address method
1.2.3 rehashing method
When there is a conflict, use the second and third. The hash function calculates the address until there is no conflict. Time consumption will increase.
1.2.4 create a public overflow area
Advantages and disadvantages:
Hash is mainly used for "fast access", in O (1) time complexity can find the target element, or determine whether it exists. The data in the Hash data structure is disorderly, so it is impossible to know the specific storage location and the relationship between the locations of the storage elements, but we can judge the location and existence of the elements in constant time. In the process of dealing with massive data, Hash method can be used to quickly access and count some data, and classify a large amount of data, such as extracting the IP address that visits the website the most times in a certain day.
2. Bit-map method
The basic principle of the bit-map (bitmap) method is to use an array of bits to indicate the existence of certain elements.
The result of the bit-map method is to generate an N-bit long string, each with "1" or "0" to represent the number in the desired set.
Example: sort 1 billion IP addresses of IPV4, and each IP will appear only once.
Idea: you can convert all 1 billion IP addresses into 32-bit unsigned integers through simple rules, sort the 1 billion integers, and then turn back to IP addresses; a better idea is to apply for an array of bit types with a length of 32 bits, and then correspond to the corresponding bits, which saves more space than the previous method.
4Byte * 1 billion = 40 0000 0000Byte = 40 0000 0000 / 1024 / 1024 / 1024 GB = 3.725G
The size calculated in the above picture is not quite the same as mine?
2 ^ 32 bit = 4 294 967 296bit = 4 294 967 296bit / 8 / 1024 pick 1024 M = 512m
3. Bloom Filter method
To take an example, Bloom Filter (Bloom filter)
In the worst case, use hash table or data table for storage, which requires high space:
64Byte * 10 billion = 6400 0000 0000 / 1024 hip 1024G = 596.046G
You can use Bloom filters when you encounter the following situations:
Specific to the Bloom filter:
The solution to the above problem:
Create a bitarray array with a length of m
There are k hash functions, the output domain > = m, and each hash function is excellent and independent of each other.
For each URL, k hash values are calculated through k hash functions, and each hash value is corresponding to the bitarray and blackened. If it is already black, it remains unchanged.
Determine whether a URL is a URL on the blacklist:
K hash values are obtained by calculating URL through k hash functions.
Map k hash values to the bitarray array. If each bit is black, it means blacklist URL;. As long as one bit is not black, it means that it is not URL in the blacklist.
When too much URL is entered and the bitarray array is too small, most of the bitarray array is blacked out, which is likely to lead to misjudgment: even if an is not the URL in the blacklist, each bitarray array bit may be blacked out.
How to determine the bitarray size:
Summary:
4. Database optimization method
4.1 excellent database management tools
4.2 data partitioning
4.3 Index
4.4 caching mechanism
4.5 increase virtual storage
4.6 batch processing
4.7 use of temporary and intermediate tables
4.8 optimize query statement
4.9 using views
4.10 using stored procedures
4.11 use sorting instead of non-sequential access
4.12 data mining using sampled data
5. Inverted index
6. Extrapolation method
7. Trie tree
8. Heap
9. Double bucket method
10. MapReduce method
Map-Reduce can be divided into two phases:
Map phase: divide a large task into several subtasks (through the hash function), and then assign the task to a node for processing
Reduce phase: subtasks are processed concurrently, and then the results are merged
The principle of Map-Reduce is simple, but it is difficult to deal with in engineering.
Ex.: count the number of each word in an article.
Case analysis
The key to solving common mass processing problems
1. divide and rule. The hash function is used to divert large tasks to the machine, or into small files.
two。 Commonly used hashMap or bitmap.
Difficulties: communication, estimation of time and space.
1. Top K questions ("Java programmer interview Treasure Book" examples, often encountered in interviews)
For example, search the top 10 search terms in the search engine and download the top 10 songs in the song library.
Tip: it can be solved by combining Map-Reduce and Hadoop.
two。 Find a number that is missing in a certain range (so-and-so pulsating company interview questions)
Given an array, the range of data is-2 ^ 32 ~ 2 ^ 32. Now given an unordered unrepeatable array with hundreds of millions of numbers in a given range, it is required to give any one of the missing numbers in the array (that is, this number is not in the array, but in-2 ^ 32 ~ 2 ^ 32). At the same time, the requirements of time and space are more stringent, need to compromise.
Hint: dichotomy; maximum and minimum.
2.1 this question is learned to meet in the Niu Ke video, which is the same as the above question.
The range of 32-bit unsigned integers is 0,4294967295. Now there is a file that contains exactly 4 billion unsigned integers, so there must be numbers that have never appeared in the whole range. You can use up to 10 megabytes of memory, just find a number that hasn't appeared before, how to find it?
Analysis: if all numbers are recorded in a hash table, in the worst case, 4 billion different numbers will appear. Each record takes 4 bytes and requires about 16 gigabytes of memory.
Solution 1: use bitarray
Solution 2: shunt
Summary:
3. Rank the ages of 1 billion people
4. Sorting questions (example of "Java programmer interview Book")
5. Find out the number that appears the most
There is a large file containing 2 billion all 32-bit integers, in which you find the number that appears the most. But the memory limit is only 2G.
Solution 1:
Solution 2:
6. Find out the 100 hottest words of the day
A search company has a large number of users' search words in a day, assuming that there are tens of billions of data, please design a feasible method to find the hottest 100 words per day.
7. Implement caching in a cluster
Engineers often use server clusters to design and implement data caching. Here are some common strategies.
01. Whether adding, deleting, or querying data, the data id is first converted into a hash value through a hash function, which is marked as key.
02. If there are currently N machines, calculate the value of key%N, which is the machine number to which the data belongs. Whether it is adding, deleting, or querying, it is only on this machine. Please analyze the problems that may be caused by this caching strategy and propose an improved scheme.
Possible problems:
Solution: use consistent hashing algorithm
Set a range of data, and then connect it at the end to form a ring. According to the machine id, the position of the machine on the ring is calculated by the hash function.
How to determine which machine a piece of data belongs to: the id of the data uses the hash function to calculate the hash value, maps it to the corresponding position in the ring, and finds the nearest machine clockwise, then the machine stores the data.
What to do when adding nodes: for example, if the m3 machine is added to it, the data in data1 will be copied to the m3 machine, which is less expensive.
What to do when deleting the machine: just copy all the data from the deleted machine to the next machine clockwise.
Some of the theories refer to the Java programmer interview Book, and most of the examples are summed up by myself!
Book sharing: https://pan.baidu.com/s/1lh7xnfQvm9KaRC4P_3wyHg
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.