What are the methods of dealing with large amounts of data and massive data 07/13 Update SLTechnology News&Howtos

What are the methods of dealing with large amounts of data and massive data

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what are the methods of dealing with large amounts of data and massive data". The content of the explanation in this article is simple and clear, and it is easy to learn and understand. let's study and learn "what are the processing methods for large amounts of data and massive data"?

1.Bloom filter

Scope of application: it can be used to realize the data dictionary, judge the weight of the data, or gather to find the intersection.

Basic principles and main points:

For the principle is very simple, bit array + k independent hash functions. Set the bit array of the values corresponding to the hash function to 1, and if you find that all the corresponding bits of the hash function are 1, it is obvious that this process does not guarantee that the search results are 100% correct. It is also not supported to delete an inserted keyword because the corresponding bit of the keyword affects other keywords. So a simple improvement is counting Bloom filter, which can be deleted by using an counter array instead of a bit array.

Another important problem is how to determine the size of the bit array m and the number of hash functions according to the number of input elements n. The error rate is the lowest when the number of hash functions is k = (ln2) * (Mzone). In the case that the error rate is not greater than E, m must be at least equal to n*lg (1max E) to represent any set of n elements. But m should be larger, because to make sure that at least half of the bit array is 0, then m should > = nlg (1amp E) * lge is about 1.44x nlg (1max E) (lg represents the logarithm with base 2).

For example, if we assume that the error rate is 0.01, then m should be about 13 times that of n. So k is about eight.

Note that the unit of m is different from that of n. M is the unit of bit, while n is the unit of the number of elements (exactly the number of different elements). Usually the length of a single element has a lot of bit. So using bloom filter is usually a savings in memory.

Extend:

Bloom filter maps the elements in the set into the array, using k (k is a hash function) whether the mapping bits are all 1 to indicate whether the element is in the set or not. Counting bloom filter (CBF) extends each bit in the bit array to a counter, thus supporting the deletion of elements. Spectral Bloom Filter (SBF) associates it with the number of occurrences of the collection element. SBF uses the minimum value in counter to approximately represent the occurrence frequency of elements.

Problem example: give you two files of URL B, each storing 5 billion URL, each URL occupies 64 bytes, and the memory limit is 4G, so that you can find out the common URL of the file. What if it's three or even n files?

According to this problem, let's calculate the memory footprint. 4G = 2 ^ 32 is about 4 billion * 8 is about 34 billion, n = 5 billion. If the error rate is 0.01, it takes about 65 billion bit. 34 billion is available now, and the difference is not much, which may increase the error rate. In addition, if these urlip are one-to-one correspondence, they can be converted into ip, which is much easier.

2.Hashing

Scope of application: fast find, delete the basic data structure, usually need the total amount of data can be put into memory

Basic principles and main points:

Hash function selection, for strings, integers, arrangement, specific corresponding hash method.

Collision processing, one is open hashing, also known as zipper method; the other is closed hashing, also known as open address method, opened addressing.

Extend:

D in d-left hashing means multiple. Let's simplify this problem and take a look at 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of equal length, called T1 and T2, respectively, and assigning T1 and T2 a hash function, H2 and h3, respectively. When storing a new key, two hash functions are used to calculate the two addresses H2 and h3. At this point, you need to check the H2 [key] location in T1 and the H3 [key] location in T2, which has stored more (collided) key, and then store the new key in a location with less load. If there are as many on both sides, for example, both locations are empty or both have a key stored, the new key is stored in the T1 subtable on the left, and the 2-left comes from it. When looking for a key, you must hash twice and find two locations at the same time.

Example of the problem:

1)。 Massive log data, extract the IP that visits Baidu the most times in a certain day.

The number of IP is still limited, up to 2 ^ 32, so you can consider using hash to store the ip directly in memory, and then count it.

3.bit-map

Scope of application: data can be quickly searched, weighed and deleted. Generally speaking, the data range is less than 10 times that of int.

Basic principles and main points: use the bit array to indicate whether certain elements exist, such as an 8-digit phone number

Extension: bloom filter can be seen as an extension to bit-map

Example of the problem:

1) it is known that a file contains some phone numbers, each of which is an 8-digit word, counting the number of different numbers.

8 bits up to 99 999 999, about 99 m bit, about 10 m bytes of memory.

2) find out the number of non-repeating integers among the 250 million integers, and there is not enough memory space to hold the 250 million integers.

Expand the bit-map and use 2bit to represent a number. 0 indicates that it does not appear, 1 indicates that it occurs once, and 2 indicates that it occurs twice or more. Or we don't use 2bit to express, we can simulate and implement this 2bit-map with two bit-map.

4. Heap

Scope of application: the first n of massive data is large, and n is relatively small, and the heap can be put into memory.

Basic principles and main points: the maximum stack is small, and the minimum stack is big. Method, such as finding the first n small, we compare the current element with the largest element in the largest heap, and if it is less than the largest element, we should replace the largest element. So the final n elements are the smallest n. Suitable for a large amount of data, the front n is small, the size of n is relatively small, so that all the first n elements can be obtained by scanning once, and the efficiency is very high.

Expansion: double heap, a combination of the largest heap and the smallest heap, can be used to maintain the median.

Example of the problem:

1) find the largest number in 100w.

Use a minimum heap of 100 elements.

5. Double bucket partition

Scope of application: the k largest, median, non-repetitive or repetitive numbers

Basic principles and main points: because the range of elements is so large that the direct addressing table can not be used, the scope is determined step by step through multiple divisions, and then carried out in an acceptable range at last. It can be scaled down many times, and the double layer is just one example.

Expand:

Example of the problem:

1). Find out the number of non-repeating integers among the 250 million integers, and there is not enough memory space to hold the 250 million integers.

It's a bit like the pigeon nest principle, the number of integers is 2 ^ 32, that is, we can divide the 2 ^ 32 into 2 ^ 8 regions (for example, using a single file to represent a region), and then separate the data into different regions, and then different regions can be solved directly by using bitmap. In other words, as long as there is enough disk space, it can be easily solved.

2). Half a billion int find their median.

This example is more obvious than the one above. First of all, we divide the int into 2 ^ 16 regions, and then read the data to count the number of numbers that fall into each region. Then we can judge which region the median falls to according to the statistical results, and know that the largest number in this region happens to be the median. Then for the second scan, we only count the numbers that fall in this area.

In fact, if it is not int but int64, we can reduce it to an acceptable level after making such a division three times. That is, you can first divide the int64 into 2 ^ 24 regions, then determine the largest number of the region, divide the region into 2 ^ 20 sub-regions, and then determine the largest number of the sub-regions, and then the number of the number in the sub-regions is only 2 ^ 20, you can directly use direct addr table for statistics.

6. Database index

Scope of application: add, delete, modify and check a large amount of data

Basic principles and key points: use the design and implementation method of data to deal with the addition, deletion, modification and query of massive data.

Extend:

Example of the problem:

7. Inverted index (Inverted index)

Scope of application: search engine, keyword query

Basic principles and main points: why is it called inverted index? An indexing method used to store the mapping of the storage location of a word in a document or group of documents under full-text search.

In English, for example, here is the text to be indexed:

T0 = "it is what it is"

T1 = "what is it"

T2 = "it is a banana"

We can get the following reverse file index:

"a": {2}

"banana": {2}

"is": {0,1,2}

"it": {0,1,2}

"what": {0,1}

The retrieved conditions "what", "is" and "it" will correspond to the intersection of sets.

A forward index is developed to store a list of words for each document. The query of forward index often satisfies the query of orderly and frequent full-text query of each document and the verification of each word in the verification document. In the forward index, the document occupies the central position, and each document points to a sequence of index items it contains. In other words, the document points to the words it contains, while the reverse index points to the document that contains it, and it is easy to see this reverse relationship.

Extend:

Problem example: document retrieval system, query which files contain a word, such as the keyword search of common academic papers.

8. External sorting

Scope of application: big data's sorting, de-duplicating

Basic principles and key points: merging method of external sorting, principle of permutation selection loser tree, optimal merging tree

Extend:

Example of the problem:

1)。 There is a 1G file in which each line is a word, the size of the word is no more than 16 bytes, and the memory limit is 1m. Returns the 100 most frequently used words.

This data has obvious characteristics, the size of the word is 16 bytes, but only 1m of memory is not enough to do hash, so it can be used for sorting. Memory can be used as an input buffer.

9.trie tree

Applicable range: large amount of data, many repetitions, but small types of data can be put into memory

Basic principles and main points: the way of implementation, the representation of node children

Expansion: compressed reality.

Example of the problem:

1)。 There are 10 files, each file 1G, every line of each file stores the user's query, and the query of each file may be duplicated. I want you to sort according to the frequency of query.

2). 10 million strings, some of which are the same (repetition), need to remove all repetitions and retain strings that do not repeat. How to design and implement?

3)。 Looking for hot queries: the repetition of query strings is relatively high, although the total is 10 million, but if you remove the repetition, no more than 3 million, each no more than 255 bytes.

10. Distributed processing mapreduce

Applicable range: large amount of data, but small types of data can be put into memory

Basic principles and main points: give the data to different machines to process, divide the data, and reduce the results.

Expand:

Example of the problem:

1) .The canonical example application of MapReduce is a process to count the appearances of

Each different word in a set of documents:

Void map (String name, String document):

/ / name: document name

/ / document: document contents

For each word w in document:

EmitIntermediate (w, 1)

Void reduce (String word, Iterator partialCounts):

/ / key: a word

/ / values: a list of aggregated partial counts

Int result = 0

For each v in partialCounts:

Result + = ParseInt (v)

Emit (result)

Here, each document is split in words, and each word is counted initially with a "1" value by

The Map function, using the word as the result key. The framework puts together all the pairs

With the same key and feeds them to the same call to Reduce, thus this function just needs to

Sum all of its input values to find the total appearances of that word.

2)。 Massive data distributed in 100 computers, find a way to efficiently count the TOP10 of this batch of data.

3)。 There are a total of N machines, and there are N on each machine. Save up to O (N) per machine and operate on them. How do I find the median of N ^ 2 (median)?

Classical problem analysis

Tens of billions of data (there are duplicates), statistics of which appear the most times of the top N data, divided into two cases: can be read into memory, can not be read at once.

Available ideas: trie tree + heap, database index, partition subset statistics, hash, distributed computing, approximate statistics, external sorting

The so-called ability to read into memory at once should actually refer to the amount of data after the removal of repetition. If the deduplicated data can be put into memory, we can set up a dictionary for the data, such as through map,hashmap,trie, and then count it directly. Of course, when updating the number of occurrences of each piece of data, we can use a heap to maintain the top N data with the largest number of occurrences. Of course, this leads to an increase in the number of maintenance, which is not as efficient as finding the top N after complete statistics.

If the data cannot be put into memory. On the one hand, we can consider whether the above dictionary method can be improved to adapt to this situation, the change that can be made is to store the dictionary on the hard disk instead of memory, which can refer to the storage method of the database.

Of course, there is a better way, that is, distributed computing can be used, which is basically the map-reduce process. First of all, according to the data value or the value after the data hash (md5), the data can be divided into different machines according to the range. It is best to divide the data into memory at one time, so that different machines are responsible for dealing with a variety of numerical ranges, which is actually map. After getting the results, each machine only needs to take out the top N data with the largest number of occurrences, and then summarize and select the top N data with the largest number of occurrences of all the data, which is actually the reduce process.

In fact, you may want to distribute the data directly to different machines for processing, so you can't get the correct solution. Because one data may be evenly distributed to different machines, while the other may be completely aggregated on one machine, and there may also be the same number of data. For example, if we want to find the top 100 with the largest number of occurrences, we distribute 10 million of the data to 10 machines and find the top 100 with the largest number of occurrences on each machine. After merging, there is no guarantee that we will find the real 100th. Because, for example, the 100th with the most occurrence may have 10,000, but it is assigned to 10 machines, so there are only 1,000 on each machine. Suppose that the machines that rank before 1000 are distributed separately on a single machine, such as 1001, so that the one with 10, 000 will be eliminated. Even if we let each machine select the 1000 that appear the most and then merge, it will still go wrong, because there may be a large number of aggregation of 1001. Therefore, the data can not be randomly evenly distributed to different machines, but should be mapped to different machines according to the values after hash, so that different machines can handle a range of values.

The method of external sorting will consume a lot of IO and will not be very efficient. The above distributed method can also be used in the stand-alone version, that is, the total data is divided into several different sub-files according to the range of values, and then processed one by one. After processing, we will merge these words and their occurrence frequency. In fact, it is possible to take advantage of an out-ordered merger process.

In addition, we can also consider approximate calculation, that is, by combining natural language attributes, we can use only those words that really appear most frequently as a dictionary, so that this scale can be put into memory.

Thank you for your reading. The above is the content of "what are the methods of dealing with large amounts of data and massive data". After the study of this article, I believe you have a deeper understanding of the processing methods of large amounts of data and massive data, and the specific use still needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.