What are the interview questions in big data's massive data processing? 07/13 Update SLTechnology News&Howtos

What are the interview questions in big data's massive data processing?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what are the interview questions about big data's massive data processing. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

1. Massive log data to extract the IP that visits Baidu the most times a day, which was mentioned in my previous article algorithm. The solution given at that time was: the number of IP is still limited, up to 2 ^ 32, so you can consider using hash to store the ip directly in memory, and then make statistics. Then introduce this plan in detail: first of all, it is this day, and the IP in the log of visiting Baidu is taken out and written into a large file one by one. Notice that the IP is 32-bit, with a maximum of 2 ^ 32 IP. We can also use mapping methods, such as module 1000, to map the whole large file into 1000 small files, and then find out the IP with the highest frequency in each small text (we can use hash_map for frequency statistics, and then find out the highest frequency) and the corresponding frequency. Then, among the 1000 largest IP, find out which IP has the highest frequency, which is what you want. 2. The search engine will record all the retrieval strings used by the user each time through the log file, and the length of each query string is 1-255 bytes. Suppose there are currently 10 million records (these query strings have a high degree of repetition, although the total is 10 million, but if duplicates are removed, no more than 3 million. The higher the repetition of a query string, the more users query it, that is, the more popular it is. Please count the top 10 query strings, requiring no more than 1 gigabyte of memory. The typical Top K algorithm is still described in this article. In this paper, the final algorithm is as follows: the first step is to preprocess the massive data and sort it with Hash table in O (N) time; then, the second step is to find out Top K with the help of the data structure of heap, and the time complexity is N'logK. That is, with the help of the heap structure, we can find and adjust / move in log order of magnitude of time. So, maintain a small root heap of K (10 in this topic), then iterate through the 3 million Query and compare it with the root element, so our final time complexity is: O (N) + Numeric O (logK), (N = 10 million, N'= 3 million). Ok, for more information, please refer to the original text. Or: using the trie tree, the key field stores the number of times the query string appears, and the number of times it does not appear is 0. Finally, the occurrence frequency is sorted by the minimum push of 10 elements. 3. There is a file with a size of 1G, in which each line is a word, the size of the word is no more than 16 bytes, and the memory limit is 1m. Returns the 100 most frequently used words. Solution: read the file sequentially, take hash (x)% 5000 for each word x, and then save it to 5000 small files according to this value. X4999). So each file is about 200k. If some of the files are more than 1m in size, you can continue to divide them down in a similar way until the size of the decomposed small files is no more than 1m. For each small file, count the words that appear in each file and the corresponding frequency (you can use trie tree / hash_map, etc.), and take out the 100th words with the largest frequency (you can use the minimum heap with 100nodes), and save the 100words and the corresponding frequency into the file, so you get another 5000 files. The next step is the process of merging (similar to merging and sorting) the 5000 files. 4. There are 10 files, each file 1G, every line of each file stores the user's query, and the query of each file may be duplicated. You are required to sort according to the frequency of query. It is also a typical TOP K algorithm, and the solution is as follows: scenario 1: read 10 files sequentially and write query to the other 10 files (marked as) according to the result of hash (query). The size of each of the newly generated files is about 1G (assuming the hash function is random). Find a machine with about 2G in memory, and use hash_map (query, query_count) in turn to count the number of times each query appears. Sort by the number of occurrences using fast / heap / merge sort. Output the sorted query and the corresponding query_cout to a file. This results in 10 sorted files (marked as). Merge and sort the 10 files (a combination of inner sort and outer sort). Option 2: generally, the total amount of query is limited, but the number of repetitions is relatively large. It is possible that all query can be added to memory at one time. In this way, we can count the number of occurrences of each query directly using trie trees / hash_map, etc., and then do a quick / heap / merge sort by the number of occurrences. Scenario 3: similar to scenario 1, but after the hash is done and divided into multiple files, it can be processed by multiple files, processed by a distributed architecture (such as MapReduce), and finally merged. 5. Given two files an and b, each store 5 billion url, each url occupies 64 bytes, the memory limit is 4G, let you find out the common url of an and b files? Option 1: it can be estimated that the size of each file is 5G × 643G, which is much larger than the memory limit of 4G. So it is not possible to fully load it into memory for processing. Consider a divide-and-rule approach. Iterate through the file a, get hash (url) 00 for each url, and then store the url in 1000 small files (marked as a0Magina1) according to the values obtained. So the size of each small file is about 300m. Traverse the file b and store the url to 1000 small files in the same way as a (recorded as b0journal b1m. B999). After this processing, all the possible identical url are in the corresponding small file (a0 vs b0 vs b1), and the non-corresponding small file cannot have the same url. Then we only ask for the same url in 1000 pairs of small files. When finding the same url in each pair of small files, you can store the url of one of the small files in hash_set. Then iterate through each url of another small file to see if it is in the hash_set you just built, and if so, it is a common url, just save it in the file. Option 2: if a certain error rate is allowed, you can use Bloom filter,4G memory to represent about 34 billion bit. Map the url in one of the files to the 34 billion bit using Bloom filter, and then read the url of the other file one by one to check whether it is with Bloom filter, and if so, then the url should be a common url (note that there will be a certain error rate). 6. Find out the non-repeating integers among the 250 million integers. Note that there is not enough memory to hold the 250 million integers. Solution 1: use 2-Bitmap (each number allocates 2bitfocus 00 means it does not exist, 01 means it occurs once, 10 means multiple times, 11 is meaningless), memory is required, and it is acceptable. Then scan the 250 million integers to see the corresponding bits in Bitmap. If it is 00 to 01, 01 to 10, 10 remains the same. After the description is done, check the bitmap and output the integer with the corresponding bit 01. Option 2: a method similar to question 1 can also be used to divide small documents. Then find out the non-duplicate integers in the small file and sort them. Then merge and pay attention to the removal of repetitive elements. 7. Tencent interview questions: give 4 billion non-repeated unsigned int integers, unsorted integers, and then give a number, how to quickly determine whether this number is among the 4 billion numbers? Similar to question 6 above, my first reaction is quick sort + binary search. Here are other better ways: scenario 1: apply for 512m of memory, with a bit bit representing a unsigned int value. Read the number of 4 billion, set the corresponding bit bit, read the number to be queried, and check whether the corresponding bit is 1. 1 means it exists, and 0 means it does not exist. Plan 2: this problem is well described in programming Zhuji. You can refer to the following ideas and discuss it: again, because 2 ^ 32 is more than 4 billion, a given number may or may not be in it; here we use a 32-bit binary to represent each of the 4 billion numbers, assuming that the 4 billion numbers start in a file. Then the 4 billion numbers are divided into two categories: 1. The highest position is 0.2. The highest bit is 1 and the two categories are written to two files, one of which is equal to 2 billion (which is equivalent to halving); compare with the highest bit of the number to find and then go to the corresponding file and look again. Then the document is divided into two categories: 1. The second highest position is 0.2. The next highest bit is 1 and write these two types to two files respectively, the number in one file = 1 billion (which is equivalent to halving); compare with the second highest bit of the number you want to find and then go to the corresponding file and look again. And so on, it can be found, and the time complexity is O (logn), scheme 2 is finished. Attachment: here, under the brief introduction, bitmap method: it is one of the common programming tasks to use bitmap method to judge whether there is repetition in the shaping array and to judge whether there is repetition in the set. When the amount of data in the set is relatively large, we usually want to scan a few times less, then the double loop method is not desirable. Bitmap method is more suitable for this situation. Its practice is to create a new array of length max+1 according to the largest element in the set, max, and then scan the original array again. If you encounter a few, give 1 to the position of the new array. If you encounter 5, set 1 to the sixth element of the new array, so that the next time you encounter 5, you will find that the sixth element of the new array is 1. This shows that the data this time must be duplicated with the previous data. This method of initializing a new array with zero and one after it is similar to the bitmap processing method, so it is called the bitmap method. Its worst case is 2N. The efficiency can be doubled if the maximum value of the array is known and the length of the new array can be set in advance. 8. How to find the one with the most repetitions among the huge amounts of data? Scheme 1: first do the hash, and then map the module to a small file, find out the one with the most repetitions in each small file, and record the number of repetitions. Then find out that the most repeated data in the previous step is the request (refer to the previous question). 9, tens of millions or hundreds of millions of data (there are duplicates), count the N data that appear the most. Plan 1: tens of millions or hundreds of millions of data, the memory of the current machine should be able to store. So consider using hash_map/ search binary tree / red-black tree to count the number of times. Then it is to take out the first N data with the most occurrence, which can be done using the heap mechanism mentioned in question 2. 10, a text file, about 10,000 lines, each line of one word, requires statistics of the most frequent occurrence of the top 10 words, please give ideas, give time complexity analysis.

This is the end of the article on "what are the interview questions for big data's massive data processing?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.