Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the idea of using TOP K?

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

What is the idea of using TOP K? I believe many inexperienced people don't know what to do about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

TOP K

General ideas:

1. Divide the large file into several small files by using the method of Hash modeling.

2. Use HashMap or dictionary tree (trie tree) to count the word frequency of small files

3. Sort the small files according to the word frequency (heap sort, etc.) and take the first N of each small file

4. Merge and sort the results of the small files, and then take the first N of the merged files.

There are 100 million floating point numbers, if you find the largest 10000 in the period?

For the third part, first read in the first 10000 numbers to create the smallest heap with a size of 10000, and the time complexity of building the heap is O (mlogm) (m is the size of the array is 10000), then traverse the subsequent numbers and compare them with the (minimum) numbers at the top of the heap. If it is less than the minimum number, continue to read subsequent numbers; if it is larger than the top number of the heap, replace the top element of the heap and realign the heap to the smallest heap. The whole process until 100 million has been traversed. Then output all 10000 digits in the current heap as traversed in the middle order.

There is a 1G file in which each line is a word, the size of the word is no more than 16 bytes, and the memory limit is 1m. Returns the 100 most frequently used words.

Read the file sequentially, take hash (x)% 5000 for each word x, and then store it in 5000 small files according to this value (marked as x0memx1mem.x4999). So each file is about 200k.

If some of the files are more than 1m in size, you can continue to divide them down in a similar way until the size of the decomposed small files is no more than 1m. For each small file, count the words that appear in each file and the corresponding frequency (you can use trie tree / hash_map, etc.), and take out the 100th words with the largest frequency (you can use the minimum heap with 100nodes), and save the 100words and the corresponding frequency into the file, so you get another 5000 files. The next step is the process of merging (similar to merging and sorting) the 5000 files.

Give 4 billion unrepeated unsigned int integers, unsorted integers, and then give a number, how to quickly determine whether this number is among the 4 billion numbers?

Apply for 512MB memory, and a bit bit represents a unsigned int value. Read the number of 4 billion, set the corresponding bit bit, read the number to be queried, and check whether the corresponding bit is 1. 1 means it exists, and 0 means it does not exist.

It is known that a file contains some phone numbers, each with 8 digits, and count the number of different numbers.

The maximum decimal value that an 8-bit integer can represent is 99999999. If each number corresponds to a bit bit in the bitmap, it takes about 99MB to store 8-bit integers. Because of 1B=8bit, 99Mbit is converted into 99/8=12.375MB memory, that is, only 12.375MB memory can be used to represent the contents of all 8-digit phone numbers.

Given two files an and b, each store 5 billion url, each url occupies 64 bytes, the memory limit is 4G, let you find out the common url of an and b files?

If you use a Bloom filter, then the problem is very easy. 4G of memory is enough to hold more than 30 billion bit, so it is enough to deal with it. First, put all the url in the a file into the Bloom filter, then traverse the b file, and ask the Bloom filter for each url to see if it already exists. If so, this URL input result file.

After reading the above, have you mastered the idea of using TOP K? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report