How to find the same URL from 10 billion URL 02/11 Update SLTechnology News&Howtos

How to find the same URL from 10 billion URL

2026-02-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to find the same URL from 10 billion URL". In daily operation, I believe many people have doubts about how to find the same URL problem from 10 billion URL. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubt of "how to find the same URL from 10 billion URL"! Next, please follow the editor to study!

Topic description

Given two files an and b, each store 5 billion URL, each URL occupies 64B, the memory limit is 4G. Please find out the common URL of files an and b.

Solution ideas

Each URL occupies 64B, so the space occupied by 5 billion URL is about 320GB.

5,000 * 64B ≈ 5GB * 64 = 320GB

Since the memory size is only 4G, it is not possible to load all the URL into memory at once. For this type of problem, the divide-and-conquer strategy is generally adopted, that is, the URL in a file is divided into several small files according to some characteristics, so that the size of each small file is no more than 4G, so that the small file can be read into memory for processing.

The ideas are as follows:

First, traverse the file a, calculate hash (URL)% 1000 for the traversed URL, and store the traversed URL to a0, A1, a2,..., a999 according to the calculation results, so that each size is about 300MB. Using the same method to traverse file b, store the URL in file b into files b0, b1, b2,..., b999.

After this processing, all possible identical URL are in the corresponding small files, that is, a0 corresponds to b0,..., A999 corresponds to b999, and non-corresponding small files cannot have the same URL. So next, we just need to ask for the same URL in these 1000 pairs of small files.

Then iterate through the ai (I ∈ [0999]) to store the URL in a HashSet collection. Then iterate through each URL in the bi to see if it exists in the HashSet collection, and if so, it means that this is a common URL, and the URL can be saved in a separate file.

At this point, the study on "how to find the same URL from 10 billion URL" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.