The principle of multithreading and distributed crawler architecture in Java 04/03 Update SLTechnology News&Howtos

The principle of multithreading and distributed crawler architecture in Java

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the knowledge of "the principle of multithreading and distributed crawler architecture in Java". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

In the previous chapters, our crawlers are all single-threaded. When we debug the crawler, there is nothing wrong with the single-threaded crawler, but when we use the single-threaded crawler to collect web pages in the online environment, single-threading exposes two fatal problems:

The collection efficiency is very slow, and the single threads are all serial. The next execution action needs to wait for the last one to be executed.

The utilization rate of the CUP of the server is not high, so think about whether it would be too wasteful for our servers to run only one thread with 8 cores, 16G and 32G.

The online environment can not be like our local testing, do not care about the collection efficiency, as long as the results can be extracted correctly. In this age of money, it is impossible to give you time to collect slowly, so single-threaded crawlers are not workable. We need to change single-threaded to multi-threaded mode to improve collection efficiency and computer utilization.

Multithreaded crawler programming is much more complex than single threading, but unlike other businesses to ensure data security under high concurrency, multithreaded crawlers are not so demanding in data security. because each page can be seen as an independent body. To do a good job of multithreaded crawlers, we must do two things: the first point is the unified URL maintenance to be collected, and the second point is the weight removal of URL. Let's talk about these two points briefly.

Maintain the URL to be collected

Multithreaded crawler programs can not be like a single thread, each thread alone to maintain its own URL to be collected, if so, then each thread will be the same collection of web pages, you are not multi-threaded collection, you are collecting a page multiple times. For this reason, we need to maintain the URL to be collected uniformly. Each thread receives the collection URL from the unified URL maintenance and completes the collection task. If a new URL link is found on the page, it is added to the container maintained by the unified URL. Here are several containers suitable for unified URL maintenance:

JDK's security queue, such as LinkedBlockingQueue

High performance NoSQL, such as Redis, Mongodb

MQ message middleware

Weight removal of URL

The deduplication of URL is also a key step in multithreaded collection, because if we do not, we will collect a large number of repeated URL, which does not improve our collection efficiency, such as a paged news list. When we collect the first page, we can get 2, 3, 4, 5 pages of links, and when we collect the second page, we will get 1, 3, 4, 5 pages of links. There will be a large number of list page links in the URL queue to be collected, so it will be collected repeatedly or even enter an endless loop, so it needs to be duplicated by URL. There are many ways to remove URL duplicates. Here are several common ways to remove URL duplicates:

Save URL to database for deduplication, such as redis and MongoDB

Put URL in a hash table to deduplicate, such as hashset

After MD5, the URL is saved to the hash table to be duplicated. Compared with the above one, it can save space.

Using a Bloom filter (Bloom Filter) to remove weights can save a lot of space, but not so accurately.

We all know about the two core knowledge points of multithreaded crawlers. Here I draw a simple architecture diagram of multithreaded crawlers, as shown in the following figure:

Above we mainly understand the architecture design of multithreaded crawler, next we might as well try Java multithreaded crawler, let's take gathering tiger news as an example to fight Java multithreaded crawler, Java multithreaded crawler designed to collect URL maintenance and URL de-weight, because we are only here to demonstrate, so we use JDK built-in container to complete, we use LinkedBlockingQueue as the URL maintenance container to be collected HashSet acts as a URL de-weight container. The following is the core code of Java multithreaded crawler. The detailed code is uploaded to GitHub at the end of the article:

/ * Multithreaded crawler * / public class ThreadCrawler implements Runnable {/ / number of articles collected private final AtomicLong pageCount = new AtomicLong (0); / / list page link regular expression public static final String URL_LIST = "https://voice.hupu.com/nba"; protected Logger logger = LoggerFactory.getLogger (getClass ()); / / queue to be collected LinkedBlockingQueue taskQueue; / / collected link list HashSet visited / / thread pool CountableThreadPool threadPool; / * @ param url start page * @ param threadNum threads * @ throws InterruptedException * / public ThreadCrawler (String url, int threadNum) throws InterruptedException {this.taskQueue = new LinkedBlockingQueue (); this.threadPool = new CountableThreadPool (threadNum); this.visited = new HashSet (); / / add the start page to this.taskQueue.put (url) in the queue to be collected } @ Override public void run () {logger.info ("Spider started!"); while (! Thread.currentThread () .isInterrupted ()) {/ / get URL final String request to be collected from the queue = taskQueue.poll () / / if the fetch request is empty and the current thread is no longer running if (request = = null) {if (threadPool.getThreadAlive () = = 0) {break }} else {/ / execute collection task threadPool.execute (new Runnable () {@ Override public void run () {try {processRequest (request);} catch (Exception e) {logger.error ("process request" + request + "error", e) } finally {/ / capture page + 1 pageCount.incrementAndGet ();}} threadPool.shutdown (); logger.info ("Spider closed! {} pages downloaded.", pageCount.get ()) } / * process collection request * @ param url * / protected void processRequest (String url) {/ / determine whether it is a list page if (url.matches (URL_LIST)) {/ / list page parse details page link added to the URL queue to be collected processTaskQueue (url);} else {/ / parse web page processPage (url) }} / * process link collection * process list page, add url to queue * * @ param url * / protected void processTaskQueue (String url) {try {Document doc = Jsoup.connect (url). Get (); / / details page link Elements elements = doc.select ("div.news-list > ul > li > div.list-hd > h5 > a") Elements.stream () .forEach ((element-> {String request = element.attr ("href"); / / determine whether the link exists in the queue or in the collected set. If it does not exist, add it to the queue if (! visited.contains (request) & &! taskQueue.contains (request)) {try {taskQueue.put (request) } catch (InterruptedException e) {e.printStackTrace ();}); / / list page link Elements list_urls = doc.select ("div.voice-paging > a"); list_urls.stream () .forEach ((element-> {String request = element.absUrl ("href")) / / determine whether the list link to be extracted meets the requirements of if (request.matches (URL_LIST)) {/ / determine whether the link exists in the queue or in the collected set. If it does not exist, add it to the queue if (! visited.contains (request) & &! taskQueue.contains (request)) {try {taskQueue.put (request) } catch (InterruptedException e) {e.printStackTrace ();} catch (Exception e) {e.printStackTrace ();}} / * parsing page * * @ param url * / protected void processPage (String url) {try {Document doc = Jsoup.connect (url). Get () String title = doc.select ("body > div.hp-wrap > div.voice-main > div.artical-title > H2"). First (). OwnText (); System.out.println (Thread.currentThread (). GetName () + "in" + new Date () + "gathering Tiger Pop News" + title); / / save the collected url to the collected set visited.add (url) } catch (IOException e) {e.printStackTrace ();}} public static void main (String [] args) {try {new ThreadCrawler ("https://voice.hupu.com/nba", 5) .run ();} catch (InterruptedException e) {e.printStackTrace ();}

Let's use 5 threads to collect Tiger Pop news list page to see the effect if? Run the program and get the following results:

Collecting results with multithread

As can be seen from the results, we started 5 threads to collect 61 pages, which took a total of 2 seconds. It can be said that the effect is good. Let's compare it with single thread to see how big the gap is. We set the number of threads to 1, start the program again, and get the following result:

Single-thread run result

It can be seen that it took 7 seconds for a single thread to collect 61 pieces of news, which is almost 4 times longer than that of multithreaded. If you think about it, it is only 61 pages. If there are more pages, the gap will become wider and wider, so the efficiency of multithreaded crawlers is still very high.

This is the end of the introduction to the principles of multithreading and distributed crawler architecture in Java. Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.