Distributed crawler based on java 04/16 Update SLTechnology News&Howtos

Distributed crawler based on java

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

[this article is transferred from the author of the blog Garden: Zhang Feng original link: https://www.cnblogs.com/skyme/p/4440831.html] Classification

Distributed web crawlers contain multiple crawlers. Each crawler needs to complete tasks similar to individual crawlers. They download web pages from the Internet, save them to local disk, extract URL from them, and continue crawling along the direction of these URL. Because parallel crawlers need to split the download task, it is possible that the crawler will send its own extracted URL to other crawlers. These reptiles may be distributed in the same local area network or scattered in different geographical locations.

According to the degree of dispersion of crawlers, distributed crawlers can be divided into the following two categories:

1. Distributed network crawler based on local area network: all crawlers of this distributed crawler run in the same local area network and communicate with each other through high-speed network connections. These crawlers access the external Internet and download web pages through the same network, and all the network load is concentrated at the exit of the local area network where they are located. Due to the high bandwidth of the local area network, the efficiency of communication between the crawlers can be guaranteed; but the upper limit of the total bandwidth of the network exit is fixed, and the number of crawlers will be limited by the bandwidth of the local area network exit.

2. Distributed web crawler based on wide area network: when the crawlers of parallel crawlers run in different geographical locations (or network locations), we call this parallel crawler distributed crawlers. For example, the crawler of a distributed crawler may be located in China, Japan, and the United States, respectively, and may be responsible for downloading pages from these three places, or at CHINANET,CERNET,CEINET, for downloading pages from each of the three networks. The advantage of distributed crawler is that it can disperse network traffic to a certain extent and reduce the load of network exit. If the crawlers are distributed in different geographical locations (or network locations), how often they need to communicate with each other has become a question worth considering. The communication bandwidth between crawlers may be limited and usually needs to communicate through the Internet.

Architecture diagram of large-scale distributed web crawler

Distributed web crawler is a very complex system. There are many factors to consider. Performance can be said to be an important indicator. Of course, hardware-level resources are also necessary.

Architecture

The following is the overall architecture of the project, and the first version is based on this scenario.

The above web layer includes console, basic permissions, monitoring display, etc., and can be further expanded as needed.

The core layer is uniformly dispatched by the controller and sends the task to the workers in the worker queue for crawling operation. Each node dynamically sends module status and other information to the monitoring module, which is uniformly displayed by the display layer.

Project goal

Public push, open source version of Jinri Toutiao!

Distributed web crawler based on hadoop thinking.

At present, fourinone, jeesite and webmagic have been integrated and further improved. The goal of the first phase is to eventually build a dynamically configurable distributed crawler system based on the designer.

Current status of the project

Current progress of the project:

1. Sourceer, which can access a variety of data sources. The API has been defined (added to builder encapsulation, simple crawlers can be used).

2. Web architecture project (web project uploaded and tested successfully, permissions, infrastructure modification, import, etc. have been recorded in video, delete activiti, delete cms).

3. Distributed framework research (distributed project subcontracting, adding some comments, testing single machine and single worker crawling).

4. Plug-in integration.

5. Various de-duplication methods and algorithms such as articles (bloomfilter, fingerprint algorithm, simhash, word segmentation algorithm (ansj)).

6. Classifier test (bayes, text classification stand-alone test is successful).

Project address:

(distributed crawler) http://git.oschina.net/zongtui/zongtui-webcrawler

(deduplicator) https://git.oschina.net/zongtui/zongtui-filter

(text classifier) https://git.oschina.net/zongtui/zongtui-classifier

(document directory) https://git.oschina.net/zongtui/zongtui-doc

Project interface:

Start jetty, the skin has not been changed yet.

Summary

At present, the project is being further improved. I hope to get more comments from you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.