Example Analysis of Java Crawler Framework 04/25 Update SLTechnology News&Howtos

Example Analysis of Java Crawler Framework

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "example Analysis of Java Crawler Framework". Many people will encounter such a dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

I. Architecture Diagram

Where the search web crawler framework is mainly aimed at e-commerce sites for data crawling, analysis, storage, indexing.

Crawler: the crawler is responsible for crawling, parsing, and processing the content of the pages of e-commerce sites.

Database: storing commodity information

Index: full-text search index of goods

Task queue: list of web pages that need to be crawled

Visited table: a list of pages that have been crawled

Crawler monitoring platform: web platform can start and stop crawlers, manage crawlers, task queues, visited tables.

Second, reptile 1. Process flow

1) Scheduler starts the crawler, and TaskMaster initializes taskQueue

2) Workers gets the task from TaskQueue

3) the Worker thread calls Fetcher to crawl the web page described in Task

4) the Worker thread sends the crawled web page to Parser for parsing

5) the data parsed by Parser is sent to Handler for processing, extracting web Link and processing web content.

6) VisitedTableManager determines whether the link extracted from URLExtractor has been crawled, if it is not submitted to TaskQueue

2. Scheduler

Scheduler is responsible for starting the crawler, calling TaskMaster to initialize the TaskQueue, and creating a monitor thread that controls the exit of the program.

When do you quit?

When TaskQueue is empty and all threads in Workers are idle. The situation has not changed within the specified 10 minutes. I think that all the web pages have been crawled. The program exits.

3. Task Master

Task manager, which is responsible for managing task queues. The task manager abstracts the implementation of the task queue.

In simple applications, we can use the in-memory task manager

In a distributed platform with multiple crawler machines, we need a centralized task queue.

At this stage, we use SQLLite as the implementation of the task queue. Redis is also available as an alternative.

The processing flow of the Task Manager:

The task manager initializes the task queue, which may vary depending on the configuration. In the case of increments, initialize according to the specified URL List. In the case of full-text crawling, only the home page of one or more e-commerce sites is initialized in advance.

The task manager creates monitor threads to control the exit of the entire program.

The task manager schedules tasks and, if the task queue is persistent, is responsible for load tasks from the task queue server. Prefetching needs to be considered.

The task manager is also responsible for verifying the validity of tasks, and the crawler monitoring platform can set some tasks in the task queue to fail.

4. Workers

Worker thread pool, where each thread executes the entire crawling process. Consider using multiple thread pools to split and asynchronize the entire process. Improve thread utilization.

5. Fetcher

Fetcher is responsible for directly crawling the pages of e-commerce sites. It is realized by HTTP Client. HTTP core 4 and above already have the function of NIO, which is implemented in NIO.

Fetcher can configure whether the HTML file needs to be saved

6. Parser

Parser parses the web page obtained by Fetcher, and the general web page may not be perfectly formatted (XHTML is perfectly formatted), so it can not be processed by XML's class library. We need a good HTML parser that can fix these imperfectly formatted web pages.

Familiar with the third-party tools are TagSoup,nekohtml,htmlparser three. Tagsoup and nekohtml can process HTML with a SAX event stream, saving memory.

Which known third-party framework uses as their parser?

L Nutch: tagsoup,nekohtml is officially supported, both of which are selected by configuration

L Droids: using nekohtml,Tika

L Tika:tagsoup

It is said that the reliability of tagsoup is better than nekohtml, and the performance of nekohtml is better than tagsoup. Nekohtml is better than htmlparser in terms of reliability and performance. Specific conclusions we need to further test.

We also support html parsers with regex,dom structures. We can use it together in use.

Further, we need to study the document comparator and save the HTML of the crawled website. It can be achieved through semantic fingerprinting or simhash. It is only needed when dealing with huge amounts of data. If the two HTML are considered to be the same, they will no longer be parsed and processed.

7. Handler

Handler is to deal with the content parsed by Parser.

Callback method (visitor): for SAX event processing, we need to adapt the handler to the content handler of sax. As a callback method for parser. Parsed content from different events can be stored in HandlingContext. Finally, it is returned by Parser.

Active approach: you need to parse the entire HTML and select the content you need. The content extracted by Parser is processed. XML needs to be parsed into a DOM structure. Easy to use, can use Xpath,nodefilter and so on, but consumes memory.

ContentHandler: it also contains the component ContentFilter. Filter content.

URLExtractor is responsible for extracting the formatted URL from the web page, building the URL into Task, and submitting it to Task queue.

8. VisitedTableManager

Access the table manager to manage the visited URLs. Extract the unified interface and abstract the underlying implementation. If the URL is crawled, it will not be added to the TaskQueue.

III. Task queue

The Task queue stores tasks that need to be crawled. There is a connection between tasks. We can save and manage this task relationship. This relationship is also the relationship between URL. Save it, help to form the Web diagram in the background and analyze the data.

Task queues are located in a distributed crawler cluster and need to be stored on a centralized server. Some lightweight databases or NoSql of support lists can be used for storage. Options:

L use SQLLite storage: need to insert and delete constantly, I don't know how the performance will be.

L use Redis to store

4. Visited table

The Visited table stores sites that have been crawled. Each crawl needs to be built.

L SQLLite storage: need to dynamically create tables, need to constantly query, insert, but also need to clean up regularly in the background, I do not know how the performance.

L Mysql memory table hash index

L Redis: Key value, set the expiration time

L Memcached: key value, value is the value of bloomfilter

For the current amount of data, SQLLite can be used.

Fifth, reptile monitoring and management platform

Start and stop the crawler and monitor the status of each crawler

L monitor, manage task queues, visited tables

L configure crawler

Manage the data crawled by the crawler. In the case of concurrency, it is difficult to guarantee that the same product will not be crawled repeatedly. After crawling, the weight can be arranged manually through the crawler monitoring and management platform.

This is the end of the "example Analysis of Java Crawler Framework". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.