In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Editor to share with you what is the principle of the implementation of the crawler, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to understand it!
Preface
Web crawlers are also known as web robots, web spiders, web ants, web robots and so on.
Web crawlers are usually composed of control nodes, crawler nodes and resource libraries.
The control node, also known as the central controller of the crawler, is mainly responsible for allocating threads according to the URL address (URL:UniformResourceLocation, uniform resource locator, that is, the string used to describe information resources on Internet, mainly used in various www client programs and server programs), and calling the crawler node to carry out specific crawling tasks.
The crawler node will crawl the web page according to the relevant algorithms (including downloading the web page, processing the web page text, etc.). After crawling, the corresponding results will be stored in the corresponding resource library.
From the figure, we can see that there can be multiple control nodes in the web crawler, and there can be multiple crawler nodes under each control node.
The control nodes, the control nodes and the crawler nodes under it, and the crawler nodes under the same control node can communicate with each other.
Understand the composition of web crawlers, so, what are the specific types of web crawlers?
According to the implemented technology and structure, web crawlers can be divided into general web crawlers, focused web crawlers, incremental web crawlers and deep web crawlers.
In the actual web crawler, it is usually a combination of these kinds of crawlers.
For different types of web crawlers, their implementation principles are also different, but there will be a lot of commonalities in these implementation principles.
The following will be based on the general web crawler and focus web crawler, the implementation principle of web crawler is introduced.
General web crawler
The implementation principle and process of a general web crawler can be briefly summarized as follows:
Figure: the implementation principle and process of general web crawler
1. Get the initial URL
The initial URL address can be specified by the user or by one or more initial crawled web pages specified by the user.
two。 Crawl the page and get a new URL according to the initial URL
After the initial URL address is obtained, the corresponding web page needs to be crawled, and the webpage is stored in the original database. At the same time, the crawled URL address is stored in a URL list, and a new URL address is found, which can be used to remove duplicates and judge the crawling process.
3. Put the new URL in the URL queue
In step 2, the newly learned URL address is put into the URL queue.
4. Repeat the crawling process
Read the new URL from the URL queue, and crawl the web page according to the new URL, at the same time get the new URL from the new web page, and repeat the above crawling process.
* when writing a crawler, the corresponding stop condition is generally set, and the crawling is stopped when the process meets the stop condition set by the crawler system. In contrast, if the stop condition is not set, the crawler will crawl until it is unable to get a new URL address.
Focus on web crawlers
Because focused web crawlers need to crawl purposefully, compared with general web crawlers, focused web crawlers must also add target definition and filtering mechanism, that is, target definition, filtering of irrelevant links, selection of URL addresses to be crawled next, and so on.
The whole process is shown in the following figure:
Figure: focus on the basic principle and implementation process of web crawler
1. Definition and description of crawling target
First of all, the crawling target of the focused web crawler is defined and described according to the crawling requirements.
two。 Get the initial URL.
3. Crawl the page according to the original URL and get the new URL.
4. Filter out links that have nothing to do with crawling targets from the new URL
Because focused web crawlers crawl web pages with a purpose, pages that have nothing to do with the target will be filtered out. At the same time, the crawled URL address needs to be stored in a URL list, which can be used to remove duplicates and determine the crawling process.
5. Put the filtered link into the URL queue.
6. According to the search algorithm, determine the priority of URL from the URL queue and the URL address to be crawled next
For focused web crawlers, different crawling sequences may lead to different crawlers' execution efficiency, so it is necessary to determine which URL addresses to crawl next according to the search strategy.
7. Read the new URL from the URL address to be crawled in the next step, then crawl the web page according to the new URL address, and repeat the crawling process. Stop crawling when the stop condition set in the system is met, or when a new URL address cannot be obtained.
Incremental web crawler
Incremental web crawler refers to the crawler that takes incremental updates to the downloaded web page and only crawls the newly generated or changed web page. To a certain extent, it can ensure that the crawled page is as new as possible.
Deep web crawler
According to the mode of existence, web pages can be divided into surface web pages and deep web pages.
Surface web pages refer to the pages that can be indexed by traditional search engines, mainly static pages that can be reached by hyperlinks.
The deep web pages refer to those pages that most of the content can not be obtained through static links, hidden behind the search form, and can only be obtained by users who submit some keywords.
Deep web crawler is mainly used for deep network, and its main architecture includes six basic functional modules:
Crawl controller, parser, form parser, form processor, response analyzer, LVS controller, and two crawler internal data structures (URL list and LVS table). Where LVS (Label Value Set) represents a collection of tags and values to represent the data source that populates the form.
In the crawling process of deep web crawler, the most important part is form filling, which includes form filling based on domain knowledge and form filling based on web page structure analysis.
The above is all the contents of the article "what is the principle of the crawler?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.