What are the types of crawler technology 07/02 Update SLTechnology News&Howtos

What are the types of crawler technology

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "what are the types of crawler technology". The content of the explanation is simple and clear, and it is easy to learn and understand. let's follow the editor's train of thought to study and learn "what are the types of reptile technology"?

Focused web crawler is a kind of crawler program "oriented to specific topic requirements", while general web crawler is an important part of the crawl engine crawling system (Baidu, Google, Yahoo, etc.), the main purpose is to download the web pages on the Internet to the local, to form a mirror backup of Internet content.

Incremental crawling means to crawl the data of a site, and when the new data of the site or the data of the site changes, it automatically grabs the new or changed data.

According to the mode of existence, Web pages can be divided into surface pages (surface Web) and deep pages (deep Web, also known as invisible Web pages or hidden Web).

The surface web page refers to the page that can be indexed by traditional search engines, that is, the Web page is mainly composed of static web pages that can be reached by hyperlinks.

Deep web pages are those Web pages that most of the content cannot be obtained through static links, hidden behind the search form, and can only be obtained by users submitting some keywords.

01 focus crawler technology

Focus on web crawlers (focused crawler), that is, themed web crawlers. Focused crawler technology adds link evaluation and content evaluation module, and the key point of its crawling strategy is to evaluate the content of the page and the importance of links.

The crawling strategy based on link evaluation mainly takes Web pages as semi-structured documents, in which there is a lot of structural information that can be used to evaluate the importance of links. Another method is to use Web structure to evaluate the value of links, that is, HITS method, which determines the link access order by calculating the Authority weight and Hub weight of each visited page.

The crawling strategy based on content evaluation mainly applies the calculation method similar to the text, and puts forward the Fish-Search algorithm, which takes the user input query words as the topic. Under the further improvement of the algorithm, the Shark-Search algorithm can use the space vector model to calculate the relevance between the page and the topic.

Topic-oriented crawler, demand-oriented crawler: crawls information for a specific content and ensures that the information and requirements are as relevant as possible. An example of a simple focused crawler usage is shown below.

[example 1] A simple focused crawler that crawls pictures

Import urllib.request # package urllib for crawlers. Different versions of Python need to download different crawler packages import re # regular to crawl keyname= "" # content you want to crawl key=urllib.request.quote (keyname) # you need to decode the keyname you enter, so that the computer can read for i in range (0jue 5): # (0p5) the number can be set by yourself It is the number of pages of a product on Taobao url= "https://s.taobao.com/search?q="+key+"&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20180815&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=1%2C48&s="+str(i*44) # url followed by the name of the website you want to crawl. Then you need to open a few more similar sites to find the rules # data is all the content of the site you crawled to decode the content to read pat=' "pic_url": "/ (. *?)"'# pat uses regular expressions to crawl pictures from web pages # put the content you crawled into a list print (picturelist) # can not print You can also print out for j in range (0Len (picturelist)): picture=picturelist [j] pictureurl= "http://"+picture # to traverse the contents of the list and add http:// to the high definition image file=" E:/pycharm/vscode file / picture / "+ str (I) + str (j) +" .jpg "# and number the pictures one by one Otherwise the duplicate name will be overwritten urllib.request.urlretrieve (pictureurl,filename=file) # and finally saved to folder 02 generic crawler technology

Universal crawler technology (general purpose Web crawler) is the whole web crawler. The implementation process is as follows.

First, get the initial URL. The initial URL address can be specified by the user or by one or more initial crawled web pages specified by the user.

Second, crawl the page according to the original URL and get the new URL. After obtaining the initial URL address, it is necessary to crawl the web page in the corresponding URL address, then store the web page in the original database, and find the new URL address while crawling the web page, and store the crawled URL address in a URL list for de-repetition and judging the crawling process.

Third, put the new URL in the URL queue, and after the next new URL address is obtained in the second step, the new URL address will be placed in the URL queue.

Fourth, read the new URL from the URL queue and crawl the web page according to the new URL. At the same time, get the new URL from the new web page and repeat the crawling process.

Fifth, stop crawling when the stop condition set by the crawler system is met. When you write a crawler, you usually set the corresponding stop conditions. If the stop condition is not set, the crawler will crawl until the new URL address cannot be obtained. If the stop condition is set, the crawler will stop crawling when the stop condition is satisfied. See the lower right subgraph in figure 2-5 for details.

The application of general crawler technology has different crawling strategies, in which breadth-first strategy and depth-first strategy are more critical, such as the implementation of depth-first strategy is to visit the next level of web links in the order of depth from low to high.

Examples of how to use a generic crawler are as follows.

[example 2] crawling JD.com commodity information

Crawl JD.com commodity information: request url: https://www.jd.com/ to extract commodity information: 1. Commodity details page 2. Trade name 3. Commodity price 4. The number of evaluators is 5. Merchants' from selenium import webdriver # introduces webdriver from selenium.webdriver.common.keys import Keys import time def get_good (driver) in selenium: try: # get all commodity information by controlling roller sliding by JS js_code =''window.scrollTo (0j5000) '' driver.execute_script (js_code) # execute js code # wait for data to load time.sleep (2) # find all merchandise div # good_div = driver.find_element_by_id ('JaimgoodsList') good_list = driver.find_elements_by_class_name (' gl-item') N = 1 for good in good_list: # find # commodity link good_url = good.find_element_by_css_selector ('.p-img a'). Get_attribute ('href') # commodity name good_name = good. Find_element_by_css_selector ('.p-name em') .text.replace ("\ n") "- -") # Commodity prices good_price = good.find_element_by_class_name ('pMurprice`). Text.replace ("\ n", ":") # number of evaluations good_commit = good.find_element_by_class_name (' pMutual'). Text.replace ("\ n") ") good_content = favored 'commodity link: {good_url} Commodity name: {good_name} Commodity Price: {good_price} number of comments: {good_commit} \ n''print (good_content) with open (' jd.txt' 'a' Encoding='utf-8') as f: f.write (good_content) next_tag = driver.find_element_by_class_name ('pn-next') next_tag.click () time.sleep (2) # Recursive call function get_good (driver) time.sleep (10) finally: driver. Close () if _ _ name__ ='_ _ main__': good_name = input ('Please enter crawled product information:'). Strip () driver = webdriver.Chrome () driver.implicitly_wait (10) # send a request driver.get ('https://www.jd.com/') # enter the product name) to JD.com 's home page And enter to search input_tag = driver.find_element_by_id ('key') input_tag.send_keys (good_name) input_tag.send_keys (Keys.ENTER) time.sleep (2) get_good (driver) 03 incremental crawler technology

Some websites regularly update a batch of data based on the original web page data. For example, a movie website will update a batch of recent popular movies in real time, and the novel website will update the latest chapter data in real time according to the progress of the author's creation. When we encounter similar scenarios, we can use incremental crawlers.

Incremental crawler technology (incremental Web crawler) is to monitor the data update of a website through the crawler program, so that you can crawl the updated data of the website.

On how to perform incremental crawling, here are three ways to detect duplicate data:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Determine whether the URL has been crawled before sending the request

Determine whether this part of the content has ever been crawled after parsing the content

When writing to the storage media, it is determined whether the content already exists in the media.

The first idea is suitable for websites that constantly have new pages, such as new chapters of novels, daily real-time news, etc.

The second idea is suitable for websites where the content of the page will be updated regularly.

The third way of thinking is equivalent to the last line of defense. This can achieve the goal of removing weight to the maximum extent.

It is not difficult to find that the core of incremental crawling is de-weight. At present, there are two methods of weight removal.

First, the URL generated in the crawling process is stored in the set of Redis. When data crawling is performed next time, the set storing the URL is first used to judge the URL corresponding to the request to be initiated. If it exists, the request will not be made, otherwise the request will be made.

Second, make a unique identification (data fingerprint) for the crawled web content, and then store the unique identity in the set of Redis. The next time you crawl web page data, you can determine whether the unique identity of the data exists in the set of Redis before persistent storage, and then decide whether to persist storage.

An example of how to use an incremental crawler is shown below.

[example 3] crawl all the movie detail data on the 4567tv website

Import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from incrementPro.items import IncrementproItem class MovieSpider (CrawlSpider): name = 'movie' # allowed_domains = [' www.xxx.com'] start_urls = ['http://www.4567tv.tv/frim/index7-11.html'] rules = (Rule (allow=r'/frim/index7-\ d+\ .html'), callback='parse_item', follow=True) ) # create Redis link object conn = Redis (host='127.0.0.1', port=6379) def parse_item (self Response): li_list = response.xpath ('/ / li [@ class= "p1m1]') for li in li_list: # get the url detail_url of the detail page url detail_url = 'http://www.4567tv.tv' + li.xpath ('. / a _ extract_first'). Extract_first () # store the url of the detail page in the set of Redis Ex = self.conn.sadd ('urls' Detail_url) if ex = = 1: print ('the url has not been crawled Can crawl data') yield scrapy.Request (url=detail_url, callback=self.parst_detail) else: print ('data has not been updated yet, there is no new data to crawl!') # parsing the movie name and type in the details page Persistent storage def parst_detail (self) Response): item = IncrementproItem () item ['name'] = response.xpath (' / / dt [@ class= "name"] / text ()'). Extract_first () item ['kind'] = response.xpath (' / / div [@ class= "ct-c"] / dl/dt [4] / / text ()'. Extract () item ['kind'] =' .join (item ['kind']) yield it

Pipe file:

From redis import Redis class IncrementproPipeline (object): conn = None def open_spider (self,spider): self.conn = Redis (host='127.0.0.1',port=6379) def process_item (self, item, spider): dic = {'name':item [' name'] 'kind':item [' kind']} print (dic) self.conn.push ('movieData',dic) # if push doesn't go in Then dic becomes str (dic) or changes the redis version of pip install-U redis==2.10.6 return item04 deep web crawler technology.

In the Internet, web pages can be divided into two categories: surface web pages and deep web pages according to the mode of existence.

The so-called surface web page, refers to do not need to submit the form, the use of static links to reach the static page; while the deep web page is hidden behind the form, can not be obtained directly through the static link, is the need to submit certain keywords to get the page, the most important part of the deep web crawler (deep Web crawler) is the form filling part.

In the Internet, the number of deep pages is often much more than the number of surface pages, so we need to find ways to climb deep pages.

The basic components of a deep web crawler: URL list, LVS list (LVS refers to the collection of tags / values, that is, the data source that fills the form), crawl controller, parser, LVS controller, form parser, form processor, response analyzer.

There are two types of form filling for deep web crawlers:

Form filling based on domain knowledge (create a keyword database to fill in the form, and select the corresponding keywords according to semantic analysis when needed)

Form filling based on web page structure analysis (usually used in the case of limited domain knowledge, this method will be analyzed according to the web page structure, and the form will be filled out automatically).

Thank you for your reading. the above is the content of "what are the types of crawler technology". After the study of this article, I believe you have a deeper understanding of the type of reptile technology. the specific use of the situation also needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.