How to implement Web Crawler with python 07/08 Update SLTechnology News&Howtos

How to implement Web Crawler with python

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces python how to achieve web crawler, the article introduces in great detail, has a certain reference value, interested friends must read it!

I. Overview

Web crawler (Web crawler), also known as web spider (Web spider) or web robot (Web robot), is mainly used to crawl the content of the target website programs or scripts.

Distinguish web crawlers functionally:

Data acquisition, data processing, data storage

In the above three parts, the basic work framework flow is shown below:

Second, principle

Function: download web page data and provide data source for search engine system. Components: controller, parser, resource library.

The Web web crawler system first puts the seed URL into the download queue, and then simply takes a URL from the head of the queue to download its corresponding web page. After the content of the web page is saved, some new URL can be obtained by parsing the link information in the web page, and these URL can be added to the download queue. Then take out a URL, download its corresponding web page, and parse it over and over again until it traverses the entire network or meets certain conditions.

Classification of reptiles 1. Traditional reptiles

The traditional crawler starts from the URL of one or more initial web pages and obtains the URL on the initial web page. In the process of crawling the web page, the traditional crawler constantly extracts new URL from the current page and puts it into the queue until certain stopping conditions of the system are met.

2. Focus crawler

The workflow of focus crawler is more complex, so it is necessary to filter links that have nothing to do with the topic according to certain web page analysis algorithms, retain useful links and put them in the queue waiting to fetch URL. Then it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process until a certain condition of the system is reached. In addition, all the web pages crawled by the crawler will be stored by the system, analyzed, filtered, and indexed for subsequent query and retrieval. For the focused crawler, the analysis results obtained from this process may also provide feedback and guidance for the future crawling process.

3. General web crawler (whole web crawler)

General web crawler, also known as whole-web crawler, crawls from some seed URL to the entire Web, mainly for portal search engines and large Web service providers to collect data. This kind of web crawlers have a large crawling range and number, high requirements for crawling speed and storage space, relatively low requirements for crawling page order, and usually work in parallel because there are too many pages to be refreshed. but it takes a long time to refresh the page. Although there are some defects, the general web crawler is suitable for searching a wide range of topics for search engines and has strong application value.

The actual web crawler system is usually implemented by the combination of several crawler technologies.

Fourth, the strategy of crawling web pages

In the crawler system, the URL queue to be fetched is a very important part. The order in which the URL in the queue is arranged is also an important question, because it involves which page to grab first and which page to grab later.

The method that determines the order of these URL is called fetching strategy.

1. Width first search:

In the crawling process, the next level search is carried out only after the current level search is completed.

Advantages: the design and implementation of the algorithm is relatively simple. Disadvantages: with the increase of crawling web pages, a large number of extraneous web pages will be downloaded and filtered, and the efficiency of the algorithm will become lower.

2. Depth-first search:

Starting from the starting page, select a URL to enter, analyze the URL in this page, grab one link at a time, and then deal with the route in the next URL after processing one route.

For example, in the following figure, the depth-first search traversal is A to B to D to E to F (ABDECF), while the breadth-first search traversal is AB C DE F.

3. Best priority search:

According to a certain web page analysis method, predict the similarity between the candidate URL and the target web page, or the correlation with the topic, and select one or more URL with the best evaluation to crawl.

4. Backlink count strategy:

The number of backlinks refers to the number of links to a web page by other web pages. The number of backlinks indicates the degree to which the content of a web page is recommended by others.

5. Partial PageRank strategy:

The Partial PageRank algorithm draws lessons from the idea of the PageRank algorithm. For the downloaded web pages, together with the URL in the URL queue to be crawled, the web page collection is formed, and the PageRank value of each page is calculated. After the calculation, the URL in the URL queue to be crawled is arranged according to the PageRank value, and the pages are grabbed in this order.

5. the method of web page crawling 1. Distributed crawler

For the current massive URL management of the Internet, it contains multiple crawlers (programs), each crawler (program) needs to complete tasks similar to a single crawler. They download web pages from the Internet, save them on local disk, extract URL from them, and continue crawling along the direction of these URL. Because parallel crawlers need to split the download task, it is possible that the crawler will send its own extracted URL to other crawlers.

These reptiles may be distributed in the same local area network or scattered in different geographical locations.

Now the more popular distributed crawlers:

Apache Nutch: relying on hadoop to run, hadoop itself consumes a lot of time. Nutch is a crawler designed for search engines. If you don't want to be a search engine, try not to choose Nutch.

2. Java crawler

Mini Program, which is developed with Java to grab network resources, commonly used tools include Crawler4j, WebMagic, WebCollector and so on.

3. Non-Java crawlers

Scrapy: a lightweight, high-level screen capture framework written by Python. The most attractive thing is that Scrapy is a framework that any user can modify according to their own needs, and has some advanced functions to simplify the fetching process.

6. Project practice 1. Grab the specified web page and grab the home page of a certain network.

Using the urllib module, this module provides an interface for reading Web page data, which can read data on www and ftp just like a local file. Urllib is a URL processing package that collects modules that deal with URL.

Urllib.request module: used to open and read URLs. Urllib.error module: contains some errors generated by urllib.request, which can be captured and handled with try. Urllib.parse module: contains some methods for parsing URLs. Urllib.robotparser: used to parse robots.txt text files. It provides a separate RobotFileParser class that tests whether a crawler can download a page through the can_fetch () method provided by this class.

The following code is the code to crawl a web page:

Import urllib.requesturl = "https://www.douban.com/"# needs a simulated browser to crawl headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64)" X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36'} request = urllib.request.Request (url, headers=headers) response = urllib.request.urlopen (request) data = response.read () # Transcoding is needed to display print properly (str (data) 'utf-8')) # the following code can print all kinds of information to crawl web pages print (type (response)) print (response.geturl ()) print (response.info ()) print (response.getcode ()) 2, crawl web pages containing keywords

The code is as follows:

Import urllib.requestdata = {'word':' Sea Thief King'} url_values = urllib.parse.urlencode (data) url = "http://www.baidu.com/s?"full_url = url + url_valuesdata = urllib.request.urlopen (full_url). Read () print (str (data, 'utf-8')) 3. Download the picture in the post bar.

The code is as follows:

Import reimport urllib.request# get the web page source code def getHtml (url): page = urllib.request.urlopen (url) html = page.read () return html# get all the pictures on the web def getImg (html): reg = rhomsrc = "([. *\ S] *\ .jpg)" pic_ext= "jpeg" 'imgre = re.compile (reg) imglist = re.findall (imgre) Html) return imglisthtml = getHtml ('https://tieba.baidu.com/p/3205263090')html = html.decode (' utf-8') imgList = getImg (html) imgName = save pictures in a loop for imgPath in imgList: F = open (str (imgName) + ".jpg" 'wb') f.write ((urllib.request.urlopen (imgPath)). Read () f.close () imgName + = 1 print (' downloading% s picture'% imgName) print ('the website image has been downloaded') 4, stock data crawl

The code is as follows:

Import randomimport reimport timeimport urllib.request# crawl content user_agent = ["Mozilla/5.0 (Windows NT 10.0; WOW64)", 'Mozilla/5.0 (Windows NT 6.3; WOW64)', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',' Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0) Rv:11.0) like Gecko', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36',' Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C Rv:11.0) like Gecko)', 'Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1',' Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3', 'Mozilla/5.0 (Windows; U Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12', 'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0,' Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6', 'Mozilla/4.0 (compatible) MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0) InfoPath.2;. Net 4.0C;. Net 4.0E)', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Maxthon/4.0.6.2000 Chrome/26.0.1410.43 Safari/537.1', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727 .net CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET 4.0C; .NET 4.0E; QQBrowser/7.3.9825.400)', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0', 'Mozilla/5.0 (Windows NT 6.1) WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.92 Safari/537.1 LBBROWSER', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; BIDUBrowser 2.x)', 'Mozilla/5.0 (Windows NT 6.1) WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/3.0 Safari/536.11'] stock_total = [] headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36'} for page in range (1,8): url= 'http://quote.stockstar.com/stock/ranklist_a_3_1_' + str (page) +' .html 'request = urllib.request.Request (url=url, headers= {"User-Agent": random.choice (user_agent)}) response = urllib.request.urlopen (request) content = str (response.read () 'gbk') pattern = re.compile ('

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.