Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Developing simple crawlers with Python

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Developing simple crawlers with Python

Source code URL: http://download.csdn.NET/detail/hanchaobiao/9860671

First, the brief introduction of reptiles and the technical value of reptiles. What is a reptile:

A program that automatically grabs Internet information can start from a URL, access its associated URL, and extract the data we need. In other words, a crawler is a program that automatically accesses the Internet and extracts data.

two。 The value of the crawler uses the data on the Internet for my own use and develops its own website or APP

Second, a simple web crawler flow frame

Crawler scheduling side: used to start, execute, stop the crawler, or monitor the operation of the crawler

There are three modules in the crawler, the URL manager: the management of the URL to be crawled and the URL that has been crawled

Web page downloader: download a web page corresponding to a URL provided in the URL manager and store it as a string, which will be transmitted to the web page parser for parsing

Web page parser: on the one hand, it parses valuable data, on the other hand, since each page has many pages pointing to other pages, these URL can be parsed and added to the URL manager.

These three departments form a simple crawler architecture, which can capture all the web pages in the Internet.

Dynamic execution process

Third, URL Manager and its three implementation methods

Prevent repeated fetching and circular fetching. In the most serious case, two URL pointing to each other will form an endless loop.

Three ways of implementation

Python memory set collection: the set collection supports deduplication

MySQL:url (access path) is_crawled (access or not)

Redis: redis has the best performance, and there is also a set type in Redis, which can be deduplicated. If you don't understand, you can take a look at Redis's introduction.

4. Webpage downloader and urlib module

This article uses urllib to implement

Urllib2 is a module that comes with python and does not need to be downloaded.

Urllib2 is changed to urllib.request in python3.x

Three ways of implementation

Method 1:

[python] view plain copy

# introduction module

From urllib import request

Url = "http://www.baidu.com"

# the first way to download a web page

Print ("the first method:")

# request = urllib.urlopen (url) 2.x

Response1 = request.urlopen (url)

Print ("status code:" response1.getcode ())

# get web content

Html = response1.read ()

# set the encoding format

Print (html.decode ("utf8"))

# turn off response1

Response1.close ()

Method 2:

[python] view plain copy

Print ("second:")

Request2 = request.Request (url)

Request2.add_header ('user-agent','Mozilla/5.0')

Response2 = request.urlopen (request2)

Print ("status code:" response2.getcode ())

# get web content

Htm2 = response2.read ()

# adjust the format

Print (htm2.decode ("utf8"))

# turn off response1

Response2.close ()

Method 3: use cookie

[html] view plain copy

# the third method uses cookie to obtain

Import http.cookiejar

Cookie = http.cookiejar.LWPCookieJar ()

Opener = request.build_opener (request.HTTPCookieProcessor (cookie))

Request.install_opener (opener)

Response3 = request.urlopen (url)

Print (cookie)

Html3 = response3.read ()

# format the content

Print (html3.decode ("utf8"))

Response3.close ()

Web page parser and BeautifulSoup third-party module

Test if bs4 is installed

[python] view plain copy

Import bs4

Print (bs4)

Print the results:

The installation is successful

Beautiful Soup has a very important advantage over other html parsing. The html is disassembled and processed as an object. The whole article is converted into dictionaries and arrays.

Compared with the regular parsing crawler, the high cost of learning rules is omitted. In this paper, the python3.x system does not need to be installed.

Use case: http://blog.csdn.net/watsy/article/details/14161201

Method introduction

Case test

Html adopts official case

[python] view plain copy

# reference module

From bs4 import BeautifulSoup

Html_doc = ""

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie

Lacie and

Tillie

And they lived at the bottom of a well.

...

"

Get all the links

[python] view plain copy

Print ("get all links")

Links = soup.find_all ('a') # a label

For link in links:

Print (link.name,link ['href'], link.get_text ())

[python] view plain copy

# get a link to href= http://example.com/lacie

Print ("get lacie links")

Link1 = soup.find ('axiom. Href = "http://example.com/lacie")

Print (link1.name,link1 ['href'], link1.get_text ())

[python] view plain copy

Print ("regular match href with" ill ")

Import re # Import re package

Link2 = soup.find ('axiomagery hrefring.compile (r "ill"))

Print (link2.name,link2 ['href'], link2.get_text ())

[python] view plain copy

Print ("get p paragraph text")

Link3 = soup.find ("title") # class is the keyword that needs to be added _

Print (link3.name,link3.get_text ())

6. Examples of crawler development (target crawler Baidu encyclopedia)

Entrance: http://baike.baidu.com/item/Python

Analyze URL format: prevent access to useless path http://baike.baidu.com/item/{ title}

Data: grab the title and introduction of Baidu encyclopedia related Python entry web pages

By reviewing the element, the title element is: class= "lemmaWgt-lemmaTitle-title"

The introduction element is: class= "lemma-summary"

Page coding: UTF-8

As a directional crawler, you need to upgrade according to the content of the crawler. If something goes wrong, you may upgrade Baidu encyclopedia. At this time, you need to reanalyze the target.

Code set comments:

Create spider_main.py

[python] view plain copy

# create a class

From imooc.baike_spider import url_manager,html_downloader,html_output,html_parser

Class spiderMain:

# Constructor initialization

Def _ init__ (self):

# instantiate the object to be referenced

Self.urls = url_manager.UrlManager ()

Self.downloader = html_downloader.HtmlDownLoader ()

Self.output = html_output.HtmlOutPut ()

Self.parser = html_parser.HtmlParser ()

Def craw (self,root_url):

# add one to url

Self.urls.add_new_url (root_url)

Count = 1

While self.urls.has_new_url ():

Try:

New_url = self.urls.get_new_url ()

Print ('craw% d:% s'% (count,new_url))

# download

Html_context = self.downloader.downloade (new_url)

New_urls,new_data = self.parser.parse (new_url,html_context)

Print (new_urls)

Self.urls.add_new_urls (new_urls)

Self.output.collect_data (new_data)

# climb a thousand interfaces

If (count==1000):

Break

Count+=1

Except:

Print ("craw faile")

Self.output.output_html ()

# create main method

If _ name__ = = "_ _ main__":

Root_url = "http://baike.baidu.com/item/Python"

Obj_spider = spiderMain ()

Obj_spider.craw (root_url)

Create url_manager.py

[python] view plain copy

Class UrlManager:

'url management class'

# Constructor initializes the set collection

Def _ init__ (self):

Self.new_urls = set () # url to crawl

Self.old_urls = set () # crawled url

# add a new url to the manager

Def add_new_url (self,root_url):

If (root_url is None):

Return

If (root_url not in self.new_urls and root_url not in self.old_urls):

# is neither in the url to be crawled nor in the crawled url, it is a brand new url, so add it to new_urls

Self.new_urls.add (root_url)

# add batch new url to Manager

Def add_new_urls (self,urls):

If (urls is None or len (urls) = = 0):

Return

For url in urls:

Self.add_new_url (url) # call add_new_url ()

# determine if there is a new url to crawl

Def has_new_url (self):

Return len (self.new_urls)! = 0

# get a url to be crawled

Def get_new_url (self):

New_url = self.new_urls.pop ()

Self.old_urls.add (new_url)

Return new_url

Create html_downloader.py

[python] view plain copy from urllib import request from urllib.parse import quote import string class HtmlDownLoader: 'download page content' def downloade (self,new_url): if (new_url is None): return None # resolve request path Chinese or special characters url_ = quote (new_url, safe=string.printable) Response = request.urlopen (url_) if (response.getcode ()! = 200): return None # request failed html = response.read () return html.decode ("utf8")

Create html_parser.py

[python] view plain copy

From bs4 import BeautifulSoup

Import re

From urllib import parse

Class HtmlParser:

# page_url basic url needs stitching

Def _ get_new_urls (self,page_url,soup):

New_urls = set ()

# match / item/%E8%87%AA%E7%94%B1%E8%BD%AF%E4%BB%B6

Links = soup.find_all ('axiom\ hrefrre.compile (r'/item/\ wicked'))

For link in links:

New_url = link ["href"]

# for example, page_url= http://baike.baidu.com/item/Python new_url=/item/ Historical Chronicle 2016?fr=navbar

# new_full_url = http://baike.baidu.com/item/ 2016?fr=navbar after using parse.urljoin (page_url,new_url)

New_full_url = parse.urljoin (page_url,new_url)

New_urls.add (new_full_url)

Return new_urls

Def _ get_new_data (self,page_url,soup):

# Python

Red_data = {}

Red_data ['url'] = page_url

Title_node = soup.find ('dd',class_= "lemmaWgt-lemmaTitle-title") .find (' H2') # get the content of the title

Red_data ['title'] = title_node.get_text ()

#

Summary_node = soup.find ('div',class_= "lemma-summary")

Red_data ['summary'] = summary_node.get_text ()

Return red_data

# new_url path html_context interface content

Def parse (self,page_url, html_context):

If (page_url is None or html_context is None):

Return

The default encoding of # python3 is unicode, and then setting it to utf8 in from_encoding will be ignored. Remove [from_encoding = "utf-8"] this one.

Soup = BeautifulSoup (html_context, "html.parser")

New_urls = self._get_new_urls (page_url, soup)

New_data = self._get_new_data (page_url, soup)

Return new_urls,new_data

Create html_output.py

[python] view plain copy

Class HtmlOutPut:

Def _ init__ (self):

Self.datas = [] # Storage of collected data

Def collect_data (self,new_data):

If (new_data is None):

Return

Self.datas.append (new_data)

Def output_html (self):

Fout = open ('output.html','w',encoding='utf8') # write to the file to prevent Chinese garbled

Fout.write ('\ n')

Fout.write ('\ n')

Fout.write ('\ n')

For data in self.datas:

Fout.write ('\ n')

Fout.write ('% s\ n'%data ['url'])

Fout.write ('% s\ n'%data ['title'])

Fout.write ('% s\ n'%data ['summary'])

Fout.write ('\ n')

Fout.write ('\ n')

Fout.write ('\ n')

Fout.write ('\ n')

Fout.close ()

Video website: http://www.imooc.com/learn/563

Source code URL: http://download.csdn.Net/detail/hanchaobiao/9860671

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report