In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Developing simple crawlers with Python
Source code URL: http://download.csdn.NET/detail/hanchaobiao/9860671
First, the brief introduction of reptiles and the technical value of reptiles. What is a reptile:
A program that automatically grabs Internet information can start from a URL, access its associated URL, and extract the data we need. In other words, a crawler is a program that automatically accesses the Internet and extracts data.
two。 The value of the crawler uses the data on the Internet for my own use and develops its own website or APP
Second, a simple web crawler flow frame
Crawler scheduling side: used to start, execute, stop the crawler, or monitor the operation of the crawler
There are three modules in the crawler, the URL manager: the management of the URL to be crawled and the URL that has been crawled
Web page downloader: download a web page corresponding to a URL provided in the URL manager and store it as a string, which will be transmitted to the web page parser for parsing
Web page parser: on the one hand, it parses valuable data, on the other hand, since each page has many pages pointing to other pages, these URL can be parsed and added to the URL manager.
These three departments form a simple crawler architecture, which can capture all the web pages in the Internet.
Dynamic execution process
Third, URL Manager and its three implementation methods
Prevent repeated fetching and circular fetching. In the most serious case, two URL pointing to each other will form an endless loop.
Three ways of implementation
Python memory set collection: the set collection supports deduplication
MySQL:url (access path) is_crawled (access or not)
Redis: redis has the best performance, and there is also a set type in Redis, which can be deduplicated. If you don't understand, you can take a look at Redis's introduction.
4. Webpage downloader and urlib module
This article uses urllib to implement
Urllib2 is a module that comes with python and does not need to be downloaded.
Urllib2 is changed to urllib.request in python3.x
Three ways of implementation
Method 1:
[python] view plain copy
# introduction module
From urllib import request
Url = "http://www.baidu.com"
# the first way to download a web page
Print ("the first method:")
# request = urllib.urlopen (url) 2.x
Response1 = request.urlopen (url)
Print ("status code:" response1.getcode ())
# get web content
Html = response1.read ()
# set the encoding format
Print (html.decode ("utf8"))
# turn off response1
Response1.close ()
Method 2:
[python] view plain copy
Print ("second:")
Request2 = request.Request (url)
Request2.add_header ('user-agent','Mozilla/5.0')
Response2 = request.urlopen (request2)
Print ("status code:" response2.getcode ())
# get web content
Htm2 = response2.read ()
# adjust the format
Print (htm2.decode ("utf8"))
# turn off response1
Response2.close ()
Method 3: use cookie
[html] view plain copy
# the third method uses cookie to obtain
Import http.cookiejar
Cookie = http.cookiejar.LWPCookieJar ()
Opener = request.build_opener (request.HTTPCookieProcessor (cookie))
Request.install_opener (opener)
Response3 = request.urlopen (url)
Print (cookie)
Html3 = response3.read ()
# format the content
Print (html3.decode ("utf8"))
Response3.close ()
Web page parser and BeautifulSoup third-party module
Test if bs4 is installed
[python] view plain copy
Import bs4
Print (bs4)
Print the results:
The installation is successful
Beautiful Soup has a very important advantage over other html parsing. The html is disassembled and processed as an object. The whole article is converted into dictionaries and arrays.
Compared with the regular parsing crawler, the high cost of learning rules is omitted. In this paper, the python3.x system does not need to be installed.
Use case: http://blog.csdn.net/watsy/article/details/14161201
Method introduction
Case test
Html adopts official case
[python] view plain copy
# reference module
From bs4 import BeautifulSoup
Html_doc = ""
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
Lacie and
Tillie
And they lived at the bottom of a well.
...
"
Get all the links
[python] view plain copy
Print ("get all links")
Links = soup.find_all ('a') # a label
For link in links:
Print (link.name,link ['href'], link.get_text ())
[python] view plain copy
# get a link to href= http://example.com/lacie
Print ("get lacie links")
Link1 = soup.find ('axiom. Href = "http://example.com/lacie")
Print (link1.name,link1 ['href'], link1.get_text ())
[python] view plain copy
Print ("regular match href with" ill ")
Import re # Import re package
Link2 = soup.find ('axiomagery hrefring.compile (r "ill"))
Print (link2.name,link2 ['href'], link2.get_text ())
[python] view plain copy
Print ("get p paragraph text")
Link3 = soup.find ("title") # class is the keyword that needs to be added _
Print (link3.name,link3.get_text ())
6. Examples of crawler development (target crawler Baidu encyclopedia)
Entrance: http://baike.baidu.com/item/Python
Analyze URL format: prevent access to useless path http://baike.baidu.com/item/{ title}
Data: grab the title and introduction of Baidu encyclopedia related Python entry web pages
By reviewing the element, the title element is: class= "lemmaWgt-lemmaTitle-title"
The introduction element is: class= "lemma-summary"
Page coding: UTF-8
As a directional crawler, you need to upgrade according to the content of the crawler. If something goes wrong, you may upgrade Baidu encyclopedia. At this time, you need to reanalyze the target.
Code set comments:
Create spider_main.py
[python] view plain copy
# create a class
From imooc.baike_spider import url_manager,html_downloader,html_output,html_parser
Class spiderMain:
# Constructor initialization
Def _ init__ (self):
# instantiate the object to be referenced
Self.urls = url_manager.UrlManager ()
Self.downloader = html_downloader.HtmlDownLoader ()
Self.output = html_output.HtmlOutPut ()
Self.parser = html_parser.HtmlParser ()
Def craw (self,root_url):
# add one to url
Self.urls.add_new_url (root_url)
Count = 1
While self.urls.has_new_url ():
Try:
New_url = self.urls.get_new_url ()
Print ('craw% d:% s'% (count,new_url))
# download
Html_context = self.downloader.downloade (new_url)
New_urls,new_data = self.parser.parse (new_url,html_context)
Print (new_urls)
Self.urls.add_new_urls (new_urls)
Self.output.collect_data (new_data)
# climb a thousand interfaces
If (count==1000):
Break
Count+=1
Except:
Print ("craw faile")
Self.output.output_html ()
# create main method
If _ name__ = = "_ _ main__":
Root_url = "http://baike.baidu.com/item/Python"
Obj_spider = spiderMain ()
Obj_spider.craw (root_url)
Create url_manager.py
[python] view plain copy
Class UrlManager:
'url management class'
# Constructor initializes the set collection
Def _ init__ (self):
Self.new_urls = set () # url to crawl
Self.old_urls = set () # crawled url
# add a new url to the manager
Def add_new_url (self,root_url):
If (root_url is None):
Return
If (root_url not in self.new_urls and root_url not in self.old_urls):
# is neither in the url to be crawled nor in the crawled url, it is a brand new url, so add it to new_urls
Self.new_urls.add (root_url)
# add batch new url to Manager
Def add_new_urls (self,urls):
If (urls is None or len (urls) = = 0):
Return
For url in urls:
Self.add_new_url (url) # call add_new_url ()
# determine if there is a new url to crawl
Def has_new_url (self):
Return len (self.new_urls)! = 0
# get a url to be crawled
Def get_new_url (self):
New_url = self.new_urls.pop ()
Self.old_urls.add (new_url)
Return new_url
Create html_downloader.py
[python] view plain copy from urllib import request from urllib.parse import quote import string class HtmlDownLoader: 'download page content' def downloade (self,new_url): if (new_url is None): return None # resolve request path Chinese or special characters url_ = quote (new_url, safe=string.printable) Response = request.urlopen (url_) if (response.getcode ()! = 200): return None # request failed html = response.read () return html.decode ("utf8")
Create html_parser.py
[python] view plain copy
From bs4 import BeautifulSoup
Import re
From urllib import parse
Class HtmlParser:
# page_url basic url needs stitching
Def _ get_new_urls (self,page_url,soup):
New_urls = set ()
# match / item/%E8%87%AA%E7%94%B1%E8%BD%AF%E4%BB%B6
Links = soup.find_all ('axiom\ hrefrre.compile (r'/item/\ wicked'))
For link in links:
New_url = link ["href"]
# for example, page_url= http://baike.baidu.com/item/Python new_url=/item/ Historical Chronicle 2016?fr=navbar
# new_full_url = http://baike.baidu.com/item/ 2016?fr=navbar after using parse.urljoin (page_url,new_url)
New_full_url = parse.urljoin (page_url,new_url)
New_urls.add (new_full_url)
Return new_urls
Def _ get_new_data (self,page_url,soup):
# Python
Red_data = {}
Red_data ['url'] = page_url
Title_node = soup.find ('dd',class_= "lemmaWgt-lemmaTitle-title") .find (' H2') # get the content of the title
Red_data ['title'] = title_node.get_text ()
#
Summary_node = soup.find ('div',class_= "lemma-summary")
Red_data ['summary'] = summary_node.get_text ()
Return red_data
# new_url path html_context interface content
Def parse (self,page_url, html_context):
If (page_url is None or html_context is None):
Return
The default encoding of # python3 is unicode, and then setting it to utf8 in from_encoding will be ignored. Remove [from_encoding = "utf-8"] this one.
Soup = BeautifulSoup (html_context, "html.parser")
New_urls = self._get_new_urls (page_url, soup)
New_data = self._get_new_data (page_url, soup)
Return new_urls,new_data
Create html_output.py
[python] view plain copy
Class HtmlOutPut:
Def _ init__ (self):
Self.datas = [] # Storage of collected data
Def collect_data (self,new_data):
If (new_data is None):
Return
Self.datas.append (new_data)
Def output_html (self):
Fout = open ('output.html','w',encoding='utf8') # write to the file to prevent Chinese garbled
Fout.write ('\ n')
Fout.write ('\ n')
Fout.write ('\ n')
For data in self.datas:
Fout.write ('\ n')
Fout.write ('% s\ n'%data ['url'])
Fout.write ('% s\ n'%data ['title'])
Fout.write ('% s\ n'%data ['summary'])
Fout.write ('\ n')
Fout.write ('\ n')
Fout.write ('\ n')
Fout.write ('\ n')
Fout.close ()
Video website: http://www.imooc.com/learn/563
Source code URL: http://download.csdn.Net/detail/hanchaobiao/9860671
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.