How to write the simplest web crawler by Python 07/03 Update SLTechnology News&Howtos

How to write the simplest web crawler by Python

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

Python how to write the simplest web crawler, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

Recently, I have a strong interest in python crawler and share my learning path here.

1. Development tools

The tool I use is sublime text3, and I'm fascinated by its shortness (maybe men don't like the word). Recommended for everyone to use, of course, if your computer configuration is good, pycharm may be more suitable for you.

It is recommended to view this blog for sublime text3 to build a python development environment:

[sublime builds python development environment] [http://www.cnblogs.com/codefish/p/4806849.html]

two。 Introduction to reptiles

As the name implies, crawlers are just like worms, crawling on the big Internet. In this way, we can get what we want.

Since we want to climb on Internet, we need to know about URL, the legal name "uniform Resource Locator" and the nickname "Link". Its structure is mainly composed of three parts:

(1) Agreement: such as the HTTP protocol that is common in our web site.

(2) Domain name or IP address: domain name, such as www.baidu.com,IP address, that is, the corresponding IP after domain name resolution.

(3) path: that is, directory or file, etc.

3.urllib develops the simplest crawler

(1) introduction to urllib

ModuleIntroduceurllib.errorException classes raised by urllib.request.urllib.parseParse URLs into or assemble them from components.urllib.requestExtensible library for opening URLs.urllib.responseResponse classes used by urllib.urllib.robotparserLoad a robots.txt file and answer questions about fetchability of other URLs.

(2) develop the simplest reptiles

Baidu's home page is simple and generous, which is very suitable for us crawlers.

The crawler code is as follows:

From urllib import request def visit_baidu (): URL = "http://www.baidu.com" # open the URL req = request.urlopen (URL) # read the URL html = req.read () # decode the URL to utf-8 html = html.decode (" utf_8 ") print (html) if _ _ name__ = ='_ main__': visit_baidu ()

The result is as follows:

We can compare with our running results by right-clicking in the blank space on the home page of Baidu to view the review elements.

Of course, request can also generate a request object that can be opened with the urlopen method.

The code is as follows:

From urllib import request def vists_baidu (): # create a request obkect req = request.Request ('http://www.baidu.com') # open the request object response = request.urlopen (req) # read the response html = response.read () html = html.decode (' utf-8') print (html) if _ _ name__ = ='_ main__': vists_baidu ()

The running result is the same as just now.

(3) error handling

Error handling is handled by the urllib module, which mainly includes URLError and HTTPError errors, where HTTPError errors are a subclass of URLError errors, that is, HTTRPError can also be captured by URLError.

HTTPError can be captured through its code property.

The code to handle HTTPError is as follows:

From urllib import request from urllib import error def Err (): url = "https://segmentfault.com/zzz" req = request.Request (url) try: response = request.urlopen (req) html = response.read (). Decode (" utf-8 ") print (html) except error.HTTPError as e: print (e.code) if _ name__ = ='_ main__': Err ()

The running result is shown in the figure:

404 for the printed error code, for this detailed information you can own Baidu.

URLError can be captured through its reason property.

The code for chuliHTTPError is as follows:

From urllib import request from urllib import error def Err (): url = "https://segmentf.com/" req = request.Request (url) try: response = request.urlopen (req) html = response.read (). Decode (" utf-8 ") print (html) except error.URLError as e: print (e.reason) if _ name__ = ='_ main__': Err ()

The running result is shown in the figure:

Since in order to handle errors, then * both errors are written into the code, after all, the more detailed, the clearer. It should be noted that HTTPError is a subclass of URLError, so be sure to put HTTPError in front of URLError, otherwise it will output URLError, such as Not Found.

The code is as follows:

From urllib import request from urllib import error # * * URLErroe and HTTPError def Err (): url = "https://segmentfault.com/zzz" req = request.Request (url) try: response = request.urlopen (req) html = response.read (). Decode (" utf-8 ") print (html) except error.HTTPError as e: print (e.code) except error.URLError as e: print (e.reason)

You can change the url to see the output form of various errors.

This is the answer to the question about how Python can write the simplest web crawler. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.