Example Analysis of Python Web Crawler and Information extraction 04/26 Update SLTechnology News&Howtos

Example Analysis of Python Web Crawler and Information extraction

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces the Python web crawler and information extraction example analysis, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

1. To learn Python web crawler and information extraction, you must know a programming language and some network-related knowledge, preferably Python.

The 2.Python web crawler explanation is based on:

A, Requests framework: automatically crawl HTML pages and automatically submit web requests

B, robots.txt: exclusion criteria for web crawlers

C. BeautifulSoup framework: parsing HTML pages

D, Re framework: regular framework to extract key information of the page

E, Scrapy framework: introduction to the principle of web crawler and professional crawler framework

On the basis of

3. Basic use of Requests

Note: Requests library is recognized as the best Python third-party library for crawling web pages, which has the characteristics of simplicity and simplicity.

Official website: http://www.python-requests.org

Find "cmd.exe" in "C:\ Windows\ System32", run it as an administrator, and type: "pip install requests" on the command line.

Use IDLE to test the Requests library:

> import requests

> r = requests.get ("http://www.baidu.com")#, such as crawling Baidu pages"

> r.status_code

> r.encoding = 'utf-8'

> r.text

Seven main methods of Requests Library

Method description of requests. Request () constructs a request to support the basic method of each method, the main method of requests.get () to obtain HTML web pages, corresponding to the GETrequests.head () of HTTP to obtain the information method of the HTML page header, the corresponding HEADrequests.post () of HTTP to submit the POST request to the HTML web page, the POSTrequests.put () of HTTP to submit the PUT request to the HTML web page, and the RUTrequests.patch () of HTTP to submit a local modification request to the HTML web page. PATCHrequests.delete () corresponding to HTTP submits a deletion request to the HTML page, and corresponds to the DELETE of HTTP

The get () method is explained in detail:

R = requests.get (url)

The get () method constructs a Request object that requests a resource from the server and returns a Response object that contains the server resource.

Requests.get (url, params=None, * * kwargs)

Url: to get the url link of the page

Extra parameters in params:url, dictionary or byte stream format, optional

* * kwargs:12 control access parameters

Two important objects of Requests Library

Request

The Response:Response object contains the content returned by the crawler

Properties of the Response object

1. R.status_code: the return status of the HTTP request. 200indicates successful connection, while 404indicates failure.

2. R.text: the string form of the HTTP response content, that is, the page content corresponding to the url

3. R.encoding: the corresponding content encoding method guessed from HTTP header

4. R.apparent_encoding: the corresponding content encoding method analyzed from the content (alternative encoding method)

5. R.content: binary form of HTTP response content

6. R.encoding: if charset does not exist in header, the encoding is considered ISO-8859-1.

7. R.apparent_encoding: the encoding method analyzed according to the content of the web page can be regarded as an alternative to r.encoding.

Coding of Response:

8. R.encoding: the encoding of response content guessed from HTTP header; if charset does not exist in header, it is considered to be encoded as ISO-8859-1 Magi r.text to display web page content according to r.encoding

9.r.apparent_encoding: the encoding method analyzed according to the content of the web page can be regarded as an alternative to r.encoding

10.r.raise_for_status (): if it is not 200, an exception requests.HTTPError is generated

4.Robots protocol

Robots Exclusion Standard web crawler exclusion standard

Function: the website tells the web crawler which pages can be crawled and which are not.

Format: robots.txt file in the root directory of the website.

For example:

# Note: * represents all, / represents the root directory

User-agent: *

Disallow: /? *

Disallow: / pop/*.html

Disallow: / pinpai/*.html?*

User-agent: EtaoSpider

Disallow: /

User-agent: HuihuiSpider

Disallow: /

User-agent: GwdangSpider

Disallow: /

User-agent: WochachaSpider

Disallow: /

Note: the Robots agreement is recommended but not binding, and web crawlers may not comply with it, but there are legal risks.

Example:

Import requests

Url = "https://item.jd.com/5145492.html"

Try:

R = requests.get (url)

R.raise_for_status ()

R.encoding = r.apparent_encoding

Print (r.text [: 1000])

Except:

Print ("crawl failure") Thank you for reading this article carefully. I hope the article "sample Analysis of Python Web crawlers and Information extraction" shared by the editor will be helpful to you. At the same time, I also hope that you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.