In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces the Python web crawler and information extraction example analysis, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.
1. To learn Python web crawler and information extraction, you must know a programming language and some network-related knowledge, preferably Python.
The 2.Python web crawler explanation is based on:
A, Requests framework: automatically crawl HTML pages and automatically submit web requests
B, robots.txt: exclusion criteria for web crawlers
C. BeautifulSoup framework: parsing HTML pages
D, Re framework: regular framework to extract key information of the page
E, Scrapy framework: introduction to the principle of web crawler and professional crawler framework
On the basis of
3. Basic use of Requests
Note: Requests library is recognized as the best Python third-party library for crawling web pages, which has the characteristics of simplicity and simplicity.
Official website: http://www.python-requests.org
Find "cmd.exe" in "C:\ Windows\ System32", run it as an administrator, and type: "pip install requests" on the command line.
Use IDLE to test the Requests library:
> import requests
> r = requests.get ("http://www.baidu.com")#, such as crawling Baidu pages"
> r.status_code
> r.encoding = 'utf-8'
> r.text
Seven main methods of Requests Library
Method description of requests. Request () constructs a request to support the basic method of each method, the main method of requests.get () to obtain HTML web pages, corresponding to the GETrequests.head () of HTTP to obtain the information method of the HTML page header, the corresponding HEADrequests.post () of HTTP to submit the POST request to the HTML web page, the POSTrequests.put () of HTTP to submit the PUT request to the HTML web page, and the RUTrequests.patch () of HTTP to submit a local modification request to the HTML web page. PATCHrequests.delete () corresponding to HTTP submits a deletion request to the HTML page, and corresponds to the DELETE of HTTP
The get () method is explained in detail:
R = requests.get (url)
The get () method constructs a Request object that requests a resource from the server and returns a Response object that contains the server resource.
Requests.get (url, params=None, * * kwargs)
Url: to get the url link of the page
Extra parameters in params:url, dictionary or byte stream format, optional
* * kwargs:12 control access parameters
Two important objects of Requests Library
Request
The Response:Response object contains the content returned by the crawler
Properties of the Response object
1. R.status_code: the return status of the HTTP request. 200indicates successful connection, while 404indicates failure.
2. R.text: the string form of the HTTP response content, that is, the page content corresponding to the url
3. R.encoding: the corresponding content encoding method guessed from HTTP header
4. R.apparent_encoding: the corresponding content encoding method analyzed from the content (alternative encoding method)
5. R.content: binary form of HTTP response content
6. R.encoding: if charset does not exist in header, the encoding is considered ISO-8859-1.
7. R.apparent_encoding: the encoding method analyzed according to the content of the web page can be regarded as an alternative to r.encoding.
Coding of Response:
8. R.encoding: the encoding of response content guessed from HTTP header; if charset does not exist in header, it is considered to be encoded as ISO-8859-1 Magi r.text to display web page content according to r.encoding
9.r.apparent_encoding: the encoding method analyzed according to the content of the web page can be regarded as an alternative to r.encoding
10.r.raise_for_status (): if it is not 200, an exception requests.HTTPError is generated
4.Robots protocol
Robots Exclusion Standard web crawler exclusion standard
Function: the website tells the web crawler which pages can be crawled and which are not.
Format: robots.txt file in the root directory of the website.
For example:
# Note: * represents all, / represents the root directory
User-agent: *
Disallow: /? *
Disallow: / pop/*.html
Disallow: / pinpai/*.html?*
User-agent: EtaoSpider
Disallow: /
User-agent: HuihuiSpider
Disallow: /
User-agent: GwdangSpider
Disallow: /
User-agent: WochachaSpider
Disallow: /
Note: the Robots agreement is recommended but not binding, and web crawlers may not comply with it, but there are legal risks.
Example:
Import requests
Url = "https://item.jd.com/5145492.html"
Try:
R = requests.get (url)
R.raise_for_status ()
R.encoding = r.apparent_encoding
Print (r.text [: 1000])
Except:
Print ("crawl failure") Thank you for reading this article carefully. I hope the article "sample Analysis of Python Web crawlers and Information extraction" shared by the editor will be helpful to you. At the same time, I also hope that you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.