Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the starting point of Python crawler

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article shows you what is the starting point of Python crawler, the content is concise and easy to understand, can definitely brighten your eyes, through the detailed introduction of this article, I hope you can get something.

What does a good crawler have to do with HTTP? In fact, we often talk about crawlers (also known as web crawlers) are network requests initiated by some network protocols, and the most commonly used network protocol is the HTTP/S network protocol suite.

What are the network libraries of Python

In real web browsing, we click on the web with the mouse and then the browser initiates the web request for us, so how do we initiate the web request in Python? The answer is, of course, libraries. Which libraries are they? Brother Zhu, let me give you a list:

Python2: httplib 、 httplib2 、 urllib 、 urllib2 、 urllib3 、 requests

Python3: httplib2 、 urllib 、 urllib3 、 requests

There are a lot of Python network request libraries, and we can see that they have all been used on the Internet, so what is the relationship between them? How to choose?

Httplib/2:

This is a Python built-in http library, but it is partial to the underlying library and is generally not used directly.

Httplib2 is a third-party library based on httplib, which is more complete than httplib implementation and supports caching, compression and other functions.

Generally speaking, neither of these two libraries is needed, and may be needed if you need to encapsulate the network request yourself.

Urllib/urllib2/urllib3:

Urlliib is a httplib-based upper-level library, while urllib2 and urllib3 are both third-party libraries. Urllib2 adds some advanced features compared to urllib, such as:

HTTP authentication or Cookie, etc., merge urllib2 into urllib in Python3.

Urllib3 provides support for thread-safe connection pooling and file post, which has little to do with urllib and urllib2.

Requests:

Requests library is a third-party network library based on urllib/3, which is characterized by powerful function and elegant API.

As we can see from the above figure, we are also recommended to use the requests library for the official python documentation of the http client. In practice, the requests library is also a frequently used library.

To sum up, we chose the requests library as our starting point for crawlers. In addition, the above libraries are synchronous network libraries. If you need high concurrent requests, you can use the asynchronous network library: aiohttp. Brother Pig will also explain this later.

II. Introduction to requests

I hope you will always remember that if you learn any language, don't forget to look at the official documents. Official documentation may not be the best introductory tutorial, but it is definitely the latest and most complete teaching documentation!

1. Home page

Official documentation of requests (currently supported in Chinese) link: http://cn.python-requests.org

Source code address: https://github.com/kennethreitz/requests

From the words "let HTTP serve human beings" on the home page, we can see that the core purpose of requests is to make it easy for users to use, which indirectly expresses their concept of elegant design.

Note: PEP 20 is the famous Zen of Python.

Warning: unprofessional use of other HTTP libraries can cause dangerous side effects, including safety deficiencies, redundant coding, reinventing the wheel, gnawing, depression, headaches, and even death.

two。 Functional characteristics

It is said that requests is powerful, so let's take a look at what features requests has:

Keep-Alive & connection pool

Internationalized domain name and URL

Session with persistent Cookie

Browser-style SSL authentication

Automatic content decoding

Basic / digest authentication

Elegant key/value Cookie

Automatic decompression

Unicode response body

HTTP (S) Agent support

Uploading files in parts

Streaming download

Connection timeout

Block request

Supports .netrc

Requests fully meets the needs of today's web. Requests supports Python 2.62.7 and 3.33.7, and runs perfectly under PyPy.

Third, install requests

Pip install requests

If it is pip3, use the

Pip3 install requests

If you use anaconda, you can.

Conda install requests

If you don't want to use the command line, download the library in pycharm like this

IV. Crawler process

The following picture is a project development process summarized by Brother Pig's previous work, which is quite detailed. You really need to be so detailed in developing a large project, otherwise you will not be able to review the project if there is something wrong with the project or modify the requirements. At that time, the programmer may carry the pot and sacrifice to heaven.

To get back to the point, let me show you that the development process of the project is to lead to the process of crawling data by the crawler:

Identify the web pages that need to be crawled

Browser checks the data source (static web page or dynamic loading)

Find the parameter rules for loading data url (such as paging)

Code simulation requests crawling data

Fifth, climb a certain east commodity page

Brother Pig takes a certain east commodity page as an example to take you to learn the simple process of reptiles. Why do you start with something instead of a treasure? Because a certain east browsing product page does not need to log in, it is easy for everyone to get started quickly!

1. Step 1: find the item you want to crawl in the browser

Ps: brother Pig is not driving. Why this product? Because later will climb the evaluation of this product to do data analysis, is not very exciting!

two。 Step 2: the browser checks the data source

Open the browser debug window to view web requests and see how the data is loaded? Do you want to return the static page directly, or do you want to load it dynamically with js?

Right mouse button and then click to check or directly F12 can open the debugging window, here Brother Pig recommends everyone to use Chrome browser, why? Programmers are using it because it is easy to use! Specific Chrome how to debug, everyone on the network to see the tutorial!

After opening the debug window, we can re-request the data, and then look at the returned data to determine the source of the data.

3. Step 3: find the parameter law of loading data url

We can see the first request link: the data returned by https://item.jd.com/1263013576.html is the web page data we want. Because we are crawling the merchandise page, there is no paging theory.

Of course, core information such as prices and some coupons are loaded through other requests, so let's not discuss it for the time being and finish our first small example first!

4. Step 4: code simulation requests crawling data

After we get the url link, let's start writing the code.

Import requestsdef spider_jd ():

"" climb JD.com 's merchandise page ""

Url = 'https://item.jd.com/1263013576.html'

Try:

R = requests.get (url) # sometimes request errors also return data

# raise_for_status determines the return status code, and throws an exception if 4XX or 5XX

R.raise_for_status ()

Print (r.text [: 500]) except:

Print ('crawl failed') if _ _ name__ = ='_ _ main__':

Spider_jd ()

Check the returned result

At this point, we have completed the crawling of a certain east commodity page. although the case is simple and the code is very few, the process of the crawler is basically the same. I hope that the students who want to learn the crawler will start to practice by themselves and pick up the goods they like. Only by doing it yourself can you really learn the knowledge!

VI. Introduction of requests Library

Above we use the get method of requests. We can look at the source code and find that there are several other methods: post, put, patch, delete, options, head, which are the request methods corresponding to HTTP.

Here is a simple list for you. Later, we will use a large number of cases to learn. After all, no one wants to read the boring explanation.

Requests.post ('http://httpbin.org/post', data = {' key':'value'})

Requests.patch ('http://httpbin.org/post', data = {' key':'value'})

Requests.put ('http://httpbin.org/put', data = {' key':'value'})

Requests.delete ('http://httpbin.org/delete')

Requests.head ('http://httpbin.org/get')

Requests.options ('http://httpbin.org/get')

Note: httpbin.org is a website that tests http requests and can respond to requests normally.

After reading the above, have you mastered the starting point of the Python crawler? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

What is the starting point of the Python crawler? have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report