Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the skills of Python crawler

2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "what are the skills of Python crawler". In daily operation, I believe that many people have doubts about the skills of Python crawler. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts about "what are the skills of Python crawler?" Next, please follow the editor to study!

1. Basically crawl web pages

Get method

Post method

two。 Use proxy IP

In the process of developing a crawler, we often encounter the situation that IP is blocked, so we need to use proxy IP.

There is a ProxyHandler class in the urllib2 package, which allows you to set up a proxy to access the web page, as shown in the following code snippet:

3. Cookies processing

Cookies is the data (usually encrypted) stored on the user's local terminal by some websites in order to identify users and carry out session tracking. Python provides the cookielib module to deal with the cookies,cookielib module. The main function of the cookies,cookielib module is to provide objects that can store cookie, so that it can be used with the urllib2 module to access Internet resources.

Code snippet:

The key is CookieJar (), which is used to manage HTTP cookie values, store cookie generated by HTTP requests, and add cookie objects to outgoing HTTP requests. The entire cookie is stored in memory, and the cookie will be lost after garbage collection of the CookieJar instance, and all processes do not need to be operated separately.

Add cookie manually:

4. Disguise as a browser

Some websites are disgusted with the visit of the crawler, so they all refuse the request. So HTTP Error 403: Forbidden often occurs when you visit a website directly with urllib2.

Pay special attention to some header. The Server side will check for these header:

Some Server or Proxy of User-Agent will check this value to determine whether the Request is initiated by the browser.

When Content-Type uses the REST interface, Server checks this value to determine how the contents of the HTTP Body should be parsed.

This can be achieved by modifying the header in the http package. The code snippet is as follows:

5. Processing of CAPTCHA

For some simple CAPTCHA codes, simple identification can be carried out. We have only done some simple CAPTCHA recognition, but some anti-human CAPTCHA codes, such as 12306, can be manually typed through the coding platform, of course, for a fee.

6. Gzip compression

Have you ever encountered some web pages, no matter how the transcoding is a mess. Haha, that means you don't know that many web services have the ability to send compressed data, which can reduce the amount of data transmitted on network lines by more than 60%. This is especially true for XML web services because the compression ratio of XML data can be very high.

But the general server will not send compressed data for you unless you tell the server that you can handle the compressed data.

So you need to modify the code like this:

This is the key: create a Request object and add an Accept-encoding header to tell the server that you can accept gzip compressed data.

And then decompress the data:

7. Multithreaded concurrent fetching

If a single thread is too slow, you will need multithreading. Here is a simple thread pool template. This program simply prints 1-10, but you can see that it is concurrent.

Although the multithreading of Python is very creepy, it can improve the efficiency to a certain extent for the frequent network type of crawlers.

At this point, the study of "what are the skills of Python crawler" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report