What are the necessary skills of Python crawlers? 04/22 Update SLTechnology News&Howtos

What are the necessary skills of Python crawlers?

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the knowledge about "what skills Python crawler masters must have". In the actual case operation process, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!

Python has many application scenarios, such as rapid Web development, crawler, automatic operation and maintenance, etc. It can be used as simple website, automatic posting script, email sending and receiving script, simple Captcha identification script.

1. Basic crawling of web pages

get method

import urllib2url = "http://www.baidu.com"response = urllib2.urlopen(url)print response.read()

post method

import urllibimport urllib2url = "http://abcde.com"form = {'name':'abc','password':'1234'}form_data = urllib.urlencode(form)request = urllib2.Request(url,form_data)response = urllib2.urlopen(request)print response.read()

2. Use proxy IP

In the process of developing crawlers, you often encounter situations where IP is blocked. In this case, you need to use proxy IP. In urllib2 package, there is ProxyHandler class, through which you can set proxy access to web pages. The following code snippet:

import urllib2proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})opener = urllib2.build_opener(proxy)urllib2.install_opener(opener)response = urllib2.urlopen('http://www.baidu.com')print response.read()

3. Cookies processing

Cookies are data (usually encrypted) stored on the user's local terminal by some websites in order to identify the user's identity and track the session. Python provides a cookelib module for processing cookies. The main role of the cookelib module is to provide an object that can store cookies in order to facilitate use with the urllib2 module to access Internet resources.

Code snippet:

import urllib2, cookielibcookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())opener = urllib2.build_opener(cookie_support)urllib2.install_opener(opener)content = urllib2.urlopen('http://XXXX').read()

The key is CookieJar(), which manages HTTP cookie values, stores cookies generated by HTTP requests, and adds cookie objects to outgoing HTTP requests. The entire cookie is stored in memory, and the cookie will be lost after garbage collection of CookieJar instances, all processes do not need to be operated separately.

Add cookies manually:

cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="

4, disguised as a browser

Some websites dislike the visit of crawlers, so they refuse requests to crawlers. HTTP Error 403: Forbidden is a common error when visiting websites directly with urllib2.

Pay special attention to some headers, and the Server will check these headers:

User-Agent Some servers or proxies check this value to determine if it is a browser-initiated request.

Content-Type When using the REST interface, the Server checks this value to determine how the content in the HTTP Body should be parsed. This can be achieved by modifying the header in the http package. The code snippet is as follows:

import urllib2headers = { 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}request = urllib2.Request( url = 'http://my.oschina.net/jhao104/blog? catalog=3463517', headers = headers)print urllib2.urlopen(request).read()

5. Page parsing

For page parsing is of course the most powerful regular expression, this is different for different users of different websites, do not need too much explanation

The second is the parsing library, commonly used are two lxml and BeautifulSoup

For these two libraries, my evaluation is that they are HTML/XML processing libraries, Beautifulsoup pure python implementation, low efficiency, but functional utility, such as the ability to obtain a HTML node source code through results search;lxml C language coding, efficient, support Xpath.

6. Processing of Captcha

For some simple Captcha, simple recognition can be made. I have only done some simple Captcha identification. However, some anti-human Captcha, such as 12306, can be manually typed through the coding platform, which of course costs a fee.

7. Gzip compression

Have you ever encountered some web pages, no matter how transcoding is a group of garbled code. Haha, that means you didn't know that many web services have the ability to send compressed data, which can reduce the amount of data transmitted over network lines by more than 60%. This is especially true for XML web services, because XML data can be compressed at a high rate.

But the server won't send compressed data for you unless you tell the server you can handle compressed data.

So we need to modify the code like this:

import urllib2, httplibrequest = urllib2.Request('http://xxxx.com')request.add_header('Accept-encoding', 'gzip')opener = urllib2.build_opener()f = opener.open(request)

Here's the key: Create the Request object and add an Accept-encoding header to tell the server that you can accept gzip compressed data.

Then, unpack the data:

import StringIOimport gzipcompresseddata = f.read()compressedstream = StringIO.StringIO(compresseddata)gzipper = gzip.GzipFile(fileobj=compressedstream)print gzipper.read()

8, multi-threaded concurrent grab

Single thread is too slow, then you need to multi-thread, here to a simple thread pool template This program simply prints 1-10, but you can see that it is concurrent.

Although Python's multithreading is very useless, it can still improve efficiency to a certain extent for frequent web crawlers.

from threading import Threadfrom Queue import Queue time import sleep# q is the task queue #NUM is the total number of concurrent threads #JOBS is how many tasks q = Queue()NUM = 2JOBS = 10#Specific processing function, responsible for processing a single task def do_something_using(arguments): print arguments#This is the worker process responsible for constantly fetching data from the queue and processing def working(): while True: arguments = q.get() do_somthing_using(arguments) sleep(1) q.task_done()#fork NUM threads waiting queue for i in range(NUM): t = Thread(target=working) t.setDaemon(True) t.start()#queue JOBS for i in range(JOBS): q.put(i)#Wait for all JOBS to complete q.join()"What are the necessary skills for Python crawler masters" is introduced here, thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.