What are the common Python crawler skills? 07/02 Update SLTechnology News&Howtos

What are the common Python crawler skills?

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the knowledge of "what are the commonly used Python crawler skills". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Basic crawl of web pages

Get method

Import urllib2 url = "http://www.baidu.com" response = urllib2.urlopen (url) print response.read ()

Post method

Import urllib import urllib2 url = "http://abcde.com" form = {'name':'abc','password':'1234'} form_data = urllib.urlencode (form) request = urllib2.Request (url,form_data) response = urllib2.urlopen (request) print response.read ()

2. Use proxy IP

In the process of developing a crawler, we often encounter the situation that IP is blocked, so we need to use proxy IP.

There is a ProxyHandler class in the urllib2 package, which allows you to set up a proxy to access the web page, as shown in the following code snippet:

Import urllib2 proxy = urllib2.ProxyHandler ({'http':' 127.0.0.1 urllib2.ProxyHandler 8087'}) opener = urllib2.build_opener (proxy) urllib2.install_opener (opener) response = urllib2.urlopen ('http://www.baidu.com') print response.read ()

3. Cookies processing

Cookies is the data (usually encrypted) stored on the user's local terminal by some websites in order to identify users and carry out session tracking. Python provides the cookielib module to deal with the cookies,cookielib module. The main function of the cookies,cookielib module is to provide objects that can store cookie, so that it can be used with the urllib2 module to access Internet resources.

Code snippet:

Import urllib2, cookielib cookie_support= urllib2.HTTPCookieProcessor (cookielib.CookieJar ()) opener = urllib2.build_opener (cookie_support) urllib2.install_opener (opener) content = urllib2.urlopen ('http://XXXX').read())

The key is CookieJar (), which is used to manage HTTP cookie values, store cookie generated by HTTP requests, and add cookie objects to outgoing HTTP requests. The entire cookie is stored in memory, and the cookie will be lost after garbage collection of the CookieJar instance, and all processes do not need to be operated separately.

Add cookie manually:

Cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg=" request.add_header ("Cookie", cookie)

4. Disguise as a browser

Some websites are disgusted with the visit of the crawler, so they all refuse the request. So HTTP Error 403: Forbidden often occurs when you visit a website directly with urllib2.

Pay special attention to some header. The Server side will check for these header:

Some Server or Proxy of 1.User-Agent will check this value to determine whether the Request is initiated by the browser.

When 2.Content-Type uses the REST interface, Server checks this value to determine how the contents of the HTTP Body should be parsed.

This can be achieved by modifying the header in the http package. The code snippet is as follows:

Import urllib2 headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} request = urllib2.Request (url =' http://my.oschina.net/jhao104/blog?catalog=3463517', headersheaders = headers) print urllib2.urlopen (request) .read ()

5. Page parsing

Of course, the most powerful thing for page parsing is the regular expression, which is different for different users of different sites, so you don't need too much explanation, with two better URLs:

Getting started with regular expressions: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

Regular expression online testing:

Http://tool.oschina.net/regex/

The second is the parsing library, there are two commonly used lxml and BeautifulSoup, for the use of these two introduce two better websites:

Lxml: http://my.oschina.net/jhao104/blog/639448

BeautifulSoup: http://cuiqingcai.com/1319.html

For these two libraries, my evaluation is that they are HTML/XML processing libraries, Beautifulsoup pure python implementation, low efficiency, but practical functions, such as the ability to get the source code of a HTML node through result search; lxml C language coding, efficient, support Xpath.

6. Processing of CAPTCHA

For some simple CAPTCHA codes, simple identification can be carried out. I have only done some simple CAPTCHA recognition. But some anti-human CAPTCHA codes, such as 12306, can be manually typed through the coding platform, which is, of course, for a fee.

7. Gzip compression

Have you ever encountered some web pages, no matter how the transcoding is a mess. Haha, that means you don't know that many web services have the ability to send compressed data, which can reduce the amount of data transmitted on network lines by more than 60%. This is especially true for XML web services because the compression ratio of XML data can be very high.

But the general server will not send compressed data for you unless you tell the server that you can handle the compressed data.

So you need to modify the code like this:

Import urllib2, httplib request = urllib2.Request ('http://xxxx.com') request.add_header (' Accept-encoding', 'gzip') opener = urllib2.build_opener () f = opener.open (request)

This is the key: create a Request object and add an Accept-encoding header to tell the server that you can accept gzip compressed data.

And then decompress the data:

Import StringIO import gzip compresseddata = f.read () compressedstream = StringIO.StringIO (compresseddata) gzipgzipper = gzip.GzipFile (fileobj=compressedstream) print gzipper.read ()

8. Multithreaded concurrent fetching

If a single thread is too slow, you will need multithreading. Here is a simple thread pool template. This program simply prints 1-10, but you can see that it is concurrent.

Although the multithreading of Python is very creepy, it can improve the efficiency to a certain extent for the frequent network type of crawlers.

From threading import Thread from Queue import Queue from time import sleep # Q is the task queue # NUM is the total number of concurrent threads # JOBS is the number of tasks Q = Queue () NUM = 2 JOBS = 10 # specific processing function, responsible for handling a single task def do_somthing_using (arguments): print arguments # this is the worker process Responsible for continuously fetching data from the queue and processing def working (): while True: arguments = q.get () do_somthing_using (arguments) sleep (1) q.task_done () # fork NUM thread waiting queue for i in range (NUM): t = Thread (target=working) t.setDaemon (True) t.start () # JOBS queued for i in range (JOBS): q.put (I) # waiting for all JOBS to finish q.join () what are the common Python crawler tips? that's it. Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.