Get you started with Python crawler, 8 common crawler skills inventory 07/06 Update SLTechnology News&Howtos

Get you started with Python crawler, 8 common crawler skills inventory

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

Python as a high-level programming language, its positioning is elegant, clear and simple. I have been learning to use python for almost a year, and I have used all kinds of crawler scripts most often: I have written scripts for local verification of agents, scripts for automatic login and posting in forums, scripts for automatic email reception, and scripts for simple CAPTCHA recognition.

These scripts have one thing in common, they are all related to web, and they always use some methods to get links, so they have accumulated a lot of experience in catching crawlers, so you don't have to repeat your work to do things in the future.

1. Basic crawl of web pages

Get method

Import urllib2

Url = "http://www.baidu.com"

Response = urllib2.urlopen (url)

Print response.read ()

Post method

Import urllib

Import urllib2

Url = "http://abcde.com"

Form = {'name':'abc','password':'1234'}

Form_data = urllib.urlencode (form)

Request = urllib2.Request (url,form_data)

Response = urllib2.urlopen (request)

Print response.read ()

two。 Use a proxy server

This is useful in some cases, such as IP is blocked, or the number of times IP access is limited, and so on.

Import urllib2

Proxy_support = urllib2.ProxyHandler ({'http':' http://XX.XX.XX.XX:XXXX'})

Opener = urllib2.build_opener (proxy_support, urllib2.HTTPHandler)

Urllib2.install_opener (opener)

Content = urllib2.urlopen ('http://XXXX').read()

3. Cookies processing

Import urllib2, cookielib

Cookie_support= urllib2.HTTPCookieProcessor (cookielib.CookieJar ())

Opener = urllib2.build_opener (cookie_support, urllib2.HTTPHandler)

Urllib2.install_opener (opener)

Content = urllib2.urlopen ('http://XXXX').read()

Yes, that's right. If you want to use both proxy and cookie, then join proxy_support and change operner to

Opener=urllib2.build_opener (proxy_support, cookie_support, urllib2.HTTPHandler)

4. Masquerade as browser access

Some websites are disgusted with the visit of the crawler, so they all refuse the request. At this point, we need to pretend to be a browser, which can be achieved by modifying the header in the http package:

Headers = {

'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'

}

Req = urllib2.Request (

Url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/',

Data = postdata

Headers = headers

)

5. Page parsing

The most powerful thing about page parsing is, of course, regular expressions, which are different for different users of different sites, so you don't need too much explanation.

The second is the parsing library, there are two commonly used lxml and BeautifulSoup.

For these two libraries, my evaluation is that both are HTML/XML processing libraries, Beautifulsoup pure python implementation, low efficiency, but practical functions, such as the ability to get the source code of a HTML node through result search; lxmlC language coding, efficient, support Xpath.

6. Processing of CAPTCHA

What should I do if I encounter the CAPTCHA? There are two situations to deal with:

Google that kind of verification code, there is no way.

Simple CAPTCHA: the number of characters is limited, only using simple translation or rotation plus noise without distortion, this can still be dealt with, the general idea is to rotate back, remove the noise, and then divide a single character, divide it, and then reduce the dimension through feature extraction methods (such as PCA) and generate a feature library, and then compare the CAPTCHA with the feature library. This is rather complicated, so we won't start it here. Please get a relevant textbook and study it carefully.

7. Gzip/deflate support

Today's web pages generally support gzip compression, which can often solve a large number of transmission time. Take the VeryCD home page as an example, uncompressed version 247K, compressed after 45K, for the original 1max 5. This means that the grab speed will be five times faster.

However, the urllib/urllib2 of python does not support compression by default. To return the compression format, you must write 'accept-encoding',' in the header of request and then read response. It is tedious and trivial to check the header to see if there is a 'content-encoding' item to determine whether it needs to be decoded. How to make urllib2 automatically support gzip and defalte?

In fact, you can inherit the BaseHanlder class and then handle it in a build_opener way:

Import urllib2

From gzip import GzipFile

From StringIO import StringIO

Class ContentEncodingProcessor (urllib2.BaseHandler):

"" A handler to add gzip capabilities to urllib2 requests "

# add headers to requests

Def http_request (self, req):

Req.add_header ("Accept-Encoding", "gzip, deflate")

Return req

# decode

Def http_response (self, req, resp):

Old_resp = resp

# gzip

If resp.headers.get ("content-encoding") = = "gzip":

Gz = GzipFile (

Fileobj=StringIO (resp.read ())

Mode= "r"

)

Resp = urllib2.addinfourl (gz, old_resp.headers, old_resp.url, old_resp.code)

Resp.msg = old_resp.msg

# deflate

If resp.headers.get ("content-encoding") = = "deflate":

Gz = StringIO (deflate (resp.read ()

Resp = urllib2.addinfourl (gz, old_resp.headers, old_resp.url, old_resp.code) # 'class to addinfo () and

Resp.msg = old_resp.msg

Return resp

# deflate support

Import zlib

Def deflate (data): # zlib only provides the zlib compress format, not the deflate format

Try: # so on top of all there's this workaround:

Return zlib.decompress (data,-zlib.MAX_WBITS)

Except zlib.error:

Return zlib.decompress (data)

And then it's easy.

Encoding_support = ContentEncodingProcessor

Opener = urllib2.build_opener (encoding_support, urllib2.HTTPHandler)

# Open the web page directly with opener, and decompress it automatically if the server supports gzip/defalte

Content = opener.open (url). Read ()

8. Multithreaded concurrent fetching

If a single thread is too slow, you will need multithreading. Here is a simple thread pool template. This program simply prints 1-10, but you can see that it is concurrent.

Although the multithreading of Python is very creepy, it can improve the efficiency to a certain extent for the frequent network type of crawlers.

From threading import Thread

From Queue import Queue

From time import sleep

# Q is the task queue

# NUM is the total number of concurrent threads

# how many tasks are there in JOBS?

Q = Queue ()

NUM = 2

JOBS = 10

# specific handling function, which is responsible for handling a single task

Def do_somthing_using (arguments):

Print arguments

# this is the working process, which is responsible for constantly fetching data from the queue and processing

Def working ():

While True:

Arguments = q.get ()

Do_somthing_using (arguments)

Sleep (1)

Q.task_done ()

# fork NUM thread waiting queue

For i in range (NUM):

T = Thread (target=working)

T.setDaemon (True)

T.start ()

# queue JOBS

For i in range (JOBS):

Q.put (I)

# wait for all JOBS to complete

Q.join ()

9. Summary

Reading code written by Python feels like reading English, which allows users to focus on solving problems rather than figuring out the language itself. Although Python is written in C language, it abandons the complex pointers in C, making it simple and easy to learn. And as open source software, Python allows code to be read, copied, and even improved. These features contribute to the high efficiency of Python. "Life is too short, I use Python," which is a wonderful and powerful language.

All in all, when you start to learn Python, you must pay attention to these four points:

1. Code specification, which in itself is a very good habit, if you do not start to maintain good code planning, it will be very painful in the future.

two。 More hands-on, less reading, many people learn Python blindly read books, this is not to learn mathematical physics, you may see examples will, learning Python is mainly to learn programming ideas.

3. Practice frequently, after learning new knowledge points, you must remember how to apply it, otherwise you will forget it. Learning from our profession is mainly practical operation.

4. Study to be efficient, if you all feel that the efficiency is very low, then stop, find out the reason, ask people who have been there why.

Pay attention to 51Testing software testing network, improve it skills, will never be proficient only one step away.

Welcome to join the 51 software testing family, where you will get [latest industry information], [free test tool installation package], [software testing technology], [job interview skills]. 51 learn and grow with you! Looking forward to your joining: QQ 2173383349

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.