In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >
Share
Shulou(Shulou.com)06/01 Report--
Python as a high-level programming language, its positioning is elegant, clear and simple. I have been learning to use python for almost a year, and I have used all kinds of crawler scripts most often: I have written scripts for local verification of agents, scripts for automatic login and posting in forums, scripts for automatic email reception, and scripts for simple CAPTCHA recognition.
These scripts have one thing in common, they are all related to web, and they always use some methods to get links, so they have accumulated a lot of experience in catching crawlers, so you don't have to repeat your work to do things in the future.
1. Basic crawl of web pages
Get method
Import urllib2
Url = "http://www.baidu.com"
Response = urllib2.urlopen (url)
Print response.read ()
Post method
Import urllib
Import urllib2
Url = "http://abcde.com"
Form = {'name':'abc','password':'1234'}
Form_data = urllib.urlencode (form)
Request = urllib2.Request (url,form_data)
Response = urllib2.urlopen (request)
Print response.read ()
two。 Use a proxy server
This is useful in some cases, such as IP is blocked, or the number of times IP access is limited, and so on.
Import urllib2
Proxy_support = urllib2.ProxyHandler ({'http':' http://XX.XX.XX.XX:XXXX'})
Opener = urllib2.build_opener (proxy_support, urllib2.HTTPHandler)
Urllib2.install_opener (opener)
Content = urllib2.urlopen ('http://XXXX').read()
3. Cookies processing
Import urllib2, cookielib
Cookie_support= urllib2.HTTPCookieProcessor (cookielib.CookieJar ())
Opener = urllib2.build_opener (cookie_support, urllib2.HTTPHandler)
Urllib2.install_opener (opener)
Content = urllib2.urlopen ('http://XXXX').read()
Yes, that's right. If you want to use both proxy and cookie, then join proxy_support and change operner to
Opener=urllib2.build_opener (proxy_support, cookie_support, urllib2.HTTPHandler)
4. Masquerade as browser access
Some websites are disgusted with the visit of the crawler, so they all refuse the request. At this point, we need to pretend to be a browser, which can be achieved by modifying the header in the http package:
Headers = {
'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
Req = urllib2.Request (
Url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/',
Data = postdata
Headers = headers
)
5. Page parsing
The most powerful thing about page parsing is, of course, regular expressions, which are different for different users of different sites, so you don't need too much explanation.
The second is the parsing library, there are two commonly used lxml and BeautifulSoup.
For these two libraries, my evaluation is that both are HTML/XML processing libraries, Beautifulsoup pure python implementation, low efficiency, but practical functions, such as the ability to get the source code of a HTML node through result search; lxmlC language coding, efficient, support Xpath.
6. Processing of CAPTCHA
What should I do if I encounter the CAPTCHA? There are two situations to deal with:
Google that kind of verification code, there is no way.
Simple CAPTCHA: the number of characters is limited, only using simple translation or rotation plus noise without distortion, this can still be dealt with, the general idea is to rotate back, remove the noise, and then divide a single character, divide it, and then reduce the dimension through feature extraction methods (such as PCA) and generate a feature library, and then compare the CAPTCHA with the feature library. This is rather complicated, so we won't start it here. Please get a relevant textbook and study it carefully.
7. Gzip/deflate support
Today's web pages generally support gzip compression, which can often solve a large number of transmission time. Take the VeryCD home page as an example, uncompressed version 247K, compressed after 45K, for the original 1max 5. This means that the grab speed will be five times faster.
However, the urllib/urllib2 of python does not support compression by default. To return the compression format, you must write 'accept-encoding',' in the header of request and then read response. It is tedious and trivial to check the header to see if there is a 'content-encoding' item to determine whether it needs to be decoded. How to make urllib2 automatically support gzip and defalte?
In fact, you can inherit the BaseHanlder class and then handle it in a build_opener way:
Import urllib2
From gzip import GzipFile
From StringIO import StringIO
Class ContentEncodingProcessor (urllib2.BaseHandler):
"" A handler to add gzip capabilities to urllib2 requests "
# add headers to requests
Def http_request (self, req):
Req.add_header ("Accept-Encoding", "gzip, deflate")
Return req
# decode
Def http_response (self, req, resp):
Old_resp = resp
# gzip
If resp.headers.get ("content-encoding") = = "gzip":
Gz = GzipFile (
Fileobj=StringIO (resp.read ())
Mode= "r"
)
Resp = urllib2.addinfourl (gz, old_resp.headers, old_resp.url, old_resp.code)
Resp.msg = old_resp.msg
# deflate
If resp.headers.get ("content-encoding") = = "deflate":
Gz = StringIO (deflate (resp.read ()
Resp = urllib2.addinfourl (gz, old_resp.headers, old_resp.url, old_resp.code) # 'class to addinfo () and
Resp.msg = old_resp.msg
Return resp
# deflate support
Import zlib
Def deflate (data): # zlib only provides the zlib compress format, not the deflate format
Try: # so on top of all there's this workaround:
Return zlib.decompress (data,-zlib.MAX_WBITS)
Except zlib.error:
Return zlib.decompress (data)
And then it's easy.
Encoding_support = ContentEncodingProcessor
Opener = urllib2.build_opener (encoding_support, urllib2.HTTPHandler)
# Open the web page directly with opener, and decompress it automatically if the server supports gzip/defalte
Content = opener.open (url). Read ()
8. Multithreaded concurrent fetching
If a single thread is too slow, you will need multithreading. Here is a simple thread pool template. This program simply prints 1-10, but you can see that it is concurrent.
Although the multithreading of Python is very creepy, it can improve the efficiency to a certain extent for the frequent network type of crawlers.
From threading import Thread
From Queue import Queue
From time import sleep
# Q is the task queue
# NUM is the total number of concurrent threads
# how many tasks are there in JOBS?
Q = Queue ()
NUM = 2
JOBS = 10
# specific handling function, which is responsible for handling a single task
Def do_somthing_using (arguments):
Print arguments
# this is the working process, which is responsible for constantly fetching data from the queue and processing
Def working ():
While True:
Arguments = q.get ()
Do_somthing_using (arguments)
Sleep (1)
Q.task_done ()
# fork NUM thread waiting queue
For i in range (NUM):
T = Thread (target=working)
T.setDaemon (True)
T.start ()
# queue JOBS
For i in range (JOBS):
Q.put (I)
# wait for all JOBS to complete
Q.join ()
9. Summary
Reading code written by Python feels like reading English, which allows users to focus on solving problems rather than figuring out the language itself. Although Python is written in C language, it abandons the complex pointers in C, making it simple and easy to learn. And as open source software, Python allows code to be read, copied, and even improved. These features contribute to the high efficiency of Python. "Life is too short, I use Python," which is a wonderful and powerful language.
All in all, when you start to learn Python, you must pay attention to these four points:
1. Code specification, which in itself is a very good habit, if you do not start to maintain good code planning, it will be very painful in the future.
two。 More hands-on, less reading, many people learn Python blindly read books, this is not to learn mathematical physics, you may see examples will, learning Python is mainly to learn programming ideas.
3. Practice frequently, after learning new knowledge points, you must remember how to apply it, otherwise you will forget it. Learning from our profession is mainly practical operation.
4. Study to be efficient, if you all feel that the efficiency is very low, then stop, find out the reason, ask people who have been there why.
Pay attention to 51Testing software testing network, improve it skills, will never be proficient only one step away.
Welcome to join the 51 software testing family, where you will get [latest industry information], [free test tool installation package], [software testing technology], [job interview skills]. 51 learn and grow with you! Looking forward to your joining: QQ 2173383349
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.