What are the anti-crawling strategies of Python crawlers? 04/25 Update SLTechnology News&Howtos

What are the anti-crawling strategies of Python crawlers?

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what are the anti-crawling strategies of Python crawlers". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "what are the anti-crawling strategies of Python crawlers?"

Crawler collection has become a personal demand of many companies, but because of this, anti-crawler technologies emerge in endlessly, such as time limit, IP limit, CAPTCHA limit and so on, which may lead to the inability of crawlers. So here are some major strategies to prevent reptiles from being anti-crawled.

Dynamically set User-Agent (randomly switch User-Agent to simulate the browser information of different users, you can use the component scrapy-random-useragent)

Disable Cookies (for simple websites, you can not enable cookies middleware, do not send cookies to Server, and some websites discover crawler behavior through the use of cookie) you can turn CookiesMiddleware on or off through COOKIES_ENABLED.

Enable Cookies (for complex websites, you need to use the headless browser scrapy-splash to get the complex cookies generated by js

Set delayed download (prevent access too frequently, set to 2 seconds or higher)

Google Cache and Baidu Cache: if possible, use the page cache of search engines such as Google / Baidu to get page data.

Referer uses fake sources, such as Baidu links with keywords

Use IP address pool: at present, most websites are based on IP to ban, and you can make a breakthrough through mass customization agent pool.

Use Yiniuyun reptile agent component code.

#!-*-encoding:utf-8-*-import base64 import sys import random PY3 = sys.version_info [0] > = 3 def base64ify (bytes_or_str): if PY3 and isinstance (bytes_or_str) Str): input_bytes = bytes_or_str.encode ('utf8') else: input_bytes = bytes_or_str output_bytes = base64.urlsafe_b64encode (input_bytes) if PY3: return output_bytes.decode (' ascii') else: return output_bytes class ProxyMiddleware (object):

Def process_request (self, request Spider): # proxy server (product website www.16yun.cn) proxyHost = "t.16yun.cn" proxyPort = "31111" # proxy tunnel verification information proxyUser = "username" proxyPass = "password" request.meta ['proxy'] = "http://{0}:{1}".format(proxyHost, ProxyPort) # add verification header encoded_user_pass = base64ify (proxyUser + ":" + proxyPass) request.headers ['Proxy-Authorization'] =' Basic'+ encoded_user_pass # set IP switch header (on demand) tunnel = random.randint (110000) Request.headers ['Proxy-Tunnel'] = str (tunnel)

Modify the project configuration file (. / project name / settings.py)

DOWNLOADER_MIDDLEWARES = {

'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,

'Project name .roomlewares.ProxyMiddleware': 100

}

Set up download middleware (Downloader Middlewares)

Download middleware is a layer of components between the engine (crawler.engine) and the downloader (crawler.engine.download ()). Multiple download middleware can be loaded and run.

When the engine passes the request to the downloader, the download middleware can process the request (such as adding http header information, adding proxy information, etc.)

When the downloader completes the http request and passes the response to the engine, the download middleware can process the response (such as decompressing gzip, etc.)

To activate the downloader middleware component, add it to the DOWNLOADER_MIDDLEWARES settings. This setting is a dictionary (dict), the key is the path of the middleware class, and the value is the order of the middleware.

Here is an example:

DOWNLOADER_MIDDLEWARES = {'mySpider.middlewares.MyDownloaderMiddleware': 543,}

Writing downloader middleware is very simple. Each middleware component is a Python class that defines one or more of the following methods:

Class scrapy.contrib.downloadermiddleware.DownloaderMiddlewareprocess_request (self, request, spider)

This method is called when each request passes through the download middleware.

Process_request () must return one of the following: a None, a Response object, a Request object, or raise IgnoreRequest:

If it returns None, Scrapy will continue to process the request and execute the appropriate methods of other middleware until the appropriate downloader handler function (download handler) is called and the request is executed (its response is downloaded).

If it returns a Response object, Scrapy will not call any other process_request () or process_exception () methods, or download the function accordingly; it will return the response. The process_response () method of the installed middleware is called when each response returns.

If it returns a Request object, Scrapy stops calling the process_request method and reschedules the returned request. When the newly returned request is executed, the corresponding middleware chain will be called according to the downloaded response.

If it raise an IgnoreRequest exception, the process_exception () method of the installed download middleware is called. If no method handles the exception, the errback (Request.errback) method of request is called. If there is no code to handle the thrown exception, the exception is ignored and not logged (unlike other exceptions).

Parameters:

Request (Request object)-request processed

Spider (Spider object)-the spider corresponding to the request

Process_response (self, request, response, spider)

Called when the downloader completes the http request and passes the response to the engine

Process_request () must return one of the following: return a Response object, return a Request object, or raise an IgnoreRequest exception.

If it returns a Response (which can be the same as the incoming response or an entirely new object), the response is processed by the process_response () method of other middleware in the chain.

If it returns a Request object, the middleware chain stops and the returned request is rescheduled for download. The processing is similar to what process_request () returns request to do.

If it throws an IgnoreRequest exception, request's errback (Request.errback) is called. If there is no code to handle the thrown exception, the exception is ignored and not logged (unlike other exceptions).

Parameters:

Request (Request object)-request corresponding to response

Response (Response object)-response processed

Spider (Spider object)-spider corresponding to response

Use case:

1. Create a middlewares.py file.

The switching of Scrapy proxy IP and Uesr-Agent is controlled by DOWNLOADER_MIDDLEWARES. We create middlewares.py files in the same level directory of settings.py to wrap all requests.

# random User-Agentclass RandomUserAgent (object): def process_request (self, request, spider): useragent = random.choice (USER_AGENTS) request.headers.setdefault ("User-Agent", useragent) class RandomProxy (object): def process_request (self, request) Spider): proxy = random.choice (PROXIES) if proxy ['user_passwd'] is None: # Agent usage request.meta [' proxy'] = "http://" + proxy ['ip_port'] else: # base64 transcoding account passwords base64_userpasswd = Base64.b64encode (proxy ['user_passwd']) # corresponds to the signaling format of the proxy server in request.headers [' Proxy-Authorization'] = 'Basic' + base64_userpasswd request.meta ['proxy'] = "http://" + proxy [' ip_port']

Why the HTTP proxy uses base64 encoding:

The principle of the HTTP agent is very simple, that is, to establish a connection with the proxy server through the HTTP protocol. The protocol signaling contains the IP and port number of the remote host to connect to. If authentication is needed, the authorization information is also needed. After receiving the signaling, the server first authenticates, and after passing, it establishes a connection with the remote host. After the connection is successful, it will be returned to the client 200, indicating that the authentication is passed. It is as simple as that. Here is the specific signaling format:

CONNECT 59.64.128.198:21 HTTP/1.1Host: 59.64.128.198:21Proxy-Authorization: Basic bGV2I1TU5OTIzUser-Agent: OpenFetion

Where Proxy-Authorization is the authentication information, and the string after Basic is the result of base64 encoding after the combination of user name and password, that is, base64 encoding of username:password.

HTTP/1.0 200 Connection established

OK, the client successfully establishes a connection after receiving the signaling from the receiver, and then the data to be sent to the remote host can be sent to the proxy server. After establishing the connection, the proxy server will cache the connection according to the IP address and port number. After receiving the signal, the corresponding connection will be found in the cache according to the IP address and port number, and the data will be forwarded through the connection.

two。 Modify settings.py configuration USER_AGENTS and PROXIES

Add USER_AGENTS:

USER_AGENTS = ["Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729)" .net CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322) "," Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30) "," Mozilla/5.0 (Windows; U; Windows NT 5.1) " Zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287c9dfb30), "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U)" Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0 "," Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5 "]

Add proxy IP settings PROXIES:

Agent IP can purchase Eniuyuyun's reptile agent IP:

PROXIES = [{'ip_port':' t.16yun.cn _ user_passwd':'}, {'ip_port':' t.16yun.cn _ user_passwd':'}, {'yunv _ t.16yun.cn _ lu _ 31112mm,' user_passwd': '16yunve16yun`}]

Unless otherwise required, disable cookies to prevent some websites from blocking crawlers according to Cookie. COOKIES_ENABLED = False

Set download delay DOWNLOAD_DELAY = 3

Finally, set the DOWNLOADER_MIDDLEWARES in setting.py and add the download middleware class written by yourself.

DOWNLOADER_MIDDLEWARES = {# 'mySpider.middlewares.MyCustomDownloaderMiddleware': 543,' mySpider.middlewares.RandomUserAgent': 1, 'mySpider.middlewares.ProxyMiddleware': 100} so far, I believe you have a deeper understanding of "what are the anti-crawling strategies of Python crawler". You might as well do it in practice! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.