How to deal with the common anti-crawling mechanism when using Scrapy framework 07/04 Update SLTechnology News&Howtos

How to deal with the common anti-crawling mechanism when using Scrapy framework

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to deal with the common anti-climbing mechanism when using the Scrapy framework". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to deal with the common anti-climbing mechanism when using the Scrapy framework".

Header test

The simplest anti-crawling mechanism is to check the Headers information requested by HTTP, including User-Agent, Referer, Cookies, and so on.

User-Agent

User-Agent checks the type and version of the client used by the user, and in Scrapy, it is usually handled in the downloader middleware. For example, create a list of many browsers User-Agent in setting.py, and then create a new random_user_agent file:

Class RandomUserAgentMiddleware (object): @ classmethod defprocess_request (cls, request, spider): ua = random.choice (spider.settings ['USER_AGENT_LIST']) if ua: request.headers.setdefault (' User-Agent', ua)

In this way, a real browser's User-Agent can be randomly selected in each request.

Referer

Referer is to check where this request comes from, and it can usually be used to determine the hotlink of the image. In Scrapy, if a page url is extracted from a previously crawled page, Scrapy automatically uses the previously crawled page url as a Referfer. You can also define the Referfer field yourself as described above.

The website may detect the number of times session_id is used in Cookie, and if the limit is exceeded, the anti-crawl policy will be triggered. So you can set COOKIES_ENABLED = False in Scrapy to make the request without Cookies.

There are also websites that are forced to open Cookis, so it's going to be a little more troublesome. You can write another simple crawler, regularly send a request without Cookies to the target website, extract the Set-cookie field information in the response and save it. When crawling a web page, bring the stored Cookies into the Headers.

X-Forwarded-For

Add the X-Forwarded-For field to the request header to declare yourself as a transparent proxy server, and some websites will be more lenient to the proxy server.

The general format of X-Forwarded-For headers is as follows

X-Forwarded-For: client1, proxy1, proxy2

Here, the client1,proxy1 is set to a random IP address, and your request is disguised as a request generated by the random IP of the agent. However, because X-Forwarded-For can be tampered with at will, many websites will not trust this value.

Limit the number of requests for IP

If the request speed of an IP is too fast, the anti-crawling mechanism will be triggered. Of course, it can be bypassed by slowing down the crawl speed, at the expense of a significant increase in crawl time. Another way is to add proxies.

It is simple to add to the downloader middleware

Request.meta ['proxy'] =' http://' + 'proxy_host' +':'+ proxy_port

Then use a different proxy IP for each request. However, the question is how to get a large number of proxy IP?

You can write an IP agent to obtain and maintain the system, regularly crawl free IP agents from various websites that disclose free agent IP, and then regularly scan whether these IP and ports are available, and clean up the unavailable agent IP in time. In this way, there is a dynamic proxy library, and another agent is randomly selected from the library for each request. However, the shortcomings of this scheme are also obvious, the development agent to obtain and maintain the system itself is very time-consuming and laborious, and the number of free agents is not large, and the stability is relatively poor. If you have to use an agent, you can also buy some stable agent services. Most of these services use agents with authentication.

It is easy to add an authenticated agent to the requests library

Proxies = {"http": "http://user:pass@10.10.1.10:3128/",}

However, Scrapy does not support this authentication method. You need to encode the authentication information base64 and add the Proxy-Authorization field of Headers:

Importbase64 # Set the location of the proxy proxy_string = choice (self._get_proxies_from_file ('proxies.txt')) # user:pass@ip:port proxy_items = proxy_string.split (' @') request.meta ['proxy'] = "http://%s"% proxy_items [1] # setup basic authentication for the proxy user_pass=base64.encodestring (proxy_items [0]) request.headers [' Proxy-Authorization'] = 'Basic' + user_pass

Dynamic loading

Now more and more websites use ajax to dynamically load content. At this time, you can first intercept the ajax request and analyze it. It is possible to construct the URL of the corresponding API request according to the ajax request to directly get the desired content, usually in json format, without having to parse the HTML.

However, in many cases, ajax requests are authenticated through backend authentication and cannot be obtained by constructing URL directly. At this point, you can simulate the browser behavior through PhantomJS+Selenium and grab the page rendered by js.

It should be noted that after using Selenium, the request is no longer executed by the Downloader of Scrapy, so the request first class information previously added will become invalid and need to be re-added in Selenium.

Headers = {...} for key, valuein headers.iteritems (): webdriver.DesiredCapabilities.PHANTOMJS ['phantomjs.page.customHeaders. {}' .format (key)] = value

In addition, to call PhantomJs, you need to specify the executable path of the PhantomJs, which is usually added to the path path of the system so that the program can automatically find it in path when it executes. Our crawlers are often put into crontab for regular execution, but the environment variables in crontab are different from those of the system, so the path required by PhamtonJs cannot be loaded, so * specifies the path when declaring:

Driver = webdriver.PhantomJS (executable_path='/usr/local/bin/phantomjs') Thank you for your reading, the above is the content of "how to deal with the ordinary anti-climbing mechanism when using the Scrapy framework". After the study of this article, I believe you have a deeper understanding of how to deal with the ordinary anti-climbing mechanism when using the Scrapy framework. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.