What are the anti-crawler measures of python? 07/16 Update SLTechnology News&Howtos

What are the anti-crawler measures of python?

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of python anti-crawler measures, the content is detailed and easy to understand, the operation is simple and fast, and has a certain reference value. I believe you will gain something after reading this python anti-reptile measures. Let's take a look at it.

The most common Headers-based anti-crawler: I'm sure everyone is familiar with this. We basically write headers every time we write a crawler, because most websites check the User-Agent and Referer fields in Headers. This breakthrough should be relatively easy. We can modify the crawler's request header directly according to the request header of the browser during normal access, which is the same as that of the browser.

Headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36} rs=requests.get (' http://jianshu.com')

Anti-crawler based on user behavior: this anti-crawler measure is really a headache, what exactly is it? Take Chestnut, the same IP visits the same page frequently in a short period of time, or you just have some programmed operations on the site (visit the page at regular intervals) that is more approachable, that is to say, it doesn't look like a normal human being. How to solve this problem? ① since an ip can't be accessed frequently, I just need to get a lot of ip, so we can bypass it through a large number of ip agents. ② when we visit, we can change the interval time into a random number to imitate the operation of normal people as much as possible.

Proxies= {"http": "http://127.0.0.1:8888"," https ":" http://127.0.0.11:1080",}requests.get(url, proxies=proxies)

Anti-crawler based on CAPTCHA: in fact, CAPTCHA is also an anti-crawler measure. I believe everyone has seen CAPTCHA now, such as counting, sliding, clicking words in order, and so on. In fact, anti-crawler measures like this kind of CAPTCHA are a little tricky, and there are a few words here (involving machine learning or you can also use coding platforms). You can learn about this anti-crawler measure first, and then I will write an article about breaking CAPTCHA.

Anti-crawler for dynamic pages: this technique should be relatively common. What is a dynamic page? For example, we crawl data on html, which belongs to static pages, which is very simple. But if you want a dynamic page, you can't get the data directly from the page, but it will involve Ajax technology, so we need to analyze the Ajax request, and then simulate sending to get the data, but at present, many websites will not let you simulate sending data so easily, so now we need to make a breakthrough through selenium+phaantomJS technology! We'll talk about that later.

Anti-crawlers based on login: for example, some websites are rather stingy, and you have to register and log in before you can see the contents, so this is also a problem, but this is not particularly difficult, as long as you are patient enough to sign up for a few more accounts, then log in to get their cookie, and then access the login through these cookie.

This is the end of this article on "what are the anti-crawler measures for python?" Thank you for reading! I believe you all have a certain understanding of "what are the anti-crawler measures of python". If you want to learn more, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.