Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the camouflage skills of python crawlers?

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Share

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what are the camouflage skills of python crawlers". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Browser camouflage

Because the website server can easily identify the source browser of the visit, take the requests request as an example, there is no browser information in the default header header data, and it is simply "streaking" when interacting with the browser, so we can add "User-Agent" information to pretend to be a real browser, as shown below:

Import requests headers= {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0'} # simulates Firefox browser response = requests.get ("http://www.baidu.com",headers=headers) # simulates request url2. Access address camouflage

The access address refers to the reffer information in the headers header, so what does it do? Give an example to explain:

I have a https://waimai.meituan.com/ link in https://bj.meituan.com/, so click on this https://waimai.meituan.com/ and its header information is: Referer= https://bj.meituan.com/

Then you can use this to prevent piracy. For example, I only allow my own website to access my own picture server.

We can add "reffer" information to disguise the access address, as follows:

Import requests headers= {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',' reffer':' https://bj.meituan.com/'} response = requests.get ("https://waimai.meituan.com/",headers=headers) # Simulation request url3.ip address camouflage

For anti-crawler policies in the network, most of them determine whether it is a web crawler based on the behavior of a single IP. For example, an anti-crawler will block an IP if it detects a large number of visits, or if the access frequency is very fast. At this time, we have to choose proxy IP to break through the anti-crawler mechanism, more stable and retrograde data crawling. The code for python to add a proxy IP is as follows:

Import requests proxies= {'https':'101.236.54.97:8866'} headers= {' User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0', 'reffer':' https://bj.meituan.com/'} response = requests.get ("https://waimai.meituan.com/",headers=headers grammar proxies) # Simulation request url

Agent IP can go to the Internet to find free ones, but they are not very stable. They can also spend money to buy some more stable ones.

4. Camouflage access rate

The access times and rules of real users are very stable, and they will not be accessed many times, so we need to crawl data disguised as real users, so that the anti-crawler mechanism will not be aware of it, and the access frequency can be controlled. The access time is mainly set randomly. The code is as follows:

Import requests import time,random proxies= {'https':'101.236.54.97:8866'} headers= {' User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) Rv:88.0) Gecko/20100101 Firefox/88.0', 'reffer':' https://bj.meituan.com/'} for i in range (10): response = requests.get ("https://waimai.meituan.com/",headers=headers pedagogical proxies) # Simulation request url time.sleep (random.uniform (1.1) 5.4) 5. Disguise the real information of the user

Some web pages need to log in before displaying data, and the cookie value will carry personal login information. Adding cookie value to the crawler can avoid the trouble of logging in, such as Zhihu, JD.com and other websites. Add the method as follows:

Import requests proxies= {'https':'101.236.54.97:8866'} headers= {' User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) Rv:88.0) Gecko/20100101 Firefox/88.0', 'reffer':' https://bj.meituan.com/'} cookies='' response = requests.get ("https://waimai.meituan.com/",headers=headers gravity proxiesmanship proxiestraining cookies) # simulates the request url" what are the camouflage skills of python crawlers ", thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report