In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article introduces the knowledge of "sharing common anti-climbing measures in python". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
1. Anti-climbing technology Headers:
Anti-crawling from a user's Headers is the most common anti-crawler strategy. Headers (mentioned in the previous lecture) is the easiest way to distinguish between browser behavior and machine behavior, and some websites detect Referer (superior links) (machine behavior is unlikely to be achieved by link jumping) to implement crawlers.
The solution: get the appropriate Headers through the review element or developer tool and transfer the corresponding Headers to python's requests so that it can be well bypassed.
II. IP restrictions on anti-crawling technology
Some websites will reverse crawl according to the frequency and number of times you visit your IP address. In other words, if you access it too frequently with a single IP address, the server will disable the IP access for a short period of time.
Solution: construct your own pool of IP agents and then randomly select agents each time you visit (but some IP addresses are not very stable and need to be checked for updates frequently).
Third, anti-crawling technology UA restrictions
UA is the browser identity when a user visits a website, and its anti-crawling mechanism is similar to ip restrictions.
Solution: construct your own UA pool, and hang the UA logo randomly every time python makes requests access, to better simulate browser behavior. Of course, if there is a time limit for anti-crawling, you can set timeout to sleep at random in requests, which will be more secure and stable, time.sleep ().
Anti-crawling technology verification code anti-crawler or simulated landing
CAPTCHA: this method is also quite old and effective. If a crawler wants to explain the contents of a CAPTCHA, it can be done through simple image recognition in the past, but now, there are many interference lines of CAPTCHA, and there are even CAPTCHA codes (~) that are difficult for human beings to recognize.
Import requests
S = requests.session ()
Login_data= {"account": "," password ":"}
Res=s.post ("http://mail.163.com/",login_data)"
5. Anti-crawling technology Ajax dynamic loading
The data that the page does not want to be obtained by the crawler is loaded dynamically using Ajax, which causes great trouble for the crawler. If a crawler does not have a js engine, or has a js engine, but does not have a solution for dealing with js returns, or has a js engine, but there is no way to make the site display enabled script settings. Based on these conditions, ajax dynamic loading is quite effective against crawlers.
Ajax dynamic loading works as follows: after loading the source code of the web page from the url of the web page, the JavaScript program will be executed in the browser. These programs will load more content and transfer it to the web page. This is why some web pages have no data when they climb its URL directly.
Method: if you use the review element to analyze the corresponding link of the "request" (method: right-click the → review element → Network → to clear, click "load more", appear the corresponding GET link to find the Type for text/html, click, view the get parameter or copy the RequestURL), cycle the process. If there is a page before "request", analyze and deduce page 1 according to the URL of the previous step. And so on, grab the data of the Ajax address. The returned json is parsed using json in requests and converted to dictionary processing using eval (). (the fiddler in the previous lecture can format the output json data.
VI. Cookie restrictions on anti-crawling technology
Solution: hang the corresponding cookie on the Headers or construct it according to its method (for example, select a few letters from it to construct). If it's too complex, consider using the selenium module (which completely simulates browser behavior).
This is the end of the content of "sharing common anti-climbing measures in python". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.