In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces the Python anti-crawler knowledge points of what related knowledge, the content is detailed and easy to understand, the operation is simple and fast, has a certain reference value, I believe that you will have some gains after reading this Python anti-reptile knowledge points, let's take a look at it.
First, why anti-reptiles?
Before designing an anti-crawler system, let's take a look at what problems crawlers will bring to the site.
In essence, the website and the data on the website that people can browse, view and use on the Internet are open and allowed to be obtained, so it does not exist in the so-called "illegal authorized access" problem.
There is no essential difference between a crawler visiting a web page and a human visiting a web page. The client initiates a HTTP request to the web server, and the web server receives the request and returns the content response to the client.
As long as the request is initiated, the website server is bound to respond, and to respond, it is bound to consume the resources of the server.
There is a mutually beneficial relationship between the visitors of the website and the website, the website provides visitors with the necessary information and services they need, and visitors also bring traffic, visitors and activity to the website. So the owner of the site will be willing to consume the server's bandwidth, disk, and memory to provide services for visitors.
What about the crawler? It is tantamount to whoring the party. Double the consumption of website server resources, occupy server bandwidth, but will not bring a trace of benefits to the website, and even the final result is detrimental to the website itself.
Crawlers, which may be regarded as African hyenas on the Internet, are no wonder they are hated by the owners of the site.
Second, identify reptiles
Since I hate crawlers, I want to keep them out of the website. To deny access to a crawler, of course, it is necessary to identify the crawler among the network visitors. How to identify it?
1. HTTP request header
This is the most basic recognition of web crawlers, and normal web visitors visit the site through browsers. Browsers will bring their own request headers to show their basic information. This is also the most easily broken identification means by the crawler, because anyone can modify and forge the HTTP request header.
2. Cookie value
Cookie is usually used to identify visitors to a website, just like a temporary voucher on hand. And with this proof-reading with the identity of the website server. Unfortunately, Cookie is data stored on the client side and can also be modified and forged.
3. Access frequency
If a visitor requests a page of the site every second, or hundreds of times a second. This visitor is either a crawler or a ghost. Who among the human beings can quickly and frequently click the mouse to visit a page? Does he have Parkinson's syndrome or octopus reincarnation?
It is feasible to identify the crawler by the access frequency, but the crawler can also achieve the effect that an IP address is accessed only once by using a large number of proxy IP, or it can be avoided by random request intervals.
4. Mouse behavior track
Normal human visitors are bound to browse the web without mechanically moving and clicking the mouse like a machine. The movement and click of the mouse can be captured by the JS script, so you can judge whether the visitor is a crawler by judging the mouse behavior trajectory of the visitor.
5. Token value
At present, many websites are developed separately from the front end, and the data is returned to the front end through the back-end interface, and the front end gets the data and then combines with the page to render. So many crawlers go directly to the data interface instead of foolishly requesting pages. Token is used to validate these back-end data interfaces. Token is usually encrypted by a combination of a key, time and some data on a web page.
There are more ways to identify reptiles without introducing them one by one. Unfortunately, any of the above methods of identifying reptiles may be bypassed and broken through by reptiles.
Third, refuse reptiles
Just like there is no website security once and for all, a decade ago, closing port 3389 would have prevented the server from becoming a broiler. Now all kinds of firewalls, various security measures have been added, and it is possible to be blackmailed because of a 0Day vulnerability.
Between reptiles and anti-reptiles, there is always a struggle and upgrading. The difference is that cyber attack and defense is unlimited fighting while anti-crawlers are Olympic boxing with gloves and helmets.
In order to operate, the website is bound to open its content to the public, and the open content is like the smell of carrion and blood floating in the African prairie, attracting hyenas.
It is difficult to strike a balance between opening up content and avoiding a crawler-like data mining pool.
1. Restrict the openness of content on content
Open content is the basis for access to users and traffic, so content must be open. But the openness of content is not unlimited. Non-registered users can see one or two articles, but they can't see the content indefinitely. This restriction can be required to log in, scan code verification, or access a click-to-verify mechanism such as Google CAPTCHA.
Now more and more websites have adopted the mechanism of limited content openness, such as Weibo, Zhihu, Taobao and so on. You can see one or two pages of content, but if you want to continue, please log in.
two。 Record user actions on behavior
Requiring visitors to log in does not solve the problem, because simulated login has always been a hot development branch of web crawlers, whether it is picture CAPTCHA, jigsaw puzzle, slider or click on Chinese characters. Even SMS CAPTCHA can communicate with crawlers and websites by writing APP.
Therefore, it is essential to record user behavior, and all user operations and access behaviors need to be recorded, which is the basis for analyzing and dealing with crawlers.
3. Severely crack down on high-frequency behavior in control
From a practical point of view, there are many crawlers that run not to kill the data and content of the website, but only to facilitate manual collection and collation. This type of crawler behavior is generally higher than manual browsing, but significantly lower than hyena-like high-frequency crawlers, this type of crawler behavior can be ignored. Leave a line of life, so that we can see each other in the future.
However, for the high-frequency crawler behavior that affects the operation of the website server, measures must be taken. Combine the user and IP information to deal with the relevant users or IP.
4. The rights are stated in the agreement.
The owner of the site must declare in the site agreement or user agreement that normal browsing, access and data access are allowed, and that any abnormal, high-frequency behavior that threatens the stability of the website server will be further dealt with.
This is the end of this article on "what are the knowledge points of Python anti-crawler?" Thank you for reading! I believe you all have a certain understanding of "what are the knowledge points of Python anti-crawler". If you still want to learn more knowledge, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.