In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Today, I will talk to you about what is the common website anti-crawling strategy and solution in big data. Many people may not know much about it. In order to let everyone know more, Xiaobian summarized the following contents for everyone. I hope you can gain something according to this article.
In the process of collecting data, we often encounter the situation of website anti-crawling, and the anti-crawling strategies of different websites are also different.
Today, I summarized several anti-climbing strategies that we often encounter, as well as solutions.
The principle of website anti-crawling is that the server identifies visitors through some visitor information carried in the visitor request and restricts visitors.
For example, the server identifies the visitor IP through the request and limits the access frequency of the same IP. When the access frequency of the same IP exceeds the limit, access problems will occur.
ForeSpider data collection engine according to several common anti-crawling strategies, developed the corresponding solution settings, users can use the process for different anti-crawling, corresponding settings, common anti-crawling strategies and solutions are as follows:
-01-Limit user IP access frequency
It usually appears as:
When the local IP acquisition speed is higher than a certain frequency, there will be acquisition errors, page redirection and so on.
When visitor IP information is stored in cookies, it will increase the difficulty of crawling.
Solution:
1. When IP is not recorded in cookies
You need to use dynamic short-term proxy IP/tunnel proxy IP, limit IP strength according to the website, adjust the collection speed, purchase an appropriate amount of proxy IP, and set it at ForeSpider IP proxy.
2. When IP is recorded in cookies
You need to use static long-term proxy IP, limit IP intensity according to the website, adjust the collection speed, purchase appropriate proxy IP, and set it in ForeSpider data collection system IP proxy.
-02-Limit user ID access frequency
It usually appears as:
After collecting for a period of time, stop collecting/collecting error, and the page cannot be displayed in the browser (page redirection, Captcha, error page, etc.). After clearing the browser browsing history, it can be displayed normally again when opened again.
In this case, you can confirm whether the server restricts user IDs by looking at the page's cookies.
When there is a UID or other ID string in the cookie of the visited page, it means that the server has recognition for the user ID. There is also a case where UID is encrypted, where the cookie contains an encrypted string.
Solution:
Use the multi-channel collection function in ForeSpider advanced settings, set the maximum number of login users, set the proxy IP (use static long-term proxy IP), and solve the restriction of ID by simulating multi-user browsing website.
-03- IP&ID double qualification
It usually appears as:
After collecting for a period of time, stop collecting/collecting error, and the page cannot be displayed in the browser (page redirection, Captcha, error page, etc.). After clearing the browser browsing history, it can be displayed normally again when opened again.
Crawler set to multi-channel collection after a period of time found that the IP was blocked. It can also be determined by observing whether there is both IP and UID/UID encrypted information in the page cookie.
Solution:
Use the multi-channel collection function in ForeSpider advanced settings, turn on dynamic IP lock at the same time, set proxy IP (use static long-term proxy IP), and set the maximum number of logged in users to solve the account restriction of the website. As shown below:
-04-Limit user account access frequency
It usually appears as:
The website needs to be logged in. After logging in, the collection account is blocked. This situation is generally caused by the server recognizing the user account and limiting the frequency of access to the user account.
Solution:
Register multiple accounts for replacement after closure.
-05-User account & user IP double restriction
It usually appears as:
The website needs to be logged in. After logging in, the collection account is blocked, and the IP is also blocked. Neither multichannel nor proxy IP acquisition has any effect. This situation is caused by the server's dual restrictions on user accounts and access IPs.
Solution:
Register multiple accounts for replacement after blocking; reduce collection speed and use static long-term proxy IP for collection.
After reading the above, do you have any further understanding of the common website anti-crawling strategies and solutions in big data? If you still want to know more knowledge or related content, please pay attention to the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.