In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article introduces the relevant knowledge of "what to do if the reptile ip is blocked". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
In the work of the crawler, we will inevitably encounter the anti-crawling blockage of the web page, so there is the attack and defense of the crawler, constantly fighting between attack and defense. Next, let's talk about six ways to limit ip when using crawlers!
Method 1.
1. IP must be required. If there are conditions, it is recommended that you use proxy IP.
2. Deploy the crawler proxy server on the machine with the extranet IP.
3. Your program, use rotation training to replace the proxy server to visit the website you want to collect.
Benefits:
1. The change of program logic is small, and only the agent function is needed.
2. According to the blocking rules of the other party's website, you just need to add more agents.
3. Even if the specific IP is blocked, you can take the proxy server offline and OK directly, and the program logic does not need to be changed.
Method 2.
1. ADSL+ script to monitor whether it is blocked, and then keep switching ip
2. The orthodox way to set the query frequency limit is to call the service interface provided by the website.
Method 3.
1. Useragent camouflage and rotation
2. Use fast cloud agent ip and rotation
3. With regard to the handling of cookies, some websites have a loose policy on login users.
Method 4.
Simulate user behavior as much as possible:
1. UserAgent is often changed.
2. The access time interval is longer, and the access time is set to random number.
3. The order of visiting the page can also be random.
Method 5.
Website closure is generally based on the number of visits to a specific IP per unit time. Grouping the collected tasks according to the IP of the target site can avoid being blocked by controlling the number of tasks sent out by each IP per unit time. Of course, this premise collects a lot of websites. If you only collect a website, it can only be achieved through multiple external IP.
Method 6.
Pressure control on crawler crawling; consider using proxies to access the target site.
1. Reduce the crawling frequency, set the time longer, and use the random number to access the time.
2. Frequently switch UserAgent (simulate browser access)
3. Multi-page data, randomly access and grab the data
4. Change the user IP, which is the most direct and effective way!
This is the end of the content of "what to do if the crawler ip is blocked". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.