In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article will explain in detail how the web crawler uses the ip agent. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.
If you want to run the crawler smoothly, you'd better learn to use the proxy ip. Here are the steps for using the proxy ip:
1. Each process interface randomly takes the IP list and uses it repeatedly. If it is invalid, call API to get it.
The general logic is as follows:
(1) each process randomly reclaims a part of ip from the interface, and repeatedly tries the ip directory to capture data.
(2) if the visit is successful, continue to catch the next one.
(3) after failure, take IP from the interface and continue to try.
Disadvantages of the scheme: all IP have deadlines, 100 are extracted, and when the 20th is used, the rest may not be available. When setting up a HTTP request, the connection time is more than 3 seconds, and the read time is more than 5 seconds, which may take 3 to 8 seconds, and may be caught hundreds of times in those 3-8 seconds.
2. First, extract a large amount of IP, import it into the local database, and then extract IP from the database.
The general logic is as follows:
(1) create a table in the database, write an import script for API per minute (please consult the agent IP service provider for advice), and import the IP list into the database.
(2) record the import time, IP, port, expiration time, IP availability and other fields in the database
(3) write a crawl script that reads the available IP from the database, and each process fetches an IP usage from the database.
(4) when crawling, judging the result, dealing with cookie, etc., whenever there is a CAPTCHA or error, abandon the IP and replace the IP.
This is the end of this article on "how web crawlers use ip agents". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.