In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article will explain in detail the ways in which reptiles use agent long-term ip. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.
1. Each process randomly takes the IP list from the interface and reuses it.
If it fails, call API to get it. The general logic is as follows:
1. In each process, randomly retrieve part of the ip from the interface and try the ip directory repeatedly to get the data.
2. If the visit is successful, continue to catch the next one.
3. If it fails, take a batch of IP from the interface and continue to try.
The disadvantage of the solution: every IP has a deadline. If you extract 100 and use the 20th, most of the rest may not be available. If the connection time is more than 3 seconds and the read time is more than 5 seconds when setting the HTTP request, it may take 3-8 seconds, and hundreds of fetches may be made within 3-8 seconds.
1. Each process randomly takes the IP list from the interface and reuses it.
If it fails, call API to get the IP. The general logic is as follows:
1. For each process, randomly retrieve an ip from the interface and use it to browse resources
2. If the visit is successful, continue to catch the next one.
3. If it fails, randomly take an IP from the interface and continue to try.
Disadvantages of the solution: calling API to obtain IP is very frequent, which will put a lot of pressure on the proxy server, affect the stability of the API interface, and may limit extraction. This scheme is not suitable for long-term stable operation.
Third, first of all, extract a large number of IP into the local database, the general logic of extracting IP from the database is as follows.
1. Create a table in the database, write an import script, how many API are needed per minute (consult the IP service provider for advice), and import the IP list into the database.
2. Record import time, IP, Port, expiration time, IP availability status and other fields
3. Write a crawl script to read the available IP from the database, and each process gets an IP from the database for use.
4. Perform fetching, judging results, dealing with cookie and so on. Whenever there is a CAPTCHA or fails, abandon the ip and replace the ip
This scheme effectively avoids the consumption of proxy server resources, effectively allocates the use of proxy IP, is more efficient and stable, and ensures the persistence and stability of crawler work.
This is the end of this article on "what are the ways in which reptiles use agent long-term ip?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.