Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How do distributed crawlers use proxy IP

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article will explain in detail how distributed crawlers use proxy IP. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

1. Each process randomly selects a set of IP from the interface API (for example, extracting 100 IP at a time) to loop.

Then call API to get it again, and the logic is as follows:

1. Each process (or thread) randomly returns a batch of IP from the interface, and recycles the IP list to get the data.

2. If the access is successful, continue to get the next entry.

3. If it fails (for example, timeout. (CAPTCHA appears, etc.), extract a batch of IP from the interface and continue to try.

Program defect: each IP has a validity period. If you take 100 IP and use 10, most of the subsequent IP will fail. When you set a HTTP request, the connection time has timed out by 3 seconds, and the read time has timed out by 5 seconds, so you may waste 3-8 seconds, and maybe catch dozens of times in those 3-8 seconds.

Second, each process randomly gets an IP from the interface API to use, and then calls API to get IP after failure.

The general logic is as follows:

1. For each process (or thread), randomly extract an IP from the interface and use the IP to access the resource

two。 If the visit is successful, proceed to the next task.

3. If it fails (for example, timeout. If a CAPTCHA appears, take a random IP from the interface and continue to try.

Program defects: the behavior of calling API to obtain IP is very frequent, which will bring great pressure to the proxy server, affect the stability of the API interface, and it is difficult to extract. The scheme is also applicable and can not run stably for a long time.

First, import a large amount of IP into the local database and extract IP from the database.

The general logic is as follows:

1. Create a table in the database, write an import script, how many API requests per minute (consult the agent IP service recommendation), and import the IP list into the database.

two。 Record the import time in the database. IP.Port. Fields such as expiration time. IP availability status.

3. Write a crawl script to read the available IP from the database, and each process fetches an IP from the database to use.

4. Perform the crawl, judge the result, deal with the cookie, etc., as soon as the validation code appears or fails, abandon the IP and replace the IP.

This scheme effectively avoids the consumption of proxy server resources, effectively allocates proxy IP, makes it more efficient and stable, and ensures the persistence and stability of crawler work.

This is the end of this article on "how distributed crawlers use proxy IP". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report