Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to build 100 million crawler IP proxy pool by Squid proxy server

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "Squid proxy server how to build 100 million crawler IP proxy pool". In daily operation, I believe many people have doubts about how to build 100 million crawler IP proxy pool on Squid proxy server. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to build 100 million crawler IP proxy pool with Squid proxy server". Next, please follow the editor to study!

Build a train of thought

A large number of proxy server resources are provided here, mainly considering how to allocate these servers to the crawler server. The original idea is to use Redis as the proxy server resource queue, a program automatically obtains the proxy provided by the station master API, verifies the availability of push to Redis, and each program pop a proxy from the Redis to crawl, but this disadvantage is that it is not good to control the agent quality of each crawler server, some agents are fast, some are slow, which affects the crawling efficiency Secondly, it is necessary to maintain a set of procedures for agent verification and distribution, which increases the amount of code and is inconvenient for later maintenance.

To solve these problems, I thought that I could use the parent agent feature provided by Squid to automatically forward requests from the crawler server to the proxy server. Squid provides automatic polling that automatically validates and removes unavailable agents. Reduces our redundant verification steps.

The crawler software only needs to set the proxy to the Squid server and does not need to reset it to another proxy server every time.

This scheme obviously reduces the workload and improves the ease of use and maintainability.

Realization process

1. First, get the proxy server resources provided by the proxy platform.

It is recommended to purchase short-term agents. After purchase, obtain the API address at the background and set parameters such as IP whitelist.

two。 Write the obtained proxy server to the squid configuration file

Parse the proxy server provided by the website and write / etc/squid/squid.conf according to certain rules

3. Reconfigure squid

Reload the * * file after writing the configuration file without interruption

4. Automatic update, repeat 1-3

Since the survival time of the agents provided by the website is only 2 minutes, it is necessary to re-obtain a batch of new IP at regular intervals.

From gevent import monkey # isort:skip monkey.patch_all () # isort:skip import logging import os import time import requests from gevent.pool import Pool logger = logging.getLogger (_ _ name__) logger.setLevel (logging.INFO) formatter = logging.Formatter ("% (asctime) s -% (name) s -% (levelname) s: -% (message) s" Datefmt= "% Y-%m-%d% H:%M:%S") # uses the profile syntax that StreamHandler outputs to the screen ch = logging.StreamHandler () ch.setLevel (logging.INFO) ch.setFormatter (formatter) logger.addHandler (ch) # Squid to forward the request to the parent agent PEER_CONF = "cache_peer% s parent% s 0 no-query weighted-round-robin weight=1 connect-fail-limit=2 allow-miss max-conn=5\ n" # available agent GOOD_PROXIES = [] pool = Pool (50) def check_proxy (proxy): "verify that the agent is available: param proxy list: [ip Port] "" global GOOD_PROXIES ip, port = proxy _ proxies= {"http": "{}: {}" .format (ip, port)} try: ip_url = "http://2019.ip138.com/ic.asp" res = requests.get (ip_url, proxies=_proxies, timeout=10) assert ip in res.content logger.info (" [GOOD]-{}: {} ".format (ip) Port) GOOD_PROXIES.append (proxy) except Exception as e: logger.error ("[BAD]-{}: {}, {}" .format (ip, port, e) def update_conf (): with open ("/ etc/squid/squid.conf.original") "r") as F: squid_conf = F.readlines () squid_conf.append ("\ n # Cache peer config\ n") for proxy in GOOD_PROXIES: squid_conf.append (PEER_CONF% (proxy [0], proxy [1])) with open ("/ etc/squid/squid.conf") W) as F: F.writelines (squid_conf) def get_proxy (): global GOOD_PROXIES GOOD_PROXIES = [] # 1. Get proxy IP resource api_url = "http://s.zdaye.com/?api=YOUR_API&count=100&fitter=1&px=2" res = requests.get (api_url). Content if len (res) = 0: logger.error (" no data ") elif" bad "in res: logger.error (" bad request ") else: logger.info (" get all proxies ") proxies = [] For line in res.split (): proxies.append (line.strip () .split (":") pool.map (check_proxy) Proxies) pool.join () # 2. Write to the Squid configuration file update_conf () # 3. Reload the configuration file os.system ("squid-k reconfigure") logger.info ("> DONE!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report