In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >
Share
Shulou(Shulou.com)06/01 Report--
Method 1
Using multiple IP proxies:
IP must be required, such as ADSL. If there are conditions, in fact, you can apply for more external IP with the computer room.
2. Deploy proxy servers on machines with external IP.
3. Your program uses rotation to replace proxy servers to access the sites you want to collect.
Benefits:
1. Program logic changes are small and only proxy functionality is required.
2. Depending on the blocking rules of the other website, you only need to add more agents.
3. Even if the specific IP is blocked, you can directly take the proxy server offline OK, the program logic does not need to change.
Method 2.
A small number of websites have weak preventive measures. You can disguise the IP and modify X-Forwarded-for. You can bypass it.、
Most websites, if you want to crawl frequently, usually need more IP.
My preferred solution is to configure foreign VPS with multiple IPs, and implement IP handover through default gateway handover, which is much more efficient than HTTP proxy and probably more efficient than ADSL handover in most cases.
Method 3.
ADSL + script, monitor whether it is blocked, and then constantly switch ip to set query frequency limit
The orthodox approach is to invoke the service interface provided by the site.
Method 4.
ADSL is the king in China. Apply for more lines, distributed in many different telecom bureaus, and it is better to be able to cross provinces and cities. Write your own disconnection and redial components, write your own dynamic IP tracking service, remote hardware reset (mainly for ADSL cats to prevent downtime), and the rest of the task allocation and data recovery are not big problems.
Method 5.
1 user agent Camouflage and rotation
2 Use proxy ip and rotation
3 Cookies processing, some websites have a looser policy for landing users
Be a responsible crawler :)
Method 6.
Simulate user behavior as much as possible:
1. UserAgent changes frequently;
2. The access time interval is set to be longer, and the access time is set to a random number;
3. The order of visiting pages can also be random.
Method 7.
A website is usually blocked based on the number of visits to a particular IP per unit time.
I grouped the collected tasks by IP of the target site
Avoid blocking by controlling the number of tasks per IP per unit time.
Of course, the premise is that you collect a lot of websites. If you only collect a website, then you can only achieve it by means of multiple external IPs.
Method 8.
Pressure control for crawler crawling; consider using proxy to access target sites.
Reduce the grab frequency, set the time longer, and use a random number for the access time.
Frequent switching UserAgent (simulating browser access)
Multi-page data, random access and then grab data-change user IP.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.