Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use free agent IP to crawl data

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

How to use free agent IP crawl data, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

one。 Preface

Crawlers can not avoid the restrictions of anti-crawling measures on major websites. It is more common to determine whether a user is a "web robot", that is, a so-called crawler, by checking the number of visits to an ip address at a fixed time. If it is identified, it is at risk of being blocked ip, so you cannot access the URL.

The general solution is to use proxy ip to crawl, but charged proxy ip is generally more expensive, there are many free proxy ip websites on the Internet, but due to timeliness, most addresses cannot be used, there are many tutorials for maintaining proxy ip pool, that is, crawling and testing can be used to put proxy ip into "proxy pool", and then extract it from it when you use it later. In my opinion, this efficiency is relatively low. Because this kind of IP address will expire quickly, what we need to do is to test and use it at the same time to ensure the timeliness of free IP.

two。 Grab IP address

Let's start the actual operation.

1. First of all, let's find a free agent ip website, as shown in the following figure.

2. Open the web page viewer and analyze the structure of the web page elements, as shown in the following figure.

3. It is a simple static web page. We use requests and bs4 to climb down the ip address and the corresponding port, as shown in the following figure.

4. Each line of ip address consists of five tags, and what we need is the first label (corresponding to the IP address) and the second label (corresponding port), so from the first one, we take out the ip address (item [:: 5]) every five, and from the second, the corresponding port (item [1:: 5]) is taken out every five times. The parameter n is the page number, and only one useful ip address is taken on one page at a time. The final effect is shown in the following figure:

three。 Verify the validity of IP

Here, Baidu encyclopedia is taken as the target website, this seemingly ordinary website, but the anti-climbing measures are so strict that the request fails after climbing a few pieces of content. Below, I will take Baidu encyclopedia to query the national railway station ownership information as an example to demonstrate how to use free agent ip.

1. First of all, I climbed down the names of all the railway stations on 12306, but I didn't have any information about where I belonged.

2. Then construct Baidu encyclopedia url information based on the station name, analyze the web page elements, and crawl the railway station address information, as shown in the following figure:

3. Therefore, we only need to look for the characters "province" or "city" in the tag content of the class_='basicInfo-item', and then output it, and finally add a while True loop, when the ip can crawl data normally, then break the loop; if the ip is banned, then immediately re-request a new ip to crawl. The code is shown in the following figure:

4. The for loop traverses all railway stations, and try is used to check whether the ip can still be used. If not, request a new ip in except. The crawling effect is as follows:

The next time a reptile is banned, it can be solved in this way.

This paper mainly introduces how to crawl the available IP on the IP proxy website, and the Python script to verify the timeliness of the IP address. If the crawler is banned, it can be solved by this method.

After reading the above, have you mastered how to use free proxy IP to crawl data? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report