In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article shows you how to use Python crawler crawl agent IP, the content is concise and easy to understand, absolutely can make your eyes bright, through the detailed introduction of this article, I hope you can get something.
I don't know if you have ever encountered such a situation when visiting a website, that is, the visited website will give a hint that "visit frequency is too high". If you want to visit, then you have to wait a while or the other party will give a CAPTCHA to unblock the visited website. The reason for such a hint is that the website we want to crawl or visit has an anti-crawler mechanism. For example, when there are too many frequent requests for web pages using the same IP, the server chooses to deny service because of the instructions of the anti-crawler mechanism. This situation is more difficult to deal with by relying solely on unblocking, so one solution is to visit or crawl web pages by disguising the local IP address. That is, all the agent IP that we told you today.
At present, there are many ip agents online, some for free and some for payment. Although free but effective agents are rare and unstable, payment may be better. Next, let's talk about the trial of the agent IP. Save the available ip into MongoDB to facilitate next time.
Running platform: Windows
Python version: Python3.6
IDE: Sublime Text
Other: Chrome browser
The brief process is as follows:
Step 1: learn how to use the requests agent
Step 2: crawl from the proxy page to the ip and port
Step 3: check whether the crawled ip is available
Step 4: store the crawled available proxies in MongoDB
Step 5: randomly select an ip from the database stored in the available ip, and return it after a successful test.
For requests, setting up the proxy is relatively simple, as long as you pass in the proxies parameter.
However, it should be noted that here I installed the package grab tool Fiddler on the local port 8888 and used it to create a HTTP proxy service (with the Chrome plug-in SwitchyOmega), that is, the proxy service is 127.0.0.1 SwitchyOmega 8888. As long as we set up this proxy, we can successfully switch the native ip to the server ip connected by the agent software.
Here I use http://httpbin.org/get as a test website. When we visit the page, we can get the requested information. The origin field is the client ip. We can judge whether the agent is successful or not according to the returned results. The returned result is as follows:
Then we start crawling the proxy IP. First, we open a Chrome browser to view the web page and find the information about the ip and port elements.
As you can see, the proxy IP stores ip addresses and related information in a table, so we can easily extract the relevant information when we use BeautifulSoup to extract, but we need to note that the crawled ip is likely to repeat, especially when we crawl multiple proxy pages and store them in the same array at the same time, so we can use sets to remove duplicate ip.
The ip that will crawl the number of pages is crawled and saved into the array, and then the ip in it is tested one by one.
Here we use the requests method mentioned above to set up the proxy. We use http://httpbin.org/ip as the test website, which can directly return our ip address, pass the test and then store it in the MomgoDB database.
Connect to the database, then specify the database and collection, and then insert the data into the OK.
Finally, run it to check the results.
After running for a period of time, it is rare to see three tests pass in a row, so hurry up to take screenshots and save them. In fact, it is, after all, a free agent IP, which is still very effective, and the survival time is indeed very short. However, the amount of crawling is large, and we can still find something available. If we just use it as practice, it is barely enough. Now look at what's stored in the database.
Because there are not many pages crawled, and there are few valid ip, and I don't climb much, so there is not much ip in the database now, but I can save these ip. Now let's see how to take it out at random.
I was worried that the ip would fail after it was put into the database for a period of time, so I retested it before taking it out. If I successfully returned ip, I moved it directly out of the database if it was not successful.
In this way, when we need to use an agent, we can retrieve it from the database at any time.
The overall code is as follows:
Zhihu.com/people/hdmi-blog
The above content is how to use Python crawler crawling agent IP, have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.