In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Today, I will talk to you about how to use R language to do a simple agent. Many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.
Recently is painstaking study crawler, one after another has learned the regular expression, xpath, css expression, basically can be qualified for the R language RCurl+XML, httr+rvest combination crawler demand, to the GET request and the POST request construction and the form submission as well as the browser grabs the package, the simulation landing, the asynchronous load also did some combing, because the crawler knowledge generality, so starts to practice urllib+lxml, requests+BeautifulSoup directly in the entry Python stage.
The beginner of the crawler also has a little experience. The next step is to explore the anti-crawling of the server while constantly practicing to consolidate the existing knowledge. This involves how to use random agent, how to construct anonymous proxy IP, and how to use multiple processes. There is still a long way to go.
I have always planned to climb the popular film and TV series short reviews. I have tried several times. The Douban short reviews page needs to be checked after landing, and the popular movies and TV series' short reviews are usually no less than 10,000 + pages, so frequent requests will be blocked as ip, so I have been studying how to solve this problem in a friendly and gentle way.
A few days ago, I saw that the aunts of the Python enthusiast community wrote the code of the agent pool with Python, and wanted to use the R language to also do one, that code provides a multi-process test for the effectiveness of the agent IP, but I don't know enough about the multi-process of the R language, so I can only use a stupid method to detect it a little bit, which is very time-consuming, although it is a bit stupid, but it finally runs successfully.
Crawl IP agent secretly brushes the number of articles read.
The target URL to climb is the domestic Xizi Gaoni agent. I have heard from bosses that free agents do not have good goods for a long time, because many anonymous agents have a time limit, and many developers may be using them on the front page, so no matter how much you climb, the available ones are limited. I climbed a total of the first 6 pages, combined with RCul+XML, to Baidu search home page as the target URL, a simple screening, 600 ip only screened 13 available ~.
But then again, West thorn has a total of 2000 + pages of agent ip, a total of about 200000 + agents, if you do not mind the trouble, you can do it slowly, but be friendly! If you want something easy to use, it is said that money can make you push ghosts!
The following is a simple IP agent crawling and detection code written by me personally using R language to emulate the ideas of the above article.
Load the expansion pack:
Library ("RCurl") library ("XML") library ("dplyr")
Get available User-Agent
# found some available user-agent on this page:
GetUserAgent%. [1: (length (.)-1)] return (UserAgent)}
# obtain UA (user-agent)
Myuseragent
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.