In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
Through the example analysis of the crawler agent IP to rapidly increase the amount of blog reading, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.
First of all, what the title says is not the purpose, but mainly to understand the anti-crawling mechanism of the website in more detail. And really want a higher reading volume or need to have a real high-quality can be achieved.
1. Anti-crawler through Headers
Headers anti-crawler requested from the user is the most common anti-crawler strategy. Many websites will test the User-Agent of Headers, and some will test Referer (the hotlink protection of some resource sites is to test Referer).
If you encounter this kind of anti-crawler mechanism, you can directly add Headers to the crawler, copy the browser's User-Agent to the crawler's Headers, or change the Referer value to the target website domain name. For anti-crawlers that detect Headers, it can be bypassed by modifying or adding Headers to the crawler.
two。 Anti-crawler based on user behavior
There are also some websites that detect user behavior, such as the same IP visiting the same page multiple times in a short period of time, or the same account doing the same operation multiple times in a short time.
Most websites are in the former case, which can be solved by using an IP agent. We can save the proxy IP in the file after detection, but this method is not advisable. The possibility of failure of the proxy IP is very high, so it is a better way to buy the agent from the merchant that provides the proxy ip.
In the second case, the next request can be made at random intervals of a few seconds after each request. Some websites with logical loopholes can bypass the restriction that the same account cannot make the same request multiple times in a short period of time by requesting several times, logging out, logging in again, and continuing the request.
And for cookies, check cookies to determine whether the user is a valid user, which is often used in websites that need to log in. What's more, login of some websites will update the verification dynamically.
4. Restrict certain IP access
Proxy IP can be obtained from many websites, since crawlers can use these proxy IP to crawl websites, websites can also use these proxy IP reverse restrictions, by crawling these IP saved on the server to limit crawlers that use proxy IP to crawl.
All right, now in practice, write a crawler that accesses the site through a proxy IP.
First get the proxy IP, which is used to crawl.
Def Get_proxy_ip ():
Headers = {
'Host':' www.16yun.cn.',# Yiniuyun quality Agent #
'User-Agent':'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'
'Accept': r'application/json, text/javascript, * / *; qroom0.01'
'Referer': r 'http://www.xicidaili.com/',}
Req = request.Request (r 'http://www.16yun.cn/nn/', headers=headers) # billion Niuyun quality Agent =
Response = request.urlopen (req)
Html = response.read () .decode ('utf-8')
Proxy_list = []
Ip_list = re.findall (r'\ d +\.\ d +\.\ d +\.
Port_list = re.findall (r'\ dcards, girls, etc. HTML)
For i in range (len (ip_list)):
Ip = ip_ list [I]
Port = re.sub (r'|','', port_ list [I])
Proxy ='% slug% s'% (ip,port) proxy_list.append (proxy) return proxy_list
This is the answer to the sample analysis question about the rapid increase of blog reading through the crawler agent IP. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.