In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly introduces the Python crawler and anti-crawling technology process example analysis, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.
1. Browser simulation (Headers)
Browser simulation is one of the most commonly used anti-crawling methods. Imagine this: if a website is constantly visited by the same version of the browser, it will probably be considered a robot. So there are countermeasures under the policy, and we can just use different browser version information every time we visit. First, let's take a look at how to find your own browser information.
How to find browser information to open the browser, press F12 (or right mouse button + check)
Click the Network button shown in the following figure
Press the keyboard Ctrl+R (MAC:Command+R) to grab the package
After the previous step, casually click on one of the items in the name on the right, and the following page will appear. The content in the red box is the browser information we are looking for.
Note: there are some websites with Referer information, the main function here is to tell the browser which URL you jumped from, and sites such as P station will check accordingly, so we can find the browser Referer information in the same way as above. The red box in the following figure shows:
Through the above steps, we can successfully get the version information of the browser, if we can get different version information, we can simulate different browsers to operate.
The way to use user-agent in Python is as follows: headers= {'Referer':' specific Referer', 'User-Agent':' specific user-agent'} requests.get (url,headers=headers) commonly used request header (simulation browser) information is as follows: User_Agent = ["Mozilla/5.0 (iPod; U; CPU iPhone OS 4 # 3 # 2 like Mac OS X) Zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8H7 Safari/6533.18.5 "," Mozilla/5.0 (iPhone; U; CPU iPhone OS 4 million 2 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8H7 Safari/6533.18.5 "," MQQBrowser/25 (Linux; U; 2.3.3; zh-cn; HTC Desire S Build/GRI40) 480,800) "," Mozilla/5.0 (Linux; U; Android 2.3.3; zh-cn; HTC_DesireS_S510e Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 "," Mozilla/5.0 (SymbianOS/9.3; U) " Series60/3.2 NokiaE75-1 / 110.48.125 Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/413 (KHTML, like Gecko) Safari/413 "," Mozilla/5.0 (iPad; U; CPU OS 4, 3, 3, like Mac OS X) Zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Mobile/8J2 "," Mozilla/5.0 (Windows NT 5.2) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.122 Safari/534.30 "," Mozilla/5.0 (Macintosh; Intel Mac OS X 10: 7) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.202 Safari/535.1 "," Mozilla/5.0 (Macintosh) Intel Mac OS X 10: 7: 2) AppleWebKit/534.51.22 (KHTML, like Gecko) Version/5.1.1 Safari/534.51.22 "," Mozilla/5.0 (iPhone; CPU iPhone OS 5: 0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A5313e Safari/7534.48.3 "," Mozilla/5.0 (iPhone) " CPU iPhone OS 5: 0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A5313e Safari/7534.48.3 "," Mozilla/5.0 (iPhone) CPU iPhone OS 5: 0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A5313e Safari/7534.48.3 "," Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.202 Safari/535.1 "," Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; SAMSUNG OMNIA7), "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; XBLWP7; ZuneWP7)", "Mozilla/5.0 (Windows NT 5.2) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.122 Safari/534.30", "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0", "Mozilla/4.0 (compatible; MSIE 8.0)" Windows NT 5.2; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .net 4.0C) "," Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .net 4.0E; .NET CLR 3.0.4506.2152) " .net CLR 3.5.30729; .NET 4.0C) "," Mozilla/4.0 (compatible; MSIE 60; Windows NT 5.1; SV1; .NET CLR 2.0.50727) "," Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET 4.0C) " .net 4.0E) "," Opera/9.80 (Windows NT 5.1; U; zh-cn) Presto/2.9.168 Version/11.50 "," Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) "," Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022 .net 4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET 4.0C) "," Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1 "," Mozilla/5.0 (Windows; U; Windows NT 5.1) " ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12 "," Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; TheWorld) "," Opera/9.80 (X11; Linux i686) Ubuntu/14.10) Presto/2.12.388 Version/12.16 "," Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14 "," Mozilla/5.0 (Windows NT 6.0; rv:2.0) Gecko/20100101 Firefox/4.0 Opera 12.14 "," Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0) Opera 12.14 "," Opera/12.80 (Windows NT 5.1; U) En) Presto/2.10.289 Version/12.02 "," Opera/9.80 (Windows NT 6.1; U; es-ES) Presto/2.9.181 Version/12.00 "," Opera/9.80 (Windows NT 5.1; U; zh-sg) Presto/2.9.181 Version/12.00 "," Opera/12.0 (Windows NT 5.2 teru) En) Presto/22.9.168 Version/12.00 "," Opera/12.0 (Windows NT 5.1) Presto/22.9.168 Version/12.00 "," Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1 "," Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0 "," Mozilla/5.0 (Macintosh; Intel Mac OS X 10) 10 " Rv:33.0) Gecko/20100101 Firefox/33.0 "," Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0 "," Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20130401 Firefox/31.0 "," Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0 "," Mozilla/5.0 (Windows NT 6.1; WOW64) " Rv:29.0) Gecko/20120101 Firefox/29.0 "," Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/29.0 "," Mozilla/5.0 (X11; OpenBSD amd64; rv:28.0) Gecko/20100101 Firefox/28.0 "," Mozilla/5.0 (X11; Linux x86x64; rv:28.0) Gecko/20100101 Firefox/28.0 "," Mozilla/5.0 (Windows NT 6.1) Rv:27.3) Gecko/20130101 Firefox/27.3 "," Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:27.0) Gecko/20121011 Firefox/27.0 "," Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0 "," Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) Gecko/20100101 Firefox/25.0 "," Mozilla/5.0 (X11) " Ubuntu; Linux x86x64; rv:24.0) Gecko/20100101 Firefox/24.0 "," Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0 "," Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Firefox/24.0 "] II. IP agent
In addition to accessing through the same browser, it is also possible to constantly visit the URL through the same IP, so it is easy to cause the entire IP to be blocked, and the personal IP is fine. If a company's IP can not visit a website, there is no need to say much about the effect.
For IP, it is necessary not only to control the change of IP address, but also to control the access speed. After all, the program does not blink when it is fast.
Python uses the IP agent in the following way: import requestsproxies = {"http":'IP address'} requests.get (url, headers=headers, proxies=proxies)
Note: with regard to the question of where to find the agent IP, we can rest assured that we can use it boldly by searching a lot on the Internet.
Control access frequency by using time module: import timetime.sleep (5) 3. Cookies simulation
There are many times when we encounter a 403 error when we visit the URL, which usually means that we do not have the right to access the requested resource, which is usually caused by not setting the cookie or setting the correct cookie. The existence of cookie is like a pass to a web site, and you will find that cookie changes when you log in and when you do not log in to the web.
Obtain cookie manually
We can get the cookie manually in the same way as we did user-agent:
Get cookie automatically
We can get the cookie automatically by using the session method.
The sample code is as follows:
Import requestssession = requests.session () session.cookies = LWPCookieJar (filename='Cookies.txt') def login (): name= input ("enter account:") password = input ("enter password:") url = "url" data = {"ck": "," name ": name," password ": password," remember ":" True "," ticket ":" } response = session.post (url, data=data) print (response.text) session.cookies.save () # Save cookie
So that our cookie can be preserved.
You can load cookie using session as follows:
Session.cookies = LWPCookieJar (filename='Cookies.txt') session.cookies.load (ignore_discard=True) uses cookies
When we have cookies, it's much easier to use, just in the same way as loading user-agent.
Headers= {'Referer':' specific Referer', 'User-Agent':' specific user-agent', 'Cookie':' cookie'} requests.get (url,headers=headers) Thank you for reading this article carefully. I hope the article "sample Analysis of Python crawler and Anti-crawling process" shared by the editor will be helpful to you. At the same time, I also hope that you will support and pay attention to the industry information channel, and more related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.