How to use the visits of python crawler to browse csdn articles 07/11 Update SLTechnology News&Howtos

How to use the visits of python crawler to browse csdn articles

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "how to use python crawler to brush the number of visits to csdn articles". In daily operation, I believe many people have doubts about how to use python crawlers to brush csdn articles. The editor looked up all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the questions of "how to use python crawlers to brush csdn articles". Next, please follow the editor to study!

Looking at my pitifully small number of visits, I suddenly had the idea of using a crawler to brush the number of visits, mainly with the mentality of trying to learn.

In fact, there are some software on the market that can replace traffic, such as traffic wizard, and the sense of use is really better than the code we write.

First edition: the following code is borrowed from the Internet to run on python3.

Import urllib.requestimport time# uses build_opener () to allow python programs to imitate browsers to access opener = urllib.request.build_opener () opener.addheaders = [('User-agent' 'Mozilla/5.0')] # print (' start brushing:') tempUrl = 'https://blog.csdn.net/Lin_QC/article/details/88966839'for j in range (2000): try: opener.open (tempUrl) time.sleep (7) print ('% d% s'% (j) TempUrl)) except urllib.error.HTTPError: print ('urllib.error.HTTPError') time.sleep (1) except urllib.error.URLError: print (' urllib.error.URLError') time.sleep (1)

The code mainly uses the crawler to open the web page to refresh the number of visits, but this method encounters a bottleneck. When refreshing to a certain number of visits, the server of csdn will block the access of the ip, so it can not refresh the volume of visits.

Therefore, it also derived the second edition.

We can see a lot of proxy ip on the https://www.xicidaili.com website, and using these proxy ip can prevent the csdn server from blocking access.

First of all, I wrote a file to get the proxy ip. Through my own experiment, the domestic http proxy ip is relatively stable, so we crawled

'https://www.xicidaili.com/wt/1

The proxy ip information for the page and store them in the proxy file. The following code is based on python2, so be careful not to get the version wrong.

Proxy_IP.py file import urllib2import BeautifulSoupUser_Agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) Rv:43.0) Gecko/20100101 Firefox/43.0'header = {} header ['User-Agent'] = User_Agenturl =' https://www.xicidaili.com/wt/1'req = urllib2.Request (url, headers=header) res = urllib2.urlopen (req). Read () soup = BeautifulSoup.BeautifulSoup (res) ips = soup.findAll ('tr') f = open ("proxy", "w") for x in range (1 Len (ips): ip = ips [x] tds = ip.findAll ("td") ip_temp = tds [1] .contents [0] + "," + tds [2] .contents [0] + "\ n" print tds [1] .contents [0] + "\ t" + tds [2] .contents [0] f.write (ip_temp)

By executing the above code, we can get a large number of proxy ip, and then use these ip to access the blog.

Csdnfake.pyimport urllib2import socketimport timeimport randomsocket.setdefaulttimeout (3) user_agent_list = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)' 'Chrome/45.0.2454.85 Safari/537.36 115 BrowserGard 6.0.3,' Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10, 6, 8) En-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',' Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)', 'Mozilla/4.0 (compatible; MSIE 7.0) Windows NT 6. 0)', 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',' Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10 / 7 / 0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',' Mozilla/4.0 (compatible MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.x MetaSr 1.0; SE 2.x MetaSr 1.0; .NET CLR 2.0.50727; SE 2.x MetaSr 1.0)', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',' Mozilla/5.0 (Windows NT 6.1) Rv:2.0.1) Gecko/20100101 Firefox/4.0.1',] f = open ("proxy") lines = f.readlines () proxys = [] for i in range (0PowerLen (lines)): ip = Lines [I] .strip () .split (" ") proxy_host =" http://"+ip[0]+":"+ip[1] print "http://"+ip[0]+":"+ip[1] proxy_temp = {" http ": proxy_host} proxys.append (proxy_temp) urls = {" https://blog.csdn.net/Lin_QC/article/details/88966839", "https://blog.csdn.net/Lin_QC/article/details/88930018", "https://blog.csdn.net/Lin_QC/article/details/88642949"," https://blog.csdn.net/Lin_QC/article/details/84568170", "https://blog.csdn.net/Lin_QC/article/details/84451279"," https://blog.csdn.net/Lin_QC/article/details/84927503", } j=1for i in range: for proxy in proxys: for url in urls: try: user_agent = random.choice (user_agent_list) proxy_support = urllib2.ProxyHandler (proxy) opener = urllib2.build_opener (proxy_support, urllib2.HTTPHandler) urllib2.install_opener (opener) req = urllib2.Request (url) c = urllib2.urlopen (req) print ("sucessful", j) jambo1 time.sleep (5) except Exception E: print proxy print e continue

User_agent_list is the proxy head of a bunch of browsers that can imitate browsers to access blogs.

Each visit has a five-second break, mainly because too fast access is not valid for csdn.

The effect is that there is a significant gap between the blogs that have been visited and those who have not.

At this point, the study on "how to use python crawler to browse the number of visits to csdn articles" is over. I hope I can solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.