In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article shows you how to use requests to crawl Baidu Tieba user information in python, the content is concise and easy to understand, it can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.
1. Install the required toolkit:
Requests package, which is mainly used to send get or post requests to get the result of the request.
Pip install requests
BeautifulSoup package, which is mainly used to parse the obtained html page, is convenient and simple, but this package is inefficient
In addition to this package, you can also try to use the xpath,css selector, or even regular parsing, as long as you like, you can use whatever parsing, this time using the beautifulSoup package
Pip install bs4
Pymongo, the toolkit for python to operate mongodb, in crawlers, because some of the data crawled down are dirty data, it is more suitable to use non-relational database storage. Mongodb is a non-relational database.
Pip install pymongo
Because I am using a cloud database, I need to install dnspython. If I do not use the cloud database provided by the official website of mongodb, I do not need to install this.
Pip install dnspython
two。 Analysis page
Let's enter python first.
At this point the page link is http://tieba.baidu.com/f?ie=utf-8&kw=python&fr=search&red_tag=s3038027151
We click on the next page, click a few more times, and then go back to the home page on the previous page.
Found links http://tieba.baidu.com/f?kw=python&ie=utf-8&pn=0, http://tieba.baidu.com/f?kw=python&ie=utf-8&pn=0
In many links, we find that regular kw is the keyword of search, pn is paging, the first page is 0, the second page is 50, and the third page is 100. at this time, we splice out the url link according to this law, put it into the postman request, and find that the rule is right. So we can send a request to fetch the list page.
Now that we have crawled to the list page, how to get the user information.
At this point, I found that the relevant information about the user will pop up when the mouse is placed here, so there should be an interface to request the user's information here.
Now when we open the developer tool and reposition the mouse here,
Sure enough, we found the request, and the result was json data. Then we copied the result to the website json online for analysis, and found that it was exactly the data we wanted. (we will not post the data here).
At this point, we share this request http://tieba.baidu.com/home/get/panel?ie=utf-8&un=du_%E5%B0%8F%E9%99%8C.
After requesting more than one person, it is found that the parameter un is used to distinguish different people. The parameter un is listed in the following figure. According to experience, this should be the user's registered user name.
So where do we get this field?
Since it requests the back-end interface based on this field, there must be this field on the page, so we open the developer tool, look at the source code, navigate to this element on the page, and then we find
The un field is here, so we can use this field to concatenate the url. After passing the test, we find that it is correct.
3. Code
When the crawler analysis is over, it's time to start writing code.
First request the page
Send the request, and then parseList ()
Def tiebaSpider (self,key,beginPage,endPage): # configure url request link and request header beginPage = (beginPage-0) * 50 endPage = (endPage+1) * 50 headers = {"User-Agent": "Mozilla/5.0 (Macintosh) Intel Mac OS X 10: 11: 0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36 "} for page in range (beginPage,endPage,50): url=" http://tieba.baidu.com/f?kw=%s&ie=utf-8&pn=%s"%(key,page) # send get request response=requests.get (url,headers=headers) self.parseList (response.text)
Parse the post bar page:
Parsing pages with beautifulSoup
The find () function looks for a matching piece of data, and the first parameter is the label of html; if it is found according to id, the id parameter is used.
Find_all () looks up all the matching data, and if you look for it according to class, use the class_ parameter
If you look for it directly from the tag, you don't need the id or class_ parameter, and the following methods are useful to
In the parsing, I first find the a tag, and then extract its href data by a ['href']
After getting the attribute value, the data is cut after the un= and before the first
If the data is not empty, send a request to get the user information
Def parseList (self,response): # parsing list page Create bs4 soup = BeautifulSoup (response,'html.parser') # get the ul tag ul=soup.find ("ul", id='thread_list') # get the following li tag liList=ul.find_all ('span') according to the page parsing Class_='frs-author-name-wrap') # parsing to get the required data for li in liList: a=li.find ('a') un=a ['href'] .split (' &') [0] .split ('? un=') [- 1] if unchecked requests' and unchecked = None: # understanding of stitching request Get user information response=requests.get ("http://tieba.baidu.com/home/get/panel?ie=utf-8&un="+un) print (" http://tieba.baidu.com/home/get/panel?ie=utf-8&un="+un) print (response.url) self.parseDetail (response.text))
Parsing user information
Since this is json data, you can parse it directly with json package and store it in mongodb.
# parsing user information def parseDetail (self) Response): try: info=json.loads (response) # uses json's loads method to convert json strings into dictionary values data=info ['data'] result = {} result [' name'] = data ['name'] result [' name_show'] = data ['name_show'] sex=data [' sex'] if sex=='female': result ['sex'] =' female 'elif sex=='male': result [' sex'] = 'male' else: result ['sex'] =' Unknown 'result [' tb_age'] = data ['tb_age'] result [' id'] = data ['id'] result [' post_num'] = data ['post_num'] result [' tb_vip'] = data ['tb_vip'] result [' followed_count'] = data ['followed_count'] self.collection.insert_one (result) except: pass
4. Summary
At this point, the whole project is done.
But there are still some improvements. I will write down my ideas and ideas. When you study, you can try it yourself.
1. Not to remove the weight, which will lead to a lot of duplicate data.
Solution: you can save the knowledge of the requested user's information, and then when you request it again, first verify whether it has been requested.
two. Anti-scraping, when requesting the second link, there will be redirection. After the postman request, it is found that it is not the problem of the link, but only that the frequent requests are blocked.
The above content is how to use requests to crawl Baidu Tieba user information in python. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.