How to use Scrapy frame crawler to crawl Weibo hot search in python 07/19 Update SLTechnology News&Howtos

How to use Scrapy frame crawler to crawl Weibo hot search in python

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article is about how to use Scrapy framework crawler to crawl Weibo in python. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

The main functions implemented are:

0. Of course, it bypassed all kinds of anti-crawling.

1. Crawl all the main contents of the hot search.

two。 Crawl every hot search related Weibo.

3. Crawl every comment related to Weibo and comment on the details of the user.

4. The realization of automatic translation, in theory, can take down any details related to hot search, but the amount of data is relatively large, so it is recommended to use the database to optimize the crawler program (because the database has not yet been learned and cannot be used at that time. It was stored locally according to a certain format)

(function not implemented):

Use crawling data to build social networks. Python data analysis can be constructed to form a social network of crawled users.

Project structure:

Weibo.py

It is used to crawl the required data. After calling the callback to analyze the data, the data is handed over to item, and then item is handed over to the pipeline for processing, including persistent data, and so on.

Import scrapyfrom copy import deepcopyfrom time import sleepimport jsonfrom lxml import etreeimport reclass WeiboSpider (scrapy.Spider): name = 'weibo' start_urls = [' https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6'] home_page = "https://s.weibo.com/" # carries cookie to initiate a request def start_requests (self): cookies ="# get a cookie cookies = {i.split ("=") [0]: i.split ("=") [1] for i in cookies.split ")} yield scrapy.Request (self.start_urls [0], callback=self.parse, cookies=cookies) # analyze hot search and link def parse (self, response, * * kwargs): page_text = response.text with open ('first.html','w' Encoding='utf-8') as fp: fp.write (page_text) item = {} tr = response.xpath ('/ * [@ id= "pl_top_realtimehot"] / table//tr') [1:] # print (tr) for t in tr: item ['title'] = t.xpath ('. / td [2] / / text ()'). Extract ( ) [1] print ('title:' Item ['title']) # item [' domain_id'] = response.xpath ('/ / input [@ id= "sid"] / @ value'). Get () # item ['description'] = response.xpath (' / / div [@ id= "description"]'). Get () detail_url = self.home_page + t.xpath ('. / td [2] / / @ href'). Extract_first () Item ['href'] = detail_url print ("href:" Item ['href']) # print (item) # yield item yield scrapy.Request (detail_url,callback=self.parse_item, meta= {' item':deepcopy (item)}) # print ("parse complete") sleep (3) # print (item) # item {'title':href } # analyze all kinds of home page messages def parse_item (self, response) under each hot search * * kwargs): # print ("start parse_item") item = response.meta ['item'] # print (item) div_list = response.xpath (' / / div [@ id= "pl_feedlist_index"] / / div [@ class= "card-wrap"]') [1:] # print ('-') # print ( Div_list) # details_url_list = [] # print ("div_list:" Div_list) # create a text store with the title name = item ['title'] file_path ='. /'+ name for div in div_list: author = div.xpath ('. / / div [@ class= "info"] / div [2] / a Universe nickname'). Extract_first () brief_con = div.xpath ('. / / p [@ node-type= "feed_list_content_full"] / / text ()') .extract () if brief_con is None: brief_con = div.xpath ('. / / p [@ class= "txt"] / / text ()') .extract () brief_con = '.join (brief_con) print ("brief_con:" Brief_con) link = div.xpath ('. / / p [@ class= "from"] / an extract_first () if author is None or link is None: continue link = "https:" + link +'_ & type=comment' news_id = div.xpath ('. / @ mid'). Extract_first () print ("news_id:" News_id) # print (link) news_time = div.xpath (". / / p [@ class='from'] / a/text ()") .extract () news_time = '.join (news_time) print ("news_time:", news_time) print ("author is:" Author) item ['author'] = author item [' news_id'] = news_id item ['news_time'] = news_time item [' brief_con'] = brief_con item ['details_url'] = link # json Link template: https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4577307216321742&from=singleWeiBo Link = "https://weibo.com/aj/v6/comment/big?ajwvr=6&id="+ news_id +" & from=singleWeiBo "# print (link) yield scrapy.Request (link Callback=self.parse_detail,meta= {'item':deepcopy (item)}) # if response.xpath ('. / /') # analyze the details of each message and comment # https://weibo.com/1649173367/JwjbPDW00?refer_flag=1001030103__&type=comment # json packet # https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4577307216321742&from=singleWeiBo&__rnd=1606879908312 def parse_detail (self, response * * kwargs): # print ("status:", response.status) # print ("ur : ", response.url) # print (" request: ", response.request) # print (" headers: ", response.headers) # # print (response.text) # print (" parse_detail start ") item = response.meta ['item'] all= json.loads (response.text) [' data'] ['html'] # print (all) with open (' 3.html) Encoding='utf-8') as fp: fp.write (all) tree = etree.HTML (all) # print (type (tree)) # username = tree.xpath ('/ / div [@ class= "list_con"] / div [@ class= "WB_text"] / a [1] / text ()') # usertime = re.findall ('(. *?)' All) # comment = tree.xpath ('/ / div [@ class= "list_con"] / div [@ class= "WB_text"] / / text ()') # print (usertime) # # because the comment is preceded by a Chinese quotation mark, the rule is particularly useful # # comment = re.findall (ritual: (. *?)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.