Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python crawler official account articles and links

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use Python crawler official account articles and links". The explanation in the article is simple and clear, easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use Python crawler official account articles and links".

Grab the bag

We need to grab the URL of the official account article request by grabbing the package, and refer to a previously written article about Python crawler's preparation before APP. This time, it is easier for me to directly grab the list of official account articles on Wechat on PC.

I take the bag grabbing tool Charles as an example. Check the request that allows the computer to be crawled, which is generally checked by default.

In order to filter out other extraneous requests, we set the domain name we want to crawl at the bottom left.

After opening Wechat on PC and the official account article list of "Python knowledge Circle", Charles will grab a large number of requests and find the requests we need. The returned JSON information contains the title, summary, links and other information of the article, all under the comm_msg_info.

These are the returns after the request link, the request link url we can check in Overview.

After getting so much information by grabbing the package, we can write crawlers to crawl the information of all the articles and save them.

Initialization function

Slide up the list of official account history articles. After loading more articles, it is found that only the parameter offset changes in the link. We create an initialization function and add the proxy IP, request header and information. The request header contains User-Agent, Cookie and Referer.

All this information can be seen in the grab package tool.

Request data

By grabbing the packet to analyze the request link, we can use the requests library to request it. We can judge whether the return code is 200. if 200 means that the return information is normal, we build a function parse_data () to parse and extract the return information we need.

Def request_data (self): try: response = requests.get (self.base_url.format (self.offset), headers=self.headers, proxies=self.proxy) print (self.base_url.format (self.offset)) if 200 = = response.status_code: self.parse_data (response.text) except Exception ase: print (e) time.sleep (2) pass

Extract data

By analyzing the returned Json data, we can see that all the data we need is under app_msg_ext_info.

We use json.loads to parse the returned Json information and store the columns we need in the csv file, with three columns of information with title, summary and article links, and other information can also be added by ourselves.

Def parse_data (self ResponseData): all_datas = json.loads (responseData) if 0 = = all_datas ['ret'] and all_datas [' msg_count'] > 0: summy_datas = all_datas ['general_msg_list'] datas = json.loads (summy_datas) [' list'] a = [] for data in datas Try: title = data ['app_msg_ext_info'] [' title'] title_child = data ['app_msg_ext_info'] [' digest'] article_url = data ['app_msg_ext_info'] [' content_url'] Info = {} info ['title'] = title info ['subtitle'] = title_child info ['article link'] = article_url a.append (info) except Exception as e: Print (e) continue print ('writing file') with open ('Python official account article collection 1.csv' 'as, newline='', encoding='utf-8') as f: fieldnames = [' title', 'subtitle', 'article link'] # Control the order of the columns writer = csv.DictWriter (f Fieldnames=fieldnames) writer.writeheader () writer.writerows (a) print ("write successful") print ('-') time.sleep (int (format (random.randint (2)) 5)) self.offset = self.offset+10 self.request_data () else: print ('data capture is complete!')

In this way, the crawled results are saved in csv format.

When you run the code, you may encounter an error from SSLError, and the quickest solution is to remove s from the https in front of base_url and run it again.

Save links in markdown format

People who often write articles should know that generally writing text will use the Markdown format to write articles, so that no matter which platform they put, the format of the articles will not change.

In Markdown format, it is expressed by [article title] (article url link), so we just add a column of information when we save the information, the title and article link are obtained, and the url in Markdown format is simple.

Md_url ='[{}] '.format (title) +' ({}) '.format (article_url)

After the crawl is complete, the effect is as follows.

We can paste all the md links into the notes in Markdown format. Most note-taking software knows how to create new Markdown files.

Thank you for your reading, the above is the content of "how to use Python crawler official account articles and links". After the study of this article, I believe you have a deeper understanding of how to use Python crawler official account articles and links, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report