Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to climb Wechat official account articles by Python

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "Python how to climb Wechat official account". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

To implement the crawler, we need to use the following tools

Chrome browser

Python 3 grammar knowledge

Requests Library of Python

In addition, this crawler uses the background editing material interface of the official Wechat account. The principle is that when we insert a hyperlink, Wechat will call a special API (see below) to get a list of articles with a specified official account. Therefore, we also need an official account.

Official start

We need to log in to the Wechat official account, click material Management, click New Picture and text message, and then click the hyperlink above.

Next, press F12, open the developer tool for Chrome, and select Network

At this point, in the previous hyperlink interface, click "Select another official account" and enter the official account you need to climb (for example, China Mobile).

At this point, the previous Network will refresh some links, among which what starts with "appmsg" is what we need to analyze.

We parse the requested URL

Https://mp.weixin.qq.com/cgi-bin/appmsg?action=list_ex&begin=0&count=5&fakeid=MzI1MjU5MjMzNA==&type=9&query=&token=143406284&lang=zh_CN&f=json&ajax=1

It is divided into three parts.

The basic part of the https://mp.weixin.qq.com/cgi-bin/appmsg: request

Action=list_ex: often used in dynamic websites to generate different pages or return different results by implementing different parameter values

& begin=0&count=5&fakeid: for setting? Parameters in, that is, begin=0, count=5

By constantly browsing the next page, we find that only begin changes each time, increasing by 5 each time, that is, the value of count.

Next, we get the same resource through Python, but you can't get the resource by running the following code directly

Import requestsurl = "https://mp.weixin.qq.com/cgi-bin/appmsg?action=list_ex&begin=0&count=5&fakeid=MzI1MjU5MjMzNA==&type=9&query=&token=1957521839&lang=zh_CN&f=json&ajax=1"requests.get(url).json() # {'base_resp': {' ret': 200003, 'err_msg':' invalid session'}}

The reason why we can get resources on the browser is that we log in to the official account backend of Wechat. Python does not have our login information, so the request is invalid. We need to set the headers parameter in requests, and pass in Cookie and User-Agent to simulate the login.

Since the content of the header information changes each time, I put it in a separate file, "wechat.yaml", with the following information

Cookie: ua_id=wuzWM9FKE14...user_agent: Mozilla/5.0...

And then you just need to read it.

# read cookie and user_agentimport yamlwith open ("wechat.yaml", "r") as file: file_data = file.read () config = yaml.safe_load (file_data) headers= {"Cookie": config ['cookie'], "User-Agent": config [' user_agent']} requests.get (url, headers=headers, verify=False). Json ()

In the returned JSON, we see the title (title), summary (digest), link (link), push time (update_time) and cover address (cover) of each article.

Appmsgid is the unique identifier for each push, and aid is the unique identifier for each tweet.

In fact, in addition to Cookie, the token parameter in URL is also used to restrict crawlers, so the output of the above code is likely to be {'base_resp': {' ret': 200040, 'err_msg':' invalid csrf token'}}.

Then we write a loop that gets the JSON of all the articles and saves them.

Import jsonimport requestsimport timeimport randomimport yamlwith open ("wechat.yaml", "r") as file: file_data = file.read () config = yaml.safe_load (file_data) headers = {"Cookie": config ['cookie'], "User-Agent": config [' user_agent']} # request parameter url = "https://mp.weixin.qq.com/cgi-bin/appmsg"begin =" 0 "params = {" action ":" list_ex " "begin": begin, "count": "5", "fakeid": config ['fakeid'], "type": "9", "token": config [' token'], "lang": "zh_CN", "f": "json", "ajax": "1"} # Storage result app_msg_list = [] # without knowing how many articles are available on the official account Using the whilestatement # is also convenient to set the number of pages when rerunning I = 0while True: begin = I * 5 params ["begin"] = str (begin) # randomly pauses for a few seconds Avoid fast requests that lead to time.sleep (random.randint (1d10)) resp = requests.get (url, headers=headers, params = params, verify=False) # Wechat flow control, exit if resp.json () ['base_resp'] [' ret'] = = 200013: print ("frequencey control Stop at {} ".format (str (begin)) break # if the returned content is empty, end if len (resp.json () ['app_msg_list']) = = 0: print (" all ariticle parsed ") break app_msg_list.append (resp.json ()) # Page I + = 1

In the above code, I also stored fakeid and token in the "wechat.yaml" file, because fakeid is a unique identifier for each official account, while token changes frequently, and this information can be obtained either by parsing URL or from developer tools.

After crawling for a period of time, you will encounter the following problems

{'base_resp': {' err_msg': 'freq control',' ret': 200013}}

At this point, when you try to insert a hyperlink in the official account background, you will encounter the following prompt

This is the traffic limit of the official account, and it usually takes 30-60 minutes to continue. In order to deal with this problem perfectly, you may need to apply for multiple official accounts, compete with the login system of Wechat official account, and perhaps set up proxy pools.

But I don't need an industrial crawler. I just want to crawl my official account, so wait an hour, log back into the official account, get cookie and token, and run it. I don't want to challenge other people's jobs with my own interests.

Finally, the result is saved in JSON format.

# the saved result is JSONjson_name = "mp_data_ {} .json" .format (str (begin)) with open (json_name, "w") as file: file.write (json.dumps (app_msg_list, indent=2, ensure_ascii=False))

Or extract the article identifier, title, URL, release time of these four columns of information, save as CSV.

Info_list = [] for msg in app_msg_list: if "app_msg_list" >

The final code is as follows (the code may have bug, use it carefully), and the usage method is python wechat_parser.py wechat.yaml

Import jsonimport requestsimport timeimport randomimport osimport yamlimport sysif len (sys.argv) < 2: print ("too few arguments") sys.exit (1) yaml_file = sys.argv [1] if not os.path.exists (yaml_file): print ("yaml_file is not exists") sys.exit (1) with open (yaml_file "r") as file: file_data = file.read () config = yaml.safe_load (file_data) headers = {"Cookie": config ['cookie'], "User-Agent": config [' user_agent']} # request parameter url = "https://mp.weixin.qq.com/cgi-bin/appmsg"begin =" 0 "params = {" action ":" list_ex "," begin ": begin," count ":" 5 " "fakeid": config ['fakeid'], "type": "9", "token": config [' token'], "lang": "zh_CN", "f": "json", "ajax": "1"} # Storage result if os.path.exists ("mp_data.json"): with open ("mp_data.json") "r") as file: app_msg_list = json.load (file) else: app_msg_list = [] # without knowing how many articles there are on the official account Using the whilestatement # is also convenient to set the number of pages when rerunning I = len (app_msg_list) while True: begin = I * 5 params ["begin"] = str (begin) # randomly pauses for a few seconds Avoid fast requests that lead to time.sleep (random.randint (1d10)) resp = requests.get (url, headers=headers, params = params, verify=False) # Wechat flow control, exit if resp.json () ['base_resp'] [' ret'] = = 200013: print ("frequencey control Stop at {} ".format (str (begin)) break # if the returned content is empty, end if len (resp.json () ['app_msg_list']) = = 0: print (" all ariticle parsed ") break app_msg_list.append (resp.json ()) # paging I + = save the result to JSONjson_name =" Mp_data.json "with open (json_name "w") as file: file.write (json.dumps (app_msg_list, indent=2, ensure_ascii=False)) "how Python crawls Wechat official account article" ends here. Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 281

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report