In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
Today, I would like to talk to you about how to climb Wechat official account articles in Python, many people may not know much about it. In order to make you understand better, the editor summarized the following content for you. I hope you can get something from this article.
Selenium introduction
Selenium is a tool for automated testing of web applications, which runs directly in the browser. It can interact with the elements on the page through code control and obtain the corresponding information. One of the great advantages of Selenium is that it does not need to construct requests in a complex way, its access parameters are exactly the same as those of normal users using browsers, and its access behavior is relatively more like that of normal users, and it is not easy to be hit by anti-crawler policies. And in the process of crawling, it can also be manually intervened if necessary (such as login, entering CAPTCHA, etc.).
Selenium is often a reserved weapon in the face of a strict anti-crawling site that has no way to start. Of course, there are drawbacks: operations need to wait for the page to load before they can continue, so they are slow and inefficient (in some cases, using headless and non-graph mode can improve the efficiency a little bit).
Requirements analysis and code implementation
The need is clear: get the title, date and link of all tweets on an official account. Wechat's own tweet function can only be viewed through its App, so the crawling of App is more complex. A convenient alternative is to search through Sogou Wechat. However, if you directly use libraries such as Requests to directly request, the anti-crawling measures involved are cookie settings, js encryption, and so on, so use Selenium Dafa today!
First import the required libraries and instantiate browser objects:
From selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.support.wait import WebDriverWait# imports lines 2-4 to explicitly wait for import timeimport datetimedriver = webdriver.Chrome () driver.get ('https://weixin.sogou.com/')), which will be mentioned in a moment.
The above code can realize the operation of opening Sogou Wechat search. Next, you need to enter text into the search box and click "search article" (not directly click the search official account because it has cancelled the function of directly obtaining the corresponding articles through the official account).
Wait = WebDriverWait (driver, 10) input = wait.until (EC.presence_of_element_located ((By.NAME, 'query')) input.send_keys (' early Python') driver.find_element_by_xpath ("/ / input [@ class='swz']") .click ()
The logic is to set the maximum waiting time, and enter the official account name after finding that the input box has been loaded within 10s. Here we take "early Python" as an example, and get the location according to the xpath of the "search article" button and click it. Explicit wait is used here. The Selenium request web page waiting for a response is constrained by the network speed. If the element is not fully loaded and the code executes too fast, the error will be accidentally reported and terminated. The solution is to wait.
Implicit waiting is when trying to find an element, if you can't find it immediately, wait for a fixed length of time driver.implicitly_wait (10), showing that the waiting condition is clear, and only the condition triggers before executing the subsequent code, such as the code I used here, of course, you can also use time modules to set the sleep time between modules, and then run the subsequent code after sleeping.
In addition, you can only get the results of the first 10 pages. To view the subsequent pages, you need to scan the Wechat code to log in:
So from here on, the execution logic of the code is:
Click login automatically after traversing page 10. Manual intervention is required at this time, and login is completed by scanning the code.
The code detects whether the login is complete (can be reduced to identifying whether the "next page" button appears), and if the login is complete, continue traversing from 11 pages to the last page (without the "next page" button).
Since two iterations are involved, the parsing information can be wrapped as a function:
Num = 0def get_news (): global num # puts global variables in order to order articles that meet the criteria time.sleep (1) news_lst = driver.find_elements_by_xpath ("/ / li [contains (@ id)" 'sogou_vr_11002601_box')] ") for news in news_lst: # get the official account source source = news.find_elements_by_xpath (' div [2] / div/a') [0] .text if 'get up early' not in source: continue num + = 1 # get the article title title = news.find_elements_by_xpath ( 'div [2] / h4gama') [0] .text # get the publication date of the article date = news.find_elements_by_xpath (' div [2] / div/span') [0] .text # if it is relatively recent, it may show "1 day ago", "12 hours ago" and "30 minutes ago" # here you can use the `datetime` module to match the time The difference is calculated out the specific time # and then parsed into the format of `YYYY-MM- DD` before'in date: today = datetime.datetime.today () if 'day' in date: delta = datetime.timedelta (days=int (date [0])) elif 'hour' in date: delta = datetime.timedelta (date.replace ('hour ago') ) else: delta = datetime.timedelta (minutes=int (date.replace ('minutes ago,')) date = str ((today-delta) .strftime ('% YMY% mmi% d')) date = datetime.datetime.strptime (date) '% Ymuri% mmi% d'). Strftime ('% Ymure% mmure% d') # get url url = news.find_elements_by_xpath ('div [2] / h4gama') [0] .get _ attribute (' href') print (num, title) Date) print (url) print ('-'* 10) for i in range (10): get_news () if I = = 9: # if you traverse to page 10, you don't need to click "next page" break driver.find_element_by_id ("sogou_next"). Click ()
The next step is to click "Log in", and then scan the code manually. You can use while True to check whether the login is successful and whether the next page button appears. If so, jump out of the cycle, click the "next page" button and continue with the following code, otherwise sleep for 3 seconds and repeat the test:
Driver.find_element_by_name ('top_login') .click () while True: try: next_page = driver.find_element_by_id ("sogou_next") break except: time.sleep (3) next_page.click ()
The effect is as shown in the figure:
Then it's time to re-traverse the article. Since you don't know which page the last page is, you can use the while loop to call the function that parses the page repeatedly and half-click "next page". If there is no next page, the loop ends:
While True: get_news () try: driver.find_element_by_id ("sogou_next"). Click () except: break# finally quit the browser and driver.quit () finish reading the above content. Do you have any further understanding of how to climb the Wechat official account in Python? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.