In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article is about how Python crawls APP data from Dangdang. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it.
target
Scene: sometimes through the traditional method to climb some Web pages or APP, limited by the other party's anti-crawling scheme, it is difficult to climb to the desired data, you can consider using "Appium" combined with "mitmproxy" to crawl the data.
Among them, Appium is responsible for driving the automatic operation of the App side, and mitmproxy is responsible for intercepting the request data and parsing it and saving it to the database.
Today's goal is to crawl all the data from Dangdang and save it to the MongoDB database.
Preparatory work
First, you need to install Charles and Appium Desktop on PC and configure the mitmproxy environment.
# install mitmproxy dependency package pip3 install mitmproxy# install pymongodbpip3 install pymongo
In addition, we need to prepare an Android mobile phone and configure the Android development environment on the PC side.
Crawl train of thought
1. With the manual agent configured, open Charles to capture the initiated network request from the client in real time.
Open the Dangdang search page for goods and search for the keyword "Python". You can view the current requested URL address in Charles, including "word=Python".
Write the mitmproxy execution script file, rewrite the response () function, filter the requested URL, organize the useful data and save it to the MongoDB database.
Class DangDangMongo (object): "" initialize the MongoDB database "" def _ _ init__ (self): self.client = MongoClient ('localhost') self.db = self.client [' admin'] self.db.authenticate ("root") "xag") self.dangdang_book_collection = self.db ['dangdang_book'] def response (flow): # URL if' keyword=Python' in request.url for filtering requests: data = json.loads (response.text.encode ('utf-8')) # Book products = data.get (' products') or None product_datas = [] for product in products: # Book ID product_id = product.get ('id') # title product_name = product.get (' name') # Book Price product_price = product.get ('price') # author authorname = product.get (' authorname') # Press Publisher = product.get ('publisher') product_datas.append ({' product_id': product_id) 'product_name': product_name,' product_price': product_price, 'authorname': authorname,' publisher': publisher}) DangDangMongo () .dangdang_book_collection.insert_many (product_datas) print ('data inserted successfully')
First open the client's manual agent to listen to port 8080, then execute the "mitmdump" command, and then scroll through the product interface to find that the data has been written to the database.
Mitmdump-s script_dangdang.py
two。 Next we are going to use Appium to help us achieve automation.
First open Appium Desktop and start the service.
After getting the package name and the initial Activity, you can use WebDriver to simulate opening the Dangdang APP.
Self.caps = {'automationName': DRIVER,' platformName': PLATFORM, 'deviceName': DEVICE_NAME,' appPackage': APP_PACKAGE, 'appActivity': APP_ACTIVITY,' platformVersion': ANDROID_VERSION, 'autoGrantPermissions': AUTO_GRANT_PERMISSIONS,' unicodeKeyboard': True 'resetKeyboard': True} self.driver = webdriver.Remote (DRIVER_SERVER, self.caps)
Then use uiautomatorviewer, a tool that comes with Android SDK, to get the element information, and use WebDriver in Appium to manipulate UI elements.
When you open the application for the first time, there may be red packet rain dialog box, newcomer exclusive red packet dialog box, and switch city dialog box. Here, you need to get the close button through the element ID and click to close these dialogs.
Here a new thread is created to handle these dialogs separately.
Class ExtraJob (threading.Thread): def run (self): while self.__running.isSet (): # returns immediately when True, blocks when False until the internal identification bit is True, returns self.__flag.wait () # 1.0 [Red packet Rain] dialog box red_packet_element = is_element_exist (self.driver) 'com.dangdang.buy2:id/close') if red_packet_element: red_packet_element.click () # 1.1 [newcomer coupons] dialog box new_welcome_page_sure_element = is_element_exist (self.driver 'com.dangdang.buy2:id/dialog_cancel_tv') if new_welcome_page_sure_element: new_welcome_page_sure_element.click () # 1.2 [switch location] dialog box change_city_cancle_element = is_element_exist (self.driver 'com.dangdang.buy2:id/left_bt') if change_city_cancle_element: change_city_cancle_element.click () extra_job = ExtraJob (dangdang.driver) extra_job.start ()
The next step is to click the search button, enter the content, and execute the Click search dialog box.
# 1. Search box search _ element_pro = self.wait.until (EC.presence_of_element_located ((By.ID, 'com.dangdang.buy2:id/index_search')) search_element_pro.click () search_input_element = self.wait.until (EC.presence_of_element_located ((By.ID,' com.dangdang.buy2:id/search_text_layout')) search_input_element.set_text (KEY_WORD) # 2. Search dialog box, start searching search_btn_element = self.wait.until (EC.element_to_be_clickable ((By.ID, 'com.dangdang.buy2:id/search_btn_search')) search_btn_element.click () # 3. Sleep for 3 seconds to ensure that the content of the first page is fully loaded with time.sleep (3)
After the data on the first page is fully loaded, you can scroll up the page until the data is fully loaded, and the data will be automatically saved to the MongoDB database by mitmproxy.
While True: str1 = self.driver.page_source self.driver.swipe (FLICK_START_X, FLICK_START_Y + FLICK_DISTANCE, FLICK_START_X FLICK_START_X) time.sleep (1) str2 = self.driver.page_source if str1 = = str2: print ('stop sliding') # stop thread extra_job.stop () break print ('continue sliding' result
First use mitmdump to turn on the service that requests listening, and then execute the crawl script.
App will automatically open, after a series of operations, to the product interface, and then automatically slide the interface, through mitmproxy to automatically save useful data to the MongoDB database.
The above is how Python crawls Dangdang APP data. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.