How to use selenium to crawl Taobao goods information in Python 09/20 Update SLTechnology News&Howtos

How to use selenium to crawl Taobao goods information in Python

2025-09-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to use selenium to crawl Taobao commodity information in Python", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "how to use selenium to crawl Taobao commodity information in Python"!

The specific code is as follows:

# encoding=utf-8__author__ = 'Jonny'__location__ =' Xi'an'_ _ date__ = '2018-05-14' basic development library file: requests,pymongo,pyquery,selenium development process: search keywords: search keywords using selenium-driven browser to get the product list analysis page number and turn the page: get the product page number and simulate the page turn Get the merchandise list analysis of the subsequent page and extract the merchandise content: use PyQuery to analyze the page source code, parse and obtain the merchandise list information and store it in MongDB: store the merchandise information list in the database MongoDB. Import requestsfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.common.exceptions import TimeoutExceptionfrom pyquery import PyQuery as pqimport pymongoimport reimport timebrowser = webdriver.Chrome () wait = WebDriverWait (browser,10) client = pymongo.MongoClient ('localhost',27017) mongo = client [' taobao'] def searcher (): url= 'https://www.taobao.com/' browser.get (url=url) try: # determine whether the page is loaded successfully Set waiting time # determine position 1: whether the search input box is loaded input_kw = wait.until (EC.presence_of_element_located ((By.CSS_SELECTOR,'# q') # determine location 2: whether the search button corresponding to the search input box is loaded submit = wait.until (EC.element_to_be_clickable ((By.CSS_SELECTOR)) '# J_TSearchForm > div.search-button > button')) input_kw.send_keys (' menswear') submit.click () # wait for the page to load It is convenient to accurately judge the total number of pages page_counts = wait.until (EC.presence_of_element_located ((By.CSS_SELECTOR) '# mainsrp-pager > div > div > div > div.total')) parse_page () return page_counts.text except TimeoutException as e: print (e.args) return searcher () # implement page flipping def next_page (page_number): try: # determine whether the page is loaded successfully Set waiting time # determine position 1: whether the page jump input page is loaded input_page = wait.until ((By.CSS_SELECTOR,'# mainsrp-pager > div > div > div > div.form > input')) # position 2: confirm whether the button is loaded submit = wait.until ((By.CSS_SELECTOR) '# mainsrp-pager > div > div > div > div.form > span.btn.J_Submit')) input_page.send_keys (page_number) submit.click () # determine whether the page flip is successful or not wait.until ((By.CSS_SELECTOR,'#mainsrp-pager > div > div > div > ul > li.item.active') Str (page_number)) parse_page () except TimeoutException as e: print (e.args) next_page (page_number) # data processing for pages def parse_page (): # wait.until (EC.presence_of_element_located (By.CSS_SELECTOR,'#mainsrp-itemlist > div > div')) wait.until (EC.presence_of_element_located ((By.CSS_SELECTOR) '# mainsrp-itemlist. Items. Item') html = browser.page_source doc = pq (html) items = doc ('# mainsrp-itemlist. Items. Item'). Items () for item in items: goods = {'image':item.find (. Pic. IMG'). Attr ('src'),' price':item.find ('.price'). Text () 'deal':item.find ('.deal-cnt'). Text () [:-3],' title':item.find ('title') .text (), 'shop':item.find (' .shop') .text () 'location':item.find (' .location'). Text ()} print (goods) data_storage (goods) # stores data in the database def data_storage (goods): try: if mongo ['mongo_sheet'] .insert (goods): print (' Successfully storagekeeper') Except Exception as e: print ('failedly storageways) def main (): text = searcher () print (text) # get the total number of pages pages = int (re.compile (' (\ d +)') .search (text). Group (0) print (pages) for i in range (2) next_page (I) browser.close () if _ name__ = ='_ main__': main () so far I believe you have a deeper understanding of "how to use selenium to crawl Taobao product information in Python". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.