In addition to Weibo, there is also WeChat
Please pay attention

WeChat public account
Shulou
 
            
                     
                
2025-10-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains "Python crawler how to collect Taobao commodity information and import EXCEL form", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "Python crawler actual combat how to collect Taobao commodity information and import EXCEL form"!
First, analyze the composition of Taobao URL
1. Our first requirement is to enter the product name and return the corresponding information.
So we choose a random item here to observe its URL, here we choose a schoolbag, open the web page, we can see that his URL is:
Https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306
We may not see anything from this url alone, but we can see some clues from the picture.
We find that the parameter after Q is the name of the item we want to get.
two。 Our second requirement is to climb the page number of the product according to the entered number.
So let's take a look at the composition of the next few pages of URL.
 
From this we can conclude that the paging is based on the value of the last s = (44 (number of pages-1)).
Check the source code of the web page and use the re library to extract information
1. View the source code
Some of the information here is what we need.
Extracting information from 2.re library
A = re.findall (r'"raw_title": "(. *?)', html) b = re.findall (r'" view_price ":" (. *?) ", html) c = re.findall (r'" item_loc ":" (. *?) ", html) d = re.findall (r'" view_sales ":" (. *?) "', html)
Three: fill in the function
Here I write three functions, the first function to get the html web page, the code is as follows:
Def GetHtml (url): r = requests.get (url,headers = headers) r.raise_for_status () r.encoding = r.apparent_encoding return r
The second URL code for getting a web page is as follows:
Def Geturls (Q X): url = "https://s.taobao.com/search?q=" + Q +" & imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm "\" = a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 "urls = [] urls.append (url) if x = 1: return urls for i in range (1) X): url = "https://s.taobao.com/search?q="+ Q +" & commend=all&ssid=s5-e&search_type=item "\" & sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 "\" & bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s= "+ str (I * 44) urls.append (url) return urls
The third one is used to get the product information we need and write it into the Excel table code as follows:
Def GetxxintoExcel (html): global count# defines a global variable count for filling in the following excel table a = re.findall (r'"raw_title": "(. *?)", html) # (. *?) matches any character b = re.findall (r'"view_price": "(. *?)", html) c = re.findall (r'"item_loc": "(. *?)"' Html) d = re.findall (r'"view_sales": "(. *?)", html) x = [] for i in range (len (a)): try: x.append ((a [I], b [I], c [I] D [I]) # put the obtained information into a new list except IndexError: break I = 0 for i in range (len (x)): worksheet.write (count + I + 1, 0, x [I] [0]) # worksheet.write method to write data, the first number is the row position The second number is the column, and the third is the data information written. Worksheet.write (count + I + 1,1, x [I] [1]) worksheet.write (count + I + 1,2, x [I] [2]) worksheet.write (count + I + 1,3, x [I] [3]) count = count + len (x) # the number of rows written next time is the length + 1 return print ("completed")
Four: fill in the main function
If _ _ name__ = = "_ main__": count = 0 headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36", "cookie": "# cookie is unique to everyone. Because of the anti-crawling mechanism, crawling too fast may need to refresh your Cookie. Q = input ("enter the goods") x = int (input ("how many pages do you want to climb") urls = Geturls (QMagnex) workbook = xlsxwriter.Workbook (Q + ".xlsx") worksheet = workbook.add_worksheet () worksheet.set_column ('A VOLTH, 70) worksheet.set_column ('BRV Bobby, 20) worksheet.set_column (' CJV CRS, 20) worksheet.set_column ('DRO D' 20) worksheet.write ('A1, 'name') worksheet.write ('B1, 'price') worksheet.write ('C1, 'region') worksheet.write ('D1 payment, 'number of payments) for url in urls: html = GetHtml (url) s = GetxxintoExcel (html.text) time.sleep (5) workbook.close () # do not open excel until the end of the program The excel table is in the current directory
Five: complete code
Import re import requests import xlsxwriter import time def GetxxintoExcel (html): global count a = re.findall (r'"raw_title": "(. *?)", html) b = re.findall (r'"view_price": "(. *?)", html) c = re.findall (r'"item_loc": "(. *?)", html) d = re.findall (r'"view_sales": "(. *?)"' Html) x = [] for i in range (len (a)): try: x.append ((a [I], b [I], c [I], d [I])) except IndexError: break I = 0 for i in range (len (x)): worksheet.write (count + I + 1,0, x [I] [0]) worksheet.write (count + I + 1,1 X [I] [1]) worksheet.write (count + I + 1,2, x [I] [2]) worksheet.write (count + I + 1,3, x [I] [3]) count = count + len (x) return print ("completed") def Geturls (Q X): url = "https://s.taobao.com/search?q=" + Q +" & imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm "\" = a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 "urls = [] urls.append (url) if x = 1: return urls for i in range (1) X): url = "https://s.taobao.com/search?q="+ Q +" & commend=all&ssid=s5-e&search_type=item "\" & sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 "\" & bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s= "+ str (I * 44) urls.append (url) Return urls def GetHtml (url): r = requests.get (url Headers = headers) r.raise_for_status () r.encoding = r.apparent_encoding return r if _ _ name__ = = "_ main__": count = 0 headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0 Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 "," cookie ":"} Q = input (" enter the goods ") x = int (input (" how many pages do you want to climb ") urls = Geturls (QQuery x) workbook = xlsxwriter.Workbook (Q +" .xlsx ") worksheet = workbook.add_worksheet () worksheet.set_column ('ARO' 70) worksheet.set_column ('Blazer Beverage, 20) worksheet.set_column (' CRV Che, 20) worksheet.set_column ('DRV Duan, 20) worksheet.write (' A1, 'name') worksheet.write ('B1, 'price) worksheet.write (' C1, 'region') worksheet.write ('D1' Xx = [] for url in urls: html = GetHtml (url) s = GetxxintoExcel (html.text) time.sleep (5) workbook.close () so far I believe that everyone on the "Python crawler how to collect Taobao commodity information and import EXCEL form" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

The market share of Chrome browser on the desktop has exceeded 70%, and users are complaining about

The world's first 2nm mobile chip: Samsung Exynos 2600 is ready for mass production.According to a r


A US federal judge has ruled that Google can keep its Chrome browser, but it will be prohibited from

Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope





 
             
            About us Contact us Product review car news thenatureplanet
More Form oMedia: AutoTimes. Bestcoffee. SL News. Jarebook. Coffee Hunters. Sundaily. Modezone. NNB. Coffee. Game News. FrontStreet. GGAMEN
© 2024 shulou.com SLNews company. All rights reserved.