Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Python crawler how to collect Taobao merchandise information and import it into EXCEL form

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly explains "Python crawler how to collect Taobao commodity information and import EXCEL form", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "Python crawler actual combat how to collect Taobao commodity information and import EXCEL form"!

First, analyze the composition of Taobao URL

1. Our first requirement is to enter the product name and return the corresponding information.

So we choose a random item here to observe its URL, here we choose a schoolbag, open the web page, we can see that his URL is:

Https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306

We may not see anything from this url alone, but we can see some clues from the picture.

We find that the parameter after Q is the name of the item we want to get.

two。 Our second requirement is to climb the page number of the product according to the entered number.

So let's take a look at the composition of the next few pages of URL.

From this we can conclude that the paging is based on the value of the last s = (44 (number of pages-1)).

Check the source code of the web page and use the re library to extract information

1. View the source code

Some of the information here is what we need.

Extracting information from 2.re library

A = re.findall (r'"raw_title": "(. *?)', html) b = re.findall (r'" view_price ":" (. *?) ", html) c = re.findall (r'" item_loc ":" (. *?) ", html) d = re.findall (r'" view_sales ":" (. *?) "', html)

Three: fill in the function

Here I write three functions, the first function to get the html web page, the code is as follows:

Def GetHtml (url): r = requests.get (url,headers = headers) r.raise_for_status () r.encoding = r.apparent_encoding return r

The second URL code for getting a web page is as follows:

Def Geturls (Q X): url = "https://s.taobao.com/search?q=" + Q +" & imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm "\" = a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 "urls = [] urls.append (url) if x = 1: return urls for i in range (1) X): url = "https://s.taobao.com/search?q="+ Q +" & commend=all&ssid=s5-e&search_type=item "\" & sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 "\" & bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s= "+ str (I * 44) urls.append (url) return urls

The third one is used to get the product information we need and write it into the Excel table code as follows:

Def GetxxintoExcel (html): global count# defines a global variable count for filling in the following excel table a = re.findall (r'"raw_title": "(. *?)", html) # (. *?) matches any character b = re.findall (r'"view_price": "(. *?)", html) c = re.findall (r'"item_loc": "(. *?)"' Html) d = re.findall (r'"view_sales": "(. *?)", html) x = [] for i in range (len (a)): try: x.append ((a [I], b [I], c [I] D [I]) # put the obtained information into a new list except IndexError: break I = 0 for i in range (len (x)): worksheet.write (count + I + 1, 0, x [I] [0]) # worksheet.write method to write data, the first number is the row position The second number is the column, and the third is the data information written. Worksheet.write (count + I + 1,1, x [I] [1]) worksheet.write (count + I + 1,2, x [I] [2]) worksheet.write (count + I + 1,3, x [I] [3]) count = count + len (x) # the number of rows written next time is the length + 1 return print ("completed")

Four: fill in the main function

If _ _ name__ = = "_ main__": count = 0 headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36", "cookie": "# cookie is unique to everyone. Because of the anti-crawling mechanism, crawling too fast may need to refresh your Cookie. Q = input ("enter the goods") x = int (input ("how many pages do you want to climb") urls = Geturls (QMagnex) workbook = xlsxwriter.Workbook (Q + ".xlsx") worksheet = workbook.add_worksheet () worksheet.set_column ('A VOLTH, 70) worksheet.set_column ('BRV Bobby, 20) worksheet.set_column (' CJV CRS, 20) worksheet.set_column ('DRO D' 20) worksheet.write ('A1, 'name') worksheet.write ('B1, 'price') worksheet.write ('C1, 'region') worksheet.write ('D1 payment, 'number of payments) for url in urls: html = GetHtml (url) s = GetxxintoExcel (html.text) time.sleep (5) workbook.close () # do not open excel until the end of the program The excel table is in the current directory

Five: complete code

Import re import requests import xlsxwriter import time def GetxxintoExcel (html): global count a = re.findall (r'"raw_title": "(. *?)", html) b = re.findall (r'"view_price": "(. *?)", html) c = re.findall (r'"item_loc": "(. *?)", html) d = re.findall (r'"view_sales": "(. *?)"' Html) x = [] for i in range (len (a)): try: x.append ((a [I], b [I], c [I], d [I])) except IndexError: break I = 0 for i in range (len (x)): worksheet.write (count + I + 1,0, x [I] [0]) worksheet.write (count + I + 1,1 X [I] [1]) worksheet.write (count + I + 1,2, x [I] [2]) worksheet.write (count + I + 1,3, x [I] [3]) count = count + len (x) return print ("completed") def Geturls (Q X): url = "https://s.taobao.com/search?q=" + Q +" & imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm "\" = a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 "urls = [] urls.append (url) if x = 1: return urls for i in range (1) X): url = "https://s.taobao.com/search?q="+ Q +" & commend=all&ssid=s5-e&search_type=item "\" & sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 "\" & bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s= "+ str (I * 44) urls.append (url) Return urls def GetHtml (url): r = requests.get (url Headers = headers) r.raise_for_status () r.encoding = r.apparent_encoding return r if _ _ name__ = = "_ main__": count = 0 headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0 Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 "," cookie ":"} Q = input (" enter the goods ") x = int (input (" how many pages do you want to climb ") urls = Geturls (QQuery x) workbook = xlsxwriter.Workbook (Q +" .xlsx ") worksheet = workbook.add_worksheet () worksheet.set_column ('ARO' 70) worksheet.set_column ('Blazer Beverage, 20) worksheet.set_column (' CRV Che, 20) worksheet.set_column ('DRV Duan, 20) worksheet.write (' A1, 'name') worksheet.write ('B1, 'price) worksheet.write (' C1, 'region') worksheet.write ('D1' Xx = [] for url in urls: html = GetHtml (url) s = GetxxintoExcel (html.text) time.sleep (5) workbook.close () so far I believe that everyone on the "Python crawler how to collect Taobao commodity information and import EXCEL form" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report