Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Python crawl the data of a down jacket and draw a visualization map?

2025-04-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article shows you how Python crawls the data of an east down jacket and draws a visualization map, which is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Just before, friends from Guangzhou and Shenzhen are probably still wearing short sleeves envious of the snowy atmosphere in the north. As a result, just last week, Guangzhou and Shenzhen also ushered in a cooling, and everyone joined the "cooling group chat" one after another.

In order to help you resist the cold, I specially climbed JD.com 's down jacket data. Why not Tmall, the reason is very simple, slider verification is a bit troublesome.

Data acquisition

JD.com website is an ajax dynamically loaded website, which can only be crawled through the parsing interface or using selenium automated testing tools. With regard to dynamic web crawlers, the original article on the history of this official account has been introduced. Interested friends can learn about it.

This data acquisition uses selenium, because my Google browser version is updated quickly, resulting in the interruption of the original Google driver. So I replaced the browser automatic update and downloaded the corresponding version of the driver.

Then, use selenium to search for down jacket in JD.com, scan the mobile phone to log in, and get the product name, commodity price, shop name, number of comments and other information of down jacket.

From selenium import webdriverfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom lxml import etreeimport randomimport jsonimport csvimport timebrowser = webdriver.Chrome ('/ dish J learn Python/ JD.com / chromedriver') wait = WebDriverWait (browser) 50) # set wait time url = 'https://www.jd.com/'data_list= [] # set the global variable to store data keyword = "down jacket" # keyword def page_click (page_number): try: # slide to the bottom browser.execute_script ("window.scrollTo (0, document.body.scrollHeight)) ") time.sleep (random.randint (1,3)) # Random delay button = wait.until (EC.element_to_be_clickable ((By.CSS_SELECTOR) '# J_bottomPage > span.p-num > a.pn-next > em')) # turn the page button button.click () # Click the button wait.until (EC.presence_of_all_elements_located ((By.CSS_SELECTOR, "# J_goodsList > ul > li:nth-child (30)")) # wait until 30 items are loaded # slide to the bottom Load the last 30 merchandise browser.execute_script ("window.scrollTo (0, document.body.scrollHeight)) ") wait.until (EC.presence_of_all_elements_located ((By.CSS_SELECTOR," # J_goodsList > ul > li:nth-child (60) ") # wait until 60 items are loaded wait.until ((By.CSS_SELECTOR," # J_bottomPage > span.p-num > a.curr ") Str (page_number)) # judge whether the page turning is successful The highlighted button number is the same as the set page number html = browser.page_source# to get web page information prase_html (html) # call the function to extract data except TimeoutError: return page_click (page_number) data cleaning Import data import pandas as pdimport numpy as npdf = pd.read_csv ("/ dish J learn Python/ JD.com / down jacket .csv") df.sample (10)

Rename df = df.rename (columns= {'title':' product name', 'price':' commodity price', 'shop_name':' store name', 'number of comment':' comments'}) to view the data information df.info ()''1. There may be a duplicate value of 2. There is a missing value of 3. 3 for store name. The number of evaluators needs to be cleaned''RangeIndex: 4950 entries, 0 to 4949Data columns (total 4 columns): # Column Non-Null Count Dtype-0 Trade name 4950 non-null object 1 Commodity Price 4950 non-null float64 2 Shop name 4949 non-null object 3 number of commentators 4950 non-null object dtypes: float64 (1) Object (3) memory usage: 154.8 + KB Deduplication df = df.drop_duplicates () missing value processing df ["shop name"] = df ["shop name"] .fillna ("John Doe") product name cleaning

Thickness

Tmp= [] for i in df ["trade name"]: if "thick" in I: tmp.append ("thick") elif "thin" in I: tmp.append ("thin") else: tmp.append ("other") df ['thickness'] = tmp

Version type

For i in df ["trade name"]: if "in I: tmp.append" elif "loose" in I: tmp.append ("loose") else: tmp.append ("other") df ['version'] = tmp

style

Tmp= [] for i in df ["Commodity name"]: if "Han" in I: tmp.append ("Korean version") elif "Business" in I: tmp.append ("Business style") elif "Leisure" in I: tmp.append ("Leisure style") elif "in I: tmp.append" else: tmp. Append ("other") df ['style'] = tmp commodity price cleaning df ["price range"] = pd.cut (df ["commodity price"] [0Jing 100,300,500,700,1000dy1000000], labels= ['under 100yuan', '100yuan-300yuan', '300yuan-500yuan', '500yuan-700yuan', '700yuan-1000 yuan', 'over 1000 yuan'], right=False) evaluate the number of people cleaning import redf ['digital'] = [re.findall (\ d+\. {0prime1}\ d *)'. I) [0] for i in df ['number of comments'] # extract digit df ['number'] = df ['number'] .astype ('float') # convert numeric df [' unit'] = ['.join (re.findall (r' (ten thousand)') I) for i in df ['number of comments']] # extraction unit (ten thousand) df ['unit'] = df ['unit'] .apply (lambda x:10000if number of comments' million 'else1) df [' number of comments'] = df ['number'] * df ['unit'] # calculate the number of comments df ['number'] = df ['number'] .astype ("int") df.drop (['number', 'unit'] Axis=1 Inplace=True) Shop name cleaning df ["Store Type"] = df ["Store name"] .str [- 3:] Visualization introduces the visual correlation library import matplotlib.pyplot as pltimport seaborn as sns%matplotlib inlineplt.rcParams ['font.sans-serif'] = [' SimHei'] # set the loaded font name plt.rcParams ['axes.unicode_minus'] = False# to solve the problem that the saved image is negative'-'displayed as a square. Descriptive statistics of import jiebaimport refrom pyecharts.charts import * from pyecharts import options as opts from pyecharts.globals import ThemeType import stylecloudfrom IPython.display import Image

Relativity analysis

Commodity price distribution histogram

Sns.set_style ('white') fig,axes=plt.subplots (figsize= (15Power8)) sns.distplot (df ["Commodity prices"], color= "salmon", bins=10) plt.xticks (fontsize=16) plt.yticks (fontsize=16) axes.set_title ("Commodity Price Distribution histogram")

Histogram of the number of commentators

Sns.set_style ('white') fig,axes=plt.subplots (figsize= (15Power8)) sns.distplot (df ["number of comments"], color= "green", bins=10,rug=True) plt.xticks (fontsize=16) plt.yticks (fontsize=16) axes.set_title ("histogram of the number of comments")

The relationship between the number of commentators and commodity prices

Fig,axes=plt.subplots (figsize= (155.8)) sns.regplot (Xerox 'number of commentators', yearly 'commodity prices', data=df,color='orange',marker='*') plt.xticks (fontsize=16) plt.yticks (fontsize=16)

Df2 = df ["price range"] .astype ("str"). Value_counts () print (df2) df2 = df2.sort_values (ascending=False) regions = df2.index.to_list () values = df2.to_list () c = (Pie (init_opts=opts.InitOpts (theme=ThemeType.DARK)) .add (", list (zip (regions,values) .set _ global_opts (legend_opts = opts.LegendOpts (is_show = False)) Title_opts=opts.TitleOpts (title= "down jacket price range distribution", subtitle= "data source: Tencent Video\ nCartography: vegetable J learn Python", pos_top= "0.5%", pos_left = 'left')) .set _ series_opts (label_opts=opts.LabelOpts (formatter= "{b}: {d}%", font_size=14)) c.render_notebook ()

Number of comments top10 shop df5 = df.groupby ('shop name') ['number of comments'] .mean () df5 = df5.sort_values (ascending=True) df5 = df5.tail (10) print (df5.index.to_list ()) print (df5.to_list ()) c = (Bar (init_opts=opts.InitOpts (theme=ThemeType.DARK,width= "1100px", height= "600px")) .add _ xaxis (df5.index.to_list ()) .add _ yaxis ("" Df5.to_list (). Reversal_axis () # X axis and y axis swap order. Set _ global_opts (title_opts=opts.TitleOpts (title= "number of comments TOP10", subtitle= "data Source: JD.com\ t drawing: J Brother", pos_left = 'left'), xaxis_opts=opts.AxisOpts (axislabel_opts=opts.LabelOpts (font_size=11)) # change Abscissa font size # yaxis_opts=opts.AxisOpts (axislabel_opts=opts.LabelOpts (font_size=12)), yaxis_opts=opts.AxisOpts (axislabel_opts= {"rotate": 30}) # change ordinate font size) .set _ series_opts (label_opts=opts.LabelOpts (font_size=16,position='right')) c.render_notebook ()

Version df5 = df.groupby ('version') ['commodity price'] .mean () df5 = df5.sort_values (ascending=True) [: 2] # df5 = df5.tail (10) df5 = df5.round (2) print (df5.index.to_list ()) print (df5.to_list ()) c = (Bar (init_opts=opts.InitOpts (theme=ThemeType.DARK,width= "1000px") Height= "500px") .add _ xaxis (df5.index.to_list ()) .add _ yaxis (", df5.to_list ()). Reversal_axis () # X-axis and y-axis swap order. Set _ global_opts (title_opts=opts.TitleOpts (title=" average price of down jackets of various editions ", subtitle=" data source: Zhongyuan Real Estate\ t drawing: J Brother ", pos_left = 'left') Xaxis_opts=opts.AxisOpts (axislabel_opts=opts.LabelOpts (font_size=11)), # change Abscissa font size # yaxis_opts=opts.AxisOpts (axislabel_opts=opts.LabelOpts (font_size=12)) Yaxis_opts=opts.AxisOpts (axislabel_opts= {"rotate": 30}) # change the ordinate font size) .set _ series_opts (label_opts=opts.LabelOpts (font_size=16,position='right')) c.render_notebook ()

Thickness df5 = df.groupby ('thickness') ['commodity price'] .mean () df5 = df5.sort_values (ascending=True) [: 2] # df5 = df5.tail (10) df5 = df5.round (2) print (df5.index.to_list ()) print (df5.to_list ()) c = (Bar (init_opts=opts.InitOpts (theme=ThemeType.DARK,width= "1000px") Height= "500px") .add _ xaxis (df5.index.to_list ()) .add _ yaxis (", df5.to_list ()). Reversal_axis () # X-axis and y-axis swap order. Set _ global_opts (title_opts=opts.TitleOpts (title=" average price of down jacket of all thickness ", subtitle=" data source: JD.com\ t drawing: brother J ", pos_left = 'left')) Xaxis_opts=opts.AxisOpts (axislabel_opts=opts.LabelOpts (font_size=11)), # change Abscissa font size # yaxis_opts=opts.AxisOpts (axislabel_opts=opts.LabelOpts (font_size=12)) Yaxis_opts=opts.AxisOpts (axislabel_opts= {"rotate": 30}) # change the ordinate font size) .set _ series_opts (label_opts=opts.LabelOpts (font_size=16,position='right')) c.render_notebook ()

Style df5 = df.groupby ('style') ['commodity price'] .mean () df5 = df5.sort_values (ascending=True) [: 4] # df5 = df5.tail (10) df5 = df5.round (2) print (df5.index.to_list ()) print (df5.to_list ()) c = (Bar (init_opts=opts.InitOpts (theme=ThemeType.DARK,width= "1000px") Height= "500px") .add _ xaxis (df5.index.to_list ()) .add _ yaxis (", df5.to_list ()). Reversal_axis () # X-axis and y-axis swap order. Set _ global_opts (title_opts=opts.TitleOpts (title=" average price of down jackets of various styles ", subtitle=" data source: JD.com\ t drawing: brother J ", pos_left = 'left')) Xaxis_opts=opts.AxisOpts (axislabel_opts=opts.LabelOpts (font_size=11)), # change Abscissa font size # yaxis_opts=opts.AxisOpts (axislabel_opts=opts.LabelOpts (font_size=12)) Yaxis_opts=opts.AxisOpts (axislabel_opts= {"rotate": 30}) # change the ordinate font size) .set _ series_opts (label_opts=opts.LabelOpts (font_size=16,position='right')) c.render_notebook ()

The above is how Python crawls down jacket data and draws visualization diagrams. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report