Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Python crawl the data of books on the web and display the data visually?

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Python how to crawl when the web book data and data visualization display, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

I. Development environment

Python 3.8

Pycharm 2021.2 Professional Edition

Second, module use

The csv module saves the crawled data to the built-in module in the table.

Requests > > pip install requests data request module

Parsel > pip install parsel data parsing module css selector to extract data

Third, crawler code implementation steps

Import required modules

Send the request, simulate the browser to send the request with python code

Parse the data and extract the content of the data we want

Multi-page crawl

Save the data, save it in the csv table

1. Import required module import requests # data request module third-party module requires pip install requestsimport parsel # data parsing module third-party module requires pip install parselimport csv # save csv form data module built-in module import time # time module 2. Send the request, simulate the browser to send the request with python code

The function of the headers request header is that the python code disguises as a browser to send a request to the server.

Basic identity of User-Agent user agent browser

Invalid return characters or leading spaces in the title: User-Agent do not leave spaces

Through the get request method in the requests module, send the request for the url address, and carry the above header request header parameters, and finally use the response variable to receive the returned data

Url= f 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-{page}'# headers request header dictionary data type headers= {' User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'} response = requests.get (url=url, headers=headers) 3. Parsing data Extract the desired data content selector = parsel.Selector (response.text) # transform the acquired html string data selector object # css selector extracts the corresponding data lis = selector.css ('ul.bang_list li') for li in lis: # .name location class class name name tag a tag attr () attribute selector takes the title attribute in the a tag Get () get the data title = li.css ('.name a::attr (title)'). Get () # title # get the text data in the tag directly text comment = li.css ('.star afrog text'). Get (). Replace ('comment' '') # comment recommend = li.css ('.star .tuijian:: text') .get () .replace (' recommended' '') # recommended author = li.css ('.publisher _ info a:nth-child (1):: attr (title)'). Get () # author publish = li.css ('div:nth-child (6) a price_n price_n'). Get () # Press price_n = li.css (. Publisher _ price_n ()). Get () # Price price_r = li.css Get () # original price price_s = li.css ('. Price. Price _ price _ li.css'). Get () # discount price_e = li.css ('. Price .price _ e. Price _ href ()). Get () # ebook price href = li.css ('.price (href)'). Get () # details Page dit = {'title': title 'comments': comment, 'recommended quantity': recommend, 'author': author, 'Publishing House': publish, 'Price': price_n, 'original Price': price_r, 'discount': price_s,'E-Book Price': price_e, 'details Page': href } csv_writer.writerow (dit) # data is saved to csv print (title, comment, recommend, author, publish, price_n, price_r, price_s, price_e, href, sep=' |') 4. Multi-page crawl for page in range (1,26): # string formatting method print (f' is crawling the data content on page {page}') time.sleep (1.5) url = f 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-{page}'5. Save data, save csv table # create a file to save f = open ('Dangdang Books .csv', mode='a', encoding='utf-8', newline='') csv_writer = csv.DictWriter (f, fieldnames= ['title', 'number of comments', 'recommended quantity', 'author', 'publisher', 'selling price', 'original price', 'discount' 'ebook price', 'details page',]) csv_writer.writeheader () # write header

Run the code, as shown in the following figure:

Fourth, data visualization 1. Import the required module import pandas as pdfrom pyecharts.charts import * from pyecharts.globals import ThemeType# to set the theme from pyecharts.commons.utils import JsCodeimport pyecharts.options as opts2. Import data df = pd.read_csv ('book information .csv', encoding='utf-8', engine='python') df.head ()

3. Visualization

Overall price range of books:

Pie1 = (Pie (init_opts=opts.InitOpts (theme='dark',width='1000px',height='600px')) .add (', datas_pair_1, radius= ['35% series_opts,'60%]) .set _ series_opts (label_opts=opts.LabelOpts (formatter= "{b}: {d}%")) .set _ global_opts (title_opts=opts.TitleOpts (title= "Dangdang Books\ n\ noriginal price range") Pos_left='center', pos_top='center', title_textstyle_opts=opts.TextStyleOpts (color='#F0F8FF', font_size=20, font_weight='bold')) .set _ colors (['# EF9050','# 3B7BA9,'# 6FB27C),'# FFAF34' '# D8BFD8F8,' # 00BFFFFFA,'# 7FFFAA']) pie1.render_notebook ()

Pie1 = (Pie (init_opts=opts.InitOpts (theme='dark',width='1000px',height='600px')) .add (', datas_pair_2, radius= ['35% series_opts,'60%]) .set _ series_opts (label_opts=opts.LabelOpts (formatter= "{b}: {d}%")) .set _ global_opts (title_opts=opts.TitleOpts (title= "Dangdang Books\ n\ nPrice range") Pos_left='center', pos_top='center', title_textstyle_opts=opts.TextStyleOpts (color='#F0F8FF', font_size=20, font_weight='bold')) .set _ colors (['# EF9050','# 3B7BA9,'# 6FB27C),'# FFAF34' '# D8BFD8F8,' # 00BFFFFFA,'# 7FFFAA']) pie1.render_notebook ()

Bar chart of the number of books in each publishing house:

Bar= (Bar (init_opts=opts.InitOpts (height='500px',width='1000px',theme='dark')) .add _ xaxis (counts.index.tolist ()) .add _ yaxis ('number of Publishing House books', counts.values.tolist (), label_opts=opts.LabelOpts (is_show=True,position='top'), itemstyle_opts=opts.ItemStyleOpts (color=JsCode ("" new echarts.graphic.LinearGradient (0)) 0,0,1, [{offset: 0Powercolor: 'rgb (255djn99) 71)'}, {offset: 1MagneColor: 'rgb (32178170)'})) .set _ global_opts (title_opts=opts.TitleOpts (bar chart of the number of books published by title='), xaxis_opts=opts.AxisOpts (title of name=' books') Type_='category', axislabel_opts=opts.LabelOpts (rotate=90),), yaxis_opts=opts.AxisOpts (name=' quantity', min_=0, max_=29.0, splitline_opts=opts.SplitLineOpts (is_show=True) Linestyle_opts=opts.LineStyleOpts (type_='dash')), tooltip_opts=opts.TooltipOpts (trigger='axis',axis_pointer_type='cross'). Set _ series_opts (markline_opts=opts.MarkLineOpts (data= [opts.MarkLineItem (type_='average',name=' mean'), opts.MarkLineItem (type_='max',name=' maximum')) Opts.MarkLineItem (type_='min',name=' minimum'),])) bar.render_notebook ()

Highest number of book reviews Top20:

Bar= (Bar (init_opts=opts.InitOpts (height='500px',width='1000px',theme='dark')) .add _ xaxis (price_top.index.tolist ()) .add _ yaxis ('book unit price', price_top.values.tolist (), label_opts=opts.LabelOpts (is_show=True,position='top')) Itemstyle_opts=opts.ItemStyleOpts (color=JsCode ("" new echarts.graphic.LinearGradient (0,0,0,1), [{offset: 0,0,1, [{offset: 0magedicol: 'rgb (255)], {offset: 1 Color: 'rgb (32178170)}])) .set _ global_opts (title_opts=opts.TitleOpts (title=' 's most expensive book detailed bar chart'), xaxis_opts=opts.AxisOpts (name=' book title', type_='category') Axislabel_opts=opts.LabelOpts (rotate=90),), yaxis_opts=opts.AxisOpts (name=' unit price / yuan', min_=0, max_=1080.0, splitline_opts=opts.SplitLineOpts (is_show=True,linestyle_opts=opts.LineStyleOpts (type_='dash') Tooltip_opts=opts.TooltipOpts (trigger='axis',axis_pointer_type='cross') .set _ series_opts (markline_opts=opts.MarkLineOpts (data= [opts.MarkLineItem (type_='average',name=' mean'), opts.MarkLineItem (type_='max',name=' maximum'), opts.MarkLineItem (type_='min') Name=' minimum'),])) bar.render_notebook ()

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report