In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
Python how to crawl when the web book data and data visualization display, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.
I. Development environment
Python 3.8
Pycharm 2021.2 Professional Edition
Second, module use
The csv module saves the crawled data to the built-in module in the table.
Requests > > pip install requests data request module
Parsel > pip install parsel data parsing module css selector to extract data
Third, crawler code implementation steps
Import required modules
Send the request, simulate the browser to send the request with python code
Parse the data and extract the content of the data we want
Multi-page crawl
Save the data, save it in the csv table
1. Import required module import requests # data request module third-party module requires pip install requestsimport parsel # data parsing module third-party module requires pip install parselimport csv # save csv form data module built-in module import time # time module 2. Send the request, simulate the browser to send the request with python code
The function of the headers request header is that the python code disguises as a browser to send a request to the server.
Basic identity of User-Agent user agent browser
Invalid return characters or leading spaces in the title: User-Agent do not leave spaces
Through the get request method in the requests module, send the request for the url address, and carry the above header request header parameters, and finally use the response variable to receive the returned data
Url= f 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-{page}'# headers request header dictionary data type headers= {' User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'} response = requests.get (url=url, headers=headers) 3. Parsing data Extract the desired data content selector = parsel.Selector (response.text) # transform the acquired html string data selector object # css selector extracts the corresponding data lis = selector.css ('ul.bang_list li') for li in lis: # .name location class class name name tag a tag attr () attribute selector takes the title attribute in the a tag Get () get the data title = li.css ('.name a::attr (title)'). Get () # title # get the text data in the tag directly text comment = li.css ('.star afrog text'). Get (). Replace ('comment' '') # comment recommend = li.css ('.star .tuijian:: text') .get () .replace (' recommended' '') # recommended author = li.css ('.publisher _ info a:nth-child (1):: attr (title)'). Get () # author publish = li.css ('div:nth-child (6) a price_n price_n'). Get () # Press price_n = li.css (. Publisher _ price_n ()). Get () # Price price_r = li.css Get () # original price price_s = li.css ('. Price. Price _ price _ li.css'). Get () # discount price_e = li.css ('. Price .price _ e. Price _ href ()). Get () # ebook price href = li.css ('.price (href)'). Get () # details Page dit = {'title': title 'comments': comment, 'recommended quantity': recommend, 'author': author, 'Publishing House': publish, 'Price': price_n, 'original Price': price_r, 'discount': price_s,'E-Book Price': price_e, 'details Page': href } csv_writer.writerow (dit) # data is saved to csv print (title, comment, recommend, author, publish, price_n, price_r, price_s, price_e, href, sep=' |') 4. Multi-page crawl for page in range (1,26): # string formatting method print (f' is crawling the data content on page {page}') time.sleep (1.5) url = f 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-{page}'5. Save data, save csv table # create a file to save f = open ('Dangdang Books .csv', mode='a', encoding='utf-8', newline='') csv_writer = csv.DictWriter (f, fieldnames= ['title', 'number of comments', 'recommended quantity', 'author', 'publisher', 'selling price', 'original price', 'discount' 'ebook price', 'details page',]) csv_writer.writeheader () # write header
Run the code, as shown in the following figure:
Fourth, data visualization 1. Import the required module import pandas as pdfrom pyecharts.charts import * from pyecharts.globals import ThemeType# to set the theme from pyecharts.commons.utils import JsCodeimport pyecharts.options as opts2. Import data df = pd.read_csv ('book information .csv', encoding='utf-8', engine='python') df.head ()
3. Visualization
Overall price range of books:
Pie1 = (Pie (init_opts=opts.InitOpts (theme='dark',width='1000px',height='600px')) .add (', datas_pair_1, radius= ['35% series_opts,'60%]) .set _ series_opts (label_opts=opts.LabelOpts (formatter= "{b}: {d}%")) .set _ global_opts (title_opts=opts.TitleOpts (title= "Dangdang Books\ n\ noriginal price range") Pos_left='center', pos_top='center', title_textstyle_opts=opts.TextStyleOpts (color='#F0F8FF', font_size=20, font_weight='bold')) .set _ colors (['# EF9050','# 3B7BA9,'# 6FB27C),'# FFAF34' '# D8BFD8F8,' # 00BFFFFFA,'# 7FFFAA']) pie1.render_notebook ()
Pie1 = (Pie (init_opts=opts.InitOpts (theme='dark',width='1000px',height='600px')) .add (', datas_pair_2, radius= ['35% series_opts,'60%]) .set _ series_opts (label_opts=opts.LabelOpts (formatter= "{b}: {d}%")) .set _ global_opts (title_opts=opts.TitleOpts (title= "Dangdang Books\ n\ nPrice range") Pos_left='center', pos_top='center', title_textstyle_opts=opts.TextStyleOpts (color='#F0F8FF', font_size=20, font_weight='bold')) .set _ colors (['# EF9050','# 3B7BA9,'# 6FB27C),'# FFAF34' '# D8BFD8F8,' # 00BFFFFFA,'# 7FFFAA']) pie1.render_notebook ()
Bar chart of the number of books in each publishing house:
Bar= (Bar (init_opts=opts.InitOpts (height='500px',width='1000px',theme='dark')) .add _ xaxis (counts.index.tolist ()) .add _ yaxis ('number of Publishing House books', counts.values.tolist (), label_opts=opts.LabelOpts (is_show=True,position='top'), itemstyle_opts=opts.ItemStyleOpts (color=JsCode ("" new echarts.graphic.LinearGradient (0)) 0,0,1, [{offset: 0Powercolor: 'rgb (255djn99) 71)'}, {offset: 1MagneColor: 'rgb (32178170)'})) .set _ global_opts (title_opts=opts.TitleOpts (bar chart of the number of books published by title='), xaxis_opts=opts.AxisOpts (title of name=' books') Type_='category', axislabel_opts=opts.LabelOpts (rotate=90),), yaxis_opts=opts.AxisOpts (name=' quantity', min_=0, max_=29.0, splitline_opts=opts.SplitLineOpts (is_show=True) Linestyle_opts=opts.LineStyleOpts (type_='dash')), tooltip_opts=opts.TooltipOpts (trigger='axis',axis_pointer_type='cross'). Set _ series_opts (markline_opts=opts.MarkLineOpts (data= [opts.MarkLineItem (type_='average',name=' mean'), opts.MarkLineItem (type_='max',name=' maximum')) Opts.MarkLineItem (type_='min',name=' minimum'),])) bar.render_notebook ()
Highest number of book reviews Top20:
Bar= (Bar (init_opts=opts.InitOpts (height='500px',width='1000px',theme='dark')) .add _ xaxis (price_top.index.tolist ()) .add _ yaxis ('book unit price', price_top.values.tolist (), label_opts=opts.LabelOpts (is_show=True,position='top')) Itemstyle_opts=opts.ItemStyleOpts (color=JsCode ("" new echarts.graphic.LinearGradient (0,0,0,1), [{offset: 0,0,1, [{offset: 0magedicol: 'rgb (255)], {offset: 1 Color: 'rgb (32178170)}])) .set _ global_opts (title_opts=opts.TitleOpts (title=' 's most expensive book detailed bar chart'), xaxis_opts=opts.AxisOpts (name=' book title', type_='category') Axislabel_opts=opts.LabelOpts (rotate=90),), yaxis_opts=opts.AxisOpts (name=' unit price / yuan', min_=0, max_=1080.0, splitline_opts=opts.SplitLineOpts (is_show=True,linestyle_opts=opts.LineStyleOpts (type_='dash') Tooltip_opts=opts.TooltipOpts (trigger='axis',axis_pointer_type='cross') .set _ series_opts (markline_opts=opts.MarkLineOpts (data= [opts.MarkLineItem (type_='average',name=' mean'), opts.MarkLineItem (type_='max',name=' maximum'), opts.MarkLineItem (type_='min') Name=' minimum'),])) bar.render_notebook ()
Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.