How to use python to crawl all the data analysis book information of Dangdang 10/16 Update SLTechnology News&Howtos

How to use python to crawl all the data analysis book information of Dangdang

2025-10-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to use python to crawl all the data analysis books on Dangdang. I believe most people don't know much about it, so share this article for your reference. I hope you will gain a lot after reading this article. Let's learn about it together.

1. Crawl the target

For the Dangdang book information to be crawled, first open the Dangdang web page and search out all the book information on the page with data analysis as keywords. As shown below:

There are 11 results of this crawl: (1) serial number per page (2) Product ID (3) title (4) Book price (5) Book original price (6) Book discount (7) E-book price (8) author (9) publication time (10) Publishing House (11) Book Review

2. Crawling process

(1) determine the url address

Analyze the web page, after entering the data relation keywords, click to search all the books page information, and then pull the page to the bottom to see the following figure:

It can be seen that this page is turned over, so click on pages 2, 3, 1 to extract the url of the page as follows:

Http://search.dangdang.com/?key=%CA%FD%BE%DD%B7%D6%CE%F6&act=input&page_index=2

Http://search.dangdang.com/?key=%CA%FD%BE%DD%B7%D6%CE%F6&act=input&page_index=3

Http://search.dangdang.com/?key=%CA%FD%BE%DD%B7%D6%CE%F6&act=input&page_index=1

From the URL address information of each page, we can find that the address difference of each page is the value of page_index, so the url address can be confirmed as:

Http://search.dangdang.com/?key=%CA%FD%BE%DD%B7%D6%CE%F6&act=input&page_index=

The value of page_index, which we can add after the address in turn through a loop.

The code is as follows: urls = ['http://search.dangdang.com/?key=%CA%FD%BE%DD%B7%D6%CE%F6&act=input&page_index={}'.format(i) for i in range (1101)]

(2) determine the crawling node

With the url address, you can use the lxml library to obtain web page information for parsing and get specific crawling information. Right mouse button, click "check", as shown below:

Through the corresponding search of the information in the web page html, you can find the li corresponding to each book information. The detailed information is shown in the following figure:

You can see the name of the book, price and other information, and then through the Xpath can be extracted one by one. The detailed code is as follows:

Html=requests.get (url,headers=headers) # html.encoding = "utf-8" # print ('whether the first layer call returns normal:', html) html.encoding = html.apparent_encoding # encode the garbled code selector=etree.HTML (html.text) # print (selector) datas=selector.xpath ('/ / div [@ class= "con shoplist"]') # print (datas)

For data in datas: Classs = data.xpath ('div/ul/li/@class') # line1-line60 IDDs = data.xpath (' div/ul/li/@id') # id titles = data.xpath ('div/ul/li/a/@title') # title prices = data.xpath (' div/ul/) Li/p [3] / span [1] / text ()'# Book Price source_prices = data.xpath ('div/ul/li/p [3] / span [2] / text ()') # original Book Price discounts = data.xpath ('div/ul/li/p [3] / span [3] / text ()') # Book discount # dian_prices = data.xpath ('div/ul/ Li/p [3] / a [2] / i/text ()') # ebook price authors = data.xpath ('div/ul/li/p [5] / span [1] / a [1] / @ title') # author publish_times = data.xpath (' div/ul/li/p [5] / span [2] / text ()') # publication time publishs = data.xpath ('div/ul/li/ P [5] / span [3] / a/text ()') # Press comments = data.xpath ('div/ul/li/p [4] / a/text ()') # Book reviews urls=data.xpath ('div/ul/li/a/@href')

Note: if you want to crawl the price of e-books, some books do not have e-book prices, so there will be wrong lines when crawled out, so be sure to extract the url of the book page and recursively crawl the details page for null value processing to avoid wrong lines.

(3) Save crawl information into database

Here we will crawl information into the database, we need to connect to the database and establish database tables to facilitate subsequent storage. The data connection and table establishment codes are as follows:

Db= pymysql.connect (host='localhost', user='root', passwd=' Library password', db=' Library name: Learn_data', port=3306, charset='utf8') print ("Database connection") cursor = db.cursor () cursor.execute ("DROP TABLE IF EXISTS Learn_data.dangdangweb_info_detail") sql = "" CREATE TABLE IF not EXISTS Learn_data.dangdangweb_info_detail (id int auto_increment primary key, Class CHAR, IDD CHAR, title CHAR) Price CHAR, source_price CHAR, discount CHAR, author CHAR, publish_time CHAR, publish CHAR, comment CHAR, dian_price CHAR) DEFAULT CHARSET=utf8 "" cursor.execute (sql)

The crawled data is stored in the table as follows: cursor.execute ("insert into dangdangweb_info_detail (Class,IDD,title,price,source_price,discount,author,publish_time,publish,comment,dian_price)", "values (% sforce% s,% s), (str (Class), str (IDD), str (title), str (price)). Str (source_price), str (discount), str (author), str (publish_time), str (publish), str (comment), str (dian_ print [0])

Finally, you must use: db.commit () to close the database, otherwise the data will not be stored correctly in the table.

3. Crawl result

Finally, we can crawl normally by integrating the above code. The screenshot of the stored result is as follows:

The above is all the contents of the article "how to use python to crawl all data analysis books and information on Dangdang". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.