How to crawl the information of books you care about with one click of Python 07/06 Update SLTechnology News&Howtos

How to crawl the information of books you care about with one click of Python

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article is about how Python crawls the book information you care about with one click. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Without saying much, let's take a look at it.

Preface

Usually see Douban reptiles are basically top100 movies, some movie reviews, top100 books, popular books, etc. Recently, a demand encountered is to crawl the corresponding bibliographic information according to a list of book titles (or Excel files), that is, the publishing house, publication time, ISBN, pricing, rating, rating and other information on the Douban book page, and then integrated into pandas for processing, and finally data analysis can be carried out.

Source of demand

Recently, when you organize your bibliography, you need to sort out the corresponding attributes such as publisher, publication time, ISBN, rating, etc., according to the titles of hundreds of books. The Excel of the book list is shown in figure 1 below. Batch processing must be using crawlers, checked and found no similar articles, and their own operation also encountered more interesting problems, so put their own operation ideas and processes into this article.

Figure 1, screenshot of the book list data part

Crawling process

Page analysis

First of all, analyze the home page of Douban book: book.douban.com, when you search for the title of the book directly, you can see that the search parameters are written on url, so you want to directly use the title of https://book.douban.com/subject_search?search_text={0}&cat=1001'.format(''), directly change the search_text parameters, and press F12 to call up the console on this page. It is disappointing that the html returned by this url does not contain data, as shown in figure 2. The key is to find the data json returned asynchronously for a period of time (if anyone finds the location of the book data on pages such as Douban subject_search?search_text= {0} & cat=1001, please let me know). Consider using Selenium or other interfaces at this time.

Figure 2, html screenshot based on search url

Json analysis

Notice that there is a search prompt on the search page of Douban Books, so check Network in the console and find that the search prompt returns a json directly, such as "Future brief History". The results are as follows:

Figure 3, search tips for future brief history

The attributes that can be used to return json are: title: book title, url: Douban page corresponding to the book, pic: book cover map resource location, and so on. If the above input we only have the title of the book, according to the title of the book and the returned json corresponding, if there are author, publication year and other attributes, we can better check whether it is the book we are looking for, for simplicity, the following only uses the first item that returns json data.

Basic code

According to the returned url, we can locate the information we need to climb from this url. After getting through, you can officially write the code. The following code is organized in the jupyter notebook way, that is, it is divided into relatively fine sections. Introduce the required libraries first:

Import json

Import requests

Import pandas as pd

From lxml import etree

Read the Excel data of the book title, using only the "book title" column, regardless of other columns

Bsdf=pd.read_excel ('booklistfortest.xlsx')

Blst=list (bsdf ['title']) # list of titles

# bsdf.head (3)

Loop through the list of book titles, and the resulting attributes are stored in dictionaries. The attribute of each book is a dictionary, and the list contains each dictionary.

Get the json data returned by the search suggestion through requests.get ('https://book.douban.com/j/subject_suggest?q={0}'.format(bn)), where bn is the title string.

The general parsing of a crawler is to use BeautifulSoup or xpath, but I prefer to use xpath, so the following code parses the text mainly based on xpath.

Take scoring as an example, click the scoring section, and then press Ctrl+Shift+I, or right-click to check the element, anyway, navigate to the HTML corresponding to the rating, right-click the code part of the rating, and select Copy- > Copy XPath. For example, for scoring, there are: / / * [@ id= "interest_sectl"] / div/div [2] / strong.

Figure 4, copy the xpath of the score

The scoring data can be obtained through con.xpath ('/ * [@ id= "interest_sectl"] / div/div [2] / strong/text ()'), and the list is returned, which is usually the 0th value. Similarly, the same is true in other places, and the attributes of the author and the publishing house are relatively scattered in structure and need special treatment.

Figure 5, bibliographic information section with greater degree of freedom

The publisher attribute can be determined through / / * [@ id= "info"] / span [2], but the value of the attribute, which publisher cannot determine, is the text on the info node. For this kind of html region with variable length, we can not write the xpath analytic expression, we need to sort out its HTML tree structure and establish the info tree structure. By analyzing the info parts of several specific pages, the tree structure is established as follows:

Figure 6 the HTML tree in the part of the feather info.

What you need to get is data such as {'Publishing House': 'Citic Publishing Group'}. The characteristics you can see through the HTML tree structure are keys (such as publishers) in span, values may be in text, or may be encapsulated in child elements in span, anyway, each key-value pair is followed by a br to segment. Considering these situations, the code written is as follows:

Def getBookInfo (binfo,cc):

ITunes 0

Rss= {}

Kaohsiung

Viciously'

Frank0

Clw= []

For c in cc:

If'\ n'in c:

If'\ xa0' in c:

Clw.append (c)

Else:

Clw.append (c)

For m in binfo [0]:

If m.tagged commodities spanned:

Mlst=m.getchildren ()

If len (mlst) = = 0:

K=m.text.replace (':',')

If'\ xa0' in clw [I]:

The value under m.tagged characters is required for fallowed values.

Else:

V = clw [I]. Replace ('\ nHere,'). Replace (',')

ITunes 1

There is a sub-span under elif len (mlst) > 0 span. One judgment is that m. Attribund = {} is not accurate enough.

For n in mlst:

If n. Tagged areas including the following:

K=n.text.replace. Replace ('',') # there is no span under it, so I don't bother to use recursion.

Elif n. Tagged regions:

V=n.text.replace. Replace (',')

Elif m.tagged tags are available:

If frank1: # is it possible not to use this if

V=m.text.replace. Replace (',')

Frank0

Elif m.tagged regions are covered with bricks:

If kryptonite:

Print (iMetrology err')

Else:

Rss [k] = v

Else:

Print (m.tagmemi)

Return rss

In order to make calls in a big loop, the above part is encapsulated into a function, and the call getBookInfo () returns a dictionary that needs to be integrated into the existing dictionary. When it comes to the combination of dictionaries, check that you can use d=dict, where d is the old dictionary, dw is the new dictionary to be added to d, and the easier way is to use the d.update (dw) function. The following code uses update.

Main loop code:

Rlst= []

For bn in blst:

Res= {}

R=requests.get ('https://book.douban.com/j/subject_suggest?q={0}'.format(bn))

Rj=json.loads (r.text)

# verify and filter the rj

Html=requests.get (rj [0] ['url']) # and then consider the verification of multiple return values

Con = etree.HTML (html.text)

Bname=con.xpath ('/ / * [@ id= "wrapper"] / h2/span/text ()') [0] # and bn comparison

Res ['bname_sq'] = bn

Res ['bname'] = bname

Res ['dbid'] = rj [0] [' id'] # you don't need to save url, just save id

# it is enough to get the info in this part, and then use the advanced method to match the required elements, which is not correct at present

Binfo=con.xpath ('/ / * [@ id= "info"]')

Cc=con.xpath ('/ * [@ id= "info"] / text ()')

Res.update (getBookInfo (binfo,cc)) # call the above function to process binfo

Bmark=con.xpath ('/ * [@ id= "interest_sectl"] / div/div [2] / strong/text ()') [0]

If bmark=='':

Bits=con.xpath ('/ / * [@ id= "interest_sectl"] / div/div [2] / div/div [2] / span/a/text ()') [0]

Insufficient number of if bits==' evaluators':

Res ['score'] =''

Res ['number of evaluators'] = 'insufficient number of evaluators'

Else:

Res ['score'] =''

Res ['number of evaluators'] =''

Else:

Res ['score'] = bmark.replace (',')

Bmnum=con.xpath ('/ / * [@ id= "interest_sectl"] / div/div [2] / div/div [2] / span/a/span/text ()') [0]

Res ['number of evaluators'] = bmnum

Rlst.append (res)

The obtained data can be standardized and then analyzed and then output. The list rlst= obtained above [{title of the book: 'axiomagery' Publishing House':'b'}, {',','::'}] can be directly changed to dataframe

Outdf=pd.DataFrame (rlst) # to dataframe

Outdf.to_excel ('out_douban_binfo.xlsx',index=False) # output data

Figure 7, Overview of crawled data

Statistical analysis of basic data

The bsdf we read at the beginning has attributes such as book title, author, reading time, etc., because the climbing data may have missing values, so combine the two tables for analysis. The dimensions of the analysis include book title, author, reading time, publishing house, number of pages and so on. The first is to integrate the two tables with merge and then look at some basic statistics.

Bdf=bsdf.merge (outdf,on=' title', how='left') # data consolidation

# basic statistics

Print ('there are {0} books, {1} authors, {2} publishers; .format (len (bdf), len (list (bdf ['))

), len (list (bdf ['Publishing House'])

The output is a total of 421 books, 309 authors and 97 publishers

Let's take a look at the first few authors and publishers. Through bdf ['author']. Value_counts (). Head (7), we can output the authors who appear most in the first seven book lists, and the publishers have the same idea. The results are as follows:

Figure 8, statistics of publishers and authors

Judging from the number of author appearances, the top six books are all novel types. You can take a look at which books Wu Jun has:

Bdf.loc [bdf ['author'] ='Wu Jun', ['title', 'Reading time', 'Reading situation', 'Publishing House']]

# output:

Title Reading time Reading situation Publishing House

103 the beauty of mathematics 2016-10-20 P5 people's posts and Telecommunications Press

233Intelligent era 2017-06-22 P4 Citic Publishing House

The riddle of Silicon Valley 2017-07-01 P4 people's posts and Telecommunications Press

The essence of Business and the Wisdom of Life 2018-10-21 P4 Citic Press

Make statistics on the number of monthly readings:

Import matplotlib.pyplot as plt # drawing uses the matplotlib library

% matplotlib inline

Bdf ['reading year'] = bdf ['reading time'] .apply (lambda x: x.strftime ('% Ymuri% m'))

Read_date=bdf ['reading year']. Value_counts () # monthly readings, monthly count

Read_date=pd.DataFrame (read_date,columns= ['Reading year']) # changed from Series to DataFrame

Read_date=read_date.sort_index ()

Plt.figure (figsize= (15jue 5))

Plt.xticks (rotation=90) # format the display of the time tag

Plt.plot (read_date) # because% matplotlib inline is written in jupyter instead of plt.show ()

Figure 9, monthly readings _ timeline line chart. PNG

Wonder if there is a certain pattern every month in different years. To count this is more convenient to use the PivotTable, pandas in the pivot_table appearance.

Import numpy as np

Bdf ['reading year'] = bdf ['reading time'] .apply (lambda x: x.strftime ('% Y'))

Bdf ['reading month'] = bdf ['reading time'] .apply (lambda x: x.strftime ('% m')) # you can also use .month .year here

R_dd=bdf.loc [:, ['reading year', 'reading month']]

R_dd ['val'] = 1 # to initialize

R_dd=pd.pivot_table (rudder _ valuesvaluesvalidated _ month _ index = ['reading month'], columns= ['reading year'], aggfunc=np.sum) .fillna (value=0)

# the details of this code can be found in the output of the jupyter notebook file in my github.

R_dd=r_dd.loc [:] # because the months of other years are not complete, we can only look at these three years.

Plt.figure ()

R_dd.plot (xticks=range (1, 13), figsize= (12, 5))

Figure 10, monthly readings _ annual statistics

It can be seen that the number of reading is generally more in February and July in the past three years, and the monthly reading volume increased year by year before July, while it decreased year by year from August to December, with the largest number of books read in November 2016, reaching more than 40 books.

The score is a numerical variable, using a box chart [picture upload. (figure 12 _ data-related books in the book list. PNG-5352ab-1551272966564-0)]

To show its characteristics:

B_rank=pd.DataFrame (bdf ['score']) # score distribution (box chart)

B_rank.boxplot ()

# in addition, score top 10:

# bdf.sort_values (by=' score', ascending=False) .head (10). Loc [:, [title of the book', 'author', 'reading time'

, 'Reading situation', 'Publishing House', 'rating']

Figure 11, book scoring box diagram

From the box chart, the average score of the books with ratings is about 7.8, 75% of the books are above 7.2, and some books are below 4.

Figure 12, books related to the data in the book list

There are 37 books whose titles directly contain data in the book list, and the number of books related to data science should be greater than this value.

The following can be further analyzed:

The title of the book read, the word cloud of the author

Publishing house province

Fit the word count with the number of pages climbed down, and deal with the number of words and pages together.

Look at the price distribution after converting the price attribute containing multiple currencies according to the exchange rate.

Data output

Above through a specific requirement to practice can solve the problem of the crawler, Douban is relatively easy to climb, the above analysis of bibliographic information is still very meaningful, of course, I am using xpath to do, if using BeautifulSoup will be another way to achieve, but analysis of the problem-> the process of building a HTML tree is universal. The above code is relatively simple, not taking into account too much validation and exception handling, any comments or suggestions are welcome to exchange.

This is how Python crawls the book information you care about with one click. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.