Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use python to crawl all Python books on Dangdang

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to use python to crawl all the Python books on Dangdang. I believe most people don't know much about it, so share this article for your reference. I hope you can learn a lot after reading this article. Let's learn about it together.

1 determine the crawling target

Any website can be crawled, depending on whether you want to crawl or not. The selected crawling target is Dangdang, and the crawling content is the information of all the books in the page searched by the keyword Python. It is shown in the following figure:

There are three results of this crawl:

A picture of the cover of a book

The title of a book

The link page of the book

Finally, save these three items to the csv file.

2 crawling process

It is well known that the page DOM tree of each site is different. So we need to analyze the crawled page first, then determine the content we want to get, and then define the rules for the program to crawl content.

2.1 determine the URL address

We can use the browser to determine the URL address and provide the entry address for the urllib request. Next, we determine the request address step by step.

When the search results page is 1, the URL address is as follows:

When the search results page is 3, the URL address is as follows:

The search results page is 21:00, the last page, and the URL address is as follows:

From the picture above, we find that the difference of the URL address lies in the value of page_index, so the URL address is ultimately http://search.dangdang.com/?key=python&act=input&show=big&page_index=. The value of page_index can be added after the address in turn through a loop. Therefore, the urllib request code can be written as follows:

2.2 determine the crawl node

With the URL address, you can use urllib to get the html content of the page. At this point, we need to find the rules of the crawled node in order to parse it by BeautifulSoup. In order to solve this problem, it is necessary to do something big-the developer function of the Chrome browser (which can be started by pressing F12). We press the F12 keyboard and check the elements of each book in turn (use the right mouse button on the page and click "check"). The specific results are as follows:

From the above picture, we can see the parsing rules: the node of each book is an a tag, the a tag has three attributes of title,href, and the src of the sub-tag img, which correspond to the title of the book, the linked page of the book, and the seal of the book respectively. You will not be a little excited to see here, sighing that this is not what we are interested in? When you get the parsing rules, you can write the BeautifulSoup parsing code as follows:

The running results are as follows:

This proves that the rules just made are correct to crawl what we need.

2.3 Save crawl information

I have a habit of writing crawlers that crawl content into files every time. This is convenient to check and use later. If the amount of crawling data is relatively large, we can use it for data analysis. For convenience, I save the data to the csv file. Using Python to write the data into the file, we often worry about the problem of Chinese garbled. If you simply use the csv library, you may not be able to get rid of this worry. So we use csv and codecs together. When writing data to a csv file, we can specify the file encoding. In this way, the problem of Chinese garbled code can be easily solved. The specific code is as follows:

When you see here, you may ask why not specify the code as gb2312, so that it won't be garbled when opened with ecxel. The reason is that when the title of the book is all English words, using gb2312 coding, writer.writerow () will have the problem of coding error.

If you want to open the PythonBook.csv file with excel, you need to perform the following steps:

1) Open Excel

2) execute "data"-> "from text"

3) Select the CSV file, and the text Import Wizard appears

4) Select "separator", and next

5) check "comma", remove "Tab key", next step, finish

6) in the "Import data" dialog box, click OK directly.

3 crawling results

Finally, we can integrate the above code. Instead of posting the code here, you can read the original text and see the source code. I'll take a screenshot of the crawling result:

The above is all the content of this article "how to use python to crawl all Python books on Dangdang". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report