Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Python crawl electronic textbooks and give them to children at home?

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Python how to crawl electronic textbooks to home-based children, this article introduces the corresponding analysis and solutions in detail, hoping to help more friends who want to solve this problem to find a more simple and easy way.

In this day of national anti-epidemic, primary and secondary school students have also started the life of online classes at home. Many children who do not borrow books have to read electronic textbooks on the Internet, and some electronic textbooks are web links sent by teachers. Every time they open a web page to read, it is not only a waste of traffic, but also inconvenient. Today, we will use the crawler function of python to climb down the web-linked textbooks and make them into local files in PDF format, so that children can read them at any time. The online textbook crawled by the case of this article is shown in the following figure:

Fig. 1 Home page of electronic textbook

The idea of implementation can be divided into two parts:

Use python to crawl all textbook pictures from the website

Merge the pictures into a file in PDF format.

Specific process:

Climb the pictures of the textbook

Crawler 4 process: make a request-get a web page-parse content-save content.

According to the knowledge mentioned in the previous python batch crawling web pictures, the pictures in the web page have a separate URL. When you climb a picture, you need to crawl out the picture URL first, and then crawl the picture according to the picture URL.

1. Issue a request:

First of all, find out the appropriate URL URL, because it is a static web address, we can directly use the URL of the browser address bar. The red box in figure 2 below is the URL to be used. Just copy it.

Figure 2 browser address bar URL can be used to make a request

The web address is: http://www.shuxue9.com/beishida/cz8x/ebook/1.html

2. Send a request to get a response:

Url = http://www.shuxue9.com/beishida/cz8x/ebook/1.htmlresponse = requests.get (url)

3. Parse the response to get the content of the web page:

Soup = BeautifulSoup (response.content, 'lxml')

4. Analyze the content of the web page and obtain the URL of the picture:

Jgp_url = soup.find ('div', class_= "center"). Find (' a'). Find ('img') [' src']

5. Send a request to the image URL and get the image (because the URL only contains pictures and does not need to be parsed with find):

Jpg = requests.get (jgp_url) .content

6. Save the picture:

F = open (set_path () + number + '.jpg', 'wb') f.write (jpg)

Set_path () is a path built in advance to store images. See the code below, or you can directly write the path you want to use:

Def set_path (): path = rasceGunther Python def set_path: os.makedirs (path) paths = path+'/' return (paths)

7. Existing problems:

The above completed the crawling of the textbook pictures, we opened the folder and found that only one picture was downloaded, none of the following. This is because when browsing the web, each page has a different URL. We try to analyze it and find that the URL of each page of the e-textbook is very regular:

Page 1 Web site: http://www.shuxue9.com/beishida/cz8x/ebook/1.html

Page 2 Web site: http://www.shuxue9.com/beishida/cz8x/ebook/2.html

.

Page n URL: http://www.shuxue9.com/beishida/cz8x/ebook/n.html

The URL of the picture on each page is different and irregular. We can visit the web site in a circular way according to the regularity. after getting the picture, we can automatically cycle to the next URL. Finally get all the pictures.

8. Set cyclic extraction:

In all the above processes are incorporated into a for loop, according to the web page, we can see a total of 152pages. After setting the loop, the complete code is:

Import requests, osfrom bs4 import BeautifulSoupfor i in range (1,153): # issue the request url = "http://www.shuxue9.com/beishida/cz8x/ebook / {}" .format (I) + ".html" response = requests.get (url) # get the web page soup = BeautifulSoup (response.content, 'lxml') # parse the web page to get the picture URL jgp_url = soup.find (' div') Class_= "center") .find ('a'). Find ('img') [' src'] # send a request to parse to get the image jpg = requests.get (jgp_url). Content# sets the path to save the picture p = rattlee, os.makedirs, os.makedirs (p) # save the picture f = open (p +'/'+ str (I) + '.jpg') 'wb') f.write (jpg) print ("download complete")

Run the program, you can download all the textbook pictures at once, and the effect is as follows:

Figure 3 run the program to download pictures

Fig. 4 download the good picture

Second, merge the pictures to form a PDF format file

After the picture is downloaded, it is convenient to use the picture in PDF format. There is special software on the Internet, but the free trial version can only merge a few pictures. Today, I will teach you a free and commonly used OFFICE-ppt software to merge multiple pictures into a PDF file.

Create a new PowerPoint blank file and click insert-Photo album-New album.

In the pop-up form, click the "file / disk" in the upper left corner to import all the pictures you just downloaded. The effect after import is the red box style on the right side of the picture, and then click "create". Save the file as PDF format.

Summary:

At this point, crawl the pictures of the electronic textbook from the web page and generate the local files in PDF format. Among them, how to find and extract the picture URL in the web page has been described in detail in an article on this headline. if you have any questions, you can consult or leave a message.

Share another easy way to find a picture URL from the content of the web page: in the developer tools interface that opens, click the arrow symbol in the upper left corner, and then click on the picture you want to find the URL on the webpage, which will automatically highlight the location of the image URL. As follows:

This is the answer to the question about how Python crawls e-textbooks to the children who attend classes at home. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report