Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How Python crawler converts tutorials into PDF e-books

2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how Python crawler converts tutorials into PDF e-books". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Before we start to write about crawlers, let's analyze the page structure of the site [1]. On the left side of the page is the directory outline of tutorials, each URL corresponds to an article on the right, the top right is the title of the article, and the middle is the body of the article. The text content is the focus of our concern. The data we want to climb is the body of all the pages, and below is the user's comment area, which is of little use to us. So you can ignore it.

Tool preparation

After figuring out the basic structure of the site, you can begin to prepare the toolkits that the crawler depends on. Requests and beautifulsoup are two great artifacts of crawlers. Reuqests is used for network requests, and beautifusoup is used to manipulate html data. With these two shuttles, we don't need the reptile frame like scrapy. Mini Program sends it with a bit of a knife to kill a chicken. In addition, since the html file is converted to pdf, then there must be corresponding library support, wkhtmltopdf is a very good tool, it can be used for multi-platform html to pdf conversion, pdfkit is wkhtmltopdf's Python package. First install the following dependency package, and then install wkhtmltopdf

Pip install requestspip install beautifulsouppip install pdfkit install wkhtmltopdf

The Windows platform can be installed directly from the wkhtmltopdf official website [2] to download the stable version. After the installation is completed, add the execution path of the program to the system environment $PATH variable, otherwise pdfkit can not find wkhtmltopdf, there will be an error "No wkhtmltopdf executable found". Ubuntu and CentOS can be installed directly from the command line

$sudo apt-get install wkhtmltopdf # ubuntu$ sudo yum intsall wkhtmltopdf # centos crawler implementation

You can code when everything is ready, but organize your thoughts before you write the code. The purpose of the program is to save all the html body parts corresponding to URL locally, and then use pdfkit to convert these files into a pdf file. Let's split the task, first save the html body corresponding to a URL locally, and then find all the URL to perform the same operation.

Use the Chrome browser to find the tag in the body of the page, and press F12 to find the div tag corresponding to the body: the div is the body of the page. After loading the entire page locally with requests, you can use beautifulsoup to manipulate the dom element of HTML to extract the body content.

The specific implementation code is as follows: use the soup.find_all function to find the body tag, and then save the contents of the body to the a.html file.

Def parse_url_to_html (url):

Response = requests.get (url)

Soup = BeautifulSoup (response.content, "html5lib")

Body = soup.find_all (class_= "x-wiki-content") [0]

Html = str (body)

With open ("a.html", 'wb') as f:

F.write (html)

The second step is to parse all the URL on the left side of the page. In the same way, find the menu label on the left

The specific code implements the logic: because there are two class attributes of uk-nav uk-nav-side on the page, and the real directory list is the second. All the url has been obtained, and the function from url to html has been written in the first step.

Def get_url_list ():

"

Get a list of all URL directories

"

Response = requests.get ("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000")

Soup = BeautifulSoup (response.content, "html5lib")

Menu_tag = soup.find_all (class_= "uk-nav uk-nav-side") [1]

Urls = []

For li in menu_tag.find_all ("li"):

Url = "http://www.liaoxuefeng.com" + li.a.get ('href')

Urls.append (url)

Return urls

The final step is to convert the html to an pdf file. Converting to a pdf file is very easy, because pdfkit encapsulates all the logic, you just need to call the function pdfkit.from_file

Def save_pdf (htmls):

"

Convert all html files to pdf files

"

Options = {

'page-size': 'Letter'

'encoding': "UTF-8"

'custom-header': [

('Accept-Encoding',' gzip')

]

}

Pdfkit.from_file (htmls, file_name, options=options)

Execute the save_pdf function, and the pdf file of the ebook is generated. The effect picture:

This is the end of the content of "how Python Crawler converts tutorials into PDF eBooks". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report