Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How the Python crawler converts Liao Xuefeng's tutorials into PDF e-books

2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Today, I will talk to you about Python crawler how to convert Liao Xuefeng's tutorial into PDF e-book, which may not be understood by many people. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

There seems to be nothing more appropriate than using Python to write a crawler. The Python community provides you with so many crawler tools that you can dazzle. You can write a crawler in a minute with a variety of library that can be used directly. Today, you are thinking about writing a crawler. Climb down Liao Xuefeng's Python tutorial to make PDF e-book convenient for you to read offline.

Before we start to write about crawlers, let's analyze the page structure of website 1. On the left side of the page is the directory outline of tutorials, and each URL corresponds to an article on the right. At the top on the right is the title of the article, and in the middle is the body of the article. The text content is the focus of our concern. The data we want to climb is the body of all the pages, and below is the user's comment area, which is of little use to us. So you can ignore it.

Tool preparation

After figuring out the basic structure of the site, you can begin to prepare the toolkits that the crawler depends on. Requests and beautifulsoup are two great artifacts of crawlers. Reuqests is used for network requests, and beautifusoup is used to manipulate html data. With these two shuttles, we don't need the reptile frame like scrapy. Mini Program sends it with a bit of a knife to kill a chicken. In addition, since the html file is converted to pdf, then there must be corresponding library support, wkhtmltopdf is a very good tool, it can be used for multi-platform html to pdf conversion, pdfkit is wkhtmltopdf's Python package. First install the following dependency package, and then install wkhtmltopdf

Pip install requests pip install beautifulsoup pip install pdfkit install wkhtmltopdf

The Windows platform is installed directly in the stable version downloaded from the wkhtmltopdf official website 2. After the installation is completed, the execution path of the program is added to the system environment $PATH variable, otherwise pdfkit can not find wkhtmltopdf, there will be an error "No wkhtmltopdf executable found". Ubuntu and CentOS can be installed directly from the command line

$sudo apt-get install wkhtmltopdf # ubuntu $sudo yum intsall wkhtmltopdf # centos crawler implementation

You can code when everything is ready, but organize your thoughts before you write the code. The purpose of the program is to save all the html body parts corresponding to URL locally, and then use pdfkit to convert these files into a pdf file. Let's split the task, first save the html body corresponding to a URL locally, and then find all the URL to perform the same operation.

Use the Chrome browser to find the tag in the body of the page, and press F12 to find the div tag corresponding to the body: the div is the body of the page. After loading the entire page locally with requests, you can use beautifulsoup to manipulate the dom element of HTML to extract the body content.

The specific implementation code is as follows: use the soup.find_all function to find the body tag, and then save the contents of the body to the a.html file.

Def parse_url_to_html (url): response = requests.get (url) soup = BeautifulSoup (response.content, "html5lib") body = soup.find_all (class_= "x-wiki-content") [0] html = str (body) with open ("a.html", 'wb') as f: f.write (html)

The second step is to parse all the URL on the left side of the page. In the same way, find the menu label on the left

The specific code implements the logic: because there are two class attributes of uk-nav uk-nav-side on the page, and the real directory list is the second. All the url has been obtained, and the function from url to html is also written in the * * step.

Def get_url_list (): "get a list of all URL directories" response = requests.get ("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000") soup = BeautifulSoup (response.content)" "html5lib") menu_tag = soup.find_all (class_= "uk-nav uk-nav-side") [1] urls = [] for li in menu_tag.find_all ("li"): url = "http://www.liaoxuefeng.com" + li.a.get ('href') urls.append (url) return urls

* one step is to convert html to pdf file. Converting to a pdf file is very easy, because pdfkit encapsulates all the logic, you just need to call the function pdfkit.from_file

Def save_pdf (htmls): "" convert all html files to pdf files "" options= {'page-size':' Letter', 'encoding': "UTF-8",' custom-header': [('Accept-Encoding',' gzip')]} pdfkit.from_file (htmls, file_name, options=options)

Execute the save_pdf function, and the pdf file of the ebook is generated. The effect picture:

The total amount of code adds up to less than 50 lines, but wait a minute, the code given above actually omits some details, such as how to get the title of the article, the img tag of the text content uses the relative path, if you want to display the picture normally in pdf, you need to change the relative path to the absolute path, as well as the saved temporary html files have to be deleted, these details are placed on the github.

After reading the above, do you have any further understanding of how Python crawler can convert Liao Xuefeng's tutorial into PDF e-book? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report