In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "how to crawl forum articles and save them as PDF". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Basic development environment
Python 3.6
Pycharm
Wkhtmltopdf
Use of related modules
Pdfkit
Requests
Parsel
Install Python and add it to the environment variable, and pip installs the relevant modules you need.
I. Target requirements
Crawl and save the content of the article above CSDN and save it in PDF format.
Second, web page data analysis
If you want to save the content of a web article as PDF, you must first download a software wkhtmltopdf or you will not be able to achieve it. You can go to Baidu to search and download, or you can find the communication group above to download.
As mentioned in previous articles, it is not difficult to crawl the text.
To get the content of an article, you must first crawl the url address of each article.
The specific analysis process has also been shared in previous articles, which will be skipped here.
Python crawled CSDN blog articles and made them into PDF files.
Complete implementation code import pdfkitimport requestsimport parselhtml_str = "" Document {article} "" def save (article, title): pdf_path = 'pdf\\' + title + '.pdf' html_path = 'html\' + title + .html 'html = html_str.format (article=article) with open (html_path, mode='w') Encoding='utf-8') as f: f.write (html) print ('{} download completed '.format (title)) # the path where the exe file is stored config = pdfkit.configuration (wkhtmltopdf='C:\\ Program Files\\ wkhtmltopdf\\ bin\\ wkhtmltopdf.exe') # change html into pdf file pdfkit.from_file (html_path, pdf_path) through pdfkit Configuration=config) def main (html_url): # request header headers = {"Host": "blog.csdn.net", "Referer": "https://blog.csdn.net/qq_41359265/article/details/102570971"," User-Agent ":" Mozilla/5.0 (Windows NT 10.0) Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36 ",} # user information cookie = {'Cookie':' your own cookie'} response = requests.get (url=html_url, headers=headers) Cookies=cookie) selector = parsel.Selector (response.text) urls = selector.css ('.article-list h5 a::attr (href)') .getall () for html_url in urls: response = requests.get (url=html_url, headers=headers) Cookies=cookie) # text text (string) # encountered anti-scraping # print (response.text) "how to change HTML into PDF format" # extract part of the article sel = parsel.Selector (response.text) # css selector article = sel.css ('article'). Get () title = sel.css (' H 2 get () save (article) Title) if _ _ name__ = ='_ main__': url = 'https://blog.csdn.net/fei347795790/article/list/1' main (url)
This is the end of the content of "how to crawl forum articles and save them as PDF" in Python. Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.