Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Python crawl csnd articles and convert them to PDF files

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Today, I will talk to you about how Python crawls csnd articles and converts them into PDF files. Many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

1. Import module import requests # data request send request third party module pip install requestsimport parsel # data parsing module third party module pip install parselimport os # file operation module import re # regular expression module import pdfkit # pip install pdfkit2. Create folder filename = 'pdf\\' # File name filename_1 = 'html\' if not os.path.exists (filename): # if you don't have this folder, os.mkdir (filename) # automatically create this folder if not os.path.exists (filename_1): # if you don't have this folder, os.mkdir (filename_1) # automatically create this folder 3. Send request for page in range (1,11): print (fallow = crawling data content on page {page} =') url = f 'https://blog.csdn.net/qdPython/article/list/{page}' # python code is identified as a crawler after receiving the request from the server (if there is no camouflage) > No data will be returned to you # the client (browser) sends a request to the server > after the server receives the request > the browser returns a response response data # the headers request header disguises the python code as the browser to make the request # the headers parameter field can be queried and copied in the developer tool # not all parameter numbers Sections are all needed # user-agent: the basic information of the browser (equivalent to a wolf in sheep's clothing In this way, you can blend into the sheep) # cookie: user information checks whether you log in to your account (some websites need to log in before you can see the data. Bilibili some data content) # referer: where did the hotlink protection request your URL jump from (bilibili video content / girl picture download / VIPSHOP commodity data) # specific analysis according to different website content headers = {'user-agent':' Mozilla/5.0 (Windows NT 10.0) Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'} # request method: get request post request can see what the url request method is through the developer tool # search / login / query this is the post request response = requests.get (url=url, headers=headers) 4. Data parsing # needs to convert the acquired html string data into selector parsing object selector = parsel.Selector (response.text) # getall returns the list href = selector.css ('.article-list a::attr (href)') .getall () 5. If every element in the list is extracted for index in href: # send request send a request for the url address of the article details page response_1 = requests.get (url=index Headers=headers) selector_1 = parsel.Selector (response_1.text) title = selector_1.css ('# articleContentId::text'). Get () new_title = change_title (title) content_views = selector_1.css ('# content_views'). Get () html_content = html_str.format (article=content_views) html_path = filename_1 + new_title + '.html' pdf_path = filename + new_title + '.pdf' with open (html_path Mode='w', encoding='utf-8') as f: f.write (html_content) print ('saving:, title) 6. Replace the special character def change_title (name): mode = re.compile (r'[\\ /\:\ *\?\ "\ |]') new_name = re.sub (mode,'_', name) return new_name

Run the code to download the HTML file:

7. Convert to PDF file config = pdfkit.configuration (wkhtmltopdf=r'C:\ Program Files\ wkhtmltopdf\ bin\ wkhtmltopdf.exe') pdfkit.from_file (html_path, pdf_path, configuration=config)

After reading the above, do you have any further understanding of how Python crawls csnd articles and converts them into PDF files? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report