Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python code to convert PDF files into Word format in batches

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article is to share with you about how to use Python code to convert PDF files into Word format in batches, Xiaobian thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

In daily work or study, we often encounter such helplessness:

"Xiao Ren, please code the file in this PDF and send it to me."

Unfortunately, the PDF12 of 2m can't be finished.

Many times in the study found that many documents are in PDF format, PDF format is not conducive to learning to use, so you need to convert PDF into Word files, but maybe you download a lot of software from the Internet, but can only convert the first five pages (such as WPS, etc.), or need to charge, then there is no free conversion software?

So, rookie analysis brings you a free, simple and fast way to teach you how to use Python to process PDF files in batches, get the content you want, and save it as word.

Before implementing the PDF to Word function, we need a writing and running environment for python, as well as installing the relevant dependency packages. For python environments, we recommend using PyCharm. In the local computer environment, anaconda provides very convenient installation and deployment.

The dependency packages required for PDF to Word function are as follows:

PDFParser (document Analyzer), PDFDocument (document object), PDFResourceManager (Resource Manager), PDFPageInterpreter (interpreter), PDFPageAggregator (aggregator), LAParams (Parameter Analyzer)

I. preliminary preparatory work

Description: rookie analysis uses version 3.6 of python*** under Windows7

1. Install the pdfminer3k module

After installing anaconda, you can install it directly through pip

two。 If the installation is not successful, try the following methods

First download pdfminer3k: https://pypi.python.org/pypi/pdfminer3k; then install pdfminer

Extract the downloaded pdfminer3k to D: or other appropriate drive letter, open the running window through win+r, and type cmd

Enter D: switch to disk D, cd pdfminer3k (the folder where pdf is extracted), and enter setup.py install to install the software.

If Finished is finally displayed, it indicates success.

II. Code practice

Import related packages

From pdfminer.pdfparser import PDFParser, PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.layout import LAParams from pdfminer.converter import PDFPageAggregator

The overall idea is to construct document objects, parse document objects, and extract the required content.

Construct document objects

Construction interpreter

two。 Import PDF files that need to be parsed

Place the required parsed files in the same directory as the executed code, as shown in the figure:

Test.pdf content

3. The specific code is as follows:

From pdfminer.pdfparser import PDFParser, PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.layout import LAParams from pdfminer.converter import PDFPageAggregator from pdfminer.pdfinterp import PDFTextExtractionNotAllowed def parse (): # rb opens the local pdf document fn = open ('test.pdf') in binary read mode 'rb') # create a pdf document parser parser = PDFParser () # create an PDF document doc = PDFDocument () # Connect the parser to the document object parser.set_document () doc.set_parser () # provide the initialization password doc.initialize ("lianxipython") # if you don't have a password, create an empty The string doc.initialize (") # detects whether the document provides txt conversion Ignore if not doc.is_extractable: raise PDFTextExtractionNotAllowed else: # create PDf Explorer resource = PDFResourceManager () # create a PDF Parametric Analyzer laparams= LAParams () # create aggregator, object for reading document device = PDFPageAggregator (resource,laparams=laparams) # create interpreter, encode the document Interpreted as a format that Python can recognize: interpreter = PDFPageInterpreter (resource,device) # Loop traversal list Processing content one page at a time # doc.get_pages () gets page list for page in doc.get_pages (): # parses and reads individual pages interpreter.process_page (page) using the interpreter's process_page () method # uses aggregator get_result () method to get content layout = device.get_result () # where layout is a LTPage object It stores various objects parsed by the page for out in layout: # to determine whether it contains the get_text () method Get the text we want if hasattr (out, "get_text"): print (out.get_text ()) with open ('test.txt','a') as f: f.write (out.get_text () +'\ n') if _ _ name__ = ='_ main__': parse ()

The final test.txt results are as follows:

Conclusion: this is the end of the introduction to the operation of Python batch PDF to Word.

The above is how to use Python code to convert PDF files into Word format in batches. Xiaobian believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report