Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to easily transfer Pdf to Word to Python

2025-01-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

In this issue, Xiaobian will bring you information on how to easily convert PDF to Word with Python. The article is rich in content and analyzed and described from a professional perspective. After reading this article, I hope you can gain something.

In the daily work and study process, we will encounter a problem that is to convert the text content in pdf into word form, that is, from read-only to read-write form. Faced with this situation, most of us use online tools, but online tools are mixed, it is difficult to meet our needs.

Today, Xiaobian will lead you to use Python to realize how to convert pdf content into word documents. At the same time, we will extract the pdf images and save them to our designated folder.

01. Text extraction

The first thing we need to do is extract the pdf Chinese version, as shown in the following figure:

Pdf text is only allowed to read, but can not be changed, so what we have to do is extract the text information in pdf, and then write the extracted text to the word file, so that we can carry out subsequent rewriting. For text extraction, we use the pdfminer function library, whose main functions are shown in the following figure:

The program first uses the get_content_from_pdf function to return the data extracted from pdf;

Then we create PDFResourceManager object to store the shared data content, PDFPageAggregator object to process the resource object into the format we need, and PDFPageInterpreter object to process the page content.

The program page_index is used to help us set which pages we need to extract. For the pages we need to extract, the PDFPageInterpreter object created to interpret the page information;

Finally, the PDFPageAggregator object is used to process the data.

The layout here contains all kinds of objects parsed out of the page. Including text, pictures and other information. However, the editor found that for the extraction of images, pdfminer's effect is very bad, so later for the extraction of images, the fitz library used by the editor for separate processing, achieved a very good image extraction effect. Having said that, let's first look at the processing results for the text.

Our pdf is a two-page pdf document, we only let the program to extract the text of the first page, as can be seen from the above picture, the program completely extracted the text of the first page, without any errors.

02. Picture extraction

With text processing, let's take a look at how to extract images from pdf and save them locally. For image extraction, the procedure is as follows:

In the above program, we use the fitz library to extract objects in pdf documents, and then determine whether the object is a picture type by string matching. If not, we can skip directly.

If it is determined that the object is an image type, we can extract the image by creating a PixMap object and save it to the path we specify. The results are shown below:

As can be seen in the above picture, we extracted the picture correctly, thus achieving the purpose of our picture extraction, and Xiaobian also tried to extract multiple pictures, also without any pressure. PDF document can be completed in just a few seconds to extract all the images.

The above is how to easily get Pdf to Word with Python shared by Xiaobian. If you happen to have similar doubts, you may wish to refer to the above analysis for understanding. If you want to know more about it, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report