Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize the function of generating DOCX or EXCEL from PDF scanned parts by Python

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail for you how Python to achieve PDF scanned parts to generate DOCX or EXCEL function, the editor thinks it is very practical, so share it with you to do a reference, I hope you can get something after reading this article.

1. Problem description

According to the needs of the project, we need to obtain the contents of PDF scan files, but we need to recharge all the products that can achieve this function throughout the network. Suffering from the lack of money, I have to write functional code to achieve it.

Such as the table picture in PDF-1 effect generation diagram-2

Figure-1

Figure-2

two。 Realization process

The whole step is: read PDF file-> generate picture-> ORC get picture content-> write Excel

3. Function Code 3.1 pdf to Picture import fitz # pdf to Picture from aip import AipOcr # Picture recognition import time # running interval to avoid errors import docx # Save the recognition result as a docx file from docx.oxml.ns import qn # set the font of the docx file "your APPID AK SK"APP_ID = 'xxxxxx'API_KEY =' xxxxxxxx'SECRET_KEY = 'xxxxxxxxxxxxxxxxxxxxxxx'client = AipOcr (APP_ID, API_KEY SECRET_KEY)''convert PDF to the path of the picture pdfPath pdf file imgPath the path to be saved the scaling factor of the zoom_x x direction the scaling factor of the zoom_y y direction rotation_angle rotation angle zoom_x and zoom_y generally take the same value The higher the value, the higher the image resolution. The name and number of pages of the target pdf are returned. Def pdf_image (pdfPath, imgPath, zoom_x=10, zoom_y=10, rotation_angle=0): # get the pdf file name name = pdfPath.split ("\\") [- 1] .split ('.pdf') [0] # Open the PDF file pdf = fitz.open (pdfPath) # get the number of pdf pages num = pdf.pageCount # read PDF for pg in range page by page (0 Num): page = pdf [pg] # set zoom and rotation factor trans = fitz.Matrix (zoom_x, zoom_y) .prerotate (rotation_angle) pm = page.getPixmap (matrix=trans, alpha=False) # start writing image pm.writePNG (imgPath + name + "_" + str (pg) + ".png") pdf.close () return name The docx generated when num''' reads the picture into the path where the docx file imgPath image is located is also saved in the path where the image is located. Name is pdf name (without suffix) num is pdf page number name and num can be returned by the previous function''def ReadDetail_docx (imgPath, name) Num): # create an empty doc document doc = docx.Document () # set the global font doc.styles ["Normal"] .font.name = u "Arial" doc.styles ["Normal"]. _ element.rPr.rFonts.set (qn ('WRV eastAsia'), u 'Arizona') # read the picture for n in range (0jinnum): I = open (imgPath+name+ "_" + str (n) + ".png") 'rb') time.sleep (0.1) img = i.read () message = client.basicAccurate (img) content = message.get (' words_result') # write the content to the doc document for i in range (len (content)): doc.add_paragraph (content.get ('words')) # Save the doc document doc .save (imgPath + name + '.docx') def pdf_to_docx (pdfPath ImgPath, zoom_x=10, zoom_y=10, rotation_angle=0): print ("converting pdf files to pictures...") # call function 1 to convert pdf to pictures And get the file name and the number of pages name_, num_ = pdf_image (pdfPath, imgPath, zoom_x, zoom_y, rotation_angle) print ("converted successfully!") # print ("reading picture content...") # call function two to read the picture page by page and save it line by line in the docx file # ReadDetail_docx (imgPath, name_, num_) # print ("the pdf file named {}. Pdf has {} pages and has been successfully converted to a docx file!" .format (name_, num_)) # pdf storage path pdf_path = "JRT 0197-2020 Financial data Security data Security grading Guide. Pdf" # Storage path for images and generated docx files img_path = r "G:\ imges\" # call function pdf_to_docx (pdf_path Img_path) 3.2Table Picture text recognition to excelimport pandas as pdimport numpy as npimport re# Picture recognition from aip import AipOcr# time Module import time# Web Page acquisition import requests# operating system Interface Module import osimage_path ='# get all pictures in the folder def get_image (): images = [] # paths to all files in the folder (including files in subdirectories) for root, dirs Files in os.walk (image_path): path = [os.path.join (root, name) for name in files] images.extend (path) return imagesdef Image_Excel (APP_ID, API_KEY, SECRET_KEY): # call Baidu AI interface client = AipOcr (APP_ID, API_KEY) SECRET_KEY) # Loop through the file images = get_image () for image in images: # Open the picture img_open = open (image) in binary mode 'rb') # read picture img_read = img_open.read () # call form recognition module to recognize picture table = client.tableRecognitionAsync (img_read) # get request ID request_id = table [' result'] [0] ['request_id'] # get form processing result result = client.getTableRecognitionResult ( Request_id) # processing status is "completed" Get the download address while result ['result'] [' ret_msg']! = 'completed': time.sleep (2) # pause for 2 seconds and then refresh result = client.getTableRecognitionResult (request_id) download_url = result ['result'] [' result_data'] print (download_url) # get form data excel_ Data = requests.get (download_url) # name the table xlsx_name = image.split (".") [0] + ".xlsx" # New excel file xlsx = open (xlsx_name) 'wb') # write the data to the excel file and save xlsx.write (excel_data.content) if _ _ name__ = =' _ main__': image_path = r "G:\ imgs\\" APP_ID = 'xxxxxxxx' API_KEY =' xxxxxxx' SECRET_KEY = 'xxxxxxxxxxxxxxxxxxxxxx' Image_Excel (APP_ID, API_KEY, SECRET_KEY) 4. Case description

I am here to obtain JRT 0197-2020 financial data security data security classification guide .pdf scan file, write internal table data to excel file.

On "Python how to achieve PDF scanned parts to generate DOCX or EXCEL function" this article is shared here, I hope the above content can be of some help to you, so that you can learn more knowledge, if you think the article is good, please share it out for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report