In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly introduces "how to use Python to achieve PDF conversion text". In the daily operation, I believe that many people have doubts about how to use Python to achieve PDF conversion text. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubt of "how to use Python to achieve PDF conversion text"! Next, please follow the editor to study!
Catalogue
I. Preface
Why not use the traditional pdf text-to-text tool?
Second, the process of realization
2.1.Based on deep learning, OCR makes pdf text.
2.1.1. Convert pdf to image
2.1.2. Detect and identify text in the image
2.1.3, sample output
I. Preface
For many people, converting PDF to editable text is a rigid requirement, but there is no easy way to do it. Found that pdf slides, the effect is not bad.
Traditional lectures are usually accompanied by many pdf slides. Generally speaking, if you want to take notes on your lecture, you need to copy and supplement a lot of content from pdf.
Recently, Lucas Soares, a senior machine engineer from K1 Digital, has been trying to automate pdf slides by using CR (Optical character recognition) to manipulate their contents directly in Markdown files, thus avoiding manually copying and pasting pdf content and automating this process.
The picture shows project author Lucas Suarez.
Why not use the traditional pdf text-to-text tool?
Lucas Soares found that traditional tools tend to bring more problems that take time to solve. He tried to use the traditional Python package, but encountered a lot of problems (such as having to use complex regular expression patterns to parse the final output, etc.), so he decided to try target detection and OCR to solve it.
Second, the process of realization
The basic process can be divided into the following steps:
Convert pdf to picture
Detect and recognize text in an image
Show the sample output.
2.1.Based on deep learning, OCR converts pdf to text 2.1.1 and pdf to image
The pdf slides used by Soares come from the growing learning of David Silver (see the address of the pdf slide below). Use the pdf2image package to convert each slide to png image format.
Sample pdf slides.
Address: https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf
The code is as follows:
From pdf2image import convert_from_pathfrom pdf2image.exceptions import (PDFInfoNotInstalledError, PDFPageCountError, PDFSyntaxError) pdf_path = "path/to/file/intro_RL_Lecture1.pdf" images = convert_from_path (pdf_path) for I, image in enumerate (images): fname = "image" + str (I) + ".png" image.save (fname, "PNG")
After processing, all pdf slides are converted into pictures in png format:
2.1.2. Detect and identify text in the image
To detect and recognize text in png images, Soares uses a text detector in the ocr.pytorch library. Follow the instructions to download the model and save the model in the checkpoint folder.
Ocr.pytorch library address: https://github.com/courao/ocr.pytorch
The code is as follows:
# adapted from this source: https://github.com/courao/ocr.pytorch%load_ext autoreload%autoreload 2import osfrom ocrimport ocrimport timeimport shutilimport numpy as npimport pathlibfrom PIL import Imagefrom globimport globimport matplotlib.pyplot as pltimport seaborn as snssns.set () import pytesseractdef single_pic_proc (image_file): image = np.array (Image.open (image_file) .convert ('RGB')) result, image_framed = ocr (image) return result Image_framedimage_files = glob ('. / input_images/*.*') result_dir ='. / output_images_with_boxes/'# If the output folder exists we will remove it and redo it.if os.path.exists (result_dir): shutil.rmtree (result_dir) os.mkdir (result_dir) for image_file in sorted (image_files): result Image_framed = single_pic_proc (image_file) # detecting and recognizing the text filename = pathlib.Path (image_file). Name output_file = os.path.join (result_dir, image_file.split ('/') [- 1]) txt_file = os.path.join (result_dir, image_file.split ('/') [- 1] .split ('.) [0] + '.txt') txt_f = open (txt_file 'w') Image.fromarray (image_framed) .save (output_file) for key in result: txt_f.write (result[ key] [1] +'\ n') txt_f.close ()
Set up the input and output folders, then iterate through all the input images (converted pdf slides), then run the detection and recognition model in the OCR module through the single_pic_proc () function, and finally save the output to the output folder.
Inherits Pytorch CTPN from detection (inherit), identifies Pytorch CRNN, and models all exist in OCR module.
2.1.3, sample output
The code is as follows:
Import cv2 as cvoutput_dir = pathlib.Path (". / output_images_with_boxes") # image = cv.imread (str (np.random.choice (list (output_dir.iterdir ()), 1) [0]) image = cv.imread (f "{output_dir} / image7.png") size_reshaped = (int (image.shape [1]), int (image.shape [0]) image = cv.resize (image, size_reshaped) cv.imshow ("image") Image) cv.waitKey (0) cv.destroyAllWindows ()
The following picture shows the original pdf slide on the left and the output text behind the head on the right, with a very high accuracy.
The text recognition output is as follows:
Filename = f "{output_dir} / image7.txt" with open (filename, "r") as text: for line in text.readlines (): print (line.strip ("\ n"))
Through the above methods, you can eventually get a very powerful tool to discuss documents, from detecting and recognizing handwritten notes to detecting and identifying random in photos.
An OCR tool with text to handle some text content is much better than relying on external software to explain the document.
At this point, the study on "how to use Python to convert PDF into text" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.