In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "Python GUI programming how to make a document image extraction software", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "Python GUI programming how to make a document image extraction software" bar!
This article will further explain how to use Python to extract images from PDF and Word, and make a multi-file image extraction software combined with the GUI framework PysimpleGUI explained earlier. The results are as follows:
This article will be divided into the following parts:
PDF,Word,Excel file image extraction
Construct a picture extractor GUI framework
Integrate the code and package it
The main Python modules involved are:
PIL
PySimpleGUI
Re
Win32
Os
Zipfile
Fitz
Preparatory work
First use pip to install related dependent modules
Pip install pillow # this is the installation of the module PTL pip install pypiwin32 # this is the installation of win32 pip install os pip install zipfilepip install PyMuPDF # this is the reference fitz to PDF operation of the package pip install PySimpleGui one, extract the embedded images of each file
As mentioned in the previous article, there are two ways to read Excel. One is to change the suffix name to .zip format for extraction, and the other is to copy and save the image of Excel through the Pillow module. In our three file formats of image extraction, Excel image extraction method is the same as before.
The image extraction method in Word is similar to the .zip extraction method, and the PDF image extraction method uses a new module. Since the two methods of extracting images in Excel have been mentioned in the previous article, we will only talk about the extraction methods in PDF and Word.
1.1 ideas for extracting Word pictures
As before, let's look at the code before we explain it.
Path = values ["lujing"] count = 1for file in os.listdir (path): new_file = file.replace (".docx", ".zip") os.rename (os.path.join (path,file), os.path.join (path) New_file) count+=1 number = 0 craterDir = values ["lujing"] +'/'# the path to the folder where the zip files are stored saveDir = values ["lujing"] +'/'# the path where the pictures are stored list_dir = os.listdir (craterDir) # get all the zip file names for i in range (len (list_dir)): if 'zip' not in list_ Diri [I ]: list_ dirt [I] =''while''in list_dir: list_dir.remove ('') for zip_name in list_dir: # default mode r Read azip = zipfile.ZipFile (craterDir + zip_name) # return all folders and files namelist = (azip.namelist ()) for idx in range (0 Len (namelist): if namelist [idx] [: 11] = 'word/media/':# picture is in this path img_name = saveDir + str (number) +' .jpg 'f = azip.open (namelist [IDX]) img = Image.open (f ) img = img.convert ("RGB") img.save (img_name "JPEG") number + = 1 azip.close () # close the file Must have, free memory
The code here is the same as the code in GUI to extract Excel images through .zip.
"
Path = values ["lujing"] here is the value that reads the key * * "lujing" * * in GUI, that is, the file storage location, which is used for os module reading and operation.
New_file = file.replace (".docx", ".zip") is a replacement suffix, and if it is Excel, change .docx to .xlsx or xls.
CraterDir = values ["lujing"] +'/ 'this is the path of the folder where the zip files are stored. Note that you need to add / after the key "lujing" is returned here.
SaveDir = values ["lujing"] +'/ 'this is the path where the image is stored. In the same way, add a / number as above.
"
Finally, when talking about the proportion of Excel extraction, the biggest difference is the following code
If namelist [idx] [: 11] = 'word/media/': careful readers will find that the value in square brackets has changed in proportion to Excel extraction.
It's easy to understand that we can print a list of names [idx] and find that indexes 0 to 10 are the location of 'word/media/'. In Excel, it is the top nine.
Interested readers can take a look at the previous article, where there is a detailed analysis of this code, which is not introduced here.
1.2 the idea of extracting PDF pictures
As with the previous excel extraction images, add four images to a pdf, and we compress it into a zip file
After reading
The related code is as follows:
Def pdf2pic (path, pic_path): doc = fitz.open (path) nums = doc._getXrefLength () imgcount = 0 for i in range (1 Nums): text = doc._getXrefString (I) if ('Width 2550' in text) and (' Height 3300'in text) or ('thumbnail' in text): continue checkXO = r "/ Type (? = * / XObject)" checkIM = r "/ Subtype (? = * / Image)" isXObject = re.search (checkXO, text) isImage = re.search (checkIM) Text) if not isXObject or not isImage: continue imgcount + = 1 pix = fitz.Pixmap (doc, I) img_name = "img {} .png" .format (imgcount) if pix.n < 5: try: pix.writePNG (os.path.join (pic_path) Img_name) pix = None except: pix0 = fitz.Pixmap (fitz.csRGB, pix) pix0.writePNG (os.path.join (pic_path) Img_name) pix0 = Noneif _ _ name__ ='_ _ main__': path = values ["lujing"] +'/'+ values ["wenjian"] pic_path = values ["lujing"] pdf2pic (path, pic_path)
Let's talk about the idea of this code first. Since PDF cannot be extracted by changing its suffix like Excel and Word, a module PyMuPDF of python is used here, and the process is as follows
Read PDF and traverse every page
Filter useless elements and get pictures with regular expressions
Generate and save pictures
Fitz.open (path) is to open the PDF folder, where the path needs to get the user's file placement path to the file name in the GUI interface.
For i in range (1, nums): text = doc._getXrefString (I)
This is our first step to remerge the calendar, putting the reintroduced string content into the text
If ('Width 2550' in text) and ('Height 3300' in text) or ('thumbnail' in text): continue
Since the background of each page of PDF is a picture, we can find these background images by width and height and remove them with continue.
CheckXO = r "/ Type (? = * / XObject)" checkIM = r "/ Subtype (? = * / Image) isXObject = re.search (checkXO, text) isImage = re.search (checkIM, text) if not isXObject or not isImage: continue
Through experiments, we find that the corresponding string of the picture is in checkxo and checkIM. Therefore, the regular expression is used to extract the index in the text, and the search function of the re module is used. If it is not these two strings, continue to remove them.
Pix = fitz.Pixmap (doc, I) img_name = "img {} .png" .format (imgcount) if pix.n < 5: try: pix.writePNG (os.path.join (pic_path, img_name)) pix = None except: pix0 = fitz.Pixmap (fitz.csRGB, pix) pix0.writePNG (os.path.join (pic_path, img_name) pix0 = None
This is the last step to generate and save the picture.
"
Pix = fitz.Pixmap (doc, I) is the generate image statement, doc represents the PDF file, and I represents the index value that traverses each picture object.
Img_name = "img {} .png" .format (imgcount) is a statement that sets the image name, using the sequence number of the extracted picture as the naming format.
"
If pix.n
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
The rebuild server detects the list of rebuild requests.
© 2024 shulou.com SLNews company. All rights reserved.