How to make a document Image extraction Software by Python GUI programming 07/01 Update SLTechnology News&Howtos

How to make a document Image extraction Software by Python GUI programming

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "Python GUI programming how to make a document image extraction software", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "Python GUI programming how to make a document image extraction software" bar!

This article will further explain how to use Python to extract images from PDF and Word, and make a multi-file image extraction software combined with the GUI framework PysimpleGUI explained earlier. The results are as follows:

This article will be divided into the following parts:

PDF,Word,Excel file image extraction

Construct a picture extractor GUI framework

Integrate the code and package it

The main Python modules involved are:

PIL

PySimpleGUI

Win32

Zipfile

Fitz

Preparatory work

First use pip to install related dependent modules

Pip install pillow # this is the installation of the module PTL pip install pypiwin32 # this is the installation of win32 pip install os pip install zipfilepip install PyMuPDF # this is the reference fitz to PDF operation of the package pip install PySimpleGui one, extract the embedded images of each file

As mentioned in the previous article, there are two ways to read Excel. One is to change the suffix name to .zip format for extraction, and the other is to copy and save the image of Excel through the Pillow module. In our three file formats of image extraction, Excel image extraction method is the same as before.

The image extraction method in Word is similar to the .zip extraction method, and the PDF image extraction method uses a new module. Since the two methods of extracting images in Excel have been mentioned in the previous article, we will only talk about the extraction methods in PDF and Word.

1.1 ideas for extracting Word pictures

As before, let's look at the code before we explain it.

Path = values ["lujing"] count = 1for file in os.listdir (path): new_file = file.replace (".docx", ".zip") os.rename (os.path.join (path,file), os.path.join (path) New_file) count+=1 number = 0 craterDir = values ["lujing"] +'/'# the path to the folder where the zip files are stored saveDir = values ["lujing"] +'/'# the path where the pictures are stored list_dir = os.listdir (craterDir) # get all the zip file names for i in range (len (list_dir)): if 'zip' not in list_ Diri [I ]: list_ dirt [I] =''while''in list_dir: list_dir.remove ('') for zip_name in list_dir: # default mode r Read azip = zipfile.ZipFile (craterDir + zip_name) # return all folders and files namelist = (azip.namelist ()) for idx in range (0 Len (namelist): if namelist [idx] [: 11] = 'word/media/':# picture is in this path img_name = saveDir + str (number) +' .jpg 'f = azip.open (namelist [IDX]) img = Image.open (f ) img = img.convert ("RGB") img.save (img_name "JPEG") number + = 1 azip.close () # close the file Must have, free memory

The code here is the same as the code in GUI to extract Excel images through .zip.

Path = values ["lujing"] here is the value that reads the key * * "lujing" * * in GUI, that is, the file storage location, which is used for os module reading and operation.

New_file = file.replace (".docx", ".zip") is a replacement suffix, and if it is Excel, change .docx to .xlsx or xls.

CraterDir = values ["lujing"] +'/ 'this is the path of the folder where the zip files are stored. Note that you need to add / after the key "lujing" is returned here.

SaveDir = values ["lujing"] +'/ 'this is the path where the image is stored. In the same way, add a / number as above.

Finally, when talking about the proportion of Excel extraction, the biggest difference is the following code

If namelist [idx] [: 11] = 'word/media/': careful readers will find that the value in square brackets has changed in proportion to Excel extraction.

It's easy to understand that we can print a list of names [idx] and find that indexes 0 to 10 are the location of 'word/media/'. In Excel, it is the top nine.

Interested readers can take a look at the previous article, where there is a detailed analysis of this code, which is not introduced here.

1.2 the idea of extracting PDF pictures

As with the previous excel extraction images, add four images to a pdf, and we compress it into a zip file

After reading

The related code is as follows:

Def pdf2pic (path, pic_path): doc = fitz.open (path) nums = doc._getXrefLength () imgcount = 0 for i in range (1 Nums): text = doc._getXrefString (I) if ('Width 2550' in text) and (' Height 3300'in text) or ('thumbnail' in text): continue checkXO = r "/ Type (? = * / XObject)" checkIM = r "/ Subtype (? = * / Image)" isXObject = re.search (checkXO, text) isImage = re.search (checkIM) Text) if not isXObject or not isImage: continue imgcount + = 1 pix = fitz.Pixmap (doc, I) img_name = "img {} .png" .format (imgcount) if pix.n < 5: try: pix.writePNG (os.path.join (pic_path) Img_name) pix = None except: pix0 = fitz.Pixmap (fitz.csRGB, pix) pix0.writePNG (os.path.join (pic_path) Img_name) pix0 = Noneif _ _ name__ ='_ _ main__': path = values ["lujing"] +'/'+ values ["wenjian"] pic_path = values ["lujing"] pdf2pic (path, pic_path)

Let's talk about the idea of this code first. Since PDF cannot be extracted by changing its suffix like Excel and Word, a module PyMuPDF of python is used here, and the process is as follows

Read PDF and traverse every page

Filter useless elements and get pictures with regular expressions

Generate and save pictures

Fitz.open (path) is to open the PDF folder, where the path needs to get the user's file placement path to the file name in the GUI interface.

For i in range (1, nums): text = doc._getXrefString (I)

This is our first step to remerge the calendar, putting the reintroduced string content into the text

If ('Width 2550' in text) and ('Height 3300' in text) or ('thumbnail' in text): continue

Since the background of each page of PDF is a picture, we can find these background images by width and height and remove them with continue.

CheckXO = r "/ Type (? = * / XObject)" checkIM = r "/ Subtype (? = * / Image) isXObject = re.search (checkXO, text) isImage = re.search (checkIM, text) if not isXObject or not isImage: continue

Through experiments, we find that the corresponding string of the picture is in checkxo and checkIM. Therefore, the regular expression is used to extract the index in the text, and the search function of the re module is used. If it is not these two strings, continue to remove them.

Pix = fitz.Pixmap (doc, I) img_name = "img {} .png" .format (imgcount) if pix.n < 5: try: pix.writePNG (os.path.join (pic_path, img_name)) pix = None except: pix0 = fitz.Pixmap (fitz.csRGB, pix) pix0.writePNG (os.path.join (pic_path, img_name) pix0 = None

This is the last step to generate and save the picture.

Pix = fitz.Pixmap (doc, I) is the generate image statement, doc represents the PDF file, and I represents the index value that traverses each picture object.

Img_name = "img {} .png" .format (imgcount) is a statement that sets the image name, using the sequence number of the extracted picture as the naming format.

If pix.n

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.