How does python extract all the pictures in the word file 07/06 Update SLTechnology News&Howtos

How does python extract all the pictures in the word file

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "python how to extract all the pictures in the word file". In the daily operation, I believe many people have doubts about how to extract all the pictures in the word file. The editor consulted all kinds of materials and sorted out a simple and easy-to-use method of operation. I hope it will be helpful to answer the question of "how to extract all the pictures in the word file by python". Next, please follow the editor to study!

Preface

In the office, occasionally encounter a situation, need to extract pictures in word documents, decided to write such a tool to automatically extract pictures.

About the use of scripts:

Scenario 1: if you get a folder and all the word files are in a subdirectory of this folder, with a depth of 1 layer, you can use the script directly

Scenario 2: if you get a folder and open it, it is cluttered with all kinds of files, and you are not sure where the word documents are, then you need to use Everything to extract all the word documents manually. Although I can also let the script achieve this function, it is necessary to use the script to consider that there may be files of the same name, and then the amount of code will be larger, so use Everything to move the files manually. Who makes the amount of code now far more than I expected?

3: after the first two steps of preprocessing, you can run the script directly

4: the script notes are very detailed, so I won't repeat them here.

5: currently only docx format is supported, the main reason is that if you support doc, you need to convert doc to docx, which is a little slower, and I don't need it. If you are interested, I will introduce the method of interconversion at the bottom, and you can add this function.

The code import zipfileimport osimport shutilimport hashlibimport send2trash''' assumes that all word documents are stored in a path that contains all kinds of miscellaneous things using Everything, or "filter file .py" to move all docx files to C:\ Users\ asuka\ Desktop\ 123 one by one to extract each docx document, and extract images. It is strongly recommended to use Everything to filter out all word documents, so that if there are two documents with the same name You can handle it manually if you write software to implement it, it will be a lot of trouble. # A function def extract_zip (zip_path): os.chdir (os.path.dirname (zip_path)) that is used to extract files needs to enter this path It is only under this path that a = zipfile.ZipFile (zip_path) # calls the zipfile.ZipFile () function to create a ZipFile object a.extractall () a.close () os.chdir (path) # to restore to the previous path # to get all the images. When testing, it is found that after different word files are decompressed, the image naming format inside is the same. As a result, the picture cannot be moved directly, otherwise it will cause file overwriting. Here, we need to check every file found. Rename''def get_picture (demo_path): count = 1 # used to rename an image for current_folder, list_folders Files in os.walk (demo_path): for f in files: if f.endswith ('png') or f.endswith (' jpg') or f.endswith ('jpeg'): # set the picture type to this move_f = current_folder +'\'+ f # give the path of the file to be moved new_file_path = path2 +'\' + str (count) +'.'+ f.rpartition ('.) [- 1] # specify the file path for the new file The file name is incremented, and the file suffix shutil.move (move_f, new_file_path) # Mobile file count + = 1 print ('[-] get a total of {} pictures' .format (count-1)) # deduplicates the picture # calculates the MD5 value of each picture, and removes the weight accordingly Deduplicated files will be deleted to the Recycle Bin def only_one (test_path): md5_list = [] count = 0 for current_folder, list_folders, files in os.walk (test_path): for file in files: picture_path = current_folder +'\'+ file # get the path of each picture f = open (picture_path 'rb') # start calculating the MD5 value of each picture md5obj = hashlib.md5 () md5obj.update (f.read ()) get_hash = md5obj.hexdigest () f.close () md5_value = str (get_hash). Upper () # start deduplicating if md5_value in md5_list : send2trash.send2trash (picture_path) # if the MD5 value of this file has ever appeared Delete this picture count + = 1 print ('[-] delete duplicate picture:'+ str (file)) else: md5_list.append (md5_value) # if the MD5 value of this picture does not exist in the list Add print to the list ('[-] delete duplicate pictures: {} '.format (count)) print (' [+] only word documents with the suffix docx can extract pictures!) Path = input ('[+] Please enter the folder where the word document is located:') # get the path where the original word document is located, os.chdir (path) print ("[+] Please enter a path to store all the pictures") print ("[+] or press enter I will automatically store the pictures on your desktop ") path2 = input ('') # path2 is used to store all the picture files if len (path2): passelse: desktop_path = os.path.join (os.path.expanduser (" ~ "), 'Desktop') # get the desktop path path2 = os.path.join (desktop_path) 'Pictures in all word files') os.makedirs (path2) files = os.listdir (path) # get all files in the specified folder for file in files: # traverse all files in the specified folder if file.endswith ('docx'): # add a judgment So even if there are other types of files under the path path, filename = file.rpartition ('.') [0] # get the file name file_path = os.path.join (path, filename) os.makedirs (file_path) # create a folder shutil.move (file) for the obtained file name File_path) # move the word document to a folder with the same name word_path = os.path.join (file_path, file) # get the file path of the word file at this time extract_zip (word_path) # No need to change the suffix Directly extract the docx file get_picture (path) only_one (path2) print ('[-] existing images: {} '.format (len (os.listdir (path2) attached: doc to docx

Introduce the realization of the interchange between the two.

What needs to be explained is:

To install OFFICE, if you use Kingsoft WPS, you cannot apply it yet.

The conversion speed is a little slower, but it is acceptable.

If you want to convert to another format, you need to modify it in the format file name and use the following save as parameter

Code

About lines 9 and 19:

Line 9 doc.SaveAs ("{} x" .format (fn), 12):

"{} x" .format (fn) is equivalent to turning C:UsersasukaDesktop11123.doc into C:UsersasukaDesktop11123.docx, first specifying the path and file name, and then 12 indicating that it is stored in docx format, ensuring that the suffix name and format correspond.

Line 19 doc.SaveAs ("{}" .format (fn [:-1]), 0):

"{}" .format (fn [:-1]) is equivalent to turning C:UsersasukaDesktop11456.docx into C:UsersasukaDesktop11456.doc, specifying the file to be saved, the path and file name to be saved, and then 0 means storing it in doc format, ensuring that the suffix name and format correspond.

From win32com import client# converts doc to docxdef doc2docx (fn): word = client.Dispatch ("Word.Application") # Open word Application # for file in files: doc = word.Documents.Open (fn) # Open word file doc.SaveAs ("{} x" .format (fn), 12) # Save as a file with the suffix ".docx" Parameter 12 or 16 refers to docx file doc.Close () # close the original word file word.Quit () # convert docx to docdef docx2doc (fn): word = client.Dispatch ("Word.Application") # Open word application # for file in files: doc = word.Documents.Open (fn) # Open word file doc.SaveAs ("{}" .format (fn [:-1]) 0) # Save as a file with the suffix ".docx" Parameter 0 indicates that doc print (fn [:-1]) doc.Close () # closes the original word file word.Quit () doc2docx (ringing Cposition UsersasukaDesktop11123.doc') docx2doc (ringing Claze UsersasukaDesktop11456.docx') so far, the study on "how python extracts all the pictures in the word file" is over. I hope you can solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.