How to use Python to find duplicate files and delete them 07/12 Update SLTechnology News&Howtos

How to use Python to find duplicate files and delete them

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you a script about how to use Python to find duplicate files and delete them. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

In real life, there is often the trouble of file repetition, that is, the same file may be in both An and B directory, and even worse, even if it is the same file, the file name may not be the same. In the case of fewer documents, this kind of situation is relatively easy to deal with, the worst is the manual comparison of one by one, even so, it is difficult to ensure that your eyes are sharp enough. If there are a lot of files, isn't this an impossible mission?

The following script mainly includes the following modules: diskwalk,chechsum,find_dupes,delete. The diskwalk module traverses the files, given the path, and outputs all the files under the path. The chechsum module is to find the MD5 value of the file. Find_dupes imports diskwalk and chechsum modules to determine whether the file is the same according to the value of md5. Delete is the deletion module. The details are as follows:

1. Diskwalk.py

Import os,sysclass diskwalk (object): def _ _ init__ (self,path): self.path = path def paths (self): path=self.path path_collection= [] for dirpath,dirnames Filenames in os.walk (path): for file in filenames: fullpath=os.path.join (dirpath File) path_collection.append (fullpath) return path_collectionif _ _ name__ = ='_ main__': for file in diskwalk (sys.argv [1]) .paths (): print file

2.chechsum.py

Import hashlib,sysdef create_checksum (path): fp = open (path) checksum = hashlib.md5 () while True: buffer = fp.read (8192) if not buffer:break checksum.update (buffer) fp.close () checksum = checksum.digest () return checksumif _ _ name__ = ='_ main__': create_checksum (sys.argv [1])

3. Find_dupes.py

From checksum import create_checksumfrom diskwalk import diskwalkfrom os.path import getsizeimport sysdef findDupes (path): record = {} dup = {} d = diskwalk (path) files = d.paths () for file in files: compound_key = (getsize (file) Create_checksum (file) if compound_key in record: dup [file] = record [compound _ key] else: record [compound _ key] = file return dupif _ name__ = ='_ main__': for file in findDupes (sys.argv [1]). Items (): print "The duplicate file is% s"% file [0] print "The original file is%\ n"% file [1]

The findDupes function returns the dictionary dup, whose key is a duplicate file and the value is the original file. This answers a lot of people's questions, after all, how do you make sure that you output duplicate files?

4. Delete.py

Import os,sysclass deletefile (object): def _ _ init__ (self File): self.file=file def delete (self): print "Deleting% s"% self.file os.remove (self.file) def dryrun (self): print "Dry Run:% s [NOT DELETED]"% self.file def interactive (self): answer=raw_input ("Do you really want to delete:% s [Y NOT DELETED N]"% self.file) if answer.upper ( ) = 'Yee: os.remove (self.file) else: print "Skiping:% s"% self.file returnif _ _ name__ = =' _ main__': from find_dupes import findDupes dup=findDupes (sys.argv [1]) for file in dup.iterkeys (): delete=deletefile (file) # delete.dryrun () delete.interactive () # delete.delete ()

The deletefile class constructs three functions, all of which implement the file deletion function, in which the delete function deletes the file directly, the dryrun function is a trial run, the file is not deleted, and the interactive function is an interactive mode, which allows the user to determine whether to delete or not. This fully takes into account the needs of customers.

Summary: these four modules have been packaged and can be used separately to achieve their own functions. Together, you can delete duplicate files in batches by entering a path.

Finally, post a full version, compatible with Python 2.0,3.0.

#! / usr/bin/python#-*-coding: UTF-8-*-from _ _ future__ import print_functionimport os, sys, hashlibclass diskwalk (object): def _ init__ (self, path): self.path = path def paths (self): path = self.path files_in_path = [] for dirpath, dirnames Filenames in os.walk (path): for each_file in filenames: fullpath = os.path.join (dirpath, each_file) files_in_path.append (fullpath) return files_in_pathdef create_checksum (path): fp = open (path 'rb') checksum = hashlib.md5 () while True: buffer = fp.read (8192) if not buffer: break checksum.update (buffer) fp.close () checksum = checksum.digest () return checksumdef findDupes (path): record = {} dup = {} d = diskwalk (path) files = d.paths () for each_file in files: compound_key = (os.path.getsize (each_file) Create_checksum (each_file)) if compound_key in record: Dupp [each _ file] = key [compound _ key] else: record [compound _ key] = each_file return dupclass deletefile (object): def _ init__ (self) File_name): self.file_name = file_name def delete (self): print ("Deleting% s"% self.file_name) os.remove (self.file_name) def dryrun (self): print ("Dry Run:% s [NOT DELETED]"% self.file_name) def interactive (self): try: answer = raw_input ("Do You really want to delete:% s [YampN] "% self.file_name) except NameError: answer = input (" Do you really want to delete:% s [YampN] "% self.file_name) if answer.upper () = = 'yearly: os.remove (self.file_name) else: print (" Skiping:% s "% self") .file _ name) returndef main (): directory_to_check = sys.argv [1] duplicate_file = findDupes (directory_to_check) for each_file in duplicate_file: delete = deletefile (each_file) delete.interactive () if _ _ name__ = ='_ main__': main ()

Where the first parameter is the directory to be detected.

The above is how to use Python to find duplicate files and delete the script. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.