How to automatically clean up a large number of files under Linux 07/09 Update SLTechnology News&Howtos

How to automatically clean up a large number of files under Linux

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the knowledge of "how to clean up a large number of files automatically under Linux". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

The need for automatic file cleaning

In the hands of the system administrator, manages the most valuable asset of the enterprise-data, and Linux, which accounts for half of the enterprise server operating system market, makes the Linux system administrator the most important asset administrator. The responsibility of the administrator is to allow the limited IT resources to store the most valuable data. When IBM launched a 3.5-inch 1GB hard disk in 1991, administrators could manage files manually with insight into every file on the hard disk, but today's PB-level storage devices bring unprecedented challenges to file management.

People who have used Linux should be able to delete files. So what can you do to delete the following files?

Delete files that end with a specific suffix in the entire file system,

Delete a specified file in a 1 million file system,

Delete 100000 files created on a specified date from a multi-million file system,

In a billion-level file system, file system cleanup is performed every day, deleting millions of files generated a year ago.

The following is to discuss how to achieve the above file deletion strategies and methods, if the above operation is easy for you, you can ignore this article.

For cleaning the file system, we can simply divide the cleaning task into two categories: cleaning out-of-date files and cleaning up junk files.

Expired file

Any data has its own life cycle. the life cycle curve of data tells us that the value of data is the greatest in a period of time after it is generated and produced, and then the value of data declines over time. When the data life cycle is over, these expired files should be deleted, freeing up storage space for valuable data.

Junk file

In the process of running the system, there will be a variety of temporary files, some temporary files when the application is running, Trace files generated by system errors, Core Dump and so on. After these files are processed, they lose their retention value, and these files can be collectively referred to as junk files. Timely cleaning of garbage files is helpful to the maintenance and management of the system, and ensures the stable and effective operation of the system.

Overview of automatic cleanup of files

Characteristics and methods of automatic file cleaning

Delete a file under the specified absolute path and rm can implement it. If you only know the file name but not the path, we can find it through `find` and delete it. By extension, if we can find the specified file according to the preset conditions, we can delete it. This is the basic idea of automatic file cleaning, generate a list of files to be deleted according to preset conditions, and then perform regular cleaning tasks to delete operations.

For expired files, their common mark is timestamp, which may be different time attributes such as file creation time, access time, expiration time and so on, depending on the file system. As most of the expired files exist in the archiving system, the number of such files is huge. For large systems, the number of expired files may reach hundreds of thousands or even millions of orders of magnitude every day. For such a large number of files, it takes a lot of time to scan the file system and generate a file list, so file cleaning performance is a problem that such people have to consider.

For junk files, it may be a file stored in a specific directory, it may be a file ending with a special suffix, or it may be a zero-size or oversized file caused by a system error. for these files, the number of files is generally small, but there are many kinds of files, and the situation is more complicated. It is necessary to formulate more detailed file query conditions according to the experience of the system administrator. Scan periodically, generate a list of files, and then process them further.

Introduction to related Linux commands

Common file system management commands include `ls`, `rm`, `find` and so on. Since these commands are common system management commands, I will not repeat them here, but please refer to the command help or Linux user manual for detailed usage. Because large-scale file systems are generally stored on dedicated file systems, these file systems provide unique commands for file system management. The practice chapter of this article takes IBM's GPFS file system as an example, and the following briefly introduces several file system management commands of GPFS.

Mmlsattr

This command is mainly used to view the extended attributes of files in the GPFS file system, such as storage pool information, expiration time, and so on.

Mmapplypolicy

GPFS uses policies to manage files, and this command can perform various operations on the GPFS file system according to user-defined policy files, which is very efficient.

Difficulties in automatic cleaning of a large number of files

Linux file deletion mechanism

Linux controls file deletion through the number of link, and only when a file does not have any link will it be deleted. Each file has two link counters-- i_count and i_nlink. The meaning of i_count is the number of current users, and the meaning of i_nlink is the number of media connections; or it can be understood that i_count is a memory reference counter and i_nlink is a hard disk reference counter. In other words, i_count increases when a file is referenced by a process, and i_nlink increases when a hard connection to a file is created.

For rm, it is to reduce i_nlink. Here is a problem: what happens if a file is being called by a process and the user deletes the file by performing a rm operation? When the user performs the rm operation, ls or other file management commands can no longer find the file, but the process continues to execute normally and can still read the contents correctly from the file. This is because the `rm` operation only sets the i_nlink to 0; because the file is consumed by the process, the i_count is not 0, so the system does not really delete the file. I_nlink is a sufficient condition for file deletion, while i_count is a necessary condition for file deletion.

For individual file deletions, we may not need to care about this mechanism at all, but for mass file deletions, this is a very important factor. Please allow me to elaborate on this in the following chapters. Please note the file deletion mechanism of Linux here.

Generate a list to be deleted

When there are 10 files under a folder, `ls` can be clear at a glance, and you can even use `ls` to check the detailed attributes of all files; when the number of files becomes 100, `ls` may only have to look at the file name; if the number of files increases to 1000, it may be acceptable to turn a few more pages; what if it is 10000? `ls` may take a long time to get a result; when it is expanded to 100000, most systems may not respond, or "Argument list too long". Not only `ls` will encounter this problem, but other commonly used Linux system management commands will encounter similar problems. Shell has parameters to limit the length of the command. Even if we can expand the length of the command by modifying the Shell parameter, it does not improve the efficiency of the command execution. For a very large file system, waiting time for the return of common file management commands such as `ls` and `find` is unacceptable.

So how can we generate a list of deleted files on a larger number of file systems? A high-performance file system index is a good method, but a high-performance file index is the patent of a few people (which also explains why google and baidu can make so much money). Fortunately, file systems of this size generally only exist in high-performance file systems, which provide very powerful file management functions. For example, the mmapplypolicy of the IBM Universal parallel File system (GPFS) mentioned earlier quickly scans the entire file system by scanning inode directly, and can return a list of files according to specified conditions. The following demonstrates how to get a list of files based on timestamp and file type.

Effect of deadlock on file deletion performance

For a system that executes file deletion tasks regularly every day, first generate files to be deleted, and then use the list as input to perform deletion operations; if the list to be deleted on a certain day is so large that the deletion task on the first day has not been completed, what will be the result of the deletion task on the second day?

A file that has not been deleted on the first day will appear in the deleted file list the next day, and then the next day's file deletion process will delete it as output. At this point, both the deletion process of the first day and the deletion of the second day will try to delete the same file, and the system throws a large number of unlink failure errors, which will greatly affect the deletion performance. The decline in deletion performance will cause the files of the second day to remain undeleted, and the deletion process on the third day will aggravate the deadlock of deleting files and enter a vicious circle of declining deletion performance.

Can the above problems be solved if you simply delete the list generated on the first day to be deleted? I can't. As described earlier in the Linux file deletion mechanism, deleting the file list file on the first day can only zero the file's i_nlink. When the first day's file deletion process does not end, the file's i_count is not zero, so the file will not be deleted. The list file to be deleted on the first day is not really deleted until the process has finished processing all the files in the list and the process exits.

At least before the new file deletion process starts, we need to terminate the deletion process of other files in the system to ensure that the deletion deadlock does not occur. But in doing so, there are still some drawbacks. Considering that in extreme cases, if the deletion process fails to complete the deletion task within one cycle for a continuous period of time, the list to be deleted will continue to grow, and the file scanning time will be extended, thus squeezing the working time of the file deletion process and falling into another vicious circle.

And practical experience tells us that when the delete list is particularly large, the performance of the deletion process also decreases. A parameter input file of appropriate size can ensure the effective execution of the process. Therefore, the list file to be deleted is divided into a series of files according to the fixed size, which can make the deletion operation execute stably and efficiently. Moreover, splitting into multiple files allows us to execute multiple delete processes concurrently, as long as storage and host performance allows.

Best practices for automatic cleaning of large numbers of files

Best practices for large-scale automatic cleanup over extra years under the GPFS file system

The following is a practice of automatic file cleaning on a 10 million-level GPFS file system: the hardware environment is two IBMx3650 servers and a DS4200 disk array with 50TB storage capacity, with the Linux operating system and GPFS v3.2 installed. The goal is for 2:00AM to perform file cleanup operations every day, deleting files that are 30 days old and all files that end with tmp.

Mmapplypolicy scan results show that there are 323784950 files and 158696 folders on the system.

The code is as follows:

[I] Directories scan: 323784950 files, 158696 directories

0 other objects, 0 'skipped' files and/or errors.

Define the search rules as follows, save as trash_rule.txt

The code is as follows:

RULE EXTERNAL LIST 'trash_list' EXEC''

RULE 'exp_scan_rule' LIST' trash_list' FOR FILESET ('data')

WHERE DAYS (CURRENT_TIMESTAMP)-DAYS (ACCESS_TIME) > 30

RULE 'tmp_scan_rule' LIST' trash_list' FOR FILESET ('data') WHERE NAME LIKE' .tmp'

Execute mmapplypolicy and cooperate with the grep and awk commands to generate a complete list of files to be deleted, and then use the split command to split the complete list into sublists of 10000 files per list:

The code is as follows:

Mmapplypolicy / data-P trash_rule.txt-L 3 | grep

"/ data" | awk'{pint $1}'> trash.lst

Split-a 4-C 10000-d trash.lst trash_split_

Execute the following command to delete:

The code is as follows:

For an in `ls trash_splict_* `

Rm `cat $a`

Done

Save the above operation as trash_clear.sh, and then define the crontab task as follows:

The code is as follows:

0 2 * / path/trash_clear.sh

Perform the deletion task manually. The scan results of the files to be deleted are as follows:

The code is as follows:

[I] GPFS Policy Decisions and File Choice Totals:

Chose to migrate 0KB: 0 of 0 candidates

Chose to premigrate 0KB: 0 candidates

Already co-managed 0KB: 0 candidates

Chose to delete 0KB: 0 of 0 candidates

Chose to list 1543192KB: 1752274 of 1752274 candidates

0KB of chosen data is illplaced or illreplicated

During file deletion, we can use the following command to calculate the number of file deletions per minute. As can be seen from the following output, the file deletion rate is 1546 files per minute:

The code is as follows:

Df-I / data;sleep 60th DF-I / data

Filesystem Inodes IUsed IFree IUse% Mounted on

/ dev/data 2147483584 322465937 1825017647 16% / data

Filesystem Inodes IUsed IFree IUse% Mounted on

/ dev/data 2147483584 322467483 1825016101 16% / data

The file deletion operation is timed by the `time` command. As can be seen from the output results, the file deletion operation takes a total of 1168 minutes (19.5 hours):

The code is as follows:

Time trash_clear.sh

Real 1168m0.158s

User 57m0.168s

Sys 2m0.056s

Of course, for the GPFS file system, the file system itself provides other file cleaning methods, such as the ability to perform file deletion operations through mmapplypolicy, which makes it possible to achieve more efficient file cleaning tasks. The purpose of this article is to discuss a general method of large-scale file cleanup. There is no further discussion on file cleanup operations based on the functions provided by the file system. Interested readers can give it a try.

This is the end of the content of "how to automatically clean up a large number of files under Linux". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.