Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to solve the problem of collapse of distributed file system in production environment

2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to solve the problem of the collapse of distributed file system in production environment". The content of the explanation in this article is simple and clear, and it is easy to learn and understand. let's study and learn how to solve the problem of collapse of distributed file system in production environment.

Problem positioning

By logging in to the server to view the access log of the system, it is found that the following exception information is output in the log file.

Org.csource.common.MyException: getStoreStorage fail, errno code: 28 at org.csource.fastdfs.StorageClient.newWritableStorageConnection (StorageClient.java:1629) at org.csource.fastdfs.StorageClient.do_upload_file (StorageClient.java:639) at org.csource.fastdfs.StorageClient.upload_file (StorageClient.java:162) at org.csource.fastdfs.StorageClient.upload_file (StorageClient.java:180)

Obviously, the problem is caused by the system's inability to upload files, and this log information is very important and plays a vital role in troubleshooting the problem.

Analyze the reasons

Since there is a problem with uploading files, let me try to see if I can access the previously uploaded files. After verification, the previously uploaded files are accessible, and it is verified again that it is the problem of uploading files.

Since the production environment is a distributed file system, in general, there is no problem, there is a problem with uploading files, and the most likely event is that the server is out of disk space. Then I will follow this line of thinking to sort out the problem.

As a result, I use df-h to check the storage usage of the server, which has reached 91%.

Well, disk space could be the cause of the problem. Next, let's further confirm whether the problem is caused by disk space.

So I opened the configuration of tracker.conf in the / etc/fdfs/ directory and saw that the reserved storage space was 10% (Note: the distributed file system here uses FastDFS).

Seeing here, you can be sure that it is the problem of unable to upload files caused by insufficient disk space.

The overall reason is that 91% of the server disk space has been used, while the disk space reserved in the distributed file system configuration is 10%. In fact, when uploading the file, the system has detected that the remaining disk space of the current server is less than 10%. Throw an exception and refuse to upload the file.

At this point, the cause of the problem has been determined, and the next step is to solve the problem.

Solve the problem

First of all, there are two ways to solve this problem, one is to delete unwanted files, and the other is to expand disk space.

Delete unwanted files

This way is used cautiously, here, I also briefly introduce this way. I provide my friends with several ways to delete recursively.

Recursively delete files in .pyc format.

Find. -name'* .pyc'- exec rm-rf {}\

Print files of the specified size under the current folder

Find. -name "*"-size 145800c-print

Recursively delete files of the specified size (145800)

Find. -name "*"-size 145800c-exec rm-rf {}\

Recursively delete files of the specified size and print them out

Find. -name "*"-size 145800c-print-exec rm-rf {}\

Here are some brief descriptions of the above commands.

"." Indicates a recursive search starting from the current directory

"- name'* .exe'" looks up by name, looking for all folders or files ending in .exe

The type found by "- type f" is a file.

"- print" output looks for the name of the file directory

-size 145800c specifies the file size

-exec rm-rf {}\; Recursive deletion (the result of the previous query)

Expand disk space

Here, Glacier recommends this method, and I also use this way to repair failures in the production environment.

By looking at the disk space of the server, it is found that the space under the / data directory is full of 5TB. Hehe, why don't the operators point the data storage directory of the file system to the / data directory? So I started migrating the file system's data storage directory to the / data directory, as shown below.

Note: here, I will simply simulate the migration of the data under / opt/fastdfs_storage_data to / data.

(1) copy files and migrate data

Cp-r / opt/fastdfs_storage_data / data cp-r / opt/fastdfs_storage / data cp-r / opt/fastdfs_tracker / data

(2) modify the path

Here you need to modify the / etc/fdfs/storage.conf, mod_fastdfs.conf, client.conf,tracker.conf files of the file system.

/ etc/fdfs/storage.conf

Store_path0=/data/fastdfs_storage_data base_path=/data/fastdfs_storage

/ etc/fdfs/mod_fastdfs.conf

Store_path0=/data/fastdfs_storage_data (with two places) base_path=/data/fastdfs_storage

/ etc/fdfs/client.conf

Base_path=/data/fastdfs_tracker

/ etc/fdfs/tracker.conf

Base_path=/data/fastdfs_tracker

Re-establish the symbolic link from M00 to the storage directory: ln-s / data/fastdfs_storage_data/data/ data/fastdfs_storage_data/data/M00

(3) kill the process and restart the storage service (tracker and memory)

Execute the following commands in turn

Pkill-9 fdfs service fdfs_trackerd start service fdfs_storaged start

(4) modify the read path nginx configuration of the file

Location ~ / group1/M00 {root / data/fastdfs_storage_data/data;}

(5) restart nginx

Cd / opt/nginx/sbin. / nginx-s reload Thank you for your reading, the above is the content of "how to solve the problem of distributed file system crash in production environment". After the study of this article, I believe you have a deeper understanding of how to solve the problem of distributed file system crash in production environment, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report