How to analyze the problems of using MooseFS 03/14 Update SLTechnology News&Howtos

How to analyze the problems of using MooseFS

2026-03-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces you how to analyze the use of MooseFS. The content is very detailed. Interested friends can refer to it for reference. I hope it can help you.

With the increasing amount of data, there have been some problems in the use of MFS. Here are some analyses and summaries. The following are shared with you:

Let's first mention two messages that appear frequently when MFS goes wrong:

connectivity outage

bad block problem

The following error occurs on the Master side of the connection interruption problem:

mfsmaster[15861]: connection with client(ip: 10.11.18.175) has been closed by peer Indicates that the connection between client and master is broken mfsmaster[15861]: connection with ML(10.11.19.76) has been closed by peer Indicates that the connection between Metalogger and Master is broken mfsmaster[15861]: connection with CS(10.11.18.199) has been closed by peer Indicates that the connection between ChunkServer and Master is broken

The cause analysis may be as follows:

Network flash-normal phenomenon, MFS itself can be automatically reconnected, no problem

Clinet or ChunkServer active disconnection, such as the Kill process, can also cause this error

ChunkServer or Client to Master connection timeout, will also disconnect, causing timeout may have two reasons:

Too many Client requests, causing the Master request queue to be full, resulting in a connection timeout

Timeouts due to slow network response (distinguished from network glitches)

Solution:

For 1, 3 caused by the interruption can be ignored, focus on 2 caused by the problem:

For 2-a: Client control requests, such as ultra-high concurrent read and write delete, another operation to note is ls, we know that the Linux system itself has a limit on the number of files displayed in a directory (such as 10W, then the need to traverse instructions will report an error, list too long), similarly, we MFS traverse the directory files should also note that the number of files to traverse too many will lead to timeout caused by the connection is interrupted and other issues.

2-b: Reasonable allocation of bandwidth resources, optimization of network environment solution.

Remarks:

After the connection from Client or Chunk to Master is interrupted, Reconnection and Register operations are automatically issued by Client or Chunk.

Bad Block Problem The following error occurs on the Master side:

mfsmaster[3250]: chunkserver has nonexistent chunk (000000000002139F_00000001), so create it for future deletion mfsmaster[3250]: (10.11.18.199:9422) chunk: 000000000002139F creation status: 20 mfsmaster[3250]: chunk 000000000002139F has only invalid copies (1) – please repair it manually mfsmaster[3250]: chunk 000000000002139F_00000001 – invalid copy on (10.11.18.199 – ver:00000000) mfsmaster[3250]: currently unavailable chunk 000000000002139F (inode: 135845 ; index: 23)

The above log means: there is a block with metadata information in Master, but ChunkServer does not have this block, the system will automatically create this block on ChunkServer for subsequent deletion, because there is no content, so it is illegal copy, we can not access this block.

There may be many reasons for this, such as:

During large file transfer on Client side, forcibly unplug master host power, causing master to shut down illegally. After using mfsmestore-a to repair, master log reports bad blocks.

ChunkServer csstats.mfs storage location space is insufficient, resulting in file blocks can not be written, but also will cause block errors

Manually delete block files on ChunkServer

After deleting the file, Master restarts after the abnormal end, but there is no result changelog.mfs to restore, which will also cause bad blocks.

There should be a lot of reasons, follow-up encounter to supplement.

Solution:

The Client uses mfsfilepair to repair the file.

I understand there are two kinds of bad blocks:

One is that no trunk node has data (repair work is actually to generate chunks, fill in 0 where content needs to be supplemented, and delete such chunks afterwards)

The other is that there are nodes with data blocks (copy from existing data blocks, where the blocks do not need to be deleted)

The following log messages may appear after repair:

mfsmaster[3250]: chunk hasn’t been deleted since previous loop – retry mfsmaster[3250]: (10.11.18.199:9422) chunk: 000000000002139F deletion status: 13

The Client performs an mv or rm operation, and the master will no longer display this information, such as:

mv 80499644316259743_s.jpg 80499644316259743_s_1.jpg About how to carry out MooseFS use problem analysis to share here, I hope the above content can be of some help to everyone, you can learn more knowledge. If you think the article is good, you can share it so that more people can see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.