How to design a file system LFS 07/02 Update SLTechnology News&Howtos

How to design a file system LFS

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, Xiaobian will bring you about how to design file system LFS. The article is rich in content and analyzed and described from a professional perspective. After reading this article, I hope you can gain something.

LFS is an ultra-fast file system that can store large and small files simultaneously. And not just a file system, I used it as a database.

My test data is: and C directly read and write files almost the same speed.

Project address: github git@osc

Note: This is an open source project, you can use it freely, but for developers, we need to review and confirm whether we can undertake development work. So, you need @ME.

Before designing, two goals need to be clear: high concurrency and ultra-fast reads and writes.

In order to achieve high concurrency, each concurrency must be separated and do its own work without affecting each other. This means that A and B can read the same or different files at the same time, or write different files, but cannot write to the same file.

In order to achieve ultra-fast read and write, under the premise of high concurrency, file operations can be carried out very quickly, and the location of files is also very important. In fact, I am based on the design of file location to achieve high concurrency.

Then my job becomes clear and simple, I need to implement a very good file location is good.

For speed, instead of using a filename scheme (which is unnecessary in the application scenario), I designed the filename to be a file ID. I use an int to implement this, which means I can provide up to 2^32 -1 files, which is already large enough to store almost all the files of a large service. (In fact, in order to achieve mass storage, to achieve distribution, we can very simply organize this file ID to achieve a universe-unique file ID. Why? Because file IDs are self-growing, we can ignore int restrictions, meaning file IDs can be variable-length.)

File IDs are self-growing, so for upper-level applications, file names are generated by the file system. The advantage of this is that the file name is small enough, and because it's designed, the file system knows how to assign a file ID to an upper application. And according to this file ID, ideal file location strategy can be realized. In my implementation, there is nothing mysterious about this file ID, so no algorithm is needed to calculate this file ID, it just grows on its own, just right.

Forget the file ID for a moment, Mark it first, and talk about it later. Let's first look at how to quickly locate the file content. The best way is to know directly where the file content is stored (starting and ending), ok, then we use two int to store these two information, and then we can directly find the actual file content according to this storage location. Well, yeah, that's a good idea. So the problem is, we have a lot of files, and each of them needs to record location information, so we have to get a file out and store this location information specifically to correspond to the actual content of the file. So I made an index file to record where the file content is stored (people call it metadata, but I don't understand it that way, in LFS it's just an index file, and you'll understand why it's named that later). In this way I can use this index file to record the location information of many files. Look it up in the index file and you will know where the storage location is. It is very simple. However, as files become numerous, the actual content of the file corresponding to the index file grows much faster than the index file, and when it grows to a certain limit, we cannot write new content to it (limited by the file format of the operating system, we may not be able to write a large amount of content to a single file, and since we used two int to record location information, this determines that the file content cannot be larger than 4G). So we have to chunk the file that stores the actual content, using many, many chunk files to store content that exceeds the capacity limit, called blocks in LFS.

This way we can store a lot of data, but the Index and Block mapping is broken (in fact, I like to see this happen, because this is the first step to high concurrency). Why do you say this, if you still retain this correspondence is also possible, that is to say, each Index has a Block corresponding to it. But this way, the Index will have many, and if a Block only stores a file, then the Index will become very small (only 8 bytes), which will cause the Index to become severely fragmented, and once the Index becomes fragmented, then we will locate the file content according to it. So, we break the correspondence between Index and Block and use another way to make Block correspond to Index. Make the Index record the Block ID corresponding to each file to implement the new correspondence, which adds an int to record the Block ID. In this way, Index can record many blocks, and it will not be fragmented, and it can grow smoothly.

In this way, we can achieve a kind of concurrency. Because each file will record its own Block ID, when operating on different files, it means that it is operating on different Blocks, because each Block has independence, so as long as the same time, processing is not the same Block, then many Blocks can be freely processed without affecting each other. So far, we have been able to achieve partial concurrency.

For Index, the recorded location information can also be processed concurrently. Because what has already been recorded does not change again, high concurrency can be achieved when reading the index, thus achieving high concurrency of reading.

However, there is a problem here, which has been ignored by the previous paragraph, that is, how is the Index written?

When the LFS receives a file write request, it needs to find a free Block and record the Block ID and the location information in the Block to the Index, that is, each record in the Index has 12 bytes. There is no problem with single-threaded processing, and the Index is added continuously. But when concurrency occurs, we run into trouble. Which thread will write the contents of Index? It might get messed up. So, our path to high concurrency is blocked here, and it becomes single-threaded. Fortunately, however, Index writes very little, only 12 bytes at a time, so it's fast, reducing the impact on concurrency.

Just like Block, we can also use many indexes to achieve high concurrency. That is, only one thread is writing for each Index, so high concurrency is increased by an order of magnitude.

At this point, let's describe the following index format: BlockID:int, start:int, end:int, a total of 12 bytes. Each file has these 12 bytes. But? Uh... Where's our file ID?

The answer is: No file ID.

Next, let's talk about the file ID of Mark before us. As I said, no file ID will be stored, so how to find the corresponding file content, that is, how to find the location information in the Index?

In fact, I hate the file ID thing, so I avoid it because there is really no need to store the file ID, if it is a file name, then you have to store it, but fortunately in the early days of LFS design, file ID is used. And why would I waste bytes storing file IDs? So, because of our previous design, we can no longer store file IDs, which means that there will be no lookup process in LFS, even in Index.

The scheme used by LFS is to find concrete content by calculation, and only calculation. Since each file in the Index is the same 12-byte record, LFS multiplies the file ID transmitted by the application layer by 12 bytes to obtain the position of the file ID in the Index, and then extracts these 12 bytes to the corresponding Block for operation. Does the index have to be hashed or sorted? It doesn't look like it. So that's why I call it an index, because an index plays a much more important role than metadata.

However, because there are multiple Index files, we first need to know the file ID, where the Index is located. Fortunately, the nominal size of each Index is the same, that is, each Index can record the same amount of file location information. And each Index is numbered, that is, we understand that: Index with number 0 stores the location information of file ID [0 - 99], Index with number 1 stores the location information of file ID [100 - 199]; now we need to read the content of file ID 109, then use Math.floor.(109 / 100), get 1 is the Index numbered 1; then we also need to get the file ID in the Index, that is: 109 - (1 * 100) = 9; finally, because each file records 12 bytes of information, so: 9 * 12 = 108, that is, the offset in the index, and then take out the next 12 bytes, that is, the information corresponding to Block. Thus, the Block is operated according to the read Block information.

Because the file ID and Index are implicitly associated, assigning a free Index at concurrency means assigning a file ID that is just right, which achieves self-growth of the file ID and is really just right.

This is the core principle of LFS. What's the reason why I'm unhappy?

The core principle seems simple, but LFS implements more, supports updates and deletions, especially updates, and is really complex to implement.

Does LFS use extended blocks for updates like other file systems? No, why would I? We already have Block, so we can update it directly with Block. LFS provides a variety of ways to operate updates, but let's take the simplest way to implement and describe it:

One way to do this is to delete the old content and then rewrite the file. That is, the process of writing the file once is repeated, except that in this case, the file ID does not need to be increased, and the content can be written directly to the specified file ID. This also means that even when writing a file for the first time (LFS has not been incremented to the file ID), you can write content using the specified file ID, not necessarily LFS generating the file ID before writing (but I personally do not recommend this).

The other way to do this is to modify the original data (processed differently) without deleting the old content. If the new content is larger than the original content, a new Block is allocated to write the overflow content. This process may result in data migration, but fortunately, there are ways to avoid data migration in the multiple update methods of LFS and also provide ways to allocate fewer blocks to maximize update efficiency. (I love this approach, it fits my work perfectly, and it's great for content that updates frequently or grows constantly.)

Since there are many ways to update to meet various needs, the functionality of the update is complex to implement, but the external API is still simple. In fact, LFS doesn't have an updated API, and the file writing API is updated at the same time, which makes it easier for outer applications (why do you have to have an updated API? NO!）。

LFS chunks many files, meaning that almost every part is concurrent. The default size of each file is 64 megabytes (why? Not because it's popular, but because I think the 64M can be loaded into memory very quickly, even on poorly configured machines. In fact, as you can see, there is hardly much CPU computation, so LFS is better suited for deployment on machines with low cost but excellent IO, thus reducing costs. Cost is also one of the requirements of LFS and I hope anyone can afford it.), It can be used to store large and small files, which means that large and small files can be stored at the same time due to the principles described above.

And I don't just use LFS as a file system, in fact, I personally use its database features, which I believe you can understand.

In addition, LFS provides defragmentation in the plan, which is used to merge fragments, etc., hoping that interested partners can join in. In fact, I've come up with a way to make it easier, but there's still some that haven't been done yet.

The above is how to design the file system LFS shared by Xiaobian for everyone. If you happen to have similar doubts, you may wish to refer to the above analysis for understanding. If you want to know more about it, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.