How to conduct in-depth analysis of Google GFS file system 07/01 Update SLTechnology News&Howtos

How to conduct in-depth analysis of Google GFS file system

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces you how to carry out in-depth analysis of the Google GFS file system, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Abstract

We design and implement the Google GFS file system, a scalable distributed file system for large-scale data-intensive applications. GFS runs on cheap, universal hardware devices, but it still provides disaster redundancy and high-performance services for a large number of clients.

Although the design goal of GFS has a lot in common with many traditional distributed file systems, our design is based on our analysis of the load and technical environment of our applications. No matter now or in the future, there are obvious differences between GFS and early distributed file systems. So we re-examine the eclectic choice in the design of the traditional file system and derive a completely different design idea.

GFS fully meets our storage needs. As a storage platform, GFS has been widely deployed within Google to store the data generated and processed by our services, as well as for research and development work that requires large-scale data sets. By far, the largest cluster uses thousands of hard drives of thousands of machines, providing hundreds of TB of storage space and serving hundreds of clients at the same time.

Below we show the extension of the file system interface that can support distributed applications, discuss many aspects of our design, and finally list small-scale performance tests and performance-related data in real production systems.

Common terms

Design, reliability, performance, measurement

Keywords

Fault tolerance, scalability, data storage, cluster storage

1. Brief introduction

In order to meet the rapidly growing data processing needs of Google, we design and implement a Google file system (Google File System-GFS). GFS shares many of the same design goals as traditional distributed file systems, such as performance, scalability, reliability, and availability. However, our design is also based on the impact of our observations on the load of our own applications and the technical environment, and the assumptions of GFS and early file systems are significantly different now and in the future. So we re-examine the eclectic choice in the design of the traditional file system and derive a completely different design idea.

First, component failures are considered normal events, not unexpected events. GFS consists of hundreds or even thousands of storage machines assembled by ordinary cheap devices and accessed by a considerable number of clients at the same time. The number and quality of GFS components make it virtually possible that some components fail to work and some components fail to recover from their current failure state at any given time. We have encountered a variety of problems, such as application bug, operating system bug, human error, and even hard disk, memory, connector, network and power failure. Therefore, the mechanisms of continuous monitoring, error detection, disaster redundancy and automatic recovery must be integrated into GFS.

Second, our documents are huge by usual standards. Files that count GB are very common. Each file usually contains many application objects, such as web documents. When we often need to deal with fast-growing data sets consisting of hundreds of millions of objects and counting TB, it is very unwise to manage hundreds of millions of small files the size of KB, although some file systems support such management. Therefore, the design assumptions and parameters, such as the Imax O operation and the size of the Block, need to be reconsidered.

Third, most of the files are modified by appending data at the end of the file, rather than overwriting the original data. Random writes to files are almost non-existent in practice. Once you have finished writing, you can only read the file, usually in order. A large amount of data meets these characteristics, such as an oversized data set scanned by a data analyzer, a continuous data stream generated by a running application, archived data, and intermediate data generated by one machine and processed by another machine. The intermediate data may be processed at the same time or later. For this access mode for massive files, it is meaningless for the client to cache data blocks, and the appending operation of data is the main consideration of performance optimization and atomicity guarantee.

Fourth, the collaborative design of application and file system API improves the flexibility of the whole system. For example, we relaxed the requirements for the GFS consistency model, which reduced the stringent requirements of the file system on the application and greatly simplified the design of GFS. We introduce atomic record append operation to ensure that multiple clients can append operation at the same time, and no additional synchronization operation is needed to ensure data consistency. The details of these issues are discussed in detail later in this article.

Google has deployed multiple GFS clusters for different applications. The largest cluster has more than 1000 storage nodes, more than the hard disk space of 300TB, and is frequently accessed continuously by hundreds of clients on different machines.

two。 Design Overview 2.1 Design expectations

When designing a file system that meets our needs, our design goals have both opportunities and challenges. We have mentioned some key points that need to be paid attention to before, and here we will discuss the details of the expected goals of the design.

The ◆ system consists of many cheap common components, and component failure is a norm. The system must continuously monitor its own state, and it must regard component failure as a normal state, which can quickly detect, redundant and recover failed components.

The ◆ system stores a certain number of large files. We expect to have millions of files, usually of 100MB or more. Several GB-sized files are also ubiquitous and should be managed effectively. The system must also support small files, but there is no need to make special optimizations for small files.

The workload of ◆ system is mainly composed of two kinds of read operations: large-scale streaming reads and small-scale random reads. Large-scale streaming reads usually read hundreds of KB of data at a time, and more often read 1MB or more data at a time. Consecutive operations from the same client usually read a contiguous area in the same file. Small random reads usually read several KB data at a random location in the file. If your application is very concerned about performance, it is common practice to merge and sort small random reads, and then read them in batches in order, thus avoiding moving read positions back and forth in the file.

The workload of the ◆ system also includes many large-scale, sequential, data-appended write operations. In general, the size of the data written each time is similar to that of a large read. Once the data is written, the file is rarely modified. The system supports small-scale random location write operations, but may be inefficient.

◆ systems must be efficient and well-defined (alex Note: well-defined) to achieve multi-client parallel append data to the same file semantics. Our files are usually used in "producer-consumer" queues or other multiplex file merging operations. There are usually hundreds of producers, each of which runs on a machine and appends a file at the same time. It is necessary to implement atomic multiplex data operations with minimal synchronization overhead. The file can be read later, or the consumer can read the file at the same time as the append operation.

◆ 's high-performance and stable network bandwidth is far more important than low latency. Most of our target programs are required to process data at high speed and in large quantities, and few programs have strict response time requirements for a single read and write operation.

2.2 Interface

GFS provides a set of API interface functions similar to traditional file systems, although it is not strictly implemented in the form of standard API such as POSIX. Files are organized in a hierarchical directory, identified by a path name. We support common operations, such as creating new files, deleting files, opening files, closing files, reading and writing files.

In addition, GFS provides snapshot and record append operations. Snapshots create a copy of a file or directory tree at a very low cost. Record append operation allows multiple clients to append data to a file at the same time, while ensuring that the append operation of each client is atomic. This is useful for merging multiple results and for "producer-consumer" queues, where multiple clients can append data to a file at the same time without the need for additional synchronous locking. We found that these types of files are very important for building large distributed applications. Snapshot and record append operations are discussed in sections 3.4 and 3.3, respectively.

2.3 Architecture

A GFS cluster contains a single Master node (alex Note: a single Master node here means that there is only one logical Master component in the GFS system. We will also mention Master node replication later, so, for ease of understanding, we treat the Master node as a logical concept. A logical Master node consists of two physical hosts, that is, two Master servers), multiple Chunk servers, and is accessed by multiple clients at the same time, as shown in figure 1. All of these machines are usually ordinary Linux machines running user-level (user-level) service processes. We can easily put the Chunk server and client on the same machine, as long as machine resources permit, and we can accept the reduced risk of stability caused by unreliable application code.

The files stored in GFS are split into fixed-size Chunk. When the Chunk is created, the Master server assigns each Chunk an immutable, globally unique 64-bit Chunk identity. The Chunk server saves the Chunk as a linux file on the local hard disk and reads and writes block data according to the specified Chunk ID and byte range. For reliability reasons, each block is copied to multiple block servers. By default, we use three storage replication nodes, but users can set different replication levels for different file namespaces.

The Master node manages all file system metadata. This metadata includes namespaces, access control information, mapping information for files and Chunk, and location information for the current Chunk. The Master node also manages system-wide activities, such as Chunk lease management (alex BDB also has a description of lease, I don't know if it's the same), the collection of orphan Chunk (alex orphaned chunks), and the migration of Chunk between Chunk servers. The Master node communicates periodically with each Chunk server using heartbeat information, sends instructions to each Chunk server and receives the status information of the Chunk server.

The GFS client code is linked to the client in the form of a library. The client code implements the API interface function of the GFS file system, the application communicates with the Master node and the Chunk server, and reads and writes data. The communication between the client and the Master node only obtains metadata, and all data operations are performed by the client directly with the Chunk server. We do not provide the functionality of the POSIX standard API, so GFS API calls do not need to drill down to the Linux vnode level.

Neither the client nor the Chunk server needs to cache file data. Caching data on the client side is of little use because most programs either read a large file as a stream or the working set is too large to be cached at all. It also simplifies the design and implementation of the client and the whole system without considering cache-related problems. (however, the client caches metadata. The reason why Chunk servers do not need to cache file data is that Chunk is saved as local files, and the file system cache of the Linux operating system caches frequently accessed data in memory.

2.4 single Master node

The strategy of a single Master node greatly simplifies our design. A single Master node can accurately locate the location of Chunk and make replication decisions through global information. In addition, we must reduce the reading and writing of Master nodes to prevent Master nodes from becoming the bottleneck of the system. The client does not read and write file data through the Master node. Instead, the client asks the Master node about the Chunk server it should contact. The client caches the metadata information for a period of time, and the subsequent operations will read and write data directly with the Chunk server.

Let's use figure 1 to explain the flow of a simple read. First of all, the client converts the file name and the byte offset specified by the program into the Chunk index of the file according to the fixed Chunk size. It then throws the file name and Chunk request to the Master node. The Master node sends the corresponding Chunk identity and the location information of the copy back to the client. The client caches this information using the file name and the Chunk index as the key.

The client then sends a request to one of the copies and usually chooses the nearest one. The request information contains the identity and byte range of the Chunk. In subsequent read operations on this Chunk, the client no longer has to communicate with the Master node unless the cached metadata information expires or the file is reopened. In fact, the client usually queries for multiple Chunk information in a single request, and the response of the Master node may also contain information about the Chunk immediately following the requested Chunk. In practical application, this additional information avoids the possible future communication between the client and the Master node without any cost.

2.5 Chunk Siz

The size of Chunk is one of the key design parameters. We chose 64MB, which is much larger than the Block size of a normal file system. A copy of each Chunk is saved on the Chunk server as a normal Linux file, which is expanded only when needed. The inert space allocation strategy avoids space waste caused by internal debris, which is perhaps the most controversial point about choosing such a large Chunk size.

Choosing a larger Chunk size has several important advantages. First of all, it reduces the need for communication between the client and the Master node, because you only need to communicate with the Mater node once to get the location information of the Chunk, and then you can read and write to the same Chunk multiple times. This approach is effective in reducing our workload because our applications usually read and write large files continuously. Even if it is a small-scale random read, using a larger Chunk size also brings obvious benefits, and the client can easily cache all the Chunk location information in a working data set counting TB. Secondly, with a larger Chunk size, the client can operate on a block multiple times, so that the network load can be reduced by maintaining a long TCP connection with the Chunk server. Third, the selection of larger Chunk size reduces the amount of metadata that Master nodes need to save. This allows us to keep all the metadata in memory, and we will discuss the additional benefits of keeping all the metadata in memory in Section 2.6.1.

On the other hand, even with inert space allocation, the use of larger Chunk size also has its defects. Small files contain less Chunk, or even only one Chunk. When many clients access the same small file multiple times, the Chunk server that stores these Chunk becomes a hot spot. In practical applications, because our programs usually read large files containing multiple Chunk continuously, hot spots are not the main problem.

However, when we first used GFS for the batch queuing system, a hot issue arose: an executable file was saved as a single-chunk file on GFS, and then the executable file was started on hundreds of machines at the same time. Several Chunk servers that hold this executable file are partially overloaded by concurrent requests from hundreds of clients. We solved this problem by using larger replication parameters to save the executable and staggering the startup time of the batch queue system program. One possible long-term solution is to allow clients to read data from other clients in this case.

2.6 metadata

Master server (alex Note: note the difference between a logical Master node and a physical Master server. What we will talk about later is the behavior of each Master server, such as storage, memory, and so on, so we will all use physical names) to store the three main types of metadata, including: the namespace of files and Chunk, the correspondence between files and Chunk, and the location of each Chunk copy. All metadata is stored in the memory of the Master server. The first two types of metadata (namespace, file, and Chunk correspondence) are also recorded in the operating system's system log file as a change log, which is stored on the local disk and the log is copied to other remote Master servers. By keeping the change log, we can simply and reliably update the status of the Master server without worrying about the risk of data inconsistency caused by the crash of the Master server. The Master server does not persist Chunk location information. When the Master server starts, or when a new Chunk server joins, it polls each Chunk server for information about the Chunk they store.

2.6.1 data structures in memory

Because the metadata is kept in memory, the Master server is very fast. Moreover, the Master server can scan all the state information saved by itself periodically in the background simply and efficiently. This periodic status scan is also used to achieve Chunk garbage collection, re-copy data when the Chunk server fails, achieve load balancing across Chunk servers through Chunk migration, and disk usage statistics. These behaviors are discussed in depth in chapters 4.3 and 4.4.

The method of keeping all metadata in memory has a potential problem: the number of Chunk and the carrying capacity of the entire system are limited by the amount of memory that the Master server has. But in practical application, this is not a serious problem. The Master server needs less than 64 bytes of metadata to manage a 64MB Chunk. Since most files contain multiple Chunk, most of the Chunk is full, except that the last Chunk of the file is partially populated. Similarly, the data size of each file in the namespace is usually less than 64 bytes, because the saved file name is compressed using a prefix compression algorithm.

Even if you need to support a larger file system, the cost of adding extra memory to the Master server is very small, and by adding a limited fee, we will be able to keep all the metadata in memory, enhancing the simplicity, reliability, high performance and flexibility of the system.

2.6.2 Chunk location Information

The Master server does not keep information about which Chunk server holds a copy of the specified Chunk. The Master server simply polls the Chunk server at startup to get this information. The Master server can guarantee that the information it holds is always up-to-date because it controls the allocation of all Chunk locations and monitors the status of the Chunk server through periodic heartbeat information.

At first, we tried to keep the location information of Chunk on the Master server permanently, but later we found it easier to poll the Chunk server at startup and then poll for updates periodically. This design simplifies the problem of data synchronization between the Chunk server and the Chunk server when the Master server joins the cluster, leaves the cluster, renames, fails, and restarts. Such events occur frequently in a cluster with hundreds of servers.

This design decision can be understood from another perspective: only the Chunk server can finally determine whether an Chunk is on its hard drive. We have never considered maintaining a global view of this information on the Master server, because errors in the Chunk server may cause the Chunk to disappear automatically (for example, the hard drive is corrupted or inaccessible), or the operator may rename a Chunk server.

2.6.3 Operation Log

The operation log contains key metadata change history. This is very important for GFS. This is not only because the operation log is the only persistent storage record for metadata, but it also serves as a logical time baseline for determining the order of synchronous operations (alex note: that is, the sequence number of the logical log is used as the logical time when the operation occurs, similar to LSN in a transactional system). Files and Chunk, along with their versions (see Section 4.5), are unique and permanent identifiers of the logical time they create.

The operation log is very important, and we must ensure that the log file is complete and that the log is visible to the client only after changes in the metadata are persisted. Otherwise, even if there is nothing wrong with Chunk itself, we may still lose the entire file system or the most recent operation of the client. Therefore, we will copy the log to multiple remote machines, and only after writing the corresponding log record to the local and remote machine's hard disk will we respond to the client's operation request. The Master server collects multiple log records and processes them in batches to reduce the impact of writing to disk and replication on the overall performance of the system.

During disaster recovery, the Master server restores the file system to its most recent state by replaying the operation log. In order to shorten the time it takes for Master to start, we must make the log small enough. (alex Note: the number of logs that repeat system operations is as small as possible). When the log grows to a certain amount, the Master server Checkpoint the system state once (alex note: Checkpoint is an act of taking a snapshot of the database state), and writes all the state data to a Checkpoint file (alex Note: and delete the previous log file). During disaster recovery, the Master server can recover the system by reading the Checkpoint file from disk and repeating the limited number of log files after the Checkpoint. Checkpoint files are stored in a compressed B-tree data structure and can be mapped directly to memory without additional parsing when used for namespace queries. This greatly improves the speed of recovery and enhances availability.

Because it takes time to create a Checkpoint file, the internal state of the Master server is organized into a format that ensures that ongoing modifications are not blocked during the Checkpoint process. The Master server uses separate threads to switch to a new log file and create a new Checkpoint file. The new Checkpoint file includes all changes before switching. For a cluster of millions of files, it takes about a minute to create a Checkpoint file. After the creation is completed, the Checkpoint file is written to the local and remote hard disk.

Only the latest Checkpoint files and subsequent log files are needed for Master server recovery. Old Checkpoint files and log files can be deleted, but in response to catastrophic failures (alex Note: catastrophes), we usually save more history files. The Checkpoint failure does not have any impact on correctness, because the code that recovers the function can detect and skip incomplete Checkpoint files.

2.7 consistency model

GFS supports a loose consistency model that supports our highly distributed applications while maintaining the advantages of relative simplicity and ease of implementation. In this section, we discuss the mechanism for ensuring GFS consistency and what it means to the application. We also focus on how GFS manages these consistency assurance mechanisms, but the details of implementation will be discussed in other parts of this paper.

2.7.1 GFS consistency guarantee mechanism

Modifications to the file namespace (for example, file creation) are atomic. They are controlled only by the Master node: namespace locks provide atomicity and correctness (Chapter 4.1); the operation log of the Master node defines the global order of these operations (Chapter 2.6.3).

The status of the modified file region (alex Note: the word region is very difficult to express in Chinese, I think it should be a range of files involved in the modification operation) depends on the type of operation, success, and synchronous modification. Table 1 summarizes the results of various operations. If all clients, no matter which copy they read, read the same data, then we think that the file region is "consistent"; if after the data of the file is modified, the region is consistent, and the client can see the full contents of the write operation, then the region is "defined." When a data modification operation is performed successfully and is not disturbed by other simultaneous writes, the affected region is defined (implied consistency): all clients can see the write. After the parallel modification operation completes successfully, the region is in a consistent, undefined state: all clients see the same data, but cannot read the data written by any one write operation. Typically, the file region contains mixed pieces of data from multiple modification operations. A failed modification operation results in an region in an inconsistent state (also undefined): different customers see different data at different times. Later we will describe how applications distinguish between defined and undefined region. There is no need for the application to subdivide the different types of undefined region.

Data modification operations can be divided into two types: writing or record appending. The write operation writes the data at the file offset specified by the application. Even if there are multiple modification operations performed in parallel, the record append operation can append the data atomically to the file at least once, but the offset position is selected by GFS (Chapter 3.3) (alex Note: this sentence is a bit confusing, its meaning is that all additional writes will be successful, but it is possible to be executed multiple times, and each additional file offset is calculated by GFS itself). (by comparison, it is commonly said that the offset of the append operation is the end of the file. GFS returns an offset to the client, indicating the starting point of the defined region that contains the written record In addition, GFS may insert populated data or duplicate records in the middle of the file. The file region occupied by this data is considered to be inconsistent, and the data is usually much smaller than user data.

After a series of successful modification operations, GFS ensures that the modified file region is defined and contains the data written by the last modification operation. GFS ensures the above behavior by: (a) making changes to all copies of Chunk in the same order (Chapter 3.1), and (b) using the version number of Chunk to detect whether the copy was invalidated due to the failure of the Chunk server on which it is located due to the failure of the modification operation (Chapter 4.5). The invalid copy will not be modified, and the Master server will no longer return the location information of the Chunk copy to the client. They will be recycled by the garbage collection system as soon as possible.

Because the Chunk location information is cached by the client, it is possible for the client to read data from an invalid copy before the information is refreshed. There is a time window between the timeout of the cache and the next time the file is opened, and when the file is opened again, all Chunk location information related to the file in the cache is cleared. And, since most of our files are appended only, an invalid copy usually returns a prematurely terminated Chunk rather than expired data. When a Reader retries and contacts the Master server, two proper nouns, Reader and Writer, are used in this article to denote programs that perform GFS read and write operations, it immediately gets the latest Chunk location information.

Even after the modification operation is performed successfully for a long time, the failure of the component may damage or delete the data. GFS finds failed Chunk servers through regular "handshakes" between the Master server and all Chunk servers, and uses Checksum to verify that the data is corrupt (Chapter 5.2). Once a problem is found, the data should be restored with a valid copy as soon as possible (Chapter 4.3). Only when all copies of a Chunk are lost before the GFS detects an error and takes action to deal with it, the Chunk will be lost irreversibly. In general, the response time of a GFS is a few minutes (alex note: the Master node detects an error and takes action). Even in this case, Chunk is simply unavailable, not corrupted: the application receives an explicit error message instead of corrupted data.

2.7.2 implementation of the program

Applications using GFS can use some simple techniques to achieve this loose consistency model, which are also used to achieve some other target functions, including: try to use additional writes rather than overwrites, Checkpoint, self-validating write operations, and self-identifying records.

In practical applications, all of our applications write files by appending data rather than overwriting as much as possible. A typical application in which the application writes data from beginning to end and generates a file. After all the data is written, the application automatically renames the file to a permanently saved file name, or periodically Checkpoint, to record how much data has been successfully written. Checkpoint files can contain checksums at the program level. Readers only validates and processes the file region generated after the last Checkpoint, and the state of these files region must be defined. This method meets our requirements of consistency and concurrent processing. Append writes are more efficient than random location writes and are more flexible in handling application failures. Checkpoint allows Writer to start over in an incremental manner and prevents Reader from processing data that has been successfully written but not yet completed from an application's point of view.

Let's analyze another typical application. Many applications append data to the same file in parallel, such as merging results or a producer-consumer queue. The "at least one append" feature of recording the append mode ensures the output of the Writer. Readers uses the following methods to deal with accidental populated data and duplicated content. Writers contains additional information, such as Checksum, in each written record to verify its validity. Reader can use Checksum to identify and discard additional populated data and record fragments. If the application cannot tolerate occasional duplicates (for example, if these duplicates trigger non-idempotent operations), you can filter them with unique identifiers of records, which are typically used to name entity objects processed in the program, such as web documents. These These functionalities for record alex O functions (except for deduplicated data) are included in the shared library of our programs and are applicable to other file interface implementations within Google. So, records of the same sequence, plus some occasional duplicate data, are distributed to Reader.

3. System interaction

When we design this system, an important principle is to minimize the interaction between all operations and Master nodes. With this design philosophy in mind, we now describe how the client, Master server, and Chunk server interact to implement data modification operations, atomic record append operations, and snapshot functions.

3.1 lease and change order

(alex Note: lease is a term in the database)

A change is an operation that changes Chunk content or metadata, such as a write operation or a record append operation. The change operation is performed on all copies of the Chunk. We use the lease mechanism to maintain the consistency of the change order among multiple replicas. The Master node establishes a lease for a copy of Chunk, which we call the master Chunk. The main Chunk serializes all changes to the Chunk. All copies follow this sequence for modification. Therefore, the global order of the modification operation is determined first by the order of the lease selected by the Master node, and then by the sequence number assigned by the master Chunk in the lease.

The purpose of the lease mechanism is to minimize the administrative burden of Master nodes. The initial timeout for the lease is set to 60 seconds. However, as long as the Chunk is modified, the primary Chunk can apply for a longer lease, which is usually confirmed by the Master node and receives the lease extension. These lease extension requests and approval information are usually passed in a heartbeat message between the Master node and the Chunk server. Sometimes the Master node tries to cancel the lease in advance (for example, the Master node wants to cancel a modification on a file that has been renamed). Even if the Master node loses contact with the primary Chunk, it can safely sign a new lease with another Chunk copy after the old lease expires.

In figure 2, we show the control flow of the write operation according to the step number.

The ◆ client asks the Master node which Chunk server holds the current lease and the location of the other copies. If no Chunk holds the lease, the Master node selects one of the replicas to establish a lease (this step is not shown in the figure).

The ◆ Master node returns the identifier of the primary Chunk and the location of other copies (also known as secondary replicas, secondary replicas) to the client. The client caches this data for subsequent operations. The client needs to re-contact the Master node only if the primary Chunk is not available, or if the primary Chunk reply indicates that it no longer holds the lease.

The ◆ client pushes the data to all copies. The client can push data in any order. The Chunk server receives the data and stores it in its internal LRU cache until the data is used or exchanged out of date. Because the network transmission load of the data flow is very high, by separating the data flow from the control flow, we can plan the data flow based on the network topology and improve the system performance, regardless of which Chunk server holds the main Chunk. Chapter 3.2 will discuss this further.

◆ when all copies confirm that the data has been received, the client sends a write request to the main Chunk server. This request identifies the data that was previously pushed to all copies. The main Chunk assigns contiguous sequence numbers to all operations received, which may come from different clients, and the sequence numbers ensure that the operations are performed sequentially. It applies operations to its own local state in sequence number order. (alex Note: these operations are performed locally, which is a bit confusing to translate literally, maybe it should be translated as "it performs these operations sequentially and updates its own status").

The ◆ master Chunk passes the write request to all secondary replicas. Each secondary copy performs these operations in the same order according to the sequence number assigned by the primary Chunk.

All secondary replicas of ◆ reply to the primary Chunk, and they have completed the operation.

The ◆ primary Chunk server (alex Note: the Chunk server where the primary Chunk is located) replies to the client. Any errors generated by any copy are returned to the client. In the event of an error, the write operation may succeed on the primary Chunk and some secondary replicas. If the operation fails on the primary Chunk, the operation will not be assigned a sequence number and will not be passed. The client's request is confirmed to have failed and the modified region is in an inconsistent state. Our client code handles such errors by repeating failed operations. Before repeating the execution from scratch, the customer will make several attempts from step (3) to step (7).

If the application writes a large amount of data at a time, or if the data spans multiple Chunk,GFS client code, it will be divided into multiple write operations. These operations follow the control flow described earlier, but may be interrupted or overwritten by simultaneous operations on other clients. Therefore, the tail of the shared file region may contain fragments of data from different clients, however, since these shredded writes are performed in the same order on all replicas, all replicas of Chunk are consistent. This puts the file region in a consistent but undefined state as described in Section 2.7.

3.2 data flow

In order to improve the efficiency of the network, we take measures to separate the data flow from the control flow. While controlling the flow from the client to the primary Chunk and then to all secondary replicas, the data is piped sequentially along a carefully selected chain of Chunk servers. Our goal is to make full use of the bandwidth of each machine, avoid network bottlenecks and high latency connections, and minimize the delay of pushing all data.

To take full advantage of the bandwidth of each machine, data is pushed sequentially along a chain of Chunk servers, rather than distributed in other topologies (for example, tree topologies). In linear push mode, all the egress bandwidth of each machine is used to transmit data at the fastest speed, rather than allocating bandwidth among multiple recipients.

In order to avoid network bottlenecks and high latency links as much as possible (eg,inter-switch is most likely to have similar problems), each machine tries to choose the nearest machine in the network topology that has not yet received the data as the target to push data. Suppose the client pushes data from Chunk server S1 to S4. It pushes the data to the nearest Chunk server S1. S1 pushes the data to S2 because the closest machine between S2 and S4 is S2. Similarly, S2 passes the data to a machine closer to S3 and S4 and pushes it down in turn. Our network topology is very simple, and the "distance" of the node can be calculated from the IP address.

Finally, we use a pipeline data push method based on TCP connection to minimize latency. As soon as the Chunk server receives the data, it starts pushing forward. Pipeline data push helps us a lot because we use a full-duplex switching network. Pushing forward as soon as the data is received does not slow down the reception. In the absence of network congestion, the ideal time to transfer B bytes of data to R copies is B/T+RL, T is the throughput of the network, and L is the delay in data transmission between two machines. Normally, our network connection speed is 100Mbps (T), L will be much less than 1ms. Therefore, 1MB data can be distributed around 80ms ideally.

3.3 record addition of atoms

GFS provides an atomic data append operation-record append. In a traditional write operation, the client program specifies the offset of the data write. Parallel writes to the same region are not serial: the tail of the region may contain pieces of data written by different clients. With record append, the client only needs to specify the data to be written. GFS ensures that at least one atomic write operation is executed successfully (that is, writing a sequential byte stream), the written data is appended to the offset specified by GFS, and then GFS returns this offset to the client. This is similar to the behavior of multiple concurrent write operations without race conditions for files opened in O_APPEND mode in the Unix operating system programming environment.

Record appending is frequently used in our distributed applications, where many clients write data to the same file in parallel. If we use traditional file write operations, the client needs additional complex and expensive synchronization mechanisms, such as using a distributed lock manager. In our work, such files are usually used in queuing systems with multiple producers / single consumers, or result files that merge data from multiple clients.

Record appending is a modification operation, which also follows the control flow described in Section 3.1, except for some additional control logic in the main Chunk. The client pushes the data to all copies of the last Chunk of the file and then sends the request to the main Chunk. The master Chunk checks to see if this record append operation causes the Chunk to exceed the maximum size (64MB). If the maximum size is exceeded, the primary Chunk first populates the current Chunk to the maximum size, then notifies all secondary copies to do the same, and then replies to the client asking it to re-record and append the next Chunk. (the size of the recorded additional data is strictly limited to 1x4 of the maximum size of Chunk, so that even in the worst case, the number of data fragments is still within a controllable range. Usually, the additional record does not exceed the maximum size of the Chunk, and the primary Chunk appends the data to its own copy, then informs the second-level replica to write the data in the same location as the master Chunk, and finally replies to the client successfully.

If the record append operation fails on any copy, the client needs to re-operate. The result of re-recording appends is that different copies of the same Chunk may contain different data-repeating all or part of the data of a record. GFS does not guarantee that all copies of Chunk are identical at the byte level. It only ensures that the data as a whole atom is written at least once. This feature can be deduced from a simple observation: if the operation is successful, the data must have been written to the same offset position of all copies of the Chunk. After that, all copies are at least the length of the tail of the record, and any subsequent records are appended to a larger offset address, or to a different Chunk, even if other Chunk replicas are selected by the Master node as the primary Chunk. In the case of our consistency assurance model, the region that records that the append operation successfully writes to the data is defined (and therefore consistent), and vice versa is inconsistent (and therefore undefined). As we discussed in Section 2.7.2, our program can handle inconsistent areas.

3.4 Snapshot

(alex Note: this section is very difficult to understand. In general, it describes what snapshots are, the COW technology used by snapshots, and how snapshots do not interfere with the current operation.)

Snapshot operations can make a copy of a file or directory tree ("source") almost instantly and cause almost no interference with other ongoing operations. Our users can use snapshots to quickly create a branched copy of a large dataset (and often a recursive copy), or use snapshot operations to back up the current state before doing experimental data operations. this allows you to easily commit or roll back to the state at the time of the backup.

Just like AFS (alex AFS, a distributed file system), we use standard copy-on-write technology to implement snapshots. When the Master node receives a snapshot request, it first cancels the lease of all Chunk of the file that took the snapshot. This measure ensures that subsequent writes to these Chunk must interact with the Master to find the lease holder. This gives the Master node a chance to take the lead in creating a new copy of Chunk.

After the lease is cancelled or expired, the Master node logs the operation to the hard disk. The Master node then reflects the changes in this log record to the state saved in memory by copying the metadata of the source file or directory. The newly created snapshot file and the source file point to exactly the same Chunk address.

After the snapshot operation, when the client first wants to write data to Chunk C, it first sends a request to the Master node to query the current lease holder. The Master node noticed that the reference count of Chunke C exceeded 1 (alex Note: I don't quite understand why it is greater than 1. Didn't Snapshot release the reference count? . Instead of replying to the client's request immediately, the Master node selects a new Chunk handle C`. After that, the Master node requires each Chunk server that has the current copy of Chunk C to create a new Chunk called C`. By creating a new Chunk on the Chunk server where the source Chunk is located, we ensure that the data is replicated locally rather than over the network (our hard drive is about three times faster than our 100Mb Ethernet). At this point, the request is handled in the same way as any other Chunk: the Master node ensures that a copy of the new Chunk C` has a lease, then replies to the client, and the client can write the Chunk normally after getting a reply, regardless of whether it is cloned from an existing Chunk.

4. Operation of Master node

The Master node performs all namespace operations. In addition, it manages all Chunk replicas in the entire system: it determines the storage location of Chunk, creates new Chunk and its replicas, coordinates various system activities to ensure that Chunk is fully replicated, load balances among all Chunk servers, and reclaims unused storage space. In this section we discuss the above topics.

4.1 Namespace management and locking

Many operations of the Master node can take a long time: for example, snapshot operations must cancel the lease of all Chunk involved in snapshots on the Chunk server. We do not want to delay the operation of other Master nodes while these operations are running. Therefore, we allow multiple operations to occur simultaneously, using locks on the region of the namespace to ensure the correct order of execution.

Unlike many traditional file systems, GFS does not have a data structure that lists all the files in the directory for each directory implementation. GFS also does not support links to files or directories (that is, hard links or symbolic links in Unix terminology). Logically, the namespace of GFS is a lookup table of full-path and metadata mapping relationships. With prefix compression, this table can be efficiently stored in memory. On the tree structure that stores namespaces, each node (the file name of the absolute path or the directory name of the absolute path) has an associated read-write lock.

The operation of each Master node acquires a series of locks before starting. In general, if an operation involves / d1/d2/... / dn/leaf, then the first thing to do is to get the directory / D1 _ mag _ r _ D2, … , / d1/d2/... / dn read lock, and / d1/d2/... / dn/leaf 's read-write lock. Note that depending on the operation, leaf can be either a file or a directory.

Now, let's demonstrate how the locking mechanism prevents the creation of a file / home/user/foo when / home/user is snapped to / home/user/foo. The snapshot operation acquires read locks for / home and / save, and write locks for / home/user and / save/user. The file creation operation acquires read locks for / home and / home/user, and write locks for / home/user/foo. These two operations are performed sequentially because the locks of the / home/user they are trying to acquire conflict with each other. The file creation operation does not need to acquire the write lock of the parent directory because there is no "directory" or data structures such as inode that are used to prohibit modification. The read lock on the file name is sufficient to prevent the parent directory from being deleted.

The advantage of this locking scheme is that it supports parallel operations on the same directory. For example, you can create multiple files in the same directory at the same time: each operation acquires a read lock on a directory name and a write lock on a file name. The read lock of the directory name is sufficient to prevent the directory from being deleted, renamed, and snapped. The write lock serializes the file creation operation of the file name to ensure that a file with the same name is not created multiple times.

Because namespaces may have many nodes, read-write locks adopt a lazy allocation strategy and are deleted immediately when they are no longer in use. Similarly, locks should be acquired in a globally consistent order to avoid deadlocks: first sorted by the hierarchy of namespaces and sorted in dictionary order within the same level.

4.2 location of the copy

GFS cluster is a highly distributed multi-tier layout structure, rather than a flat structure. A typical topology is that hundreds of Chunk servers are installed on many racks. The Chunk server is accessed by hundreds of clients from the same or different racks in turn. Communication between two machines on different racks may span one or more network switches. In addition, the access bandwidth of the rack may be smaller than the combined bandwidth of all the machines in the rack. Multi-tier distributed architecture poses unique challenges to the flexibility, reliability, and availability of data.

The policy service of Chunk replica location selection has two major goals: to maximize data reliability and availability, and to maximize network bandwidth utilization. In order to achieve these two purposes, it is not enough to store these copies on multiple machines, which can only prevent the impact of hard disk damage or machine failure, and maximize the network bandwidth utilization of each machine. We have to store copies of Chunk across multiple racks. This ensures that some copies of Chunk remain available in the event that the entire rack is damaged or dropped (for example, shared resources, such as problems caused by power supplies or network switches). It also means that the consolidated bandwidth of multiple racks can be effectively utilized in terms of network traffic, especially for Chunk reads. On the other hand, write operations must communicate with devices on multiple racks over the network, but this is a price we are willing to pay.

4.3 create, re-copy, re-load balancing

Replicas of Chunk have three uses: Chunk creation, re-replication, and re-load balancing.

When the Master node creates a Chunk, it chooses where to place the initial empty copy. The Master node considers several factors. (1) We want to store new copies on Chunk servers with lower than average hard disk utilization. This approach will eventually balance hard disk usage between Chunk servers. (2) We want to limit the number of "recent" Chunk creation operations on each Chunk server. Although the creation operation itself is cheap, it also means that a large number of data writes will follow, because the Chunk is not created until the Writer actually writes the data, and in our "append once, read multiple" mode, the Chunk becomes read-only once the write is successful. (3) as mentioned above, we want to distribute copies of Chunk across multiple racks.

When the number of valid copies of Chunk is less than the replication factor specified by the user, the Master node replicates it again. This may be caused by several reasons: a Chunk server is unavailable, the Chunk server reports that a copy it stores is corrupted, a disk in the Chunk server is unavailable due to an error, or the replication factor for the Chunk replica has increased. Each Chunk that needs to be replicated is sorted according to several factors. One factor is the difference between the number of existing replicas of Chunk and the replication factor. For example, a Chunk that loses two copies has a higher priority than a Chunk that loses one copy. In addition, we prefer to re-copy the Chunk of the active (live) file over the Chunk of the recently deleted file (see Section 4.4). Finally, to minimize the impact of a failed Chunk on a running application, we increase the priority of the Chunk that blocks the client program's processing flow.

The Master node selects the Chunk with the highest priority and then commands a Chunk server to "clone" a copy directly from the available copy. The strategy for selecting the location of new replicas is similar to when creating: balancing hard disk usage, limiting the number of clones in progress on the same Chunk server, and distributing replicas across racks. In order to prevent the network traffic generated by the clone from greatly exceeding the traffic of the client, the Master node limits the number of simultaneous cloning operations on the entire cluster and each Chunk server. In addition, the Chunk server limits the bandwidth it uses for clone operations by adjusting how often it reads requests to the source Chunk server.

Finally, the Master server periodically rebalances the replicas: it checks the current replica distribution, and then moves the replicas in order to make better use of hard disk space and load balancing more efficiently. And in the process, the Master server gradually fills a new Chunk server, rather than overloading it with new Chunk in a short period of time. The storage location selection strategy for the new copy is the same as discussed above. In addition, the Master node must select which copy to remove. Typically, the Master node removes replicas from Chunk servers where the remaining space is below average, thus balancing the overall hard disk usage of the system.

4.4 garbage collection

GFS does not reclaim available physical space immediately after the file is deleted. GFS space collection uses a lazy strategy and is only performed during file and Chunk-level regular garbage collection. We find that this method makes the system simpler and more reliable.

4.4.1 Mechanism

When a file is deleted by the application, the Master node immediately logs the deletion operation just like any other modification operation. However, instead of recycling resources immediately, the Master node changes the file name to a hidden name that contains the deletion timestamp. When the Master node does a regular scan of the file system namespace, it deletes all hidden files from three days ago (this interval can be set). Until the files are actually deleted, they can still be read with a new special name, or they can be "anti-deleted" by renaming the hidden file to the normally displayed file name. When the hidden file is deleted from the namespace, the relevant metadata of the file saved in the memory of the Master server will be deleted. This also effectively disconnects the file from all the Chunk it contains (alex Note: the original text is This effectively severs its links to all its chunks).

During a similar regular scan of the Chunk namespace, the Master node finds the orphan Chunk (Chunk that is not contained by any file) and deletes their metadata. In the heartbeat information that the Chunk server interacts with the Master node, it reports the information of the Chunk subset it owns, and the Master node replies to the Chunk server which Chunk no longer exists in the metadata saved by the Master node. The Chunk server can delete copies of these Chunk at will.

4.4.2 discussion

Although distributed garbage collection is a difficult problem that needs complex solutions in the field of programming languages, it is very simple in GFS systems. We can easily get all references to Chunk: they are only stored in the file-to-block mapping table on the Master server. We can also easily get copies of all Chunk: they are all stored as Linux files in the specified directory of the Chunk server. All copies that are not recognized by Master nodes are "junk".

Garbage collection has several advantages over direct deletion in terms of space collection. First of all, for large-scale distributed systems where component failure is normal, garbage collection is simple and reliable. Chunk may be created successfully on some Chunk servers, failed on some Chunk servers, and the failed copy is in a state that cannot be recognized by the Master node. The replica deletion message may be lost, and the Master node must resend the failed deletion message, including its own and the Chunk server's (alex Note: its own refers to the delete metadata message). Garbage collection provides a consistent and reliable way to clean up useless copies. Second, garbage collection integrates the collection of storage space into the regular background activities of Master nodes, such as routine scanning and handshaking with Chunk servers. Therefore, the operation is executed in batches and the overhead is dispersed. In addition, garbage collection is done when the Master node is relatively idle. This allows the Master node to provide a faster response to client requests that require a quick response. Third, delaying storage space reclamation provides security for unexpected and irreversible deletion operations.

According to our experience, the main problem with delayed recycling is that delayed recycling hinders users from tuning the use of storage space, especially when storage space is scarce. When an application repeatedly creates and deletes temporary files, the free storage space cannot be immediately reused. We speed up space recycling by explicitly deleting a file that has been deleted again. We allow users to set different replication and recycling policies for different parts of the namespace. For example, users can specify that files under some directory trees are not copied and deleted files are instantly and unrecoverably removed from the file system.

4.5 expired copy detection

When the Chunk server fails, the copy of Chunk may expire due to missing some modification operations. The Master node holds the version number of each Chunk to distinguish between the current copy and the expired copy.

Whenever the Master node signs a new lease with Chunk, it increments the version number of the Chunk and then notifies the latest copy. The Master node and these replicas record the new version number in their persistent state information. This action occurs before any client is notified, and therefore before writing to the Chunk. If the Chunk server where a copy is located happens to be in a failed state, the version number of the copy will not be increased. When the Master node restarts the Chunk server and reports to the Master node the collection of Chunk it owns and the corresponding version number, it detects that it contains expired Chunk. If the Master node sees a version number higher than the version number it records, the Master node thinks that its lease with the Chunk server failed, so it chooses a higher version number as the current version number.

The Master node removes all expired copies during routine garbage collection. Previously, when Master nodes responded to client requests for Chunk information, they simply assumed that those expired blocks did not exist at all. Another safeguard is that when the Master node informs the client which Chunk server holds the lease or instructs the Chunk server to clone from which Chunk server, the Chunk version number is attached to the message. Either the client or the Chunk server validates the version number when performing an operation to ensure that the current version of the data is always accessed.

5. Fault tolerance and diagnosis

One of the biggest challenges we encounter when designing GFS is how to deal with frequent component failures. The number and quality of components make these problems occur far more frequently than normal system accidents occur: we cannot rely entirely on the stability of the machine, nor can we fully trust the reliability of the hard disk. The failure of components may make the system unavailable and, worse, may result in incomplete data. We discuss how we meet these challenges and use GFS's own tools to diagnose system failures when component failures inevitably occur.

5.1 High availability

Among the hundreds of servers in the GFS cluster, there must be some servers that are unavailable at any given time. We use two simple but effective strategies to ensure high availability of the entire system: fast recovery and replication.

5.1.1 Fast recovery

No matter how the Master server and Chunk server are shut down, they are designed to restore their state and restart in seconds. In fact, we don't distinguish between normal shutdown and abnormal shutdown; usually, we shut down the server by directly kill the process. Clients and other servers will feel a bit bumpy in the system (alex Note: a minor hiccup), the request being made will time out and need to reconnect to the rebooted server, and then try the request again. 6.6.2 chapter records the measured start-up time.

5.1.2 Chunk replication

As discussed earlier, each Chunk is copied to a different Chunk server on a different rack. You can set different replication levels for different parts of the file namespace. The default is 3. When a Chunk server goes offline, or corrupted data is found through a Chksum check (see Section 5.2), the Master node ensures that each Chunk is fully replicated by cloning an existing copy (alex Note: each Chunk has a number of replicas determined by the replication factor, the default is 3). While the Chunk replication strategy is very effective for us, we are also looking for other forms of redundancy solutions across servers, such as using parity, or Erasure codes (alex Note: Erasure codes is used to solve irrelevant errors in the link layer, as well as packet loss errors caused by network congestion and buffer restrictions) to address our growing demand for read-only storage. The main workload of our system is additional write and read operations, and there are few random write operations, so we find it challenging to implement these complex redundancy schemes in our highly uncoupled system architecture. but it's not impossible.

5.1.3 replication of Master servers

In order to ensure the reliability of the Master server, the state of the Master server should also be replicated. All operation logs and checkpoint files of the Master server are copied to multiple machines. Changes to the state of the Master server can be successfully submitted on the premise that the operation log is written to the standby node of the Master server and to the local disk. To put it simply, a Master service process is responsible for all modification operations, including background services, such as garbage collection and other activities that change the internal state of the system. When it fails, it can be restarted almost immediately. If the machine or disk on which the Master process is located fails, the monitoring process outside the GFS system will start a new Master process on other machines with full operation logs. The client accesses the Master (such as gfs-test) node using the canonical name, which is similar to the DNS alias, so it can also access the new Master node by changing the actual point of the alias when the Master process is transferred to another machine for execution.

In addition, there are some "shadow" Master servers in GFS that provide read-only access to the file system when the "primary" Master server goes down. They are shadows, not mirrors, so their data may be slower to update than the "master" Master server, usually less than a second. For files that do not change often, or applications that allow a small amount of expired data, the "shadow" Master server can improve the efficiency of reading. In fact, because the file content is read from the Chunk server, the application will not find the out-of-date file content. In this short time window, it may be the metadata of the file that expires, such as the contents of the directory or access control information.

In order to keep its state up to date, the Shadow Master server reads a log copy of what is currently in progress and changes the internal data structure in exactly the same order as the main Master server. Like the main Master server, the "shadow" Master server polls data from the Chunk server when it starts (and then pulls the data periodically), and the data includes the location information of the Chunk copy; the "shadow" Master server "shadows" the Chunk server regularly to determine their status. When the primary Master server updates the replica location information due to the creation and deletion of replicas, the Shadow Master server communicates with the primary Master server to update its status.

5.2 data integrity

Each Chunk server uses Checksum to check whether the saved data is corrupt. Considering that a GFS cluster usually has hundreds of machines and thousands of hard drives, it is very common for disk corruption to cause data to be corrupted or lost during reading and writing (Section 7 describes a reason). We can solve the data corruption problem by using other Chunk replicas, but it is impractical to compare replicas across Chunk servers to check for data corruption. In addition, GFS allows ambiguous copies to exist: the semantics of GFS modification operations, especially those appended to atomic records discussed earlier, are not guaranteed to be identical (alex Note: copies are not byte-wise identical). Therefore, each Chunk server must maintain Checksum independently to verify the integrity of its own copy.

We divide each Chunk into 64KB-sized blocks. Each block corresponds to a 32-bit Checksum. Like other metadata, Checksum is separate from other user data and is stored on memory and hard disk, as well as operation logs.

For read operations, the Chunk server verifies the Checksum of blocks within the scope of the read operation before returning the data to the client or other Chunk server. Therefore, the Chunk server will not pass the error data to other machines. If the Checksum of a block is incorrect, the Chunk server returns an error message to the requestor and notifies the Master server of the error. In response, the requestor should read data from other replicas, and the Master server will restore the cloned data from other replicas. When a new copy is ready, the Master server notifies the Chunk server with the wrong copy to delete the wrong copy.

Checksum has little impact on the performance of read operations and can be analyzed for several reasons. Because most of the read operations have to read at least a few blocks, and we only need to read a small part of the additional relevant data for verification. The GFS client code further reduces the negative impact of these additional read operations by aligning each read operation to the boundary of the Checksum block. In addition, on the Chunk server, the search and comparison of Chunksum do not need the Icano operation, and the calculation of Checksum can be carried out at the same time.

Checksum's calculations are highly optimized for append writes at the end of the Chunk (as opposed to writes that overwrite existing data) because such operations account for a large proportion of our work. We only incrementally update the Checksum of the last incomplete block and calculate the new Checksum with all the appended new Checksum blocks. Even if the last incomplete Checksum block is corrupted, and we can't check it right away, because the new Checksum does not match the existing data, the next time we read this block, we will check that the data is corrupted.

In contrast, if the write operation covers a range of Chunk that already exists, we must read and verify the first and last overwritten block before performing the write operation; after the operation is complete, the new Checksum is recalculated and written. If we do not check the first and last written blocks, the new Checksum may hide data errors in the uncovered area.

When the Chunk server is idle, it scans and verifies the contents of each inactive Chunk. This allows us to find out whether the Chunk, which is rarely read, is complete. Once Chunk data corruption is found, Master can create a new, correct copy and delete the corrupted copy. This mechanism also prevents inactive, corrupted Chunk from deceiving Master nodes into thinking that they already have enough copies.

5.3 Diagnostic tool

Detailed and detailed diagnostic logs bring us immeasurable help in problem isolation, debugging, and performance analysis, with little overhead. Without the help of logs, it is difficult to understand short-lived, non-repetitive message interactions between machines. GFS's server generates a large number of logs, recording a large number of critical events (such as Chunk server startup and shutdown) as well as all RPC requests and responses. These diagnostic logs can be deleted at will and will not affect the correct operation of the system. However, we will try our best to save these logs as long as the storage space allows.

The RPC log contains detailed records of all requests and responses that occur on the network, but does not include file data that is read and written. By matching requests and responses and collecting RPC log records on different machines, we can replay all message interactions to diagnose the problem. Logs are also used to track load tests and performance analysis.

Logs have little impact on performance (far less than the benefits) because these logs are written sequentially and asynchronously. Recent event logs are kept in memory and can be used for continuous online monitoring.

6. Measurement

In this section, we will use some small-scale benchmarks to show some inherent bottlenecks in the architecture and implementation of the GFS system, as well as benchmark data from real GFS clusters used internally by Google.

6.1 small-scale benchmark

We measured performance on a GFS cluster consisting of 1 Master server, 2 Master server replication nodes, 16 Chunk servers, and 16 clients. Note that this cluster configuration scheme is used only for ease of testing. A typical GFS cluster has hundreds of Chunk servers and hundreds of clients.

All machines have the same configuration: two PIII 1.4GHz processors, 2GB memory, two 80G/5400rpm hard drives, and 100Mbps full-duplex Ethernet connected to a HP2524 switch. All 19 servers in the GFS cluster are connected to one switch and all 16 clients are connected to another switch. A 1Gbps line connection is used between the two switches.

6.1.1 read

N clients read data synchronously from the GFS file system. Each client randomly reads the contents of the 4MB region from the 320GB's file collection. The read operation is repeated 256 times, so each client ends up reading 1GB's data. All Chunk servers add up to only 32GB memory, so we expect only up to 10% of read requests to hit Linux's file system buffer. Our test results should be close to the results of a read test without a file system cache.

Figure 3: total throughput: the curve above shows the upper limit of the total theoretical throughput under our network topology. The following curve shows the observed throughput. This curve is 95% reliable because sometimes the measurements are not accurate enough.

Figure 3 (a) shows the overall read speed of N clients and the theoretical limit of this speed. When the link of the 1Gbps connecting two switches is saturated, the theoretical limit of the overall read speed is 125MB/S, or when the 100Mbps Nic configured by each client reaches saturation, the theoretical limit of the read speed of each client is 12.5MB/s. The measured result is that when a client reads, the reading speed is 10MB/s, that is to say, it reaches 80% of the theoretical reading speed limit of the client. For 16 clients, the overall read speed reaches 94MB/s, which is about 75% of the theoretical limit of overall read speed, that is, the read speed of each client is 6MB/s. The reading efficiency decreased from 80% to 75%. The main reason is that when the number of clients reading increases, the chances of multiple clients reading a Chunk server at the same time also increase, resulting in a decline in overall reading efficiency.

6.1.2 write

N clients write data to N different files simultaneously. Each client continuously writes 1GB data at the speed of each 1MB. Figure 3 (b) shows the overall write speed and their theoretical limits. The theoretical limit is 67MB/s, because we need to write each byte to 3 of the 16 Chunk servers, and the input connection speed for each Chunk server is 12.5MB/s.

The write speed of a client is 6.3MB, which is about half the theoretical limit. The main reason for this result is our network protocol stack. It is not compatible with the pipeline mode we used to push data to the Chunk server. The delay in data transfer from one copy to another slows down the overall write speed.

The overall write speed of 16 clients reached 35MB/s (that is, each client 2.2MB/s), which is about half the theoretical limit. The situation of reading with multiple clients is very typical, and as the number of clients increases, so does the chance that multiple clients write to the same Chunk server at the same time. Moreover, 16 clients writing in parallel can cause much greater conflicts than 16 clients reading in parallel, because each write involves three different copies.

The speed of writing is slower than we thought. In practice, this is not our main problem, because even if latency can be felt on a single client, it does not have a significant impact on the overall write bandwidth when there are a large number of clients.

6.1.3 record append

Figure 3 (c) shows the performance of recording append operations. N clients attach data to a file at the same time. The performance of recording append operations is limited by the bandwidth of the Chunk server that holds the last Chunk of the file, regardless of the number of clients. The speed of recording appends begins with the 6.0MB/s of one client and decreases to the 4.8MB/s of 16 clients. The decline is mainly due to the network congestion of different clients and the difference of network transmission speed.

Our program tends to process multiple such files at the same time. In other words, N clients append data to M shared files at the same time, where N and M are tens or hundreds or more. Therefore, in our practical application, the network congestion of the Chunk server has not become a serious problem. If one file of the Chunk server is being written, the client will write another file.

6.2 clusters in practical applications

Let's now take a closer look at the two clusters in use within Google, which are representative. Cluster An is usually used by hundreds of engineers for research and development. A typical task is to be initialized manually and run continuously for several hours. It usually reads data from MB to TB, then transforms or analyzes it, and finally writes the results back to the cluster. Cluster B is mainly used to process current production data. The task of cluster B lasts longer, continuously generating and processing datasets of several TB with little human intervention. In both cases, a single "task" refers to multiple processes running on multiple machines that read and write multiple files at the same time.

6.2.1 Storage

As described in the first five rows of the table above, both clusters are made up of hundreds of Chunk servers and support hard disk space of several TB; both clusters store a large amount of data, but there is still space left. "used space" contains all copies of Chunk. In fact, three copies of all the documents have been copied. Therefore, the cluster actually stores file data for 18TB and 52TB respectively.

Both clusters store the same number of files, but there are a large number of dead files on cluster B. The so-called "dead file" means that the file has been deleted or replaced by a new version of the file, but the storage space has not yet been recycled. Because cluster B stores large files, it also has a large number of Chunk.

6.2.2 metadata

In total, the Chunk server holds more than a dozen GB of metadata, most of which are Checksum of 64KB-sized blocks from user data. The other metadata stored on the Chunk server is the Chunk version number information, which we described in Section 4.5.

The metadata stored on the Master server is much smaller, about a few dozen MB, or an average of 100 bytes per file. This is the same as what we imagine, the memory size of the Master server will not become the bottleneck of GFS system capacity in practical applications. The metadata for most files is the file name stored in the prefix compression mode. Other metadata stored on the Master server includes the owner and permissions of the file, the mapping of the file to Chunk, and the current version number of each Chunk. In addition, for each Chunk, we keep the current copy location and the reference count to it, which is used to implement the write-time copy (alex Note: COW,copy-on-write).

For each individual server, whether it is a Chunk server or a Master server, only 50MB to 100MB metadata is saved. Therefore, recovering the server is very fast: it only takes a few seconds to read the data from disk before the server responds to the customer's request. However, the Master server continues to bump for a period of time-usually 30 to 60 seconds-until it finishes polling all Chunk servers and gets location information for all Chunk.

6.2.3 read and write rate

Table 3 shows the reading and writing rates for different periods of time. At the time of testing, both clusters ran for about a week. Both clusters have recently been restarted as a result of upgrading the new version of GFS.

After the cluster restarts, the average write rate is less than 30MB/s. When we extracted the performance data, cluster B was doing a lot of writes, the write speed reached 100MB/s, and because each Chunk had three copies, the network load reached 300MB/s.

The read rate is much higher than the write rate. As we envisioned, the proportion of reads is much higher than that of writes in the total workload. Both clusters perform heavy read operations. In particular, cluster A maintained the 580MB/s read speed for a week. The network configuration of cluster A can support the speed of 750MB/s, obviously, it makes effective use of resources. The peak read speed supported by cluster B is 1300MB/s, but its application only uses 380MB/s.

6.2.4 load of Master server

The data in Table 3 shows that approximately 200 to 500 operation requests are sent to the Master server per second. The Master server can easily cope with this request speed, so the processing power of the Master server is not the bottleneck of the system.

In earlier versions of GFS, the Master server occasionally became a bottleneck. It spends most of its time sequentially scanning a large directory (containing tens of thousands of files) to find a particular file. Therefore, we modify the data structure of the Master server to improve the efficiency by binary search of the namespace. Now the Master server can easily access thousands of files per second. If necessary, we can further increase the speed by setting the name query buffer before the namespace data structure.

6.2.5 recovery time

When a Chunk server fails, the number of some Chunk replicas may be lower than the number specified by the replication factor, and we must make the number of Chunk replicas up to the specified number by cloning replicas. The time it takes to restore all Chunk replicas depends on the number of resources. In our experiment, we Kill a Chunk server on cluster B. There are about 15000 Chunk on this Chunk server, totaling 600GB data. To reduce the impact of clone operations on running applications and to provide room for correction of GFS scheduling decisions, we default to set the number of concurrent clone operations in the cluster to 91 (40% of the number of Chunk servers), and the maximum bandwidth allowed for each clone operation is 6.25MB/s (50mbps). All Chunk recovered within 23.2min, replicating at a speed as high as 440MB/s.

In another test, we Kill lost two Chunk servers, each Chunk server has about 16000 Chunk, a total of 660GB data. These two failures resulted in a single copy of 266 Chunk. The 266 Chunk are preferentially scheduled by GFS to replicate and restore to at least two replicas within 2 minutes; now the cluster is taken to another state where the system can tolerate the failure of another Chunk server without losing data.

6.3 workload Analysis (Workload Breakdown)

In this section, we show a detailed analysis of the workload of two GFS clusters, which are similar but not identical to those in Section 6.2. Cluster X is used for research and development, and cluster Y is used for production data processing.

6.3.1 Methodology and precautions

The resulting data listed in this section includes only the original requests initiated by the client, so these results reflect the entire workload generated by our application on the GFS file system. They do not include requests that interact between servers to implement client requests, nor requests related to background activities within the GFS, such as forward-forwarded write operations or re-load balancing operations.

We derive and reconstruct statistics about IO operations from the real RPC request log recorded by the GFS server. For example, a GFS client might divide a read operation into several RPC requests to improve parallelism, and we can derive the original read operation from these RPC requests. Because our access mode is highly stylized, we believe that any data that does not match is an error (alex Note: Since our access patterns are highly stylized, we expect any error to be in the noise). It is possible for an application to provide more accurate diagnostic data if it is able to log in more detail; but it is unrealistic to recompile and restart thousands of running clients for this purpose. and collecting results from so many clients is an onerous task.

Excessive generalization of general conclusions from our workload data should be avoided (alex note: do not use the data in this section as the basic guiding data). Because Google has complete control over GFS and applications that use GFS, applications are optimized for GFS, and GFS is designed for these applications. Such interactions may also exist in general programs and file systems, but the impact may be more significant in our case.

6.3.2 Chunk server workload

Table 4 shows the distribution of operations by the amount of data involved. The read operation shows a bimodal distribution according to the amount of data involved in the operation. Small read operations (less than 64KB) are generally initiated by the client of the find operation to find small chunks of data from large files. Large read operations (greater than 512KB) usually read the entire file sequentially from beginning to end.

On cluster Y, a considerable number of reads do not return any data. In our applications, especially in production systems, files are often used as producer-consumer queues. The producer appends data to the file in parallel, while the consumer reads the data from the end of the file. In some cases, the consumer reads faster than the producer writes, which results in no data being read. Cluster X is usually used for short-term data analysis tasks rather than long-running distributed applications, so this rarely happens in Cluster X.

Write operations also show a bimodal distribution according to the amount of data. Large writes (more than 256KB) are usually caused by Writer's use of a caching mechanism. Writer caches small data through frequent Checkpoint or synchronous operations, or simply counts the amount of data written (less than 64KB) (alex Note: aggregates many small writes, when the data reaches a threshold, one write), and then writes in batches.

Let's take a look at the record append operation again. We can see that the proportion of large record append operations in cluster Y is much higher than that in cluster X. this is because cluster Y is used in our production system and is more fully tuned for GFS.

Table 5 shows the total amount of data transferred by the amount of data involved in the operation. Of all the operations, large operations (more than 256KB) account for the main traffic. Small reads (less than 64KB) transfer a relatively small amount of data, but still account for a considerable proportion of the data read, which is caused by the workload of random Seek in the file.

6.3.3 record additional vs. Write operation

Record append operations are widely used in our production system. For cluster X, the ratio of record append operations to normal write operations is 108 by byte ratio and 8:1 by number of operations. For cluster Y, which is our production system, the two ratios are 3.7 and 2.5, respectively. Furthermore, this set of data shows that the proportion of record append operations is larger than that of write operations on both of our clusters. For cluster X, the percentage of record append operations is low throughout the measurement, so the results are affected by one or two applications that use buffer of certain sizes.

As we expected, our data modification operation is mainly to record append operations rather than overwrite operations. We measured the data overwrite of the first copy. This is similar to a client deliberately overwriting the data just written, rather than adding new data. For cluster X, the proportion of overwrite operations in the bytes occupied by write operations is less than 0.0001%, and the proportion in the number of operations occupied is less than 0.0003%. For cluster Y, both ratios are 0.05%. Although this is only a broken situation, it is still higher than we expected. This is due to these overwrite operations, mostly because the client retries after an error or timeout occurs. In essence, this should not be counted as part of the workload, but as a result of the retry mechanism.

6.3.4 workload for Master

Table 6 shows a breakdown of requests by type on the Master server. Most of the requests are read operations to query Chunk location information (FindLocation) and modification operations to query lease holder information (FindLease-Locker).

Clusters X and Y differ significantly in the number of delete requests because cluster Y stores production data and generally regenerates the data and replaces old data with new versions of the data. The difference in quantity is also hidden in the Open request, because older versions of the file may be implicitly deleted when opened in rewrite mode (similar to the "w" mode in UNIX's open function).

FindMatchingFiles is a pattern matching request that supports "ls" and other similar file system operations. Unlike other requests from the Master server, it may retrieve most of the contents of the namespace, so it is a very expensive operation. Cluster Y has more such requests because the task process that automates data processing needs to examine parts of the file system in order to understand the state of the application globally. In contrast, applications in Cluster X tend to be controlled by individual users, usually knowing in advance the names of all the files they need to use.

7. experience

In the process of building and deploying GFS, we experienced a variety of problems, some operational and some technical.

At first, GFS was envisioned as the back-end file system of our production system. Over time, support for research and development tasks has been gradually increased in the use of GFS. We started to add some small features, such as permissions and quotas, which are now initially supported by GFS. Although our production system is strictly controlled, the user layer is not always like this. More infrastructure is needed to prevent interference between users.

Our biggest problem is disk and Linux-related problems. Many disks claim to support a certain range of Linux IDE hard disk drivers, but this is not the case in practical applications, they only support the latest drivers. Because the version of the protocol is very close, most disks are available, but occasionally there is a mismatch between the driver and the kernel, which leads to misjudgment of the state of the driver. This can cause the data to be accidentally corrupted due to problems in the kernel. This problem prompted us to use Checksum to validate the data, and we also modified the kernel to deal with these problems caused by protocol mismatches.

Earlier, we encountered some problems with the Linux 2.2 kernel, mainly the efficiency of fsync (). Its efficiency is related to the size of the file rather than the size of the modified part of the file. This poses a problem when our operation log file is too large, especially when we haven't implemented Checkpoint yet. We went to a lot of trouble to solve this problem with synchronous writing, but in the end we ported it to the Linux2.4 kernel.

Another Linux-related problem is the problem of a single read-write lock, that is, any thread in a given address space must hold live when page in (read lock) from disk, or rewrite the address space when mmap () calls (write lock). We found that even if our system load is very light, there will be occasional timeouts, and we spend a lot of energy looking for resource bottlenecks or hardware problems. Finally, we found that this single lock locks the current network thread when the disk thread exchanges previously mapped data to disk, preventing it from mapping new data to memory. Since our performance is mainly limited by the network interface rather than the bandwidth of the memory copy, we use pread () instead of mmap () and an additional copy action to solve this problem.

Despite occasional other problems, Linux's open source allows us to quickly explore and understand the behavior of the system. In due course, we will improve the kernel and share these changes with open source organizations.

8. Related work

Similar to other large distributed file systems, such as AFS [5], GFS provides a location-independent namespace, which allows data to be transparently migrated in different locations for load balancing or disaster redundancy. Unlike AFS, GFS stores files on different servers, which is more similar to Xfs [1] and Swift [3] to improve overall performance and disaster redundancy.

Because the disk is relatively cheap and the replication method is much simpler than the RAID [9] method, GFS currently only uses replication for redundancy, so it takes up more bare storage space than xFS or Swift. (alex Note: Raw storage, bare disk space).

Unlike file systems such as AFS, xFS, Frangipani [12], and Intermezzo [6], GFS does not provide any Cache mechanism at the file system level. Our main job is to rarely read data repeatedly when a single application executes, because they work either by streaming a large dataset or by randomly Seek to a location in a large dataset, and then read a small amount of data at a time.

Some distributed file systems, such as Frangipani, xFS, Minnesota's GFS [11], GPFS [10], remove the central server and rely only on distributed algorithms to ensure consistency and manageability. We choose the method of central server in order to simplify the design, increase the reliability and be able to expand flexibly. It is particularly worth mentioning that because the central Master server stores almost all Chunk-related information and controls all changes to Chunk, it greatly simplifies the implementation of the originally very complex Chunk allocation and replication strategy. We ensure the disaster redundancy of the system by reducing the amount of state information saved by the Master server and copying the state of the Master server to other nodes. Scalability and high availability (for reads) are currently guaranteed through our shadow Master server mechanism. Changes to the state of the Master server are persisted by prewriting logs. To do this, we can adjust to use a primary-copy scheme similar to that in Harp [7] to provide a stricter consistency guarantee than our current one.

We have solved a problem similar to that encountered by Lustre [8] in how to ensure the overall performance of the system when there are a large number of clients. However, we have achieved the goal of simplifying the problem by focusing only on the needs of our application, rather than providing a POSIX-compliant file system. In addition, the GFS design is expected to use a large number of unreliable nodes to build clusters, so disaster redundancy is the core of our design.

GFS is very similar to NASD architecture [4]. The NASD architecture is based on network disks, while GFS uses ordinary computers as Chunk servers, just like the solution in the NASD prototype. The difference is that our Chunk server allocates fixed-size Chunk lazily instead of longer object storage space. In addition, GFS implements the features required in a production environment, such as reload balancing, replication, recovery mechanisms, and so on.

Unlike Minnesota's GFS and NASD, we do not change the Model of the storage device (alex Note: I don't know about these two file systems because I don't quite understand what the Model of the storage device is used for, and I don't know if this model is a model or a model). We only focus on using ordinary equipment to solve the daily data processing of very complex distributed systems.

We implement the producer-consumer queue through the atomic record append operation, which is similar to the distributed queue in River [2]. River uses a cross-host, memory-based distributed queue, and in order to implement this queue, the data flow must be carefully controlled, while GFS is implemented as persistent files that can be appended concurrently by the producer. River mode supports m-to-n distributed queues, but lacks the fault-tolerant mechanism provided by persistent storage, and GFS only supports m-to-1 queues. Multiple consumers can read a file at the same time, but the intervals of their input streams must be aligned.

9. Concluding remarks

The Google file system demonstrates the characteristics of a system that uses common hardware to support large-scale data processing. Although some of the design points are tailored to our special needs, there are still many features suitable for data processing tasks of similar size and cost.

First of all, we evaluate the characteristics of traditional file systems based on our current and predictable future application scale and technical environment. Our evaluation results lead us to a design idea that is completely different from the traditional design idea. According to our design idea, we believe that component failure is normal rather than abnormal, and optimize large files that are written by appending (possibly concurrent append) and then read (usually serialized). And extend the standard file system interface and relax interface restrictions to improve the whole system.

Our system provides disaster redundancy through continuous monitoring, replication of critical data, and rapid and automatic recovery. Chunk replication allows us to tolerate the failure of the Chunk server. High-frequency component failure requires the system to have an on-line repair mechanism, which can periodically and transparently repair damaged data, and can also re-establish lost copies as soon as possible. In addition, we use Checksum to detect data corruption at the disk or IDE subsystem level, which is quite high in a large system with a staggering number of disks.

Our design ensures high aggregate throughput when there are a large number of concurrent read and write operations. We achieve this goal by separating the control flow from the data flow, which is processed on the Master server and the data flow on the Chunk server and the client. When the general operation involves the Master server, due to the large size of the Chunk selected by GFS and the transfer of control rights to the master replica through Chunk Lease, these measures minimize the burden on the Master server. This keeps a simple, central Master from becoming a bottleneck. We believe that our optimization of the network protocol stack will increase the current write throughput limit per client.

GFS has successfully realized our storage requirements. Within Google, it has been widely used not only as a storage platform for research and development, but also as a data processing platform for production systems. It is an important tool for us to continue to innovate and deal with challenges across the WEB.

On how to carry out in-depth analysis of the Google GFS file system to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.