Detailed explanation of GlusterFS distributed File system Cluster Theory 07/03 Update SLTechnology News&Howtos

Detailed explanation of GlusterFS distributed File system Cluster Theory

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

In enterprises, some important data are generally stored in the hard disk, although the performance of the hard disk itself is also improving, but no matter how fast the access speed of the hard disk is, what the enterprise is looking for is reliability first, and then efficiency. If the data is facing the risk of loss, no matter how good the hardware can not play with the loss of the enterprise, coupled with the emergence of cloud computing in recent years, put forward higher requirements for storage. Distributed storage is gradually accepted by people, it has better performance, high scalability and reliability. Most distributed solutions store metadata such as directory structure through meta-servers, and metadata servers provide indexing work for the whole distributed storage. However, once the metadata server is damaged, the entire distributed storage will not work. Next we will introduce a meta-server-free distributed storage solution-GlusterFS.

1. A brief introduction to GlusterFS1.GlusterFS

GlusterFS is an open source distributed file system and the core of GlusterFS, a Scale-Out storage solution. It has a strong scalability in storing data, and can support PB-level storage capacity by extending different nodes. GlusterFS aggregates decentralized storage resources with the help of TCP/IP or InfiniBand RDMA networks, provides storage services, and uses a single global command space to manage data. Based on stackable user space and meta-free design, GlusterFS provides excellent performance for a wide variety of data loads.

GlusterFS is mainly composed of storage server, client and NFS/Samba storage gateway (optional, choose to use as needed). As shown in the figure:

The biggest design feature of GlusterFS architecture is that there are no metadata server components, which helps to improve the performance, reliability and stability of the whole system. Traditional distributed file systems mostly store metadata through metaservers, which contain directory information and directory structure on storage nodes. This design is very efficient in browsing directories, but it also has some defects, such as a single point of failure. Once the metadata server fails, even if the node has high redundancy, the whole storage system will crash. GlusterFS distributed file system is based on meta-server design, which has strong data horizontal expansion ability, high reliability and storage efficiency. GlusterFS supports the interconnection of TCP/IP and InfiniBand RDMA high-speed networks, the client can access data through the original GlusterFS protocol, and other terminals that do not run GlusterFS clients can access data through the storage gateway through the NFS/CIFS standard protocol.

2.GlusterFS features scalability and high performance. GlusterFS takes advantage of dual features to provide high-capacity storage solutions.

(1) the Scale-Out architecture improves storage capacity and performance by adding storage nodes (disks, computers and Icano resources can all be added independently), and supports high-speed network interconnection such as 10GBE and InfiniBand.

(2) Gluster elastic hash solves the dependence of GlusterFS on metadata server. GlusterFS uses elastic algorithm to locate data in the storage pool, abandoning the traditional positioning of data through metadata server. GlusterFS can intelligently locate any data fragments (storing data fragments on different nodes) without looking at the index or thinking about metadata server query. This design mechanism realizes the horizontal expansion of storage, improves single point of failure and performance bottleneck, and truly realizes parallel data access. High availability. GlusterFS can automatically copy files (similar to RAID1) by configuring certain types of storage volumes, even if a node fails, without affecting data access. When the data is inconsistent, the automatic repair function can restore the data to the correct state, and the data repair is carried out incrementally in the background, which will not take up too much system resources. GlusterFS can support all storage, within which it does not design its own private data file format, but uses standard disk file systems in the operating system (such as EXT3, XFS, etc.) to store files, and data can be accessed using the traditional way of accessing disk; global unified namespace. The global unified namespace aggregates all storage resources into a single virtual storage pool, shielding physical storage information from users and applications. Storage resources (similar to LVM) can be flexibly expanded or shrunk as needed in a production environment. In the multi-node scenario, the global unified namespace can also do load balancing based on different nodes, which greatly improves the access efficiency; elastic volume management. GlusterFS stores data in logical volumes, which are logically partitioned independently from logical storage pools. Logical storage pools can be added and removed online without business disruption. Logical volumes can increase or decrease online according to demand, and can be load balanced across multiple nodes. File system configuration changes can also be made and applied online in real time to adapt to changes in workload conditions or online performance tuning; based on standard protocols. Gluster storage services support NFS, CIFS, HTTP, FTP, FTP, SMB and Gluster native protocols and are fully compatible with the POSIX standard. Existing applications can access data in Gluster without any modification, or use dedicated API to access (more efficiently), which is very useful when deploying Gluster in a public cloud environment, where Gluster abstracts private APl from cloud service providers and then provides standard POSIX excuses 3.GlusterFS term Brick (storage block): refers to the dedicated partition provided by the host for physical storage in the trusted host pool, which is not only the basic storage unit in GlusterFS, but also the storage directory provided on the server in the trusted storage pool. The format of the storage directory consists of the absolute path of the server and the directory, and the representation method is SERVER:EXPORT, such as 192.168.1.4/date/mydir/. Volume (logical volumes): a logical volume is a collection of Brick. A volume is a logical device for data storage, similar to a logical volume in LVM. Most Gluster management operations are performed on volumes; FUSE: is a kernel module that allows users to create their own file systems without modifying kernel code; VFS: kernel space provides an interface to access disks provided by user space; Glusterd (background management process): runs on every node in the storage cluster; 4. Modular stack architecture

As shown in the figure:

GlusterFS adopts modular and stack structure, and can configure customized application environment according to requirements, such as large file storage, massive small file storage, cloud storage, multi-transport protocol applications and so on. Through a variety of combinations of modules, the interface realizes complex functions. For example: Replicate module can achieve RAID1,Stripe module can achieve RAID0, through the combination of the two can achieve RAID10 and RAID01, while achieving higher performance and reliability.

GlusterFS is a modular stack architecture design. The module becomes Translator, which is a powerful mechanism provided by GlusterFS. With this well-defined interface, the functionality of the file system can be extended efficiently and easily.

(1) the highly modular colleague module interface between the server and the client is compatible, and the same transtator can be loaded on the client and server.

(2) all the functions in GlusterFS are realized through transtator, in which the client is more complex than the server. So the focus of the function is mainly focused on the client.

II. The working principle of GlusterFS, the work flow of 1.GlusterFS

The GlusterFS data access process is shown in the figure:

What is shown in the figure is just an overview of GlusterFS data access, roughly the process:

(1) the client or application accesses data through the hanging point of GlusterFS

(2) the kernel of Linux system receives requests and processes them through VFS API.

(3) VFS submits the data to the FUSE kernel file system and registers an actual file system FUSE with the system, while the FUSE file system delivers the data to the GlusterFS client side through the / dev/fuse device file. You can think of the FUSE file system as an agent

(4) after GlusterFS client receives the data. Client processes the data according to the configuration file

(5) after GlusterFS client processing, the data is transferred to the remote GlusterFS Server through the network, and the data is written to the server storage device.

two。 Elastic HASH algorithm

The elastic HASH algorithm uses the Davies-Meyer algorithm, and a 32-bit integer range is obtained by the HASH algorithm. Assuming that there are N storage units Brick in the logical volume, the 32-bit integer range is divided into N continuous subspaces, each space corresponding to a Brick. When a user or application accesses a namespace, the Brick of the data is located according to the 32-bit integer space corresponding to the hash value by calculating the hash value for the namespace.

The advantages of the resilient HASH algorithm are as follows:

Ensure that the data is evenly distributed in each Brick; solve the dependence on the metadata server, and then solve the single point of failure and access bottleneck

Now let's assume that we create a GlusterFS volume with four Brick nodes, and the Brick mount directory on the server allocates an average of 2 ^ 32 range space to the four Brick. The GlusterFS hash distribution interval is stored in the directory rather than according to the machine to distribute the interval. As shown in the figure:

Brick* represents a directory, and the distribution interval is stored on the extended attributes of each Brick mount point directory.

Create four files in the volume. When accessing the file, calculate the corresponding HASH value through the fast Hash function, and then hash the corresponding subspace to the server Brick according to the calculated hash value, as shown in the figure:

3. Volume type of GLusterFS

GlusterFS supports seven volumes, which can meet the high-performance and high-availability needs of different applications. The seven volumes are:

(1) distributed volumes (Distribute volume): files are distributed to all Brick Server through the HASH algorithm, which is the basis of Glusterf. Hashing to different Brick according to the HASH algorithm in units of files only expands the disk space. If a disk is damaged, the data will be lost. RAID0, which belongs to the file level, does not have fault tolerance. (2) stripe volume (Stripe volume): similar to RAID0, files are divided into data blocks and distributed to multiple Brick Server by polling. File storage is based on data blocks, and large files are supported. The larger the file, the higher the reading efficiency; (3) copy volume (Replica volume): synchronize files to multiple Brick, so that they have multiple file copies, belonging to file-level RAID1, and have fault tolerance. Because the data is distributed across multiple Brick, read performance is greatly improved, but write performance is degraded. (4) distributed stripe volume (Distribute Stripe volume): the number of Brick Server is a multiple of the number of stripes (the number of Brick distributed in blocks), which has the characteristics of both distributed volume and stripe volume. (5) distributed replication volumes (Distribute Replica volume): the number of Brick Server is a multiple of the number of mirrors (number of data replicas), with the characteristics of distributed volumes and replication volumes; (6) stripe replication volumes (Stripe Replica volume): similar to RAID10, with the characteristics of both stripe volumes and replication volumes; (7) distributed stripe replication volumes (Distribute Stripe Replica volume): composite volumes of three basic volumes, usually used for Map Reduce-like applications

Several important volume types are described in detail below:

1. Distributed volume

Distributed volumes are the default volumes for GlusterFS, and when creating volumes, the default option is to create distributed volumes. In this mode, the file is not divided into blocks, the file is directly stored on a Server node, and the local file system is directly used for file storage. Most Linux commands and tools can continue to be used normally. You need to save the hash value by extending the file properties. Currently, the underlying file systems supported are ext3, ext4, ZFS, XFS and so on.

Due to the use of the local file system, the access efficiency is not improved, but will be reduced because of network communication; in addition, it will be difficult to support very large files, because distributed volumes do not block files. Although ext4 can already support a single file with the largest 16TB, the capacity of local storage devices is really limited.

As shown in the figure:

As shown in the figure: File1 and File2 are stored in Server1, while File3 is stored in Server2. Files are stored randomly. A file is either on Server1 or Server2, and cannot be stored in blocks on Server1 and Server2 at the same time.

Distributed volumes have the following characteristics:

Files are distributed on different servers, do not be redundant; it is easier to expand the size of the volume cheaply; a single point of failure can cause data loss; rely on the underlying data protection

Commands to create distributed volumes:

[root@localhost ~] # gluster volume create dis-volume server1:/dir1 server2:/dir2// creates a distributed volume called dis-volume, and the files will be distributed in server1:/dir1 and server2:/dir2 according to HASH Creation of dis-volume has been successfulPlease start the volume to access data2. Stripe roll

The Stripe mode is equivalent to RAID0, in which the file is divided into N blocks (N stripe nodes) according to the offset, and is polled stored in each Brick Server node. The node stores each block as an ordinary file in the local file system, and records the total number of blocks and the sequence number of each block through extended attributes. "the number of stripes specified at configuration time must be equal to the number of storage servers contained in the Brick in the volume, especially when storing large files, but without redundancy."

As shown in the figure:

File is divided into 6 segments, 1, 3, 5 in Server1, 2, 4, 6 in Server2!

Stripe volumes have the following characteristics:

Data is divided into smaller chunks and distributed to different stripes in the block server farm; distribution reduces load and smaller files speed up access; there is no data redundancy

Command to create a stripe volume:

[root@localhost ~] # gluster volume create rep-volume replica 2 transport tcp server1:/dir1 server2:/dir2// creates a stripe volume called Stripe-volume, and the file will be polled stored in server1:/dir1 and server2:/dir2 Brick Creation of rep-volume has been successfulPlease start the volume to access data3. Copy Volum

Replication mode, also known as AFR, is the equivalent of RAID1. That is, one or more copies of the same file are kept, and each node holds the same content and directory structure. Replication mode has low disk utilization because you want to save the copy. If the storage space on multiple nodes is inconsistent, the capacity of the lowest node is taken as the total capacity of the volume according to the bucket effect. When configuring a replication volume, the number of replications must be equal to the number of storage servers contained in the Brick in the volume, and the replication volume is redundant and does not affect the normal use of the data even if one node is damaged.

As shown in the figure:

File1 and File2 are stored on both Server1 and Server2, which means that files in Server2 are copies of files in Server1!

Replication volumes have the following characteristics:

All servers in the volume keep a complete copy; the number of copies of the volume can be determined when the customer creates it; there are at least two block servers or more servers; redundant

Commands to create replication volumes:

[root@localhost ~] # gluster volume create rep-volume replica 2 transport tcp server1:/dir1 server2:/dir2// creates a replication volume named rep-volume, and the file will store two copies at the same time, Creation of rep-volume has been successfulPlease start the volume to access data4 in the Server1:/dir1 and Server2:/dir2 Brick. Distributed stripe volume

Distributed stripe volume has both distributed and stripe volume functions, which is mainly used for large file access processing. It takes at least 4 servers to create a distributed stripe volume.

As shown in the figure:

As shown in the figure, File1 and File2 are located to Server1 and Server2, respectively, through the function of distributed volumes. In Server1, File1 is divided into four segments, of which 1 and 3 are in the exp1 directory in Server1, and 2 and 4 are in the exp2 directory in Server1. In Server2, File2 is also divided into 4 segments, just like File1!

Commands to create distributed stripe volumes:

[root@localhost ~] # gluster volume create dis-stripe stripe 2 transport tcp server1:/dir1 server2:/dir2 server3:/dir3 server4:/dir4// created a distributed stripe volume called dis-stripe. When configuring a distributed stripe volume, the number of storage servers contained in the Brick in the volume must be a multiple of the stripe number (> = 2 times) Creation of rep-volume has been successfulPlease start the volume to access data

Note: when creating volumes, if the number of storage servers is equal to the number of stripes or replications, then stripe volumes or replication volumes are created; if the number of storage servers is twice or more than stripe volumes or replication volumes, then a distributed stripe volume or distributed replication volume will be created.

5. Distributed replication Volum

Distributed replication volumes take into account both distributed volumes and replication volumes, and are mainly used in situations where redundancy is needed, as shown in the figure:

As shown in the figure: File1 and File2 are located to Server1 and Server2 respectively through the function of distributed jaunty. When storing File1, File1 will have two identical replicas according to the characteristics of replicated volumes, namely, the exp1 directory in Server1 and the exp2 directory in Server2. When storing File2, File2 will also have two identical replicas according to the characteristics of replicated volumes, namely exp3 directory in Server3 and exp4 directory in Server4.

Commands to create distributed replication volumes:

[root@localhost ~] # gluster volume create dis-rep replica 2 transport tcp server1:/dir1 server2:/dir2 server3:/dir3 server4:/dir4// creates a distributed stripe volume called dis-rep. When configuring a distributed replication volume, the number of storage servers contained in the Brick in the volume must be a multiple of the stripe number (> = 2 times) Creation of rep-volume has been successfulPlease start the volume to access data

If there are eight servers, when the replica is 2, servers 1 and 2 as one replication, servers 3 and 4 as one replication, servers 5 and 6 as one replication, servers 7 and 8 as one replication When the replica is 4, according to the order of the server list, server 1 to 2, to 3, to 4 as a replication, and to server 5, to 6, to 7, to 8 as a replication.

About these theoretical concepts is not very difficult, basically read it to understand the meaning of it, then there is no more to say here!

This blog article mainly introduces the theory, about the actual combat can refer to the blog post: deploy GlusterFS distributed file system, actual combat!

-this is the end of this article. Thank you for reading-

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.