How to deeply analyze Lustre Architecture 07/19 Update SLTechnology News&Howtos

How to deeply analyze Lustre Architecture

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you about how to deeply analyze the Lustre architecture. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

Lustre architecture is a kind of cluster storage architecture, and its core component is Lustre file system. The file system runs on the Linux operating system and provides a UNIX file system interface that conforms to the POSIX standard.

What is the Lustre file system?

The Lustre architecture is used for many different kinds of clusters. It is well known that it serves many of the world's largest high-performance computing (HPC) clusters, providing tens of thousands of clients, PB-level storage and throughput of hundreds of GB per second. Many HPC sites use the Lustre file system as a site-wide global file system, serving dozens of clusters.

Lustre file systems have the ability to scale capacity and performance on demand, reducing the need to deploy multiple separate file systems (such as one per compute cluster), thus avoiding data replication between compute clusters and simplifying storage management. The Lustre file system aggregates not only the storage capacity of many servers, but also their I / O throughput and extends by adding servers. By adding servers dynamically, it is easy to improve the throughput and capacity of the whole cluster.

Although the Lustre file system can run in many work environments, it is not the best choice for all applications. When a single server cannot provide the required capacity, it is undoubtedly most appropriate to use Lustre file system clustering. In some cases, Lustre file systems perform better than other file systems even in a single server environment because of their strong locking and data consistency.

Currently, the Lustre file system is not particularly suitable for "end-to-end" user mode. In this mode, the client and server run on the same node, and each node shares a small amount of storage. Because Lustre lacks a software-level copy of data, if a client or server fails, the data stored on that node will not be accessible until the node is restarted.

Lustre file system featur

The Lustre file system can run on kernels of various vendors. An Lustre file system expands or shrinks in terms of the number of client nodes, disk storage, and bandwidth. Scalability and performance depend on the available disks, network bandwidth, and the processing power of the servers in the system.

The Lustre file system can be deployed in a variety of configurations that are scalable far beyond the scale and performance observed so far in production systems. The scalability and performance of some Lustre file systems are listed in the following table:

Performance-enhanced ext4 file system: the Lustre file system uses an improved version of the ext4 journaling file system to store data and metadata. This version, named ldiskfs, not only improves performance but also provides additional functionality required by the Lustre file system.

"in Lustre 2.4 or later, ZFS can be used as a backup file system for MDT,OST and MGS storage for Lustre." This enables Lustre to take advantage of the scalability and data integrity features of ZFS to achieve a single storage goal.

Comply with the POSIX standard: test the Lustre file system client through a complete POSIX test set, just like the local file system Ext4, with very few exceptions. In a cluster, most operations are atomic, so clients never see corrupted data or metadata. The Lustre software supports mmap () file I / O operations.

High-performance heterogeneous networks: Lustre software supports a variety of high-performance and low-latency networks, and can use remote direct memory access (RDMA) to achieve fast and efficient network transmission over advanced networks such as InfiniBand and Intel OmniPath. Lustre routing can be used to bridge multiple RDMA networks for best performance. Lustre software also integrates network diagnostics.

High availability: the Lustre file system enables active / active failover through shared storage partitions of OSTs (OSS targets). Lustre 2.3 or earlier enables active / passive failover by using shared storage partitions of MDT (MDS target). Lustre file systems can work with various high availability (HA) managers to achieve automatic failover and eliminate a single point of failure (NSPF). This makes it possible for applications to recover transparently. Multiple mount protection (MMP) provides comprehensive protection against errors in high availability systems to avoid file system corruption.

In Lustre 2.4 or later, active / active failover for multiple MDT can be configured. This allows the metadata performance of the Lustre file system to be extended by adding MDT storage devices and MDS nodes.

Security: by default, TCP connections only allow authorized ports to pass through. UNIX group membership is verified on MDS.

Access control lists (ACL) and extended attributes: the Lustre security model follows UNIX file system principles and is enhanced with POSIX ACL. There are also some additional features, such as root squash.

Interoperability: the Lustre file system runs on a variety of CPU architectures and mixed clusters of large and small sides, maintaining interoperability between continuously released major versions of Lustre software.

Object-based architecture: the client and disk file structures are isolated from each other, and the storage architecture can be upgraded without affecting the client.

Byte granularity file lock and fine granularity metadata lock: many clients can read and modify the same file or directory at the same time. The Lustre distributed Lock Manager (LDLM) ensures that files are consistent between all clients and servers in the file system. Where the MDT lock manager is responsible for managing inode permissions and pathnames. Each OST has its own lock manager for locking file stripes stored on it, and its performance can scale as the file system grows in size.

Quotas: user, group, and project quotas (User, Group, Project Quota) are available for Lustre file systems.

Capacity growth: by adding new OST and MDT to the cluster, you can increase the size of the Lustre file system and the total cluster bandwidth without disrupting service.

Controlled file layout: cross-OST file layouts can be configured on a per-file, per-directory, or per-file-system basis. This allows you to adjust the file Iwhite O in a single file system to suit specific application requirements. The Lustre file system uses RAID-0 for striping and adjusts space usage between OST.

Network data integrity protection: the checksum of all data sent from the client to the OSS prevents data from being corrupted during transmission.

The MPI I/O:Lustre architecture has a dedicated MPI ADIO layer that optimizes the parallel Imax O to match the underlying file system architecture.

NFS and CIFS export: Lustre files can be re-exported using NFS (via Linux knfsd) or CIFS (via Samba) so that they can be shared with non-Linux clients such as Microsoft Windows and Apple Mac OS X.

Disaster recovery tool: the Lustre file system provides online distributed file system check (LFSCK) to restore consistency between storage components in the event of a major file system error. The Lustre file system can also run in the presence of file system inconsistencies, while LFSCK can run while the file system is in use, so LFSCK does not need to be completed before the file system resumes production.

Performance monitoring: the Lustre file system provides a variety of mechanisms to check performance and make adjustments.

Open source: Lustre software uses the GPL 2.0 license to run on the Linux operating system.

Introduction of Lustre components

An example of a Lustre installation includes a management server (MGS) and one or more Lustre file systems interconnected with a Lustre network (LNet). The basic configuration of the Lustre file system components is shown in the following figure:

Management Server (MGS)

MGS stores configuration information for all Lustre file systems in the cluster and provides this information to other Lustre components. Each Lustre target (target) provides information by contacting MGS, while Lustre customers get information by contacting MGS. It is better for MGS to have its own storage space so that it can be managed independently. But at the same time, MGS can be put together with MDS and share storage space, as shown in the figure above.

Lustre file system components

Metadata server (MDS): MDS makes metadata stored in one or more MDT available to Lustre clients. Each MDS manages the names and directories in the Lustre file system and provides network request processing for one or more local MDT.

Metadata target (MDT): in Lustre 2.3 or earlier, there was only one MDT per file system. MDT stores metadata (such as file names, directories, permissions, and file layouts) on MDS's additional storage. Although the MDT on the shared storage destination can be used for multiple MDS, only one MDS can be accessed at a time. If the current MDS fails, the standby MDS can serve the MDT and provide it to the client. This is called MDS failover.

In Lustre 2.4, multiple MDT can be supported in a distributed namespace environment (DNE). In addition to saving the main MDT of the root directory of the file system, you can add other MDS nodes, each with its own MDT to hold the subdirectory tree of the file system.

In Lustre 2.8, DNE also allows the file system to distribute files from a single directory to multiple MDT nodes. Directories distributed across multiple MDT are called striped directories.

Object Storage Server (OSS): OSS provides file I / O services and network request processing for one or more local OST. Typically, OSS serves two to eight OST, each up to 16TB; configuring one MDT; on a dedicated node configures two or more OST; on each OSS node and configures clients on a large number of compute nodes.

Object Storage Target (OST): user file data is stored in one or more objects, each in a separate OST of the Lustre file system. The number of objects per file is configured by the user and can be debugged to optimal performance according to workload conditions.

Lustre client: the Lustre client is a compute, visualization, or desktop node running Lustre client software that mounts the Lustre file system.

The Lustre client software provides an interface between the Linux virtual file system and the Lustre server. The client software includes an administrative client (MGC), a metadata client (MDC) and multiple object storage clients (OSC). Each OSC corresponds to an OST in the file system.

Logical object volumes (LOV) provide transparent access to all OST by aggregating OSC. As a result, clients that mount the Lustre file system will see a coherent synchronization namespace. Multiple clients can write different parts of the same file at the same time, while other clients can read the file at the same time.

Similar to LOV file access, logical metadata volumes (LMV) provide transparent access to all MDT through aggregate MDC. This allows the client to treat the directory tree on multiple MDT as a single coherent namespace and merge the striped directory into the client to form a single directory for users and applications to view.

Lustre Network (LNet)

Lustre Networking (LNet) is a customized network API that provides a communication infrastructure to process metadata and file Imando data for Lustre file system servers and clients.

Lustr file system cluster

In terms of scale, a Lustre file system cluster can contain hundreds of OSS and thousands of clients (as shown in the following figure). Multiple types of networks can be used in Lustre clusters, and shared storage between OSS enables failover.

Lustre file system storage and iCompO

The Lustre file identifier (FID) was introduced in Lustre 2.0 to replace the UNIX inode number used to identify files or objects. FID is a 128bit identifier, where 64 bits are used to store unique serial numbers, 32 bits are used to store object identifiers (OID), and 32 bits are used to store version numbers. The serial number is unique among all Lustre targets in the file system (OST and MDT). This change makes it possible to support a variety of MDT and ZFS (both introduced in Lustre 2.4) in the future.

At the same time, a ldiskfs feature called FID-in-dirent (also known as Dirdata) has been introduced in this release, and FID is stored in the parent directory as part of the file name. This feature significantly improves the performance of ls command execution by reducing disk Ibind O. The FID-in-dirent is generated when the file is created.

In Lustre 2.4, the LFSCK file system consistency check tool provides the ability to enable FID-in-dirent on existing files. The details are as follows:

Generate a FID in IGIF mode for existing files on the version 1.8 file system.

Verify the FID-in-dirent of each file and regenerate the FID-in-dirent if it is invalid or missing.

Validate each linkEA entry and regenerate it if it is invalid or missing. LinkEA consists of the file name and the parent class FID, which is stored in the file itself as an extended attribute. Therefore, linkEA can be used to rebuild the full pathname of the file.

Information about the location of the file data on the OST will be stored as an extended attribute layout EA in the MDT object identified by the FID (as shown in the following figure). If the file is a normal file (that is, not a directory or symbolic link), the MDT object points 1 to N to the OST object that contains the file data. If the MDT file points to an object, all file data is stored in that object. If the MDT file points to multiple objects, use RAID 0 to divide the file data into multiple objects, storing each object on a different OST.

When the client reads and writes the file, it first obtains the layout EA from the MDT object of the file, and then uses this information to perform I / O on the file, directly interacting with the OSS node of the storage object. The specific process is shown in the following figure.

The available bandwidth of the Lustre file system is as follows:

The network bandwidth is equal to the total bandwidth from the OSS to the destination.

Disk bandwidth is equal to the total disk bandwidth of the storage target (OST) and is limited by network bandwidth.

The total bandwidth is equal to the minimum of disk bandwidth and network bandwidth.

The available file system space is equal to the sum of the free space of all OST.

Striping of Lustre file system

One of the main reasons for the high performance of the Lustre file system is the ability to strip data across multiple OST in a polling manner. Users can configure the number of stripes, stripe size and OST for each file as needed. Striping can be used to improve performance when the total bandwidth of a single file exceeds the bandwidth of a single OST. At the same time, striping can also play a role when a single OST does not have enough free space to hold the entire file.

As shown in the figure below, striping allows segments or "blocks" in a file to be stored in different OST. In the Lustre file system, data is striped over a certain number of objects through the RAID 0 mode. The number of objects processed in a file is called stripe_count. Each object contains a block of data in the file, and when the block written to a particular object exceeds the stripe_size, the next block in the file is stored on the next object. The default values for stripe_count and stripe_size are set for the file system, where stripe_count is 1 and 1MB is the stripekeeper size. Users can change these values on each directory or file.

In the following figure, the stripe_size of file C is greater than the stripe_size of file A, indicating that more data is allowed to be stored in a single stripe of file C. If the stripe_count of file An is 3, the data is striped on three objects. The stripe_count for file B and file C is 1. There is no space reserved on the OST for unwritten data.

The maximum file size is not limited by the size of a single target. In the Lustre file system, files can be split across multiple objects (up to 2000), each using up to 16 TB of ldiskfs and up to 256PB's ZFS. That is, the maximum file size for ldiskfs is 31.25 PB,ZFS, and the maximum file size is 8EB. The file size on the Lustre file system is limited by and only by the free space on the OST, and Lustre can support up to 2 ^ 63 bytes (8EB) of files.

Note: prior to Lustre 2.2, the maximum number of stripes per file was 160 OST. Although a file can only be divided into more than 2000 objects, the Lustre file system can have thousands.

The above is the editor for you to share how to deeply analyze the Lustre architecture, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.