Why do you need to pay attention to Ceph 04/11 Update SLTechnology News&Howtos

Why do you need to pay attention to Ceph

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail why you need to pay attention to Ceph. Xiaobian thinks it is quite practical, so share it with you for reference. I hope you can gain something after reading this article.

Why we need to focus on Ceph

Among the diverse storage projects in today's open source world, different projects have a focus that ultimately serves enterprise IT Infrastructure services. So what exactly do enterprise IT infrastructure managers need for storage, and are their needs being met? The author tries to summarize the following figure based on the research on enterprise storage products.

figure I

From the above figure, we can understand that storage interface requirements, expansion, operations and costs constitute the four centers of enterprise storage products. Almost all storage products, both hardware (SAN) and software, focus on this aspect, either because of cost or because of scalability. So let's see how Ceph positions itself.

figure II

Ceph meets the diverse needs of the enterprise through its three storage interfaces, builds its own requirements matrix through its "trumpeted" scalability from the beginning, and through its distributed architecture and fault tolerance dedicated to petabytes of storage.

figure IV

The diagram above is also a transformation of enterprise IT architecture solutions that Ceph can bring, and after understanding the features that Ceph provides, how to achieve and implement it is something that everyone unfamiliar with Ceph urgently needs to understand.

Ceph Architecture

The following is a classic Ceph module architecture diagram.

figure v

The bottom layer is Rados, which is also the basis for Ceph to implement distributed storage, and all storage interfaces are based on Rados implementation. Rados itself is an object storage interface, which maintains a cluster state and implements data distribution requirements. Rados is usually called Ceph Cluster because the storage interfaces on it, such as CephFS, are based on the interface implementation on it.

Why is the underlying object storage model?

Instead of using fixed-block storage, names can be stored in a flat namespace, can take on variable sizes, and have rich semantics with a simple API.

Compared to using file storage, there is no need for hard-to-distribute tree hierarchies, upload semantics cannot span objects, and cannot be parallelized.

What are RADOS components?

OSD: Each disk, SSD, RAID group, or other physical storage device becomes an OSD that is primarily responsible for storing and locating objects, and for distributing and restoring them to replication nodes.

Monitor: Maintains cluster membership and state, providing strong consistency in decision making (similar to Zookeeper)

RADOS Distribution Policy-CRUSH Algorithm

figure vi

From the flow diagram above, the author tries to understand how RADOS distributes objects to different OSD (OSD). First, he needs to understand several storage area concepts defined by Ceph. Pool is a namespace. When the client stores objects on RADOS, it needs to specify a Pool. Pool is defined by configuration file and the range of OSD nodes and PG number of Pool can be specified. PG is the concept inside Pool, which is the intermediate logical hierarchy between object and OSD. Object will first get its stored PG through simple Hash algorithm, and this PG and object are determined. Each PG then has a primary OSD and several Secondary OSDs, and objects are distributed across all of these OSDs for storage, and this distribution strategy is called CRUSH-Ceph for evenly distributed data at the core. It should be noted that the entire CRUSH process implementation is calculated on the client, so the client itself needs to save a Cluster Map, which is obtained from Monitor. From here we can also understand that Monitor's main responsibility is to maintain this Cluster Map and ensure strong consistency.

CRUSH uses pseudo-random algorithms to ensure uniform data distribution. Its inputs are PG,Cluster State and Policy, and it ensures that CRUSH's Object Name is the same. Even if the latter two parameters change, they will get consistent reading (note reading). And CRUSH algorithm is configurable, through PG Number and Weight can be specified to obtain different distribution strategy. This configurable distribution strategy greatly enhances Ceph's capabilities.

figure VII

Ceph Usage Scenarios Librados

figure VIII

First we review the entire architecture diagram we saw earlier, where we learned about RADOS and what it does, and then we focus on Librados, which provides direct access to RADOS. Librados provides support for C, C++, Java, Python, Ruby and PHP.

figure IX

The difference between librados and RADOSGW mentioned later is that it has direct access to RADOS without the Http protocol overhead. It supports single atomic operations such as simultaneous update of data and attributes, CAS operations, and object-granular snapshot operations. Its implementation is based on the RADOS plug-in API, so it is essentially a wrapper library running on Rados.

RadosGW

figure x

RadosGW sits on top of Librados, providing RESTful interfaces and compatible with S3 and Swfit interfaces. RadosGW also provides Bucket namespaces (similar to folders) and account support, and has usage records for billing purposes. In contrast, it increases the load on the HTTP protocol.

RadosGW can provide the ability to store Ceph Clusters as distributed objects, such as Amazon's S3 scope, Swift, etc. Enterprises can also use it directly as media for data storage, distribution, etc.

RBD

Block storage is another big anchor for Ceph, which currently provides block storage with different paths for virtual machines and hosts.

figure Xi

Ceph Cluster provides block device support for virtual machines. LibRBD is a Librados-based block device interface implementation that maps a block device to different objects. LibRBD allows you to create a Container and Attach it to the VM via QEMU/KVM. Decoupling containers and VMs allows block devices to be bound to different VMs.

figure XII

Ceph Cluster provides block device support for hosts via RBD Kernel Module(rbd.ko). One difference here is that Librados is a kernel module and the module name is (libceph). This is because RBD kernel modules need to utilize Librados, which are also located in kernel space. From this we can understand that Ceph actually maintains a very large number of libraries, but in fact the quality is uneven, which requires people who understand Ceph to use them reasonably. It is precisely because of how diverse the Library is that Ceph's storage interfaces are also diversified, rather than having different storage interfaces reluctantly implemented. Different storage interfaces have completely different paths.

Both of the above methods store a virtual block device fragment in RADOS(Ceph Cluster), both use data striping to improve data parallel transmission, and both support snapshot of block devices, COW(Copy-On-Write) cloning. RBD also supports Live migration. Both OpenStack and CloudStack use the first approach to provide block appliances for virtual machines.

figure XIII

figure XIV

The above illustration also shows how to minimize and efficiently utilize storage capacity in the case of a large number of VMs. When a large number of VMs build volumes based on the same Snapshot, all capacity is not consumed immediately, but COW. This feature is also emphasized by many storage vendors in VDI solutions. Storage costs are the most important part of a VDI solution, and Thin provisioning and data parallelization using Ceph can greatly increase the attractiveness of VDI solutions.

Ceph's block storage is currently a highly recommended and rapidly developed module because it provides interfaces that are more familiar to users and are widely accepted and supported in the current popularity of OpenStack and CloudStack. Compute and storage decoupling of Ceph block storage, Live migration features, efficient snapshots, and clone/restore are all compelling features.

CephFS

CephFS is a PB-level distributed file system based on Rados implementation. Here, a new component MDS(Meta Data Server) will be introduced, which mainly provides metadata for POSIX-compatible file systems, such as directory and file metadata. MDS also stores metadata in RADOS(Ceph Cluster). After metadata is stored in RADOS, metadata itself is parallelized, which greatly enhances the speed of file operation. Note that MDS does not provide file data directly to clients, but only metadata for clients.

figure XV

As can be seen from the above figure, when Client opens a file, it will query and update the corresponding metadata of MDS, such as the object information included in the file, and then directly obtain the file data from RADOS(Ceph Cluster) according to the provided object information.

figure XVI

Since CephFS is a distributed file system, it needs to provide data Load Balancer to avoid MDS hotspots when faced with different file hotspots and sizes. As you can see from the above diagram, the five MDSs manage directory trees of different "sizes," dynamically adjusting for different file access hotspots and sizes.

The benefits of MDS include faster directory listing operations, such as file size, number and time under directories. At the same time, it supports snapshots of files.

CephFS is currently available in a variety of ways:

1. Linux Kernel client: mount -t ceph 8.8.8.8:/ can be mounted locally and accessed directly.

2. ceph-fuse: CephFS can be mounted from user mode space via ceph-fuse, such as "ceph-fuse -m 192.168.0.1: 6789/home/username/cephfs"

3. libcephfs.so: Libcephfs can replace HDFS to support different Hadoop and even HBase.

Ceph's Other

QoS mechanism: Ceph Cluster supports multiple QoS configurations, such as cluster rebalancing and priority configuration of data recovery to avoid affecting normal data communication.

Geo-replication: Geo-location based object storage

OpenStack: Ceph is currently heavily integrated into OpenStack, and all OpenStack storage (except databases) can use Ceph as a storage backend. For example, KeyStone and Swfit can utilize RadosGW, and Cinder, Glance and Nova utilize RBD as block storage. Even images of all VMs are stored on CephFS.

At present, Ceph has many optimization points. At present, Rados IO path is too complex, thread management is not limited. If you optimize the performance of Small IO on Rados, the cluster performance of Ceph Cluster will be greatly improved. Many people also worry that Ceph's implementation of such diverse storage interfaces will necessarily reduce the performance requirements of each interface. However, from the perspective of architecture, Ceph's RADOS is only a pure distributed storage mechanism, and its interfaces see a unified storage pool, and the interfaces are separated from each other without affecting each other.

About "why need to pay attention to Ceph" this article is shared here, I hope the above content can be of some help to everyone, so that you can learn more knowledge, if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.