Ceph architecture and components 07/15 Update SLTechnology News&Howtos

Ceph architecture and components

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Architecture of Ceph Stora

Ceph storage consists of several different daemons that are independent of each other, and each component provides specific functionality.

Ceph core components

RADOS (Reliable Autonomic Distributed Object Store reliable, automatic, distributed object storage) is the foundation of ceph storage components. Ceph stores everything as objects, and RADOS stores those objects, regardless of their data type. The RADOS layer components ensure data consistency and reliability. It includes data replication, fault detection and recovery, data migration and rebalancing between nodes. RADOS consists of a large number of storage devices and node clusters. RADOS is developed by C++.

The main function of LIBRADOS (basic library, also known as RADOS library) is to abstract and encapsulate RADOS, and to provide API upward for application development based on RADOS directly. RADOS is an object storage system, and LIBRADOS implements the function of API for object storage. Support for CJournal Centrals, Java, Python, Ruby and PHP. At the same time, LIBRADOS is also the basis of services such as RBD,RGW. RADOSGW (CEPH object Gateway, also known as RADOS Gateway RGW) provides a gateway for S3 and Swift-compatible RESTful API, supporting multi-tenant and openstack authentication. Compared to LIBRADOS, it provides a higher level of API abstraction and makes it easier to manage using class S3 and Swift scenarios.

RDB (Reliable Block Device) provides a standard block device interface that provides block storage that can be mapped, formatted, and mounted to the server like other disks. It is often used to create volume, cloud disk, etc., in virtualization scenarios. At present, red Hat has integrated the driver of RBD into KVM/QEMU.

CephFS (Ceph File system) is a distributed file system compatible with POSIX. It uses MDS as the daemon. Libcephfs has native Linux kernel driver support, and you can mount it using the mount command. Able to support CIFS and SMB protocols. Ceph RADOS

RADOS is the core of ceph storage system, also known as ceph storage cluster. All the excellent features of ceph are provided by RADOS, including distributed object storage, high availability, self-healing, and self-management.

RADOS component architecture diagram:

RADOS contains two core components: Ceph Monitor and Ceph OSD.

Ceph Monitor

Ceph Monitor is responsible for monitoring the running status of the entire cluster. In order to achieve high availability, it is usually configured as a small cluster (usually 3 nodes. The cluster usually has one leader and two slave. When the leader fails, the cluster elects a new leader to ensure the high availability of the system. ).

The information in Monitor is provided by the daemons in the cluster members, including the status between the nodes and the configuration information of the cluster. Mainly used to manage Cluster Map,Cluster Map is the key data structure of the whole RADOS system, similar to metadata information. Cluster Map in Mon includes Mon Map, OSD Map, PG Map, MDS Map, CRUSH, and so on.

Mon Map: records the information of mon cluster and the ID of cluster. Command ceph mon dumpOSD Map: records the information of the osd cluster and the ID of the cluster. Command ceph osd dumpPG Map: pg, whose full name is placement group, is a carrier for storing object. The creation of pg is specified when the ceph storage pool is created, and is also related to the specified number of replicas. For example, if it is a 3 copy, there will be three identical pg on three different osd. Pg is actually a directory in the existence form of osd, and pg is the storage unit. The usage of pg also reflects the current storage state of the cluster. In a cluster, PG has a variety of states, and different states reflect the current health status of the cluster. The command ceph pg dumpCRUSH Map: contains information about cluster storage devices, fault hierarchy, and storage data that define failure domain rule information. Command ceph osd crush dumpMDS map: MDS,MDS is a metadata service for Ceph FS only if you use Ceph FS. At least one MDS service is required in the cluster.

Ceph Monitor cluster maintains all kinds of Map data, and it does not provide storage services for client data. Clients and other cluster nodes will regularly check and update the Map maintained by Monintor. When the client reads and writes data, they will request Monitor to provide Map data, and then directly operate with OSD.

Ceph Monitor is a lightweight daemon that usually does not consume a lot of resources and usually requires several GB disk space to store logs.

Ceph OSD

OSD is an important component of Ceph storage. OSD stores data on the disk of each node in the cluster in the form of objects, and most of the work of data storage is realized by the OSD daemon.

The Ceph cluster usually contains multiple OSD. For any read and write operation, after the client obtains the Cluster Map from the Ceph Monitor, the client will directly perform the Cluster Map O operation with the OSD without Monitor intervention. This makes the data read and write process faster and does not require other additional levels of data processing to increase overhead.

OSD usually has multiple identical copies, and each data object has a master copy and several slave replicas, which are distributed on different nodes by default. When the master copy fails (disk failure or node failure), Ceph OSD Deamon will select a slave copy to promote the master copy, and at the same time, it will produce a new copy, which ensures the reliability of the data.

The Ceph OSD operation must be on a valid Linux partition. The file system can be BTRFS,XFS or EXT4. XFS is recommended.

Journal buffer

When Ceph writes data, it writes the data to a separate storage area before writing to alternate storage. This buffer area is called journal, and this buffer can be on the same or separate disk as OSD, or on different SSD disks. By default, the log refreshes data to alternate storage every 5 seconds, and the common log size is 10G, but the larger the partition, the better.

When a cached SSD disk is provided for journal, the performance of CEPH writing data is significantly improved. Each SSD disk logs up to 4 to 5 OSD. Once this limit is exceeded, there will be a performance bottleneck.

Ceph CRUSH algorithm

CRUSH algorithm is the core of reading and writing files in Ceph. The following figure shows the process of data reading, writing and rebalancing using CRUSH algorithm:

The process client of Ceph stores data by calling API (librados,RGW,RBD, or libcephfs) to obtain a copy of cluster Map from Monitor. The client gets the status and configuration information of the Ceph cluster through the cluster Map, and then cuts and numbers the client data according to a fixed size to form multiple objects. Through some column algorithms, the object is determined which PG to write to, and the corresponding storage pool pool, where there are multiple PG in a pool, and then the location of the required primary OSD is obtained through the CRUSH rule. Once the location of the OSD is determined, the client writes the data directly to the main OSD. After the data is written to the primary OSD, the data replication is performed and the PG data is synchronized in the replica OSD.

Pool is a logical partition that contains multiple PG. At the same time, each Pool is distributed across multiple host nodes.

Operation of the storage pool

The storage pool (POOL) is the logical partition that manages the PG. Pool can ensure the high availability of data by creating the required number of replicas. In the new version, the default number of replicas is 3. 5.

In addition, we can also use SSD hard drives to create faster pools, use pool for snapshot capabilities, and assign permissions to users who access pool.

Create pool ceph osd pool create {pool-name} {pg-number} {pgp-number} [replicated] [crush-ruleset-name] [expected-num-objects] ceph osd pool create test-pool 9 # create a pool called test-pool with 9 gp to view current storage pool information ceph osd lspoolsrados lspoolsceph osd dump | grep-I pool # View the number of copies and details set the number of copies (the default number of copies in previous versions is 2 In later versions of Firefly The default number of copies is 3) ceph osd pool set test-pool size 3 # sets the number of copies to 3ceph osd dump | grep-I pool # View the number of copies renamed to pool ceph osd pool rename test-pool pool-1 # current name target name ceph osd lspools # View named modified Ceph pool to do data snapshot rados-p pool-1 put obj-1 / etc/hosts # like adding pairs to the pool Like obj-1 and file / etc/hostsrados-p pool-1 ls # View objects in the pool # create a snapshot of the pool-1 pool Snapshot01rados mksnap snapshot01-p pool-1# card snapshot information rados lssnap-p pool-1# delete snapshot information of obj-test objects rados-p pool-1 rm obj-test# view objects in pool-1 storage pool # rados-p pool-1 listsnaps obj-testobj-test:cloneid snaps size overlap1 1 5 [] # restore snapshot, specify storage pool, object name, snapshot name rados rollback-p pool-1 obj-test snapshot01 to get pool related parameters Size or other parameters ceph osd pool get pool-1 sizeceph osd pool get pool-1 {value} set the configuration parameters of the storage pool ceph osd pool set pool-1 size 3 Delete the pool (snapshots will be deleted when the pool is deleted), delete ceph osd pool delete pool-1 pool-1-- yes-i-really-really-mean-itCeph data management using twice confirmation

Here is an example of how ceph manages data.

Copy storage of PG

Create a storage pool pool-1:

Ceph osd pool create pool-1 8 8 # creates a storage pool pool-1, specifying that the number of pg and pgp is 8

Check the ID of pool:

[root@local-node-1 ~] # ceph osd lspools1 .rgw.root2 default.rgw.control3 default.rgw.meta4 default.rgw.log8 pool-1

View the OSD to which the pg assignment belongs:

[root@local-node-1 ~] # ceph pg dump | grep ^ 8 | awk'{print $1 "\ t" $17} 'dumped all8.7] 8.6 [2LJ 1] 8.5 [2LJ 1] 8.4 [1MJ 0] 8.3 [2J 0J 1] 8.2 [2J 0J 1] 8.1 [0LJ 1] 8.1 [0LJ 1] 8.0 [0LJ 1] 8.0 [0LJ 1]

Indicates that the pg of 0-7 is saved as 3 copies and divided into different copies.

file store

We added the hosts file to the storage (pool has been added here):

[root@local-node-1 ~] # rados-p pool-1 put obj1 / etc/hosts [root@local-node-1 ~] # rados-p pool-1 lsobj1

View the mapping of objects obj1 and OSD:

[root@local-node-1] # ceph osdmap pool-1 obj1osdmap E72 pool 'pool-1' (8) object' obj1'-> pg 8.6cf8deff (8.7)-> up

The meaning of the above output:

The version number 72 of osdmap E72: OSD map indicates version pool 'pool-1' (8): the name of the pool is pool-1, and the ID is 8object' obj1': object name pg 8.6cf8deff (8.7): the number of pg, indicating that obj1 belongs to PG 8.7up ([0jue 2jue 1], p0): a up collection of PG, including osd 0Magne2, p0: indicates that osd 0pl 2 is all in the acting collection, where osd.0 is the master copy. OSD.2 is the second copy and OSD.1 is the third copy.

By default, copies of the same PG are distributed on different physical nodes.

Recovery and dynamic balance

By default, when a node in the cluster fails, Ceph marks the OSD of the failed node as down. If it is not restored within 300s, the cluster begins to restore the state.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.