How to realize the Analysis of ceph principle 07/12 Update SLTechnology News&Howtos

How to realize the Analysis of ceph principle

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article shows you how to achieve ceph principle analysis, the content is concise and easy to understand, can definitely brighten your eyes, through the detailed introduction of this article, I hope you can get something.

1 introduction to the overall architecture 1.1 General introduction

Ceph is an open source software-defined storage system, born in 2004, the first project dedicated to the development of the next generation of high-performance distributed file systems. Ceph can be deployed on any x86 server with good scalability, compatibility and reliability. It can provide file system service (cephfs), block service (rbd) and object storage service (rgw). It is a unified storage system. Ceph architecture supports massive data storage, cluster can be expanded to PB capacity, the system itself has no hot data, data addressing depends on computing rather than search, and can achieve data state self-maintenance, data self-repair, is an excellent distributed storage system.

1.2 overall architecture

Figure 1 panoramic view of ceph

Ceph architecture mainly includes three storage services: Rados cluster, librados interface layer, rgw, rbd and cephfs.

Rados cluster: Rados is the core of the Ceph system, including distributed cluster management and data management. The scalability and high availability of the cluster are reflected here. The main components are monitor, osd, and mds. Monitor is the key service of cluster, which ensures the consistency of cluster metadata and the serviceability of cluster. Osd is the data service of ceph, which is responsible for the process of business data unloading, data status monitoring, data state recovery, data migration and recovery. Mds is the metadata service of cephfs, maintaining super block information, directory structure, file information and so on of file system. In general, if you don't use cephfs, you don't have to deploy mds.

Librados interface layer: unified encapsulated interface layer, which provides cluster connection interface, pool creation interface, obj read-write interface, etc., which is provided to the upper layer as a basic library, such as librbd, libcephfs and librgw. Third-party applications can directly call librados to do secondary development of ceph.

Client: ceph client includes three types: rbd, rgw, cephfs, as well as librbd, libcephfs and librgw development libraries. Storage services are provided externally, for example, rbd can export scsi block devices, cephfs can be mount to Linux hosts as file systems, and cifs/nfs can be exported as network file services. Rgw directly provides S3 or swift object storage services.

2 Cluster Management 2.1 Monitor

Monitor service is one of the core services of ceph. Its main function is to provide global configuration and system information for the whole cluster. Ceph cluster is actually reflected in Monitor cluster. Monitor is an independently deployed service, and multiple Monitor form a highly available cluster. Monitor uses paxos algorithm to ensure the consistency of cluster data. The data managed by Monitor include:

1. Monitor Map: including the fsid of the cluster, the ip and ports of all monitor, and epoch

2. OSD Map: including the fsid of the cluster, all osd status and listening addresses, pool information and pg number, etc.

3. MDS Map: including the list and status of all mds services, data pool and metadata pool

4. PG Map: all pg information, including status, version, etc.

5. CRUSH Map: a tree structure, including storage devices, failure domains, etc.

2.2 heartbeat management

Monitor obtains the working state of the OSD service through the heartbeat and updates the corresponding bitmap according to the change of the state.

First of all, there will be a heartbeat check between OSD, and OSD will check osd_heartbeat_min_peers OSD services. Peer osd actually refers to the osd adjacent osd services. The default is 10. These adjacent osd services first contain the osd list of the same pg, which is obtained from the pg_map. If the resulting osd list exceeds osd_heartbeat_min_peers, it will be discarded, and the deficiency will be filled with the osd of the current osd status up.

The check process is to check the osd on the peer side every osd_heartbeat_interval second, and the default is 6 seconds. If the peer osd does not reply after osd_heartbeat_grace, the default is 20s, then the osd status is marked as down.

Fig. 2 Inter-osd peer heartbeat

Monitor may not be aware of the state change of osd between different failure domains. Here, the number of reporter in different failure domains is set. The default is 1.

Fig. 3 check of different failure domains

As shown in figure 3, osd1 and osd2 belong to two different failure domains.

If an osd cannot make a heartbeat with all the osd in the peer, it will apply to monitor for the latest osd map. The timeout period is 30s by default for osd_mon_heartbeat_interval.

If Monitor cannot receive a message from osd in mon_osd_report_timeout, it will set osd to down,mon_osd_report_timeout and default to 900s.

Normally, every osd_mon_report_interval_min (the minimum interval is 5s by default), osd reports an event to monitor, including start just started, osd failure, up_thru change, and so on. The maximum time osd reports to monitor is osd_mon_report_interval_max, which defaults to 120s. It means that 120s need to report to osd regardless of whether there is a status change in monitor or not.

Figure 4 osd reports to monitor

3 data read and write 3.1 OSD

OSD (object storage daemon) is responsible for reading and writing data to the Ceph cluster, as well as reporting the status of the monitored OSD to the Monitor. Manage data migration, replica, data balance, data recovery. The core service of Ceph cluster data management, the data channel to disk.

OSD mainly consists of two main components, one is PG, the other is ObjectStore. PG (placement group) is the basic logical unit of data management of Ceph, which participates in all data-related processes such as data reading and writing, data migration, data recovery and so on. ObjectStore is the module responsible for storing reads and writes locally, and now filestore and bluestore are commonly used, while the current version of onestor uses filestore. Filestore is actually the local file system that operates, and the business is finally written to a file on the local disk. The default is to use the xfs file system. Bluestore is a bare disk operated directly, and the data is downloaded to the disk directly through BIO. The performance is greatly improved compared with filestore. The latest version of Ceph defaults to Bluestore.

3.2 read and write proc

Fig. 5 data reading and writing process

There are several main steps for data to be stored on Ceph:

1. Client needs to obtain the latest cluster map of Monitor, and determine the information of osd after obtaining map.

2. The hash from the object to the pg, hash the written object id, get the hash value, then take the module to the pg_num of the current pool, and finally get the pg id where the object is located.

Pg_id = hash (object_name)% pg_num

Then add pool id to the hash value, for example, 4.F1 pool id,f1 4 is the hash value calculated above, these two parts form a pg id.

3. Mapping from pg to OSD, the data needs to be dropped eventually, so the pg has to be mapped to an OSD. This mapping process uses Ceph's own unique CRUSH algorithm, through calculation to select a suitable OSD. CRUSH is essentially a pseudorandom algorithm. After finding the OSD, Client communicates directly with the OSD, establishes a network connection, and sends the data to the OSD service for processing.

4. OSD finally delivers the received data to ObjectStore, which completes the process of writing to local storage. At the same time, complete the data storage of master-slave OSD.

The data writing process of OSD satisfies the strong consistency of the data, and all the data will not be returned until the disk is down.

Fig. 6 Master-slave OSD writing process

1. Client first sends the write request to the master OSD.

2. After receiving the Client request, the master OSD immediately sends a write request to the slave OSD, which is also written to the local storage of the master OSD.

3. The master OSD receives the successful ack written from the OSD, confirms that it has handled it correctly, and finally returns it to Client to complete the ack. During the writing process, you must wait until all OSD writes have been successfully returned before you can return a successful write message to Client.

3.3 POOL and PG

Pool is a logical pool of storage resources. The storage services provided by Ceph are all provided by Pool. There are two kinds of Pool, one is copy mode, the other is erasure mode. A Pool is made up of many PG. The information about Pool is stored in osd map.

PG is a collection of object data, and objects in the same collection apply the same storage strategy. For example, copies of objects are stored in the same OSD list. An object belongs to only one PG, a PG contains multiple objects, a PG will be stored on multiple OSD, and an OSD will host multiple PG.

Figure 7 shows how object T1 is distributed to pg 1.3e when a pool is a double copy.

Fig. 7 correspondence between pg and osd

3.4 CRUSH algorithm

CRUSH (control replication under scalable hash) algorithm is an important addressing algorithm in Ceph, which mainly solves the addressing process from object to disk. At the same time, because it is calculated, the cluster does not need to query the address, so there is no central node, which greatly reduces the volume of metadata. CRUSH can distribute the data to each storage as much as possible.

The main scenarios used by the CRUSH algorithm are: when data io is created, when creating pool, osd status up/down,osd is added and deleted. As described in Section 3.2, CRUSH focuses on solving the problem of how PG maps to OSD lists. Expressed as a function is:

Crush (pg.x)-> (osd.1, osd.2, osd.3, … .osdN)

Figure 8 CRUSH Map

CRUSH map can be thought of as an abstraction of the data center, and its purpose is to find a suitable location to store copies of the data. CRUSH content includes organizational structure Hierarchical CRUSH Map, replica selection rule Placement Rules and container Bucket.

Hierarchical CRUSH Map, logically, the organizational structure is a tree structure. Contains device and bucket,device generally for osd services as leaf nodes. Bucket, as the container of the device, is generally a common node in CRUSH Map. The types of bucket are osd (device) and host,chassis,rack,row,pdu,pod,room,datacenter,region,root, which describe the storage location in CRUSH Map. Each bucket contains multiple device.

Bucket Weight, weight is an indicator used to describe storage capacity. The size of 1T is expressed as 1, which is expressed by double precision numbers. The weight size of Bucket is the weight size that contains the child bucket or device.

Placement Rules determines the object replica selection rule, which defines which bucket or device to choose from. This allows you to define different pool and choose the storage location from different disks.

Form 1 bucket definition

# buckets

Host cvknode146 {

Id-2 # do not change unnecessarily

# weight 2.160

Alg straw2

Hash 0 # rjenkins1

Item osd.1 weight 1.080

Item osd.3 weight 1.080

}

Host cvknode145 {

Id-3 # do not change unnecessarily

# weight 2.160

Alg straw2

Hash 0 # rjenkins1

Item osd.0 weight 1.080

Item osd.4 weight 1.080

}

Host cvknode144 {

Id-4 # do not change unnecessarily

# weight 2.160

Alg straw2

Hash 0 # rjenkins1

Item osd.2 weight 1.080

Item osd.5 weight 1.080

}

Rack rack0 {

Id-7 # do not change unnecessarily

# weight 6.480

Alg straw2

Hash 0 # rjenkins1

Item cvknode145 weight 2.160

Item cvknode144 weight 2.160

Item cvknode146 weight 2.160

}

Root partition0 {

Id-5 # do not change unnecessarily

# weight 6.480

Alg straw2

Hash 0 # rjenkins1

Item rack0 weight 6.480

}

# rules

Rule partition0_rule {

Ruleset 1

Type replicated

Min_size 1

Max_size 10

Step take partition0

Step chooseleaf firstn 0 type host

Step emit

}

Rule partition0_ec_rule_1 {

Ruleset 2

Type erasure

Min_size 3

Max_size 20

Step set_chooseleaf_tries 5

Step set_choose_tries 100

Step take partition0

Step chooseleaf indep 0 type host

Step emit

}

The Bucket defined by CRUSH Map is described in Table 1, which is represented by a tree diagram:

Figure 9 crush map graphical

As shown in figure 9, root, as the entrance to crush map, defines the bucket rack0, which includes three host, and each host includes two device osd. Bucket defines the random selection algorithm as straw2,hash algorithm as rjenkins1. The hash algorithm here is used to calculate the random value, and the random selection algorithm is to select OSD according to the random value.

The selection rules defined in Table 1 are partition0_rule and partition0_ec_rule_1. The prototype of the rule is as follows:

Rule {

Ruleset

Type [replicated | erasure]

Min_size

Max_size

Step take [class]

Step [choose | chooseleaf] [firstn | indep]

Step emit

}

Ruleset: the id of the current rule

Storage mode of type:pool

Min_size: do not use this rule if the number of copies of Pool is less than min_size

Max_size: do not use this rule if the number of copies of Pool is greater than max_size

Step take [class]:

Select a bucket, usually of type root bucket, and use this as the input to the query to traverse the tree. At the same time, you can only specify a custom device type.

Step choose firstn {num} type {bucket-type}:

Depth first selects num sub-bucket of type bucket-type.

L If {num} = 0, choose pool-num-replicas buckets (all available).

L If {num} > 0 & &

< pool-num-replicas, choose that many buckets. l If {num} < 0, it means pool-num-replicas - {num}. 如果num为0则选择和pool副本数一样的，num大于0小于pool的副本数，则返回num个副本数，如果num小于0，则返回pool副本数减num的个数。 step chooseleaf firstn {num} type {bucket-type}：和上一条规则一样，区别在于chooseleaf是选择叶子节点，一般就是osd。 step emit：输出选择结果。 3.4.1 Straw算法现在有了crush的组织结构CRUSH Map和选择规则Placement Rules，从bucket中选出合适的设备是采用随机算法来做具体的选择。当前CRUSH所支持的随机算法有：表格 2 随机选择算法 Uniform 适用于每个item权重相同，且很少添加或删除，item的数量比较确定。 List 以链表保存item，包含的item可以是任意权重，但是会造成某些节点被选中的概率变大。 Tree 采用2分查找树，包含大量item的情况下可以快速查找，但是会造成某些节点被选中的概率变大，而在节点删除添加移除和重新修改权重会引入额外组织变动开销，查找速度O(logn)。 Straw 相对List和Tree是比较公平的算法，每个节点都有公平竞争的机会，而且权重越大的节点被选中的机会越大。 Straw2 Straw的改进算法，减少了在节点删除移动的时候数据迁移量。具体描述Straw算法，提供各个bucket尽量公平的选择机会，权重越大选中的概率越高。执行过程其实递归遍历bucket，找到合适device。如果权限大就尽量找权限大的，如果权限一样则随机返回。图 10 straw代码片段如代码所示： 1、 crush(pg_id, osd_id, r) =>

Draw,r is constant, number of operations

2. (draw & 0xffff) * osd_weight = > straw

3. Get the maximum high_draw and return the item

Where draw is a random number, then multiply the draw by the weight to get a signature value, and select the OSD with the maximum signature value. To ensure that the one with the largest weight is not selected every time, the generation of random numbers is very important.

Fig. 11 rjenkins hash algorithm

Three groups of data are passed in for mixing and mixing to get a random value as far as possible. You can see that as long as the value passed in is the same, the random value is the same, so CRUSH is a pseudo-random algorithm.

Reference source code src/crush/mapper.c, entry function crush_do_rule

3.5 ObjectStore Modul

ObjectStore is the last module before the release of the Ceph system, which is responsible for ensuring that the business data can be safe, reliable and efficient IO. ObjectStore defines transaction operations, which must be implemented by specific local storage. ObjectStore currently includes four native implementations:

1. FileStore:H and J versions use more local storage and use the file system as an object to read and write.

2. BlueStore: local storage supported by the latest L version. The default way of Ceph in the future is to abandon the file system and directly manipulate block devices for the best performance.

3. KStore: use KV storage system as local storage.

4. MemStore: data and metadata are stored in memory, mainly for testing and verification.

FileStore and BlueStore are currently used in production environments.

3.5.1 FileStore

FileStore uses the xfs file system to save data by default. The file system POSIX interface is used to implement the interface of ObjStore, and each object exists as a file at the bottom.

Figure 12 filestore structure

FileStore includes two modules: FileJournal and DBObjectMap. FileStore introduces FileJournal to improve the writing transaction processing ability and atomicity of ObjectStore. It is equivalent to the WAL (write ahead log) of the database, in order to ensure the integrity of each write transaction. It will use direct io to write to the journal log, and then commit the transaction to the queue of FileStore to complete the write operation. If something happens, crash,OSD will recover the log during recovery.

The order in which FileStore writes data is to write journal first and then on the disk. Journal will be returned to the upper layer after writing, but you still have to wait for the data to be read ready. However, the performance of random writing scenarios in a large number of small io is still good. FileStore has a write magnification problem because it writes a log before writing a disk.

DBObjectMap is a module dedicated to managing the properties of objects, and there are two implementations of xattr and omap. Xattr is implemented by the extended attributes of the file system, which is limited by the length of the extended attributes of the file system, so it is suitable for storing a small amount of data. Omap is implemented by using leveldb KMurv key value storage, and if the attribute size exceeds the limit of xattr, it can be stored in omap.

3.5.2 Bluestore

Figure 13 overall structure of bluestore

In order to solve the write magnification caused by filestore writing journal before writing data, Bluestore makes corresponding optimization for ssd. Bluestore directly manipulates bare disks to avoid the overhead of file systems (xfs, ext4) as much as possible. When operating a bare disk, the disk space management is handled by allocator (default is BitmapAllocator). The metadata generated by Bluestore in the io process is stored in rocksdb kv database. Rocksdb data storage depends on the file system, but rocksdb abstracts the underlying system as env, and the user program only needs to implement the corresponding interface to provide the encapsulation of the underlying system. Bluestore implements BlueRocksEnv and Bluefs supports BlueRocksEnv. The log and data files of Bluefs are saved on the bare disk, and the bare disk devices are shared with user data, or different devices can be specified separately.

Figure 14 osd device data Partition

Osd directory partition: the default occupies 100m, and the xfs file system is mounted by default. Provide basic description information, including whoami (osd number), magic word of osd's type,osd, block block device entry, etc.

Block dev label partition: occupies 4K by default. Store bluestore_bdev_label_t structure information, including osd_uuid, block device size, device description information, creation time, etc.

Bluefs supper block partition: occupies 4K by default. Store bluefs_super_t structure information, including osd_uuid,version and block_size, etc.

DB data partition: by default, it occupies MAX [1G, 4% of the total block device size] and stores DB data and log data.

User data partition: the remaining space is occupied by default to store user business data.

Metadata mapping relationship:

Figure 15 metadata mapping relationship

An object corresponds to an onode, an onode contains multiple extent, multiple extent are stored in one or more blob, and each blob corresponds to multiple pextent, which is a specific physical block, so the mapping from extent to pextent is completed.

The BlueStore reading process is relatively simple. If there is a pre-read flag, it is read from the cache. If it is missed, it needs to be read from the disk and added to the cache.

The writing process of Bluestore is complicated, which is roughly divided into uppercase and lowercase. Determine whether it is lowercase or uppercase according to the length of the written data. Min_alloc_size (the minimum allocation of uppercase defaults to 64K for blob) is modularized. If it is less than 64K, it is a lowercase process, and if it is greater than 64K, it is an uppercase process.

Figure 16 write the scene

Lowercase is divided into overwrite and non-overwrite, in which non-overwrite lowercase and uppercase are not punished, first write the data and then change the metadata. Overwriting scenarios will generate WAL logs. You need to complete WAL log processing before writing business data.

The overall performance is much better than that of filestore.

4 data peering and recovery

The basic logic unit responsible for data IO and data recovery in Ceph system is called PG (Placement Group), and PG is the data carrier in Ceph. It can be understood that PG is a directory or collection that contains multiple objects.

PG has its own state, the health of PG is abnormal, the state will change, and the data will migrate.

The state of active+clean:PG 's health.

Degraded: degraded status, if it is a two-copy configuration, and one of the osd is down, PG will enter this state. In this case, PG can still provide IO services.

Recovery: an osd failed offline, but the pg is still readable and writable. When the faulty osd is started, the data it stores is older than other osd versions, which will generate recovery to ensure data consistency.

Remapped: if the osd is offline due to a hard disk failure, it will be a permanent failure at this time, and a new osd needs to be assigned for remapping. After the new osd is selected, the data is still empty, and you need to copy all the data from the normal osd to the new osd, so the status is generally remapped+backfilling.

Peered: this state occurs because there are so many failed osd that the pg cannot read or write. This state can be understood as a normal osd waiting for other copies of the osd to come online.

4.1 data peering mechanism

A Peering process is a process in which the states of a group of OSD corresponding to a PG are consistent. When the PG reaches active, the peering process is completed, but the data on the OSD corresponding to the PG is not necessarily consistent.

The scenario in which PG occurs:

N the system initializes OSD, starts reloading PG or creates PG, which triggers PG to initiate a Peering process.

The n OSD failure or the increase or decrease of OSD will cause the acting set of the PG to change, and the PG will also initiate a Peering process.

Introduction to the concept of peering

1. Acting set and up set

Acting set is a set of OSD for a PG activity, the list is ordered, and the first subscript OSD is prime OSD. Up set is generally the same as acting set. The difference is that a temporary PG is generated due to an OSD failure. Acting set and up set are not consistent at this time.

2. Temporary pg

The main OSD osd.0 in the original OSDacting set list [0je 1jue 2] broke down, and the acting set re-selected by crush is [3je 1Power2], but at this time the osd3 does not have the data of the PG and needs to do backfill, so the osd3 cannot be read at this time. So a temporary PG will be generated as the prime, for example, tell monitor to let osd1 do the temporary prime, and then up set will be [1Magi 3jue 2], acting set or [3LJ 1je 2], backfill finished, temporary pg cancelled, and the two lists will be the same.

3. Authoritative History authoritative log

Refers to the continuous operation log record of the complete sequence of pg, which serves as the basis for data recovery.

The process of Peering is basically divided into three steps

L GetInfo: the master osd of pg obtains the pg_info information of each slave OSD by sending messages.

L GetLog: according to the comparison of pg_info, select an osd (auth_log_shard) with authoritative logs. If the master osd is not an osd with authoritative logs, get it from the osd with authoritative logs. After obtaining, the master osd also has authoritative logs.

L GetMissing: pull other pg log from OSD (either partially or all), and identify the missing object information from the OSD through local auth log comparison. To use as a basis for subsequent recovery processes.

L Active: activate the master osd, notify the notify message and activate the corresponding slave osd.

To put it simply, peering does not recover data, but unifies the state of each osd to identify which objects need to be restored in preparation for the next step.

4.2 data recovery

After the Peering is complete, you can know if there is any data that needs to be recovered. There are two ways to recover: recovery and backfill.

Recovery fixes inconsistent objects based on missing records in the PG log.

Backfill is that PG repairs the missing objects by rescanning all the objects and comparing the missing objects with the whole copy. When the OSD expiration time is so long that it cannot be repaired according to the PG log records, or the data migration caused by the addition of new OSD, these can lead to the Backfill process.

5 data end-to-end consistency

The path of io in storage system is complex. Traditional storage system generally includes application layer, kernel file system, block device, SCSI layer, HBA and disk controller. Each layer has the possibility of error. Therefore, the traditional end-to-end solution will focus on block checking.

Because Ceph is an application layer, encapsulating fixed data blocks and adding parity data will lead to serious performance problems, so Ceph only introduces Scrub mechanism (Read Verify) to ensure the correctness of the data.

To put it simply, the OSD of Ceph regularly starts the Scrub thread to scan some objects to find out whether it is consistent with other replicas. If there is any inconsistency, Ceph will throw this exception to the user to solve.

Scrub can be divided into two ways according to the scanning content:

N one is called srub, which only compares the replica metadata of each object to check data consistency. Just check the metadata, so the amount of reading and calculation is relatively small, a lightweight check.

N one is called deep-srub, which further checks whether the data content of the object is consistent, realizes deep scanning, scans almost all the data on the disk and calculates the crc32 check, which costs a lot of money and takes up a lot of system resources.

Scrub is divided into two types according to the way it scans:

N online scanning: does not affect the normal business of the system.

N offline scanning: the system is required to stop or freeze business.

The Scrub of Ceph implements online scanning, but the scrub can be executed without interrupting the system business, but Scrub for a specific object will lock the object and prevent the client from accessing it until the Scrub execution is completed.

The above content is how to achieve ceph principle analysis, have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.