How to analyze the working principle and flow of Ceph 10/23 Update SLTechnology News&Howtos

How to analyze the working principle and flow of Ceph

2025-10-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how to analyze the working principle and process of Ceph. Many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

The following will first introduce the core computing-based object addressing mechanism in RADOS, then explain the workflow of object access, then introduce the working process of RADOS cluster maintenance, and finally review and analyze its technical advantages combined with the structure and principle of Ceph.

5.1 addressing proc

The addressing process in the Ceph system is shown in the following figure [1].

Several concepts on the left side of the figure above are described as follows:

File-the file here is the file that the user needs to store or access. For an object storage application developed based on Ceph, the file corresponds to the "object" in the application, that is, the "object" directly manipulated by the user.

Ojbect-- the object here is the "object" seen by RADOS. The difference between Object and the file mentioned above is that the maximum size of object is limited by RADOS (usually 2MB or 4MB) to enable organizational management of the underlying storage. Therefore, when the upper application stores the file with a large size into the RADOS, it needs to split the file into a series of object of uniform size (the size of the last one can be different) for storage. To avoid confusion, we will try to avoid using the Chinese noun "object" in this article and use file or object to explain it directly.

PG (Placement Group)-as the name implies, the purpose of PG is to organize and map the storage of object. Specifically, a PG is responsible for organizing several object (which can be thousands or more), but an object can only be mapped into one PG, that is, an one-to-many mapping relationship between PG and object. At the same time, a PG is mapped to n OSD, and each OSD carries a large number of PG, that is, there is a "many-to-many" mapping relationship between PG and OSD. In practice, n is at least 2 and at least 3 if used in a production environment. There are hundreds of PG on a single OSD. In fact, the setting of the number of PG involves the uniformity of data distribution. On this point, the following will be carried out.

OSD, that is, object storage device, has been described in detail earlier and will not be expanded here. The only thing to note is that the number of OSD is actually related to the uniformity of data distribution in the system, so its number should not be too small. In practice, it should be at least tens of hundreds of orders of magnitude to help the design of Ceph system to play its due advantages.

Failure domain-this concept is not defined in this paper, but readers who have a certain concept of distributed storage systems should be able to understand its general idea.

Based on the above definition, the addressing process can be explained. Specifically, addressing in Ceph goes through at least three mappings:

(1) File-> object mapping

The purpose of this mapping is to map the file that the user wants to operate on to the object that RADOS can handle. The mapping is very simple, essentially segmenting the file according to the maximum size of object, which is equivalent to the striping process in RAID. This segmentation has two advantages: one is to turn the unlimited size file into the maximum size consistent object; that can be efficiently managed by RADOS, and the other is to change the serial processing of a single file into the parallel processing of multiple object.

The object generated after each segmentation will get a unique oid, that is, object id. Its generation mode is also linear mapping, which is extremely simple. In the figure, ino is the metadata of the file to be manipulated, which can be simply understood as the only id of the file. Ono is the sequence number of an object generated by the file segmentation. And oid is simply concatenated with this sequence number after the file id. For example, if a file whose id is filename is divided into three object, the object sequence number is 0, 1, and 2, and the resulting oid is filename0, filename1, and filename2.

The implicit problem here is that the uniqueness of the ino must be guaranteed, otherwise the subsequent mapping cannot be done correctly.

(2) Object-> PG mapping

After the file is mapped to one or more object, you need to map each object to a PG independently. The mapping process is also simple, as shown in the figure, and the formula is:

Hash (oid) & mask-> pgid

It can be seen that the calculation consists of two steps. First of all, the hash value of oid is calculated by using a static hash function specified by the Ceph system, and the oid is mapped to a pseudo-random value with approximately uniform distribution. Then, phase this pseudo-random value with mask to get the final PG sequence number (pgid). According to the design of RADOS, given that the total number of PG is m (m should be an integer power of 2), the value of mask is Mmur1. Therefore, the overall result of hash calculation and bitwise operation is in fact an approximately uniform random selection of one of all m PG. Based on this mechanism, when there are a large number of object and a large number of PG, RADOS can guarantee an approximate uniform mapping between object and PG. Because object is segmented from file, the size of most object is the same, so this mapping ultimately ensures that the total amount of object data stored in each PG is approximately uniform.

It is not difficult to see from the introduction that "a large number" has been repeatedly emphasized here. Only when there are a large number of object and PG, the approximate uniformity of the pseudorandom relation can be established, and the data storage uniformity of Ceph can be guaranteed. In order to ensure the establishment of a "large number", on the one hand, the maximum size of object should be reasonably configured so that the same number of file can be divided into more object;, on the other hand, Ceph also recommends that the total number of PG should be hundreds of times the total number of OSD to ensure that there are enough PG for mapping.

(3) PG-> OSD mapping

The third mapping is to map the PG as the logical organization unit of the object to the actual storage unit OSD of the data. As shown in the figure, RADOS uses an algorithm called CRUSH, substitutes pgid into it, and gets a set of n OSD. These n OSD are jointly responsible for storing and maintaining all the object in a PG. As mentioned earlier, the value of n can be configured according to the need for reliability in practical applications, which is usually 3 in a production environment. For each OSD, the OSD deamon running on it is responsible for the storage, access, metadata maintenance, and so on, of the local object mapped to the local file system.

Unlike the hash algorithm used in the "object-> PG" mapping, the result of this CRUSH algorithm is not absolutely invariant, but is affected by other factors. There are two main influencing factors:

One is the current system state, that is, cluster map, which was once mentioned in "Ceph Analysis Series IV-logical structure". When the state and quantity of OSD in the system changes, cluster map may change, and this change will affect the mapping between PG and OSD.

The second is the configuration of storage policy. The policy here is mainly related to security. With policy configuration, system administrators can further improve storage reliability by specifying that three OSD hosting the same PG are located on different servers and even racks in the data center.

Therefore, the mapping between PG and OSD is fixed only when the system state (cluster map) and storage policy remain unchanged. In practical use, the policy usually does not change once configured. The change in the state of the system is either due to equipment damage or due to the expansion of the storage cluster. Fortunately, Ceph itself provides automation support for this change, so even if the mapping between PG and OSD changes, it will not cause trouble to the application. In fact, Ceph needs to make use of this dynamic mapping relationship purposefully. It is taking advantage of the dynamic characteristics of CRUSH that Ceph can dynamically migrate a PG to different OSD combinations according to the needs, thus automatically realizing the characteristics of high reliability, data distribution re-blancing and so on.

One of the reasons why CRUSH algorithm is used in this mapping instead of other hashing algorithms is that CRUSH has the above configurable features, and the physical location mapping strategy of OSD can be determined according to the configuration parameters of the administrator. On the other hand, because CRUSH has special "stability", that is, when a new OSD is added to the system, resulting in an increase in the scale of the system, the mapping relationship between most PG and OSD will not change, only a small number of PG mapping relations will change and lead to data migration. This kind of configurability and stability can not be provided by ordinary hash algorithms. Therefore, the design of CRUSH algorithm is also one of the core contents of Ceph, the specific introduction can refer to [2].

So far, Ceph has completed the whole mapping process from file to object, PG, and OSD through three mappings. Looking through the whole process, we can see that there is no need for any global table lookup operation. As for the unique global data structure cluster map, it will be introduced later. It can be pointed out here that the maintenance and operation of cluster map are lightweight and will not adversely affect the scalability, performance and other factors of the system.

One possible puzzle is: why do you need to design the second and third mappings at the same time? Don't you repeat it? Sage does not explain much on this point in his paper, but the author's personal analysis is as follows:

We can imagine, on the other hand, what would happen without the layer mapping of PG? In this case, some algorithm must be used to map the object directly to a set of OSD. If this algorithm is a fixed mapping hash algorithm, it means that an object will be fixed on a set of OSD. When one or more of the OSD is damaged, the object cannot be automatically migrated to other OSD (because the mapping function is not allowed), and when the system adds a new OSD to expand capacity, the object cannot be re-balance to the new OSD (also because the mapping function is not allowed). These limitations go against the original design intention of high reliability and high automation of Ceph system.

If you use a dynamic algorithm (for example, the CRUSH algorithm is still used) to complete this mapping, it seems to avoid the problems caused by static mapping. However, the result will be an explosive increase in the amount of local metadata processed by each OSD, and the resulting computational complexity and maintenance workload are also unbearable.

For example, in the existing mechanism of Ceph, an OSD usually needs to exchange information with other OSD that co-hosts the same PG to determine whether they are working properly and whether maintenance operations are required. Because an OSD carries about hundreds of PG, and there are usually three OSD in each PG, an OSD needs to exchange hundreds to thousands of OSD information over a period of time.

However, if there is no PG, an OSD needs to exchange information with other OSD that co-hosts the same object. Since there are likely to be millions of object carried on each OSD, the exchange of information between OSD required by an OSD will soar to millions or even tens of millions of times in the same length of time. The maintenance cost of this state is obviously too high.

To sum up, the author believes that the introduction of PG has at least two advantages: on the one hand, it realizes the dynamic mapping between object and OSD, thus leaving room for the realization of reliability, automation and other features of Ceph; on the other hand, it also effectively simplifies the data storage organization and greatly reduces the maintenance and management overhead of the system. Understanding this is important for a thorough understanding of Ceph's object addressing mechanism.

5.2 data operation flow

First of all, we will take the file writing process as an example to illustrate the data operation process.

To simplify the explanation and make it easier to understand, a number of assumptions are made here. First, it is assumed that the file to be written is small, does not need sharding, and is only mapped to an object. Second, assume that one PG in the system is mapped to three OSD.

Based on the above assumptions, the file write process can be represented by the following figure [3]:

As shown in the figure, when a client needs to write an file to a Ceph cluster, you first need to complete the addressing process described in Section 5.1 locally, change the file into an object, and then find a set of three OSD that stores the object. The three OSD have different serial numbers. The OSD with the highest serial number is the Primary OSD in this group, and the last two are Secondary OSD and Tertiary OSD.

Once the three OSD are found, the client communicates directly with the Primary OSD to initiate a write operation (step 1). After receiving the request, Primary OSD initiates write operations to Secondary OSD and Tertiary OSD, respectively (steps 2 and 3). When Secondary OSD and Tertiary OSD complete the write operation respectively, confirmation messages are sent to Primary OSD (steps 4 and 5). When Primary OSD is sure that the other two OSD writes are complete, it also completes the data writing and confirms to the client that the object write operation is complete (step 6).

The reason for adopting such a writing process is essentially to ensure the reliability of the writing process and to avoid data loss as much as possible. At the same time, because client only needs to send data to Primary OSD, the external network bandwidth and overall access delay in Internet usage scenarios have been optimized to a certain extent.

Of course, this reliability mechanism inevitably leads to a long delay, especially if you wait until all the OSD writes the data to disk before sending an acknowledgment to the client, the overall delay may be unbearable. Therefore, Ceph can confirm to client twice. After each OSD writes the data to the memory buffer, an acknowledgment is sent to the client first, and the client can then execute downwards. After each OSD writes the data to disk, a final acknowledgement signal is sent to the client, and the client can delete the local data as needed.

By analyzing the above process, we can see that under normal circumstances, client can complete the OSD addressing operation independently without having to rely on other system modules. Therefore, a large number of client can operate concurrently with a large number of OSD. At the same time, if a file is divided into a plurality of object, the plurality of object can also be sent to a plurality of OSD in parallel.

From an OSD perspective, because the same OSD has different roles in different PG, its work pressure can also be shared as evenly as possible, thus preventing a single OSD from becoming a performance bottleneck.

If you need to read the data, client only needs to complete the same addressing process and contact Primary OSD directly. In the current Ceph design, the data being read is provided only by Primary OSD. However, there are also discussions about dispersing the reading pressure to improve performance.

5.3 Cluster maintenance

As mentioned in the previous introduction, several monitor are jointly responsible for the discovery and recording of all OSD states in the entire Ceph cluster, and together form a master version of cluster map, which then spreads to all OSD and client. OSD uses cluster map for data maintenance, while client uses cluster map for data addressing.

In the cluster, the function of each monitor is generally the same, and the relationship between them can be simply understood as the master-slave backup relationship. Therefore, the individual monitor is not distinguished in the following discussion.

Slightly surprisingly, monitor does not actively poll the current status of individual OSD. On the contrary, OSD needs to report status information to monitor. There are two common situations of reporting: one is that a new OSD is added to the cluster, and the other is that an OSD discovers an exception to itself or other OSD. Upon receipt of these escalations, monitor will update the cluster map information and spread it. The details will be described below.

The actual contents of Cluster map include:

(1) Epoch, which is the version number. The epoch of Cluster map is a monotone increasing sequence. The larger the Epoch, the newer the cluster map version. Therefore, OSD or client that hold different versions of cluster map can simply decide who should follow the version by comparing epoch. And monitor must have the largest epoch and the latest version of cluster map. When any two parties find that the epoch values are different from each other when communicating, they will first synchronize the cluster map to the state of the higher version party by default, and then carry out subsequent operations.

(2) Network address of each OSD.

(3) the status of each OSD. The description of the OSD state is divided into two dimensions: up or down (indicating whether the OSD is working properly) and in or out (indicating whether the OSD is in at least one PG). Therefore, for any OSD, there are four possible states:

-- Up and in: indicates that the OSD is running normally and has already carried the data of at least one PG. This is the standard working state of an OSD.

-- Up and out: indicates that the OSD is running normally, but it does not host any PG and there is no data in it. A new OSD will be in this state as soon as it is added to the Ceph cluster. After a malfunctioning OSD is repaired, it is also in this state when rejoining the Ceph cluster.

-- Down and in: indicates that an exception occurred in the OSD, but still carries at least one PG, in which the data is still stored. The OSD in this state has just been found to be abnormal, and may still return to normal, or it may not work at all.

-- Down and out: indicates that the OSD has completely failed and no longer hosts any PG.

(4) configuration parameters of CRUSH algorithm. It shows the physical hierarchical relationship (cluster hierarchy) and location mapping rule (placement rules) of the Ceph cluster.

According to the definition of cluster map, version changes are usually triggered only by changes in (3) and (4) messages. Compared with the two, (3) the probability of change is higher. This can be reflected by the following introduction to the process of changing the working state of OSD.

After a new OSD is online, it first communicates with the monitor according to the configuration information. Monitor adds it to cluster map, sets it to up and out status, and sends the latest version of cluster map to the new OSD.

After receiving the cluster map from monitor, the new OSD calculates the PG it is hosting (to simplify the discussion, we assume that the new OSD starts with only one PG), as well as other OSD that hosts the same PG as itself. The new OSD will then contact these OSD. If the PG is currently in a degraded state (that is, the number of OSD hosting the PG is less than the normal value, if it should be 3 normally, there are only 2 or 1 at this time. This is usually the result of an OSD failure), then the other OSD will copy all the objects and metadata within this PG to the new OSD. After the data is copied, the new OSD is set to up and in state. The cluster map content will be updated accordingly. This is actually an automated failure recovery process. Of course, even if no new OSD is added, the degraded PG will calculate the other OSD implementation failure recovery.

If the PG is currently fine, the new OSD will replace one of the existing OSD (the Primary OSD will be reselected within the PG) and assume its data. After the data replication is complete, the new OSD is set to the up and in state, and the replaced OSD exits the PG (but the state is usually still up and in, because there are other PG to host). The cluster map content will be updated accordingly. This is actually an automated data re-balancing process.

If an OSD finds that another OSD that co-hosts a PG with it cannot be connected, the situation is reported to the monitor. In addition, if an OSD deamon finds that its working status is abnormal, it will also actively report the exception to the monitor. In the above case, monitor will set the status of the problematic OSD to down and in. If a booking time period is exceeded and the OSD still cannot return to normal, its status will be set to down and out. Conversely, if the OSD can return to normal, its state will return to up and in. After these state changes occur, monitor updates the cluster map and spreads. This is actually an automated failure detection process.

As can be seen from the previous introduction, for a Ceph cluster, even if it consists of thousands or more OSD, the size of the cluster map data structure is not surprising. At the same time, status updates for cluster map do not occur frequently. Even so, Ceph optimizes the diffusion mechanism of cluster map information in order to reduce the pressure on related computing and communication.

First of all, cluster map information is diffused in increments. If both parties to any communication find that their epoch is inconsistent, the party who updates the version will send the difference in the cluster map owned by the two to the other party.

Secondly, cluster map information is diffused asynchronously and lazy. In other words, monitor does not broadcast the new version to all OSD after every cluster map version update, but replies to the update to the other party when OSD reports information to itself. Similarly, when each OSD communicates with another OSD, it sends updates to the other party whose version is lower than its own.

Based on the above mechanism, Ceph avoids broadcast storms caused by cluster map version updates. Although this is an asynchronous and lazy mechanism, according to the conclusion of Sage's paper, for a Ceph cluster composed of n OSD, any version update can spread to any OSD in the cluster within O (log (n)) time complexity.

One question that may be asked is: since this is an asynchronous and lazy diffusion mechanism, the system must have inconsistent cluster map seen by each OSD during the version diffusion process, will this lead to a problem? The answer is: no. In fact, if a client is consistent with the cluster map state seen by each OSD within the PG it is accessing, the access operation can proceed correctly. If an OSD in this client or PG is inconsistent with the cluster map of several other parties, then according to the mechanism design of Ceph, these parties will first synchronize the cluster map to the latest state, and perform the necessary data re-balancing operations, and then continue normal access.

Through the above introduction, we can briefly understand whether Ceph is based on cluster map mechanism, and monitor, OSD and client work together to complete the maintenance of cluster state and data access. In particular, based on this mechanism, automatic data backup, data re-balancing, fault detection and fault recovery can be completed naturally, without the need for complex special design. This is really impressive.

After reading the above, do you have any further understanding of how to analyze the working principle and process of Ceph? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.