Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of OSD, OSDMap, PG and PGMap in Ceph

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article will explain in detail the sample analysis of OSD, OSDMap and PG, PGMap in Ceph. Xiaobian thinks it is quite practical, so share it with you as a reference. I hope you can gain something after reading this article.

panel A

Ceph is committed to providing petabytes of cluster storage capacity, and provides automatic failure recovery, convenient expansion and capacity reduction capabilities. These capabilities need Metadata Server to provide in typical distributed storage systems, because fully distributed systems have very strong pain points for data migration and expansion, but Metadata Server on the other hand needs to avoid single point failures and data bottlenecks. Here, Ceph provides freer and more robust cluster automatic failure handling and recovery capabilities, which makes Metadata Server indispensable, but in order to avoid Metadata Server bottlenecks, maintaining which Metadata becomes the most important issue. Monitor, as Metada Server of Ceph, maintains cluster information, which includes 6 maps, namely MONMap, OSDMap, PGMap, LogMap, AuthMap and MDSMap. PGMap and OSDMap are the two most important maps, which will be mainly covered in this article.

OSDMap

OSDMap is the information of all OSD nodes in the Ceph cluster. All OSD node changes such as process exit, node addition and exit, or node weight changes will be reflected on this Map. Not only will the Map be mastered by Monitor, OSD nodes and clients will also get this table from Monitor, so we actually need to process all "clients"(including OSD, Monitor and Client). In fact, each "Client" may have a different version of OSDMap. When the authoritative OSDMap held by Monitor changes, it will not send OSDMap to all "Clients". Instead, you need to understand that changing "Client" will be pushed, such as a new OSD addition that will cause some PG migration, and then the OSD of these PG will be notified. In addition, Monitor will also randomly select some OSD to send OSDMap. So how do you make OSDMap spread slowly? For example, OSD.a, OSD.b get new OSDMap, then OSD.c and OSD.d may have some PG on OSD.a, OSD.b, then their communication will be attached to the epoch of OSDMap, if the version is lower, OSD.c and OSD.d will actively pull OSDMap to Monitor, and some cases OSD.a, OSD.b will also actively push their own OSDMap to OSD.c and OSD.d (if updated). Therefore, OSDMap will slowly spread among nodes over the next period of time. When the cluster is idle, it will likely take longer to update the new Map, but this does not affect state consistency between OSDs because OSDs do not get new Maps so they do not need to know about new OSDMap changes.

Ceph avoids synchronization of cluster state by managing multiple versions of OSDMap, which allows Ceph to be unafraid of state synchronization that may occur in clusters caused by thousands of ODS-scale node changes.

panel C

When an OSD crashes unexpectedly, other OSDs that maintain Heartbeat with that OSD will find that the OSD cannot be connected. After reporting to Monitor, the OSD will be temporarily marked as OUT, and all Primary PG located on the OSD will hand over the Primary role to other OSDs (explained below).

PG and PGMap

figure E

In Ceph, PG has a state machine with more than ten states and dozens of events to deal with the anomalies that PG may face. Each PG is like a family. The data PG holds is its wealth, while OSD is just a castle. Each castle provides residence for multiple families, but in order to ensure the inheritance of wealth, each family will establish residence in multiple castles. OSD If the castle provides only a communication address (IP:Port) and some infrastructure (such as OSDMap and messaging mechanism) for PG, all family residences in other castles will update their status and reselect the new castle as their residence after the castle accident. If the castle recovers from an accident, all families in the castle will communicate with their families 'residences in other castles to learn about the changes in wealth during the accident. This example is to illustrate that Object(i.e. user data) follows PG, not OSD.

From the above description, we can understand that Monitor grasps the OSD status and PG status of the whole cluster. Each PG is the owner of a part of Object, and it is also the responsibility of each PG to maintain the information of Object. Monitor does not grasp the information of Object Level. Therefore, each PG needs to maintain PG state to ensure consistency of objects. However, the data of each PG and the records necessary for fault recovery and migration are maintained by each PG itself, that is, they exist on the OSD where each PG is located.

PGMap is the status of all PG maintained by Monitor. Each OSD will master its own PG status. PG migration needs Monitor to make a decision and then reflect it on PGMap. The relevant OSD will be notified to change its PG status. After a new OSD is launched and added to the OSDMap, Monitor notifies the OSD of the PG to be created and maintained. When there are multiple copies, the PG's Primary OSD actively communicates with the Replicated role's PG and communicates the PG's status, including the PG's recent history. Generally speaking, the new OSD will get all the data of other PG and gradually reach agreement, or OSD already has the PG information, then the Primary PG will compare the history of the PG and reach agreement on the PG information. This process is called Peering, and it is a "discussion" initiated by the Primary PG OSD, where multiple OSDs that also master this PG compare PG information and history with each other to reach agreement.

About "OSD, OSDMap and PG, PGMap example analysis in Ceph" this article is shared here, I hope the above content can be of some help to everyone, so that you can learn more knowledge, if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report