Introduction of concept and components of Ceph 04/11 Update SLTechnology News&Howtos

Introduction of concept and components of Ceph

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

First: the basic introduction of Ceph: Ceph is a reliable, automatic rebalancing and automatic recovery distributed storage system. According to the scene division, Ceph can be divided into three parts, namely, object storage, block device storage and file system services. The advantage of Ceph over other storage is that it not only stores, but also makes full use of the computing power on the storage node. when storing each data, it will calculate the location of the data storage and balance the data distribution as far as possible. at the same time, due to the good design of Ceph, it uses CRUSH algorithm, HASH ring and other methods, so that it does not have the traditional problem of single point of failure. And as the scale expands, the performance will not be affected. Second: introduction of core components Ceph OSD (required)

The full name is Object Storage Device, and its main functions include storing data, handling data replication, recovery, compensation, balancing data distribution, and providing some related data to Ceph Monitor.

Ceph Monitor (required)

Ceph monitor, the main function is to maintain the health of the entire cluster, provide consistent decision-making, including Monitor map, that is, the cluster map,monitor itself does not store any cluster data Managers (required)

The Ceph Manager daemon (ceph-mgr) is responsible for tracking runtime metrics and the current state of the Ceph cluster, including storage utilization, current performance metrics, and system load. The Ceph Manager daemon also manages and exposes Ceph cluster information based on python plug-ins, including Web-based Ceph Manager Dashboard and REST API. High availability usually requires at least two managers. Ceph MDS (optional)

The full name is Ceph Metadata Server, and it mainly stores the metadata (metadata) of Ceph's file system (File System). It is not necessary to install it. When you need to use CephFS, you will use three: basic components introduce rados.

Itself is a complete distributed object storage system, it has reliable, intelligent, distributed and other characteristics, Ceph's high reliability, high scalability, high performance, high automation are provided by this layer, user data storage is ultimately through this layer to store, RADOS can be said to be the core of Ceph, mainly composed of two parts, namely OSD and MonitorLibrados

It is a library that allows applications to interact with the RADOS system by visiting it, and supports a variety of programming languages, such as radosgw such as C #, Clippers, Python, etc.

RADOSGW is a gateway based on the current popular RESTFUL protocol, and is compatible with S3 and Swif. Rbd is only used when using object storage.

RBD provides a distributed block device through the Linux kernel client and the QEMU/KVM driver, which can be understood as dividing a disk from the Ceph cluster like linux's LVM, on which users can directly do file system and mount directory CephFs.

Provide a POSIX-compatible file system through the Linux kernel client and fuse. When some linux systems do not support mount commands or require more advanced operations, ceph- fuse4: terminology introduction, noun interpretation

Crush

Is the data distribution algorithm used by Ceph, similar to a consistent hash, to distribute data to the desired place

Map

As mentioned above, the monitor component is responsible for monitoring the operation of the entire cluster, such as the status between nodes and cluster configuration information, which are provided by daemons that maintain cluster members. How to store this information? the answer is that map,ceph monitor map mainly includes the following

Monitor map: includes end-to-end information about monitor nodes, including Ceph cluster ID, monitoring hostname and IP, and port. And store the current version information and the latest change information, and view the monitor mapOSD map through "ceph mon dump": including some commonly used information, such as cluster ID, version information and final modification information for creating OSD map, and pool related information, mainly including pool name, pool ID, type, number of copies and PGP, etc., as well as quantity, status, weight, latest cleaning interval and OSD host information. View PG map with the command "ceph osd dump": including the current PG version, timestamp, latest OSD Map version information, space usage ratio, and near full percentage information, colleagues, as well as details of each PG ID, number of objects, status, OSD status, and deep cleanup. You can view the relevant status CRUSH map through the command "ceph pg dump": CRUSH map includes cluster storage device information, so failure domain hierarchy and failure domain rule information are defined when storing data. Viewing MDS map:MDS Map through the command "ceph osd crush map" includes storing the version information of the current MDS map, creating the current Map, modification time, data and metadata POOL ID, number of cluster MDS and MDS status, and can be viewed through "ceph mds dump".

Copy

Copy is the number of copies of data stored in ceph, which can be understood as the number of copies of a file backup. The default number of copies of ceph is 3, that is, one primary, one secondary, and one tertiary. Only the copy of primary osd interprets the client request, and it writes data to other osd.

As follows, you can see that there is an object called object1 in this pool called testpool. After obtaining his map information, you can see

This object is primary on osd1, secondary and secondary on osd0 and osd2, that is, with 3 replicas, each osd stores one copy

[root@ceph-1] # ceph osdmap testpool object1osdmap e220 pool' testpool' (38) object 'object1'-> pg 38.bac5debc (38.0)-> up ([1Jing 0Jing 2], p1) acting ([1Med 0Med 2], p1)

Explanation of other contents

Osdmap e220 version number of this map pool' testpool' (38) the name of this pool and IDobject 'object1' the name of this object pg 38.bac5debc (38.0) the number of pg, that is, 38.0up ([1j0p0p2], p1) up set, indicates the osd on which the copy exists in order, osd0 (primary) osd1 (secondary) and osd2 (tertiary) acting are usually the same as up set, different situations need to understand pg_temp That is to say, if the acting set of pg is [0Magi 1j 2], then if the osd.0 fails, it causes the CRUSH algorithm to reassign the acting set of the pg to [3Magi 1J 2]. At this point, osd.3 is the primary osd of the pg, but osd.3 cannot afford to read the pg because there is no data on it yet. So apply to monitor for a temporary pg,osd.1 as the temporary master osd, and then the acting set is still [0jie 1jue 2], and the up set becomes [1je 3je 2], then the difference between acting set and up set comes out. When the osd.3 backfill is completed, the up set of the pg is restored to acting set, that is, both acting set and up set are [0mem1mem2]

Object

The lowest storage unit of ceph, that is, objects. Each object contains metadata and raw data. When users want to store data in a ceph cluster, the stored data will be divided into multiple objects. The size of each object can be set. The default is 4MB. Object can be called the smallest unit of ceph storage.

Pg and pgp

Pg is used to store object

Pgp is a kind of arrangement and combination equivalent to pg stored in osd. It does not affect the number of copies, but only affects the order of copies.

Pool

Pool is a logical storage concept. When we create a pool, we need to specify that the pool of pg and pgp,Ceph is a logical partition for storing objects. Each pool of Ceph contains a certain number of PG to achieve mapping a certain number of objects to different OSD within the cluster.

This, therefore, each pool is distributed across all nodes of the cluster, that is, the pool is distributed over the entire cluster, which provides sufficient flexibility

Five: discrimination of easy-to-mix points

The relationship between object and pg

Because of the large number of object, Ceph introduces the concept of pg to manage object. Each object will eventually be mapped to a pg through CRUSH calculation. A pg can contain multiple object.

The relationship between pg and osd

Pg also needs to be mapped to osd for storage through CRUSH calculation. If there are three copies, each pg will be mapped to three osd, such as [osd.0,osd.1,osd.2]. Then osd.0 is the master copy of the pg, and osd.1 and osd.2 are the slave copies of the pg, ensuring the redundancy of the data.

The relationship between pg and pool

Pool is also a logical storage concept. When we create a storage pool pool, we all need to specify the number of pg and pgp. Logically, pg belongs to a storage pool, just like object belongs to a pg.

The relationship between pg and pgp

Pg is used to store object. Pgp is equivalent to a kind of arrangement and combination in which pg stores osd, such as three osd1 2 / 3, the number of copies is set to 3 and the default number of copies of ceph is 3. If pgp is 1, then an object can only be arranged in the only order of osd0 osd1 osd2, assuming that pgp is 2 at this time. So at this time, the object may be arranged in one of the order of osd0 osd1 osd2 and osd1 osd0 osd2. If pgp is 3 at this time, then there are three kinds of order, so pgp actually does not affect the number of copies of pg, but only affects the optional number of combinations of pg copies in the order of osd distribution. Well, it can also be understood that the function of pgp is to balance the data of osd in the cluster. Pg is the number of directories where objects are stored in the specified storage pool. Pgp is the storage pool pg. The increase in the number of OSD distribution combinations of pg will cause the data in pg to split, and the increase of pgp in the newly generated pg on the same OSD will cause changes in the distribution of some pg, but will not cause changes in objects in pg.

Relationship between storage data, object, pg,pgp, pool, osd, storage disk

The 12m file is divided into three objects, objectA,objectB,objectC, which are stored in the three pg of pgA,pgB,pgC, and the three pg of pgA,pgB,pgC are managed by poolA,poolB,poolC respectively. Each pg is distributed on which osd is selective, and how many choices are decided by pgp. Here, set pgp to 1, then the figure shows one of the possible pg distribution ordering, and is unique, if pgp is 2. Well, in addition to the exclusion of the distribution shown in the figure, there will be another sort of distribution, which may be pgA on osd1, pgB on osd3, and pgC on osd2, and of course there may be other sort of distribution, but here we assume that pgp is 2, so there are only two kinds of distribution to choose from.

In addition, there is another picture here, which I found in the articles of other bosses, so I put it here.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.