Example Analysis of Snapshot in Ceph 07/08 Update SLTechnology News&Howtos

Example Analysis of Snapshot in Ceph

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you the example analysis of Snapshot in Ceph, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

Parsing Ceph: Snapshot

Wang, Haomai | December 27, 2013

Developers often ask about the implementation of Ceph Snapshot in the mailing list. Due to the limited implementation documents and complex code structure and code, it is not easy to figure out Ceph Snapshot. Recently, DBObjectMap, which is refactoring the Ceph storage engine layer, involves dealing with the problem of inter-Snapshot clone, and rearranges the snapshot-related code flow, which accounts for a large proportion of the Ceph IO path. Here, it does not focus on the code or data structure, but shows the implementation of Snapshot from the perspective of high-level design.

Be sure to understand the basics and usage scenarios of Ceph before reading the following. Why Ceph and how to use Ceph?

Ceph Snapshot usage scenario

Most people who try Ceph's Snapshot often start with Ceph's RBD library, which is called block storage. Using librbd, you can quickly create volumes and Snapshot with simple commands.

Rbd create image-name-size 1024-p pool

Rbd snap create pool/image-name-snap snap-name

The first command creates a volume called "image-name", and in the process the librbd library just creates a metadata and does not actually request space from Ceph. More details on how librbd uses Rados to implement block storage and management will be discussed in future articles, leaving a hole here.

The second command creates a Snapshot named "snap-name" for the "image-name" volume, and after creation, any write to the "image-name" volume can be rolled back to the data when the "snap-name" Snapshot was created at any time. Such as the following command

Rbd snap rollback pool/image-name-snap snap-name

When users actually try, they will find that Ceph is very light in volume operation and management. At any time, any volume size, any cluster size volume creation is the same order of magnitude, and it is essentially the same operation behind it. Developers will be more interested in how to implement Snapshot, because the way Snapshot is implemented determines how to use Snapshot effectively.

Ceph Snapshot implementation

Before I elaborate, it is important to understand that Ceph has the concept of Pool, that is, the-p pool involved in the above command. A Ceph Cluster can create multiple Pool, each Pool is a logical isolation unit, different Pool can have a completely different way of data processing. For example, Replication Size (number of copies), Placement Groups (PG) and CRUSH Rules,Snapshots,Ownership are all isolated by Pool.

Therefore, any operation on Ceph needs to be performed by specifying a Pool, and the above image operations are performed on a Pool called "pool", and an Image named "image-name" is also stored in "pool".

In addition to the concept of Pool, Ceph essentially has two Snapshot modes, and both Snapshot cannot be applied to the same Pool at the same time.

Pool Snapshot: hit a Snapshot for the entire Pool, and all objects in that Pool will be affected

Self Managed Snapshot: user-managed Snapshot, the simple understanding is that the object affected by the Pool is controlled by the user. The users here tend to be applications such as librbd.

Our previous operation using the rbd command is essentially using the second mode, so let's first introduce the implementation of the second mode.

As mentioned earlier, Snapshot is also isolated using Pool, and the implementation of the two Snapshot mode is basically similar, and how to use it is an important reason for the separation of the two modes. Each Pool has a snap_seq field, which can be thought of as the Global Version of the entire Pool. All Object stored in Ceph also come with snap_seq, and each Object will have a Head version, and there may also be a set of Snapshot objects, both Head version and snapshot object will have snap_seq, so let's see how librbd uses this field to create Snapshot.

The user applies to create a Snapshot named "snap-name" for the "image-name" in "pool"

The snap sequence,Ceph Monitor that librbd requests from Ceph Monitor to get a "pool" increments the snap_seq of the Pool and then returns the value to librbd.

Librbd replaces the new snap_seq in the snap_seq of the original image and sets the original snap_seq to the snap_seq of the user-created Snapshot named "snap-name"

From the above operation, students who are familiar with the version control implementation may roughly guess Ceph's implementation of Snapshot. Each Snapshot controls a snap_seq,Image that can be regarded as a Snapshot of a Head Version. Each IO operation pair is sent with snap_seq to Ceph OSD,Ceph OSD to query the snap_seq of the object involved in the IO operation. For example, if "object-1" is a data object in "image-name", then the initial snap_seq is the snap_seq of "image-name". When you create a Snapshot, when you write to "object-1" again, you will bring the new snap_seq,Ceph to check the Head Version of "object-1" after receiving the request, and you will find that the snap_seq of the write operation is greater than the snap_seq of "object-1". Then a new Object Head Version will be cloned from the original "object-1", the original "object-1" will be used as a Snapshot, and the new Object Head will be accompanied by a new snap_seq, which librbd has previously applied for.

The implementation of Ceph is of course much more complex than that mentioned above, with more exceptions to consider and the management of Object Snaps.

This is the second Snapshot Mode mentioned above, so the first mode is actually simpler. Since the second way is for the application (librbd) to apply for snap_seq and then manage it, the first scenario can be Snapshot with commands such as "rados mksnap snap-name-p pool" for global pool, and the application does not need to know about snap_seq. This command increments the snap_seq of "pool", and then all subsequent objects pairs under "pool" are affected, because all subsequent IO operations automatically inherit the snap_seq of "pool" and clone the object. This pattern is used in CephFS to manage the global Snapshot.

So, to put it more simply, the difference between the two mode lies in whether the application comes with snap_seq when it makes the IO request.

Storage Management of Object Snapshot

The above mentioned is how to use snap_seq to find the corresponding objects in the underlying storage and return them, so how does the underlying storage engine manage different versions of an Object?

First of all, any Object is accessed through the ObjectStore interface. Currently, the Ceph Master branch supports both MemStore and FileStore, and FileStore is the default storage interface implementation. Future articles will also introduce specific FileStore implementations.

In Ceph, each Object has three types of storage interfaces, which are the main Object storage, xattr storage and omap storage. Object storage is the storage of users' actual data. Xattr is mainly used to provide XATTR data storage for CephFS. Omap storage can be understood as a KLV storage and associated with a certain object. An Object's metadata (pool,PG,name, etc.) is managed by an object_info_t structure, and a SnapSetContext structure manages Snapshots, both of which are persisted as an object's KBE storage. The default FileStore uses LevelDB as the key store, and then maps and manages LevelDB through the DBObjectMap class.

In the implementation of Snapshot, the most important thing is the Clone operation, so at the FileStore level, Object data storage is actually a file, cloning between Object file systems that rely on OSD data directories, such as Ext4 or XFS will copy the data directly, using Btrfs will make use of ioctl's BTRFS_IOC_CLONE_RANGE command, and kv data cloning will implement COW strategy through an ingenious KeyMapping (slightly more complex, explained later) Xattr, on the other hand, is fully copy implemented (xattr is rarely used in Ceph).

The above is all the content of the article "sample Analysis of Snapshot in Ceph". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.