Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the basic data structures of Ceph

2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "what are the basic data structures of Ceph", which is easy to understand and well organized. I hope it can help you solve your doubts. Let the editor lead you to study and learn this article "what are the basic data structures of Ceph?"

1 Pool (pool)

The concept of Pool has been mentioned earlier, and Ceph supports rich operations on Pool, including:

List, Create and delete poolceph osd pool create {pool-name} {pg-num} [{pgp-num}] [replicated] [crush-ruleset-name] ceph osd pool create {pool-name} {pg-num} {pgp-num} erasure [erasure-code-profile] [crush-ruleset-name] ceph osd pool delete {pool-name} [{pool-name}-- yes-i-really-really-mean-it] QoS support: ceph osd pool set-quota {pool -name} [max_objects {obj-count}] [max_bytes {bytes}] Snapshot creation and deletion: ceph osd pool mksnap {pool-name} {snap-name} ceph osd pool rmsnap {pool-name} {snap-name} metadata Modification ceph osd pool set {pool-name} {key} {value} set the number of copies of the object (note that the number of copies includes the object itself) ceph osd pool set {poolname} size {num-replicas} in degraded mode The number of copies under the formula ceph osd pool set data min_size 22 volumes (image) 2.1 image users see

Image corresponds to LVM's Logical Volume, which will be striped into N sub-blocks, each of which will be stored as an object (object) in a simple block device (simple block devicees) in the RADOS object store. For example:

# create a 100GB RBD Image named 'myimage'. By default, it is striped into 25 objects of 4MB size rbd create mypool/myimage-size 102400 # is also a 100MB size RBD Image, but it is striped into 13 objects rbd create mypool/myimage of 8MB size rbd create mypool/myimage-size 102400-order 23

# call an image mount to linux host a deivce / dev/rbd1

Rbd map mypool/myimage

# write data to / dev/rbd1

Dd if=/dev/zero of=/dev/rbd1 bs=1047586 count=4 # what you can see when deleting image rbd rm mypool/myimage2.2 image's ceph system

Next, let's take a look at some inside information about image.

(1) create a new object

First create a 100GB image in an empty pool

Root@ceph2:~# rbd create-p pool100 image1-- size 102400-- image-format 2root@ceph2:~# rbd list pool100 image1

At this point, I saw more object in pool:

Root@ceph2:~# rados-p pool100 lsrbd_directoryrbd_id.image1rbd_header.a89c2ae8944a

As can be seen from the name, these object do not store image data, but metadata information such as ID,header. Where all the ID and name information of image is saved in rbd_directory:

Root@ceph2:~# rados-p pool100 listomapvals rbd_directoryid_a89c2ae8944avalue: (10 bytes): 0000: 06 0000 00 69 6d 61 67 65 31:.... image1 name_image1 value: (16 bytes): 0000: 0C 0000 00 61 38 39 63 32 61 65 38 39 34 61:.... a89c2ae8944a

Rbd_header holds the metadata of Image:

Root@ceph2:~# rados-p pool100 listomapvals rbd_header.a89c2ae8944afeaturesvalue: (8 bytes): 0000: 01 0000 0000 0000 00:. Object_prefix value: (25 bytes): 0000: 15 0000 00 72 64 5f 64 61 74 61 2e 61 38 39:.... rbd_data.a89 0010: 63 32 61 65 38 39 34 34 61: c2ae8944a order value: (1 bytes): 0000: 16:. Size value: (8 bytes): 0000: 0000 0000 19 0000 00:. Snap_seq value: (8 bytes): 0000: 0000 0000 0000:.

This information is the source of the following command:

Root@ceph2:~# rbd-p pool100 info image1rbd image 'image1': size 102400 MB in 25600 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.a89c2ae8944a format: 2 features: layering

At the same time, you can see that the name of the data object of the image is prefixed with rbd_header.a89c2ae8944a. For a new image, because there are no data objects, the actual storage space is only a very small amount of space occupied by metadata objects.

(2) write data 8MB to the object (an object in the pool is 4MB)

Root@ceph2:~# rbd map pool100/image1root@ceph2:~# rbd showmapped id pool image snap device 1 pool100 image1-/ dev/rbd1

Root@ceph2:~# dd if=/dev/zero of=/dev/rbd1 bs=1048576 count=8 8'0 records in 8'0 records out 8388608 bytes (8.4 MB) copied, 0.316369 s, 26.5 MB/s

Let's look at the objects in pool:

Root@ceph2:~# rados-p pool100 lsrbd_directoryrbd_id.image1rbd_data.a89c2ae8944a.0000000000000000 rbd_data.a89c2ae8944a.0000000000000001 rbd_header.a89c2ae8944a

You can see that there are two more object of 4MB. Continue to look at the OSD where the first object is located:

Root@ceph2:~# ceph osdmap pool100 rbd_data.a89c2ae8944a.0000000000000000osdmap e81 pool 'pool100' (7) object' rbd_data.a89c2ae8944a.0000000000000000'-> pg 7.df059252 (7.52)-> up ([8 record6, 7], p8) acting ([8, 6) 7]

The ID for PG is 7.52, the master OSD is 8, and the slave OSD is 6 and 7. Check that the node where OSD 8 is located is ceph4:

Root@ceph4:/data/osd2/current/7.52_head# ceph osd tree# id weight type name up/down reweight-1 0.1399 root default-4 0.03998 host ceph4 5 0.01999 osd.5 up 1 8 0.01999 osd.8 up 1

Log in to ceph4, check the / var/lib/ceph/osd directory, and you can see the ceph-8 directory:

Root@ceph4:/var/lib/ceph/osd# ls-ltotal 0 lrwxrwxrwx 1 root root 9 Sep 18 02:59 ceph-5-> / data/osd lrwxrwxrwx 1 root root 10 Sep 18 08:22 ceph-8-> / data/osd2

Looking at the directory at the beginning of 7.52, you can see two data files:

Root@ceph4:/data/osd2/current# find. -name'* a89c2ae8944a*'./7.5c_head/rbd\ uheader.a89c2ae8944a__head_36B2DADC__7. / 7.52_head/rbd\ udata.a89c2ae8944a.0000000000000001__head_9C6139D2__7. / 7.52_head/rbd\ udata.a89c2ae8944a.0000000000000000__head_DF059252__7

Visible:

(1) RBD image is a simple block device that can be directly mount to the host to become a device, and users can write binary data directly.

(2) the data of image is saved as several data objects in RADOS object storage.

(3) the data space of the image is thin provision, which means that the ceph does not pre-allocate the space, but waits until the data is actually written according to the object.

(4) each data object is saved as multiple copies.

3 Snapshot (snapshot) 3.1 what snapshot users see

A snapshot of a RBD image (snapshot) is a read-only copy (A snapshot is a read-only copy of the state of an image at a particular point in time.) of the state of the image at a particular time. It is important to note that you need to stop the snapshot O before doing it; if the image contains a file system, the system ensures that the file system is in a contiguous state.

Users can use the rbd tool or other API to operate snapshot:

Rbd create-p pool101-- size 102400 image1-- format 2 # create image

Rbd snap create pool101/image1@snap1 # create snapshotrbd snap ls pool101/image1 # list rbd snap protect pool101/image1@snap1 # protect rbd snap unprotect pool101/image1@snap1 # to protect rbd snap rollback pool101/image1@snap1 # rollback snapshot to image. Note that this is a time-consuming operation, rbd will display a progress bar rbd snap rm pool101/image1@snap1 # delete

Rbd snap purge pool101/image1 # Delete all snapshot of image

Rbd clone pool101/image1@snap1 image1snap1clone1 # create clone

Rbd children pool101/image1@snap1 # lists what all its clone3.2 snapshot Ceph systems see

Let's also take a look at the inner workings of snapshot.

(1) create image1, write the data of 4MB, and then create a snapshot: rbd snap create pool100/image1@snap1

(2) Ceph does not create new objects in the pool, that is, no storage space is allocated to create a data objects for snap1 at this time.

Root@ceph2:~# rbd map pool101/image1

Root@ceph2:~# rbd showmapped

Id pool image snap device

1 pool100 image1-/ dev/rbd1

2 pool101 image1-/ dev/rbd2

Root@ceph2:~# dd if=/dev/sda1 of=/dev/rbd2 bs=1048576 count=4

4'0 records in

4'0 records out

4194304 bytes (4.2 MB) copied, 0.123617 s, 33.9 MB/s

Root@ceph2:~# rados-p pool101 ls

Rbd_directory

Rb.0.fc9d.238e1f29.000000000000

Image1.rbd

Root@ceph2:~# rbd snap create pool101/image1@snap1

Root@ceph2:~# rbd snap ls pool101/image1

SNAPID NAME SIZE

10 snap1 102400 MB

Root@ceph2:~# rados-p pool101 ls

Rbd_directory

Rb.0.fc9d.238e1f29.000000000000

Image1.rbd

(3) Ceph adds the information of snapshot to the rbd_header. {image_id} object

Root@ceph2:~# rados-p pool101 listomapvals rbd_header.a9262ae8944a

Snapshot_0000000000000006

Value: (74 bytes):

0000: 03 01 44 0000 00 06 0000 0000 0000 00 05 00:.. D.

0010: 00 00 73 6e 61 70 31 00 00 00 19 00 00 00 01:.. snap1.

0020: 00 00 00 01 01 1c 00 00 00 ff ff ff:.

0030: ff ff ff ff ff 00 00 00 fe ff ff ff ff ff ff:.

0040: ff 00 00 00:.

(4) write 4MB data to image1 (actually overwrite the data in the first object), and find that there is an extra 4MB file in the data directory:

Root@ceph4:/data/osd2/current/8.2c_head# ls / data/osd/current/8.3e_head/-l

Total 8200

-rw-r--r-- 1 root root 4194304 Sep 28 03:25 rb.0.fc9d.238e1f29.000000000000__a_AE14D5BE__8

-rw-r--r-- 1 root root 4194304 Sep 28 03:25 rb.0.fc9d.238e1f29.000000000000__head_AE14D5BE__8

It can be seen that Ceph uses COW (copy on write) to implement snapshot: before writing to object, copy it out as the data object of snapshot, and then continue to modify the data in object.

(5) execute the command dd if=/dev/sda1 of=/dev/rdb1 bs=1048576 seek=4 count=4 to write [4MB, 8MB) data to image. This creates a second data object. Because this is created after doing snapshot, it has nothing to do with snapshot.

(6) create another snapshot, and then modify the second data object, so that there is an extra data object file of snapshot in the folder where the second data object is located:

Root@ceph4:/data/osd2/current/8.2c_head# ls / data/osd/current/8.a_head/-ltotal 4100 RWMurray Rafael-1 root root 4194304 Sep 28 03:31 rb.0.fc9d.238e1f29.000000000001__head_9C84738A__8 root@ceph4:/data/osd2/current/8.2c_head# ls / data/osd/current/8.a_head/-ltotal 8200-rw-r--r-- 1 root root 4194304 Sep 28 03:35 rb.0.fc9d.238e1f29.000000000001__b_9C84738A__8-rw-r--r-- 1 root root 4194304 Sep 28 03:35 rb.0.fc9d.238e1f29.000000000001__head_9C84738A__8

therefore,

(1) the data objects of snapshot is saved in the same directory as the data objects of image.

(2) the granularity of snapshot is not the whole image, but the data object in RADOS.

(3) when snapshot is created, only a small number of bytes of metadata are added to the metadata object of image; when the data objects of image is modified (write), the modified objects will be copied (copy) as the data objects of snapshot. This is what COW means.

4 cloning (clone)

The Clone is created to copy the state of a Snapshot of the image into an image. For example, imageA has a Snapshot-1,clone that is cloned into imageB based on the Snapshot-1 of ImageA. The state of imageB at this time is exactly the same as that of Snapshot-1, and it has the corresponding capabilities of image, the difference is that ImageB is writable at this time.

4.1 what Clone users see

From the user's point of view, a clone is exactly the same as any other RBD image. You can snapshot it, read / write it, resize it, etc., but there are no restrictions from the user's point of view. At the same time, creation is fast because Ceph only allows clone to be created from snapshot, while snapshot is always read-only.

Rbd clone pool101/image1@snap1 image1snap1clon3root@ceph2:~# rbd info image1snap1clon3 rbd image 'image1snap1clon3': size 102400 MB in 25600 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.a8f63d1b58ba format: 2 features: layering parent: pool101/image1@snap1 overlap: 102400 MB4.2 Clone Ceph system

From a system point of view, clone also uses COW technology, so let's learn more about it through the following steps:

(1) create a clone (protect snapshot is required before creation). You will find that there are three more object in RADOS:

Root@ceph2:~# rbd clone pool101/image1@snap1 pool101/image1snap1clone1

Root@ceph2:~# rbd ls-p pool101

Image1

Image1snap1clone1

Root@ceph2:~# rados-p pool101 ls

Rbd_header.89903d1b58ba

Rbd_directory

Rbd_id.image1snap1clone1

Rbd_id.image1

Rbd_children

Rbd_header.a9532ae8944a

Rbd_data.a9532ae8944a.0000000000000000

Where rbd_children records the father-son relationship:

Root@ceph2:~# rados-p pool101 listomapvals rbd_children

Key: (32 bytes):

0000: 08 0000 0000 0000 00 0c 0000 00 61 39 35 33: .a953

0010: 32 61 65 38 39 34 34 61 0e 00 00 00: 2ae8944a.

Value: (20 bytes):

0000: 01 0000 00 0c 0000 00 38 39 39 30 33 64 31 62: .89903d1b

0010: 35 38 62 61: 58ba

Compared to rbd_header.a9532ae8944a,rbd_header.89903d1b58ba, there is only more partent information:

Parent

Value: (46 bytes):

0000: 01 28 0000 00 08 0000 0000 0000 00 C 00:.

0010: 00 00 61 39 35 33 32 61 65 38 39 34 34 61 0e 00:.. a9532ae8944a..

0020: 00 00 00 19 00 00 00:.

Here are the sources of father-son relationship and rbd children results:

Root@ceph2:~# rbd children pool101/image1@snap1pool101/image1snap1clone1

It can be seen that Clone is also implemented in COW for snapshot.

(2) read data from clone

The RBD image that is essentially clone reads the data, and for a data objects,ceph that is not its own, it reads from its parent snapshot, and if it doesn't, keep looking for its parent image until a data object exists. From this process, we can also see that the process is inefficient.

(3) write data to object in clone

Ceph first checks to see if data object exists on the clone image. If it does not exist, copy the data object from parent snapshot or image, and then perform a data write operation. At this point, clone has its own data object.

Root@ceph3:/data/osd3/current/8.32_head# ls-ltotal 4100 copyright RWmurf Rafael-1 root root 4194304 Sep 28 05:14 rbd\ udata.89903d1b58ba.0000000000000000__head_CEDDC1B2__8

This is the relationship between the objects after the addition of clone:

4.3 Flatten clone

From the above analysis, we know that the clone operation essentially replicates a metadata object, but the data objects does not exist. Therefore, each read operation will first access the possible data object of this volume. After there is no error in the returned object, the corresponding object is accessed to the parent volume to finally determine whether this piece of data exists. Therefore, when there are multiple levels of clone chains, the read operation requires more loss to read the data objects of the parent volume. Access to the parent volume is not required only when the data object for this volume exists (that is, after the write operation).

To prevent too many parent-child layers, Ceph provides the flattern function to copy the data objects shared by clone and parent snapshot to clone and delete the parent-child relationship.

The flatten method of the rbd tool:

Rbd flatten: fill clone (image-name) with data of parent (make it independent)

If an image is a clone, copy all shared blocks (data objects) from its parent snapshot and remove the dependency on its parent. At this point, its parent snapshot can be unprotected, and if there is no other clone, it is allowed to be deleted. This feature requires image to be in format 2 format.

Note that this is a very time-consuming operation. After Flatten, the clone no longer has a relationship with the original parent snapshot, and really becomes an independent image:

Root@ceph2:~# rbd flatten image1snap1clon2Image flatten: 100% complete...done. Root@ceph2:~# rbd info image1snap1clon2 rbd image 'image1snap1clon2': size 102400 MB in 25600 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.fb173d1b58ba format: 2 features: layering

Strangely, this operation produces a large number of empty files in the directory of the image generated by parent image and clone:

Root@ceph2:/data/osd/current# ls / data/osd/current/8.7f_head/-ltotal 788When RwMui Rafael-1 root root 0 Sep 28 05:33 rbd\ udata.89903d1b58ba.000000000000014f__head_17C7607F__8-rw-r--r-- 1 root root 0 Sep 28 05:33 rbd\ udata.89903d1b58ba.0000000000000232__head_96D143FF__8-rw-r--r-- 1 root root 0 Sep 28 05:33 rbd\ udata.89903d1b58ba.0000000000000399 _ _ head_4D4E557F__8-rw-r--r-- 1 root root 0 Sep 28 05:33 rbd\ udata.89903d1b58ba.00000000000003ae__head_CE165DFF__8-rw-r--r-- 1 root root 0 Sep 28 05:33 rbd\ udata.89903d1b58ba.00000000000003e1__head_42EA8A7F__8-rw-r--r-- 1 root root 0 Sep 28 05:33 rbd\ udata.89903d1b58ba.0000000000000445__head_701607FF__8

Determine the number of layers of parent-son relationship, and delete the pseudo code of snapshot after flatten clone reaches a certain number:

Img = self.rbd.Image (client.ioctx, img_name) # get its RBD Image object based on the input img_name

_ pool, parent, snap = self._get_clone_info (img_name) # when the RBD image whose name is img_name is a clone, get its parent image and parent snapshotimg.flatten () # copy the data of parent snapshot into the clone parent_volume = self.rbd.Image (client.ioctx, parent) # get parent image parent_volume.unprotect_snap (snap) # snap to protect parent_volume.remove_snap (snap) # if snapshot has no other clone Delete it 5. Summary

The head,object,snapshot of RBD image and its relationship with client and parent:

These are all the contents of the article "what are the basic data structures of Ceph?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report