What is the use of CRUSH data distribution algorithm in ceph 07/11 Update SLTechnology News&Howtos

What is the use of CRUSH data distribution algorithm in ceph

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you what is the use of CRUSH data distribution algorithm in ceph, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

Introduction to CRUSH data Distribution algorithm of ceph

CRUSH is a module of ceph, which mainly solves the problem of controllable, scalable and decentralized distribution of data copies.

1 Summary

With the emergence of large-scale distributed storage systems (PB-level data and hundreds of storage devices). These systems must balance the distribution of data and load (improve resource utilization), maximize system performance, and deal with system expansion and hardware failures. Ceph designed CRUSH (a scalable pseudorandom data distribution algorithm), which is used in distributed object storage systems and can effectively map data objects to storage devices (no central equipment is needed). Because the structure of large systems is dynamic, CRUSH can handle the addition and removal of storage devices and minimize data migration caused by the addition and movement of storage devices.

2 introduction

The object storage device (object-based storage devices) manages the allocation of disk blocks and provides a read-write interface for objects. In some object storage systems, the data of each file is divided into multiple objects, which are distributed throughout the storage cluster. The object storage system simplifies the data layer (replacing the block list with the object list and handing over the low-level block allocation problem to each device). The basic problem of object storage system is how to distribute data to thousands of storage devices.

One robust solution is to randomly distribute data to storage devices, which ensures load balancing and the mixing of new and old data. However, simple HASH distribution can not effectively deal with the change in the number of devices, resulting in a large number of data migration. Ceph developed CRUSH (Controoled Replication Under Scalable Hashing), a pseudorandom data distribution algorithm that can effectively distribute copies of objects in hierarchical storage clusters. CRUSH implements a pseudo-random (deterministic) function that takes either object id or object group id and returns a set of storage devices that hold a copy of the object. CRUSH requires cluster map (describing the hierarchical structure of the storage cluster) and replica distribution policy (rule).

CRUSH has two key advantages:

Any component can independently calculate the location of each object (decentralized).

Only a small amount of cluster map is required, which needs to be changed only when adding devices are removed.

3 CRUSH algorithm

The CRUSH algorithm calculates the distribution of data objects by the weight of each device. The distribution of objects is determined by cluster map and data distribution policy. Cluster map describes the available storage resources and hierarchy (such as how many racks there are, how many servers are on each rack, and how many disks are on each server). Data distribution policy consists of placement rules. Rule determines how many replicas there are for each data object, and the limitations on which these replicas are stored (for example, three replicas in different racks).

CRUSH calculates x to a set of OSD collections (OSD is an object storage device):

(osd0, osd1, osd2... Osdn) = CRUSH (x)

CRUSH makes use of the multi-parameter HASH function, and the parameters in the HASH function include x, making the set from x to OSD deterministic and independent. CRUSH only uses cluster map, placement rules, x. CRUSH is a pseudo-random algorithm, and there is no correlation between the results of similar inputs.

3.1Tier Cluster map

Cluster map consists of device and bucket, both of which have id and weight values. Bucket can contain any number of item. Item can be both devices or both buckets. The administrator controls the weight of the storage device. The weight is related to the capacity of the storage device. The weight of a Bucket is defined as the sum of the weights of all the item it contains. CRUSH is based on four different bucket type, each with a different selection algorithm.

3.2 copy distribution

The distribution of copies on the storage device affects the security of the data. Cluster map reflects the physical structure of the storage system. CRUSH placement policies decides to distribute copies of objects in different areas (failure in one area does not affect other areas). Each rule contains a series of operations (used in hierarchies).

These actions include:

Tack (a): select an item, usually bucket, and return all the item contained in the bucket. These item are parameters for subsequent operations, and these item make up the vector I.

Select (n, t): iterates over each item (item in vector I), traversing down for each item (item in vector I) (traversing the item contained in this item), returns n different item (item with type is t), and puts all these item into vector I. The select function calls the c (r, x) function, which pseudo-randomly selects an item in each bucket.

Emit: put the vector I into result.

There is a definite type of storage device. Each bucket has a type attribute value, which is used to distinguish different bucket types (such as "row", "rack", "host", etc., type can be customized). Rules can contain multiple take and emit statement blocks, which allows you to select replica storage target from different storage pools.

3.3 conflicts, failures, overloads

The select (n, t) operation cycles through the selection of rust 1, … , n copies, r as the selection parameter. In this process, if the selected item encounters three situations (conflict, failure, overload), the CRUSH will refuse to select the item and reselect the item using r' (related to r, number of errors, firstn parameter) as the selection parameter.

Conflict: this item is already in the vector I and has been selected.

Failure: the equipment fails and cannot be selected.

Overload: the capacity of the device exceeds the warning line, and there is no space left to store data objects.

Failed and overloaded devices are marked on the cluster map (still in the system), which avoids unnecessary data migration.

3.4 MAP change and data migration

When a storage device is added or removed, or when a storage device fails (when the cluster map changes), the data in the storage system is migrated. A good data distribution algorithm can minimize the size of data migration.

Type of 3.5 Bucket

The CRUSH mapping algorithm solves the contradictory goals of efficiency and scalability. And when the storage cluster changes, data migration can be minimized and balanced distribution can be restored. CRUSH defines four kinds of buckets with different algorithms. Each bucket is based on a different data structure and has a different c (rMagnex) pseudorandom selection function.

Different bucket have different performance and features:

Uniform Buckets: applies to item with the same weight, and bucket rarely adds or removes item. Its search speed is the fastest.

List Buckets: its structure is a linked list structure, and the item it contains can have any weight. CRUSH starts from the header to find the location of the copy, it first gets the weight Wh of the header item, the sum of the weights of all the item in the remaining linked list, and then according to hash (x, r, item) get a value v of [0q1]. If the value v is in the [0~Wh/Ws), the copy is in the header item and returns the id of the header item. Otherwise, continue to traverse the remaining linked list.

Tree Buckets: the search complexity of linked lists is O (n) and that of decision trees is O (log n). Item is the leaf node of the decision tree, and other nodes in the decision tree know the weight of its left and right subtrees, and the weight of the nodes is equal to the sum of the weights of the left and right subtrees. CRUSH starts to find the location of the copy from the root node, it first gets the weight Wl of the left subtree of the node, gets the weight Wn of the node, and then according to hash (x, r, node_id) gets a value v of [0room1]. If the value v is in [0~Wl/Wn), the copy is in the left subtree, otherwise it is in the right subtree. Continue traversing the node until you reach the leaf node. The key to Tree Bucket is that when adding and removing leaf nodes, the node_id of other nodes in the decision tree remains the same. The identification of the node_id of the node in the decision tree is determined by the sequence traversal of the binary tree (node_id is not equal to the id of item, nor is it equal to the weight of the node).

Straw Buckets: this type gives all item contained in bucket a level playing field (unlike list and tree that require traversal). This algorithm is like drawing lots, where all item have a chance to be won (only the longest lots can be drawn). The length of each lot is determined by length = f (Wi) * hash (x, r, I). F (Wi) is related to the weight of item, and I is the id number of item. C (r, x) = MAXi (f (Wi) * hash (x, r, I)).

Fig. 1 the structure of Tree Buckets

The above is all the content of the article "what is the use of CRUSH data distribution algorithm in ceph". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.