Example Analysis of capacity calculation and Management in Ceph 07/13 Update SLTechnology News&Howtos

Example Analysis of capacity calculation and Management in Ceph

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the example analysis of capacity calculation and management in Ceph, which is very detailed and has a certain reference value. Interested friends must read it!

Capacity calculation and Management in Ceph

After deploying the Ceph cluster, we can generally use the Ceph df command to check the capacity status of the cluster, but how is Ceph calculated and managed? I'm sure everyone is curious. Because anyone who has used the ceph df command will have this question, how on earth is its output calculated? Why is the free space of all pool sometimes equal to and sometimes not equal to the free space in GLOBAL? With these questions in mind, we can analyze the implementation of ceph df to see how Ceph calculates and manages capacity.

In general, the output of ceph df is as follows:

Ceph-df

[root@study-1] # ceph dfGLOBAL: SIZE AVAIL RAW USED% RAW USED 196G 99350M 91706M 45.55 POOLS: NAME ID USED% USED MAX AVAIL OBJECTS rbd 1 20480k 0.02 49675M 11 x 2 522 0 49675M 11

There is a SIZE,AVAIL,RAW USED,%RAW USED in the GLOBAL dimension. As you can see from the output above, ceph's calculation of capacity is actually divided into two dimensions. One is the GLOBAL dimension and the other is the POOLS dimension.

There is a USED,%USED,MAX AVAIL,OBJECTS in the POOLS dimension.

Let's focus on RAW USED and AVAIL here. After these two analyses are clear, the rest can be easily solved.

Here we roughly calculate that the RAW USED in GLOBAL is 91706m, which is significantly larger than USED 20480k*3 + 522bytes*3 in pool below. And the MAX AVAIL sum of each pool is not equal to the AVAIL in GLOBAL. We need to go deep into the code to analyze why.

Analysis.

Ceph commands are basically the first to go to Montior, how Monitor can handle the request, directly deal with, can not be forwarded.

Let's take a look at how Monitor handles the ceph df command. The Monitor processing command is mainly in the Monitor::hanlde_command function.

Handle_command

Else if (prefix = = "df") {bool verbose = (detail = = "detail"); if (f) f-> open_object_section ("stats"); pgmon ()-> dump_fs_stats (ds, f.get (), verbose); if (! f) ds dump_pool_stats (ds, f.get (), verbose); if (f) {f-> close_section () F-> flush (ds); ds dump_fs_stats, the other is pgmon ()-> dump_pool_stats.

GLOBAL dimension

Starting with PGMonitor::dump_fs_stats:

Dump_fs_stats

Void PGMonitor::dump_fs_stats (stringstream & ss, Formatter * f, bool verbose) const {if (f) {f-> open_object_section ("stats"); f-> dump_int ("total_bytes", pg_map.osd_sum.kb * 1024ull); f-> dump_int ("total_used_bytes", pg_map.osd_sum.kb_used * 1024ull) F-> dump_int (total_avail_bytes, pg_map.osd_sum.kb_avail * 1024ull); if (verbose) {f-> dump_int ("total_objects", pg_map.pg_sum.stats.sum.num_objects);} f-> close_section ();}

Stat_pg_update

Void OSDService::update_osd_stat (vector& hb_peers) {Mutex::Locker lock (stat_lock); osd_stat.hb_in.swap (hb_peers); osd_stat.hb_out.clear (); osd- > op_tracker.get_age_ms_histogram (& osd_stat.op_queue_age_hist); / / fill in osd stats too struct statfs stbuf; int r = osd- > store- > statfs (& stbuf); if (r

< 0) { derr >

10; osd_stat.kb_avail = avail > > 10; osd- > logger- > set (l_osd_stat_bytes, bytes); osd- > logger- > set (l_osd_stat_bytes_used, used); osd- > logger- > set (l_osd_stat_bytes_avail, avail); check_nearfull_warning (osd_stat); dout (20) delta_stats, see ReplicatedPG::do_osd_ops for details. For example, you can start with op dealing with WRITE, and write_update_size_and_usage () is called when dealing with op of type CEPH_OSD_OP_WRITE. Ctx- > delta_stats will be updated inside. When IO finishes processing, that is, applied and commited, it will publish_stats_to_osd ().

Here, the stat_queue_item of the changed pg will be added to the pg_stat_queue. Then set osd_stat_updated to True. After joining the queue, tick_timer sends the status of the PG to Monitor through send_pg_stats () in the C_Tick_WithoutOSDLock ctx. So Monitor can know how pg has changed.

The free space, the value of MAX AVAIL, is a little more complicated to calculate. Ceph calculates the value of Available first, and then calculates the value of MAX AVAIL according to the replica policy. The value of Available is calculated in get_rule_avail (). In this function, a list of osd with weight is calculated through get_rule_weight_osd_map ().

Note that the weight here is generally less than 1 because it is divided by sum. And sum is the sum of all the osd weight in pool. After getting the weight list, the value of kb_avail in pg_map.osd_stat is divided by weight, and the smallest one is selected as the value of Available.

This description is a bit abstract. Let me give you an example. For example, here we have three osd in our pool, assuming that the kb_avail is all 400G.

That is,

{osd_0: 0.9, osd_1, 0.8, osd_2: 0.7} . The calculated weight value is {osd_0: 0.9 impulse 2.4 cosmetic 1: 0.8 Compact 2: 0.7 Universe 2.4}

So divide the available space of osd by the weight value here, where the value of Available is 400G*0.7/2.4. A formula is attached here, which may be more intuitive.

Then depending on the copy strategy of your POOL, the AVAL of POOL is calculated differently. If it is REP mode, it is directly divided by the number of copies. If it is EC mode, the AVAL of POOL is Available * k / (m + k).

So in general, the sum of the MAX AVAIL of each POOL is not equal to the AVAIL of GLOBAL, but it can be very close (the difference can be ignored as close at G level).

Summary

At this point, we know that the calculation of capacity in CEPH is sub-dimensional. If it is a GLOBAL dimension, it is more accurate because it is calculated using the statfs of the disk where osd is located. And another dimension, POOLS.

Due to the need to take into account the copy policy of POOL, CRUSH RULE,OSD WEIGHT, it is still quite complex to calculate. Capacity management is mainly on the OSD side, and OSD will pass the information to MON for MON to maintain.

The calculation of osd weight value is more complex, here is attached to calculate the function of weight, add some notes, help interested students to analyze together.

Int CrushWrapper::get_rule_weight_osd_map (unsigned ruleno, map * pmap) {if (ruleno > = crush- > max_rules) return-ENOENT; if (crush- > rules [ruleno] = = NULL) return-ENOENT; crush_rule * rule = crush- > rules [ruleno]; / / build a weight map for each TAKE in the rule, and then merge them for (unsigned iTuno; ilen; + + I) {map m; float sum = 0 If (rule- > steps [I] .op = = CRUSH_RULE_TAKE) {/ / if it is take, enter int n = rule- > steps [I] .arg1; if (n > = 0) {/ / n is osd if it is greater than or equal to 0, otherwise it is buckets m [n] = 1.0; / / if it is osd, because here is direct take osd, it doesn't matter whether there is weight or not. } else {/ / is not osd, but list Q; q.push_back (n); / / id of buckets is queued / / breadth first iterate the OSD tree while (! q.empty ()) {int bno = q.front (); / / take out id q.pop_front () of buckets; / / out crush_bucket * b = crush- > buckets [- 1-bno] / / get buckets assert (b) according to the sequence number; / / the buckets must be the existing for (unsigned juni0; jsize; + + j) {/ / get the corresponding bucket int item_id = b-> items [j] from the items array of buckets; if (item_id > = 0) {/ / it's an OSD float w = crush_get_bucket_item_weight (b, j) / / take out the weight m of the osd [item _ id] = w; / m join the queue sum + = w; / / weight plus} else {/ / not an OSD, expand the child later q.push_back (item_id) / / if it is not osd, add its item_id, so here is the depth traversal of a tree} for (map::iterator p = m.begin (); p! = m.end (); + p) {map::iterator Q = pmap- > find (p-> first) / / because the pmap we passed here has no data / / it must be hit for the first time, if (Q = = pmap- > end ()) {(* pmap) [p-> first] = p-> second / sum;} else {/ / here we also need to consider the situation of osd in different buckets Q-> second + = p-> second / sum;} return 0 } these are all the contents of the article "sample Analysis of capacity calculation and Management in Ceph". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.