Example Analysis of Ceph Cluster capacity reduction and related Fault treatment 04/21 Update SLTechnology News&Howtos

Example Analysis of Ceph Cluster capacity reduction and related Fault treatment

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the example analysis of Ceph cluster capacity reduction and related fault handling, which is very detailed and has a certain reference value. Interested friends must read it!

Introduction

Due to the current shortage of machines, we need to provide a batch of machines in my cluster to other businesses. The problem arises. The exit of cluster machines means that the data will be redistributed, and it is easy to fail in the process of data migration.

The process handles the machine corresponding to the test POOL

There are many POOL in the cluster, and some POOL are customer data, which is very important. Some POOL are tested by me, and the corresponding OSD of these POOL can be deleted directly. Even if the cluster reports an pg exception, there is no need to care. Delete the corresponding POOL after deleting the corresponding OSD, and the corresponding pg exception will disappear.

Note: to avoid data migration during the shutdown of OSD, set the norecover flag.

Ceph osd set norecover

The command to delete all OSD information on the corresponding host is as follows:

Killall-9 ceph-osdfor I in {108... 119} do ceph osd out osd.$i;ceph osd crush remove osd.$i;ceph auth del osd.$i;ceph osd rm $I Tipceph auth del osd.$i;doneceph osd crush remove hostnameremoved item id-10 name' hostname' from crush map machines corresponding to the POOL required for business processing

For the business, the POOL is distributed across 10 machines, and now five of these 10 machines are released, which involves data migration. There are three ways to deal with it.

Method 1: set out

The machine to be exited is set to the out state in turn. When one machine is done, another machine is done, and the system is responsible for moving the data away.

Method 2: set the weight

The weight of the machine to be launched is adjusted to 0, and the system is responsible for moving the data.

Method 3 to construct new rules

Build a new group and put the machines you want to keep under the new group

Build a new crushrule, take from newgroup

Set the rule of business pool to new crush rule

This is the fastest way, involving only one migration, and after waiting for the data to be migrated, you can close and remove the unwanted OSD.

Problems encountered after dealing with

Symptoms, showing a small amount of abnormal PG status in the cluster state. Active + remapped + backfilling active + remapped

[root@gnop029-ct-zhejiang_wenzhou-16-11 ~] # ceph-s cluster c6e7e7d9-2b91-4550-80b0-6fa46d0644f6 health HEALTH_WARN 2 pgs backfilling 3 pgs stuck unclean recovery 24swap 2148593 objects misplaced (0.001%) norecover,noscrub,nodeep-scrub flag (s) set monmap e3: 3 mons at {astat101.71.4.11 set monmap 6789bank 0Legend 101.71.4.12 C osdmap e69909: 120 osds: 120 osds: 120 up, 120 election epoch 3 remapped pgs flags norecover,noscrub,nodeep-scrub pgmap v8678900: 10256 pgs, 16 pools, 2763 GB data, 1047 kobjects 7029 GB used 197 TB / 214 TB avail 24 Charger 2148593 objects misplaced (0.001%) 10253 active+clean 2 active+remapped+backfilling 1 active+remapped [root@ceph] # ceph pg dump_stuck uncleanokpg_stat state up up_primary acting acting_primary23.1c1 active+remapped+backfilling [59 active+remapped+backfilling 37] 59 [76 legends 84] 7623.23b active+remapped [35 minus 7] 35 [82119] 8223.221 active+remapped+backfilling [15] 15 [70 pr 82] 70

Then I turned on scrub and deepscrub, scanned all pg and returned to active + clean.

When data migration occurs, sometimes some osd processes exit because the load is too high. This requires two aspects of work:

Reduce the number of threads in osd backfill and reduce osd workload

The osd dropped by down will be instantly restored, otherwise there will be many abnormal Pg status, and these abnormal pg will soon return to normal after osd reply.

The above is all the contents of the article "sample Analysis of Ceph Cluster reduction and related failure handling". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.