Detailed explanation of various states of PG in distributed storage Ceph 07/12 Update SLTechnology News&Howtos

Detailed explanation of various states of PG in distributed storage Ceph

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Author: Li Hang Source: talkwithtrend

Original link: https://mp.weixin.qq.com/s/a3mTVNQWYvMuEz8X_uCRMA

1. PG introduction

The main purpose of this time is to share the detailed explanation of various states of PG in Ceph. PG is one of the most complex and difficult concepts to understand. The complexity of PG is as follows:

-at the architectural level, PG is in the middle of the RADOS layer.

a. It is responsible for receiving and processing requests from the client.

b. It is next responsible for translating these data requests into transactions that can be understood by the local object store.

-is the basic unit that makes up the storage pool. Many features in the storage pool are directly implemented on PG.

-the backup strategy for disaster tolerance domain makes it necessary for PG to perform distributed writes across nodes, so the synchronization and recovery of data between different nodes also rely on PG.

2. PG status table

The normal PG state is 100% active + clean, which means that all PG is accessible and all replicas are available to all PG. If Ceph also reports other warnings or error status of PG. PG status table:

3. Detailed explanation of status and recurrence of fault simulation

3.1 Degraded

3.1.1 description

Degradation: it can be seen from the above that there are three copies of each PG, which are stored in a different OSD. In non-failure cases, the PG is in the active+clean state, so if the copy of the PG osd.4 dies, the PG is degraded.

3.1.2 Fault simulation

Stop osd.1$ systemctl stop ceph-osd@1

View PG status $bin/ceph pg stat 20 pgs: 20 active+undersized+degraded; 14512 kB data, 302 GB used, 6388 GB / 6691 GB avail; 12 objects degraded (33.333%)

View cluster monitoring status

Client IO operation

Fault summary:

In order to simulate the failure, (size = 3, min_size = 2) We manually stop osd.1, and then check the PG status. It can be seen that its current state is active+undersized+degraded. When the OSD where a PG is located dies, the PG will enter the undersized+degraded state, and the meaning of [0L2] is that there are still two copies alive on osd.0 and osd.2, and at this time the client can read and write IO normally.

3.1.3 Summary

A downgrade is when Ceph marks all PG on the OSD as Degraded after some failure, such as hanging up the OSD.

The degraded cluster can read and write data normally, and the degraded PG is only equivalent to a glitch, not a serious problem.

Undersized means that the current number of surviving PG copies is 2, which is less than 3. Marking it indicates that the number of copies in stock is insufficient, and it is not a serious problem.

3.2 Peered

3.2.1 description

"Peering has been completed, but the current Acting Set size of PG is less than the minimum number of replicas (min_size) specified by the storage pool."

3.2.2 Fault simulation

a. Stop two copies of osd.1,osd.0

$systemctl stop ceph-osd@1$ systemctl stop ceph-osd@0

b. View cluster health status

c. Client IO operation (tamping)

Read objects to files and tamp IO

$bin/rados-p test_pool get myobject ceph.conf.old

Fault summary:

-now only osd.2 survives on pg, and pg has one more status: peered, which means to look carefully, here we can understand it as negotiation and search.

-when you read the file, you will find that the instruction will be stuck in that place all the time, so why can't you read the content? because we set the min_size=2, if the number of survivors is less than 2, such as 1 here, then we will not respond to external IO requests.

d. Adjusting min_size=1 can solve the problem of IO tamping.

Set min_size = 1

$bin/ceph osd pool set test_pool min_size 1 set pool 1 min_size to 1

e. View cluster monitoring status

f. Client IO operation

Read objects to a file

$ll-lh ceph.conf*-rw-r--r-- 1 root root 6.1K Jun 25 14:01 ceph.conf-rw-r--r-- 1 root root 6.1K Jul 3 20:11 ceph.conf.old-rw-r--r-- 1 root root 6.1K Jul 3 20:11 ceph.conf.old.1

Fault summary:

-as you can see, the PG status Peered is gone, and the client file IO can be read and written normally.

-when min_size=1, as long as a copy is alive in the cluster, it can respond to external IO requests.

3.2.3 Summary

The Peered status can be interpreted here as it is waiting for other copies to come online.

The Peered state can be removed when min_size = 2, that is, when two copies must be guaranteed to survive.

The PG in the Peered state cannot respond to external requests and the IO is suspended.

3.3 Remapped

3.3.1 description

When the Peering is completed, the Remapped status occurs when the PG's current Acting Set is inconsistent with the Up Set.

3.3.2 Fault simulation

a. Stop osd.x

$systemctl stop ceph-osd@x

b. Start osd.x after an interval of 5 minutes

$systemctl start ceph-osd@x

c. View PG status

d. Client IO operation

Rados read and write normally

Rados-p test_pool put myobject / tmp/test.log

3.3.3 Summary

The OSD on the PG will reassign the osd number to which the PG belongs according to the Crush algorithm when the OSD is hung up or when the capacity is expanded. And will send PG Remap to other OSD.

When Remapped status, PG's current Acting Set is inconsistent with Up Set.

The client IO can read and write normally.

3.4 Recovery

3.4.1 description

Refers to the process by which PG synchronizes and repairs objects with inconsistent data through PGLog logs.

3.4.2 Fault simulation

a. Stop osd.x

$systemctl stop ceph-osd@x

b. Start osd.x at an interval of 1 minute

Osd$ systemctl start ceph-osd@x

c. View cluster monitoring status

$ceph health detail HEALTH_WARN Degraded data redundancy: 183 pgs unclean 57960 objects degraded (0.316%), 17 pgs unclean, 17 pgs degraded PG_DEGRADED Degraded data redundancy: 183 pgs degraded pg 57960 objects degraded (0.316%), 17 pgs unclean, 17 pgs degraded pg 1.19 is active+recovery_wait+degraded, acting [29 pgs unclean 9 acting 17]

3.4.3 Summary

Recovery recovers data through recorded PGLog.

The recorded PGLog is within the osd_max_pg_log_entries=10000 bar, and at this time the data can be recovered incrementally through PGLog.

3.5 Backfill

3.5.1 description

When the replica of PG is nothing more than recovering data through PGLog, full synchronization is required, by completely copying all the objects in the current Primary.

3.5.2 fault simulation

a. Stop osd.x

$systemctl stop ceph-osd@x

b. Start osd.x at an interval of 10 minutes

$osd systemctl start ceph-osd@x

c. View cluster health status

$ceph health detail HEALTH_WARN Degraded data redundancy: 6 pg unclean 57927 objects degraded (0.010%), 1 pg unclean, 1 pg degraded PG_DEGRADED Degraded data redundancy: 6 pg unclean 57927 objects degraded (0.010%), 1 pg unclean, 1 pg degraded pg 3.7f is active+undersized+degraded+remapped+backfilling, acting [21 pg degraded PG_DEGRADED Degraded data redundancy 29]

3.5.3 Summary

When the data cannot be recovered according to the recorded PGLog, you need to perform the Backfill process to recover the full data.

If the number of osd_max_pg_log_entries=10000 entries is exceeded, you need to recover all the data at this time.

3.6 Stale

3.6.1 description

Mon detected that the osd where the Primary of the current PG is located is down.

The Primary timeout did not report pg-related information (such as network congestion) to mon.

A situation in which all three copies of the PG are dead.

3.6.2 fault simulation

a. Stop the three replicas of osd in PG, first stop osd.23

$systemctl stop ceph-osd@23

b. Then stop osd.24

$systemctl stop ceph-osd@24

c. View the status of stopping two copies of PG 1.45

d. The third copy osd.10 in stop PG 1.45

$systemctl stop ceph-osd@10

e. View the status of stopping three copies of PG 1.45

f. Client IO operation

Read-write mount disk IO tamping

Ll / mnt/

Fault summary:

Stop two copies in the same PG first, and the status is undersized+degraded+peered.

Then stop three copies within the same PG with a status of stale+undersized+degraded+peered.

3.6.3 Summary

When there is a situation where all three copies of a PG are dead, the stale status occurs.

At this time, the PG can not provide client-side read and write, IO hang tamping.

Primary timeout does not report pg-related information (such as network congestion) to mon, and stale status also occurs.

3.7 Inconsistent

3.7.1 description

PG detected inconsistencies between PG instances of one or some objects through Scrub.

3.7.2 Fault simulation

a. Delete the copy osd.34 header file in PG 3.0

$rm-rf / var/lib/ceph/osd/ceph-34/current/3.0_head/DIR_0/1000000697c.0000122c__head_19785300__3

b. Manually execute PG 3.0for data cleaning

$ceph pg scrub 3.0 instructing pg 3.0 on osd.34 to scrub

c. Check the cluster monitoring status

$ceph health detail HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent OSD_SCRUB_ERRORS 1 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 3.0 is active+clean+inconsistent, acting [34 mine23 is active+clean+inconsistent]

d. Fix PG 3.0

Fault summary:

When there are data inconsistencies among the three copies of PG, if you want to repair the inconsistent data files, you only need to execute the ceph pg repair repair instruction, and ceph will copy the missing files from the other copies to repair the data.

3.7.3 Fault simulation

When osd briefly hangs up, because there are still two replicas in the cluster, which can be written normally, but the data in osd.34 has not been updated. After a while, osd.34 is online. At this time, the data of osd.34 is obsolete, and the data is restored to osd.34 through other OSD to make the data up to date, and in the process of recovery. The status of PG will change from inconsistent-> recover-> clean, and eventually return to normal.

This is a scenario of cluster failure self-healing.

3.8 Down

3.8.1 description

During the Peering process, PG detects an Interval that cannot be skipped (for example, during this Interval, PG completes the Peering and successfully switches to the Active state, so it is possible that the read and write requests from the client are processed normally), and the remaining online OSD is not enough to complete the data repair.

3.8.2 Fault simulation

a. Check the number of replicas in PG 3.7f

$ceph pg dump | grep ^ 3.7f dumped all 3.7f 43 00 00 494927872 1569 1569 active+clean 2018-07-05 02V 52V 51.512598 21315V 8011511356 V 111666 [5LJ 21J 29] 5 [5LJ 2J 29] 5 21315F 80115 2018-07-05 0215 52R 51.512568 6206V 800832018-06-29 2251Vol 05.831219

b. Stop PG 3.7f copy osd.21

$systemctl stop ceph-osd@21

c. View PG 3.7f status

$ceph pg dump | grep ^ 3.7f dumped all 3.7f 66 089 00 591396864 1615 1615 active+undersized+degraded 2018-07-05 15,2915.741318 213611611611365 512307 [5magistr29] 5 [5Mei29] 5 2131501152018-07-05 022purge 52pur51.512568 620614832018-06-29 2212018 05.831219

d. When the client writes data, make sure that the data is written to the copy of PG 3.7f [5pr 29]

e. Stop the replica osd.29 in PG 3.7f and check the status of PG 3.7f

f. Stop the replica osd.5 in PG 3.7f and check the status of PG 3.7f

g. Pull up the replica osd.21 in PG 3.7f (the osd.21 data is stale at this time) and view the PG status (down)

h. Client IO operation

At this time, the client IO will tamp.

Ll / mnt/

Fault summary:

First of all, there is a PG 3.7f with three copies. When an osd.21 is stopped, the data is written to osd.5, osd.29. Stop osd.29 and osd.5 at this time, and finally pull up osd.21. At this time, the data of osd.21 is relatively old, and the PG will be down. At this time, the client IO will be rammed, and the problem can only be fixed by pulling the suspended osd.

3.8.3 conclusion

Typical scenarios: a (main), B, C

1. First, kill B

two。 Write new data to A, C

3. Kill An and C

4. Pull up B

The scenario where PG is Down occurs because the osd node data is too old and other online osd is insufficient to complete the data repair.

At this time, the PG can not provide client-side IO read and write, IO will hang ramming.

Author: Li Hang, who has many years of low-level development experience, has rich experience in high-performance nginx development and distributed cache redis cluster. He has been working on Ceph for about two years. Successively worked in 58.com, Automobile House and Youku Tudou Group. Currently working in the Didi basic platform Operations Department is responsible for distributed Ceph cluster development and operation and maintenance work. The main technical areas of personal concern: high-performance Nginx development, distributed cache, distributed storage.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.