(6) ceph cluster osd down fault handling 07/06 Update SLTechnology News&Howtos

(6) ceph cluster osd down fault handling

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

(1) check the cluster status and find that the two osd statuses are down.

[root@node140 /] # ceph-s cluster: id: 58a12719-a5ed-4f95-b312-6efd6e34e558 health: HEALTH_ERR noout flag (s) set 2 osds down 1 scrub errors Possible data damage: 1 pg inconsistent Degraded data redundancy: 1633 pgs degraded 10191 objects degraded (16.024%), 84 pgs degraded, 122 pgs undersized services: mon: 2 daemons, quorum node140,node142 (age 3D) mgr: admin (active, since 3D) Standbys: node140 osd: 18 osds: 16 up (since 3D), 18 in (since 5d) flags noout data: pools: 2 pools, 384 pgs objects: 3.40k objects, 9.8 GiB usage: 43 GiB used, 8.7 TiB / 8.7 TiB avail pgs: 1633 active+clean 10191 objects degraded (16.024%) 261 active+clean 84 active+undersized+degraded 38 active+undersized 1 active+clean+inconsistent

(2) View osd status

[root@node140 /] # ceph osd treeID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF-1 9.80804 root default-2 3.26935 host node140 0 hdd 0.54489 osd.0 up 1.00000 1.00000 1 hdd 0.54489 osd.1 up 1.00000 1.00000 2 hdd 0 .54489 osd.2 up 1.00000 1.00000 3 hdd 0.54489 osd.3 up 1.00000 1.00000 4 hdd 0.54489 osd.4 up 1.00000 1.00000 5 hdd 0.54489 osd.5 up 1.00000 1.00000-3 3.26935 host node141 12 hdd 0.54489 Osd.12 up 1.00000 1.00000 13 hdd 0.54489 osd.13 up 1.00000 1.00000 14 hdd 0.54489 osd.14 up 1.00000 1.00000 15 hdd 0.54489 osd.15 up 1.00000 1.00000 16 hdd 0.54489 osd.16 up 1.00000 1.00000 17 hdd 0.54489 osd.17 Up 1.00000 1.00000-4 3.26935 host node142 6 hdd 0.54489 osd.6 up 1.00000 1.00000 7 hdd 0.54489 osd.7 down 1.00000 1.00000 8 hdd 0.54489 osd.8 down 1.00000 1.00000 9 hdd 0.54489 osd.9 up 1. 00000 1.00000 10 hdd 0.54489 osd.10 up 1.00000 1.00000 11 hdd 0.54489 osd.11 up 1.00000 1.00000

(3) osd 7 osd 8 status check, it is already failed, and the restart cannot be started.

[root@node140 /] # systemctl status ceph-osd@8.service ● ceph-osd@8.service-Ceph object storage daemon osd.8 Loaded: loaded (/ usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled) Active: failed (Result: start-limit) since Fri 2019-08-30 17:36:50 CST 1min 20s ago Process: 433642 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh-- cluster ${CLUSTER}-- id% I (code=exited Status=1/FAILURE) Aug 30 17:36:50 node140 systemd [1]: Failed to start Ceph object storage daemon osd.8.Aug 30 17:36:50 node140 systemd [1]: Unit ceph-osd@8.service entered failed state.Aug 30 17:36:50 node140 systemd [1]: ceph-osd@8.service failed.Aug 30 17:36:50 node140 systemd [1]: ceph-osd@8.service holdoff time over Scheduling restart.Aug 30 17:36:50 node140 systemd [1]: Stopped Ceph object storage daemon osd.8.Aug 30 17:36:50 node140 systemd [1]: start request repeated too quickly for ceph-osd@8.serviceAug 30 17:36:50 node140 systemd [1]: Failed to start Ceph object storage daemon osd.8.Aug 30 17:36:50 node140 systemd [1]: Unit ceph-osd@8.service entered failed state.Aug 30 17:36:50 node140 systemd [1]: ceph-osd@8.service failed.

(4) osd hard disk failure, status change

The osd hard drive failed and the status changed to down. After the interval set by mod osd down out interval, ceph marks it as out and begins data migration recovery. In order to reduce the impact, you can turn it off first, and then turn it on after the hard disk replacement is completed.

[root@node140 /] # cat / etc/ceph/ceph.conf

[global]

Mon osd down out interval = 900,

(5) stop data equalization

[root@node140 /] # for i in noout nobackfill norecover noscrub nodeep-scrub;do ceph osd set $ibot done

(6) locate I fault disk

[root@node140 /] # ceph osd tree | grep-I down

7 hdd 0.54489 osd.7 down 0 1.00000

8 hdd 0.54489 osd.8 down 0 1.00000

(7) unload the failed node

[root@node142 ~] # umount / var/lib/ceph/osd/ceph-7

[root@node142 ~] # umount / var/lib/ceph/osd/ceph-8

(8) remove osd from crush map

[root@node142 ~] # ceph osd crush remove osd.7

Removed item id 7 name 'osd.7' from crush map

[root@node142 ~] # ceph osd crush remove osd.8

Removed item id 8 name 'osd.8' from crush map

(9) Delete the key of the failed osd

[root@node142 ~] # ceph auth del osd.7

Updated

[root@node142 ~] # ceph auth del osd.8

Updated

(10) Delete faulty osd

[root@node142 ~] # ceph osd rm 7 removed osd.7 [root@node142 ~] # ceph osd rm 8removed osd.8 [root@node142 ~] # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF-1 8.71826 root default-2 3.26935 host node140 0 hdd 0.54489 osd.0 up 1.00000 1.00000 1 hdd 0.54489 osd.1 up 1.00000 1.00000 2 hdd 0.54489 osd.2 up 1.00000 1.00000 3 hdd 0.54489 osd.3 up 1.00000 1.00000 4 hdd 1.00000 1.00000 5 hdd 0.54489 osd.5 up 1.00000 1.00000-3 3.26935 host node141 12 hdd 0.54489 osd.12 up 1.00000 1.00000 13 hdd 0.54489 osd.13 up 1.00000 1.00000 14 hdd 0.54489 osd.14 up 1.00000 1.00000 15 hdd 0.54489 osd.15 up 1.00000 1.00000 16 hdd 0.54489 Osd.16 up 1.00000 1.00000 17 hdd 0.54489 osd.17 up 1.00000 1.00000-4 2.17957 host node142 6 hdd 0.54489 osd.6 up 1.00000 1.00000 9 hdd 0.54489 osd.9 up 1.00000 1.00000 10 hdd 0.54489 Osd.10 up 1.00000 1.00000 11 hdd 0.54489 osd.11 up 1.00000 1.00000 [root@node142 ~] #

(11) replace the failed hard drive to check the drive letter, and then rebuild

[root@node142] # ceph-volume lvm create-- data / dev/sdd

[root@node142] # ceph-volume lvm create-- data / dev/sdc

(12)

[root@node142 ~] # ceph-volume lvm list

(13) after crush map is added to the new osd, restart the cluster disable flag.

For i in noout nobackfill norecover noscrub nodeep-scrub;do ceph osd unset $iBank done

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.