What to do about Ceph monitor exception caused by IP change and OSD disk crash 07/11 Update SLTechnology News&Howtos

What to do about Ceph monitor exception caused by IP change and OSD disk crash

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the IP changes caused by Ceph monitor anomalies and OSD disk crash how to do, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor with you to understand.

The company moved and the ip of all servers changed. After the ceph server is configured with ip, it is found that the monitor process fails to start, and the monitor process always tries to bind to the previous ip address, which is of course impossible. At first, I thought there was a problem with the ip setting of the server, but after changing hostname, ceph.conf and other methods without success, step by step analysis found that the ip address in monmap or the previous ip,ceph started the monitor process by reading monmap, so it was necessary to modify the monmap. The methods are as follows:

# Add the new monitor locations # monmaptool-- create-- add mon0 192.168.32.2Va6789-- add osd1 192.168.32.3VOV6789\-- add osd2 192.168.32.4Rd 6789-- fsid 61a520db-317b-41f1-9752-30cedc5ffb9a\-- clobber monmap # Retrieve the monitor map # ceph mon getmap-o monmap.bin # Check new contents # monmaptool-- print monmap.bin # Inject the monmap # ceph-mon-I mon0- -inject-monmap monmap.bin # ceph-mon-I osd1-- inject-monmap monmap.bin # ceph-mon-I osd2-- inject-monmap monmap.bin

Start monitor again, everything is fine.

But there was an OSD disk that hung up as described in the previous article. After checking around, I only found a bug of ceph on ceph's official website. Unable to fix it, delete the osd and reinstall it:

# service ceph stop osd.4 # No need to execute ceph osd crush remove osd.4 # ceph auth del osd.4 # ceph osd rm 4 # umount / cephmp1 # mkfs.xfs-f / dev/sdc # mount / dev/sdc / cephmp1 # osd # ceph-deploy osd prepare osd2:/cephmp1:/dev/sdf1 # ceph-deploy osd activate osd2:/cephmp1:/dev/sdf1 cannot be installed properly by executing create here

Restart the osd after completion and run successfully. Ceph automatically balances data. The status of * is:

[root@osd2] # ceph-s cluster 61a520db-317b-41f1-9752-30cedc5ffb9a health HEALTH_WARN 9 pgs incomplete; 9 pgs stuck inactive; 9 pgs stuck unclean 3 requests are blocked > 32 sec monmap e3: 3 mons at {mon0=192.168.32.2:6789/0,osd1=192.168.32.3:6789/0,osd2=192.168.32.4:6789/0}, election epoch 76, quorum 0Grady 1 mon0,osd1,osd2 osdmap e689: 6 osds: 6 up, 6 in pgmap v189608: 704 pgs, 5 pools, 34983 MB data, 8966 objects 69349 MB used 11104 GB / 11172 GB avail 695 active+clean 9 incomplete

There are nine incomplete states for pg.

[root@osd2 ~] # ceph health detail HEALTH_WARN 9 pgs incomplete; 9 pgs stuck inactive; 9 pgs stuck unclean; 3 requests are blocked > 32 sec 1 osds have slow requests pg 5.95 is stuck inactive for 838842.634721, current state incomplete, last acting [1,4] pg 5.66 is stuck inactive since forever, current state incomplete, last acting [4,0] pg 5.de is stuck inactive for 808270.105968, current state incomplete, last acting [0,4] pg 5.f5 is stuck inactive for 496137.708887, current state incomplete, last acting [0,4] pg 5.11 is stuck inactive since forever, current state incomplete, last acting [4,1] pg 5.30 is stuck inactive for 507062.828403, current state incomplete Last acting [0,4] pg 5.bc is stuck inactive since forever, current state incomplete, last acting [4,1] pg 5.a7 is stuck inactive for 499713.993372, current state incomplete, last acting [1,4] pg 5.22 is stuck inactive for 496125.831204, current state incomplete, last acting [0,4] pg 5.95 is stuck unclean for 838842.634796, current state incomplete, last acting [1,4] pg 5.66 is stuck unclean since forever, current state incomplete, last acting [4,0] pg 5.de is stuck unclean for 808270.106039, current state incomplete Last acting [0,4] pg 5.f5 is stuck unclean for 496137.708958, current state incomplete, last acting [0,4] pg 5.11 is stuck unclean since forever, current state incomplete, last acting [4,1] pg 5.30 is stuck unclean for 507062.828475, current state incomplete, last acting [0,4] pg 5.bc is stuck unclean since forever, current state incomplete, last acting [4,1] pg 5.a7 is stuck unclean for 499713.993443, current state incomplete, last acting [1,4] pg 5.22 is stuck unclean for 496125.831274, current state incomplete Last acting [0,4] pg 5.de is incomplete, acting [0,4] pg 5.bc is incomplete, acting [4,1] pg 5.a7 is incomplete, acting [1,4] pg 5.95 is incomplete, acting [1,4] pg 5.66 is incomplete, acting [4,0] pg 5.30 is incomplete, acting [0,4] pg 5.22 is incomplete, acting [0,4] pg 5.11 is incomplete, acting [4,1] pg 5.f5 is incomplete Acting [0osds have slow requests 4] 2 ops are blocked > 8388.61 sec 1 ops are blocked > 4194.3 sec 2 ops are blocked > 8388.61 sec on osd.0 1 ops are blocked > 4194.3 sec on osd.0 1 osds have slow requests

I checked it all to no avail. A passage from a person who has the same experience:

I already tried "ceph pg repair 4.77", stop/start OSDs, "ceph osd lost", "ceph pg force_create_pg 4.77". Most scary thing is "force_create_pg" does not work. At least it should be a way to wipe out an incomplete PG without destroying a whole pool.

The above methods have been tried, but none of them work. I can't solve it for the time being. I feel a bit of a pit.

PS: common pg operations

[root@osd2 ~] # ceph pg map 5.de osdmap e689 pg 5.de (5.de)-> up [0ceph pg 5.de mark_unfound_lost revert pg has no unfound objects 4] acting [0meme4] [root@osd2 ~] # ceph pg 5.de query [root@osd2 ~] # ceph pg scrub 5.de instructing pg 5.de on osd.0 to scrub [root@osd2 ~] # ceph pg 5.de mark_unfound_lost revert pg has no unfound objects # ceph pg dump_stuck stale # ceph pg dump_stuck inactive # ceph pg dump_ Stuck unclean [root@osd2 ~] # ceph osd lost 1 Error EPERM: are you SURE? This might mean real, permanent data loss. Pass-- yes-i-really-mean-it if you really do. [root@osd2 ~] # [root@osd2 ~] # ceph osd lost 4-- yes-i-really-mean-it osd.4 is not down or doesn't exist [root@osd2 ~] # service ceph stop osd.4 = osd.4 = Stopping Ceph osd.4 on osd2...kill 22287...kill 22287...done [root@osd2 ~] # ceph osd lost 4-- yes-i-really-mean-it marked osd lost in epoch 690 [root@osd1 mnt] # ceph Pg repair 5.de instructing pg 5.de on osd.0 to repair [root@osd1 mnt] # ceph pg repair 5.de instructing pg 5.de on osd.0 to repair Thank you for reading this article carefully I hope the article "Ceph monitor exception caused by IP change and OSD disk crash" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.