How to deploy distributed system Ceph on Ubuntu system 04/20 Update SLTechnology News&Howtos

How to deploy distributed system Ceph on Ubuntu system

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to deploy distributed system Ceph on Ubuntu system". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to deploy distributed system Ceph on Ubuntu system".

Ceph is a unified storage system that supports three interfaces.

Object: native API, but also compatible with Swift and S3 API

Block: support for thin configuration, snapshots, clones

File:Posix interface, which supports snapshots

Ceph is also a distributed storage system, which is characterized by:

High scalability: use a regular x86 server, support 10,000 servers, support TB to PB-level expansion.

High reliability: no single point of failure, multiple data copies, automatic management, automatic repair.

High performance: balanced data distribution and high degree of parallelism. For objects storage and block storage, no metadata server is required.

Architecture

At the bottom of Ceph is RADOS, which means "A reliable, autonomous, distributed object storage". RADOS consists of two components:

OSD: Object Storage Device, which provides storage resources.

Monitor: maintains the global state of the entire Ceph cluster.

RADOS has strong expansibility and programmability, and Ceph is developed based on RADOS.

Object Storage 、 Block Storage 、 FileSystem . The other two components of Ceph are:

MDS: used to hold the metadata of CephFS.

RADOS Gateway: provides REST interface, compatible with S3 and Swift API.

Mapping

The namespace of Ceph is (Pool, Object), and each Object is mapped to a set of OSD (this set of OSD holds the Object):

(Pool, Object) → (Pool, PG) → OSD set → Disk

The attributes of Pools in Ceph are:

Number of copies of Object

Number of Placement Groups

The CRUSH Ruleset used

In Ceph, Object is mapped first to PG (Placement Group), and then from PG to OSD set. Each Pool has multiple PG, and each Object obtains its corresponding PG by calculating the hash value and taking the module. The PG is then mapped to a set of OSD (the number of OSD is determined by the number of copies of the Pool), the first OSD is Primary, and the rest are Replicas.

The way of data mapping (Data Placement) determines the performance and scalability of the storage system. The mapping of (Pool, PG) → OSD set is determined by four factors:

CRUSH algorithm: a pseudo-random algorithm.

OSD MAP: contains the current status of all Pool and the status of all OSD.

CRUSH MAP: contains the current hierarchy of disks, servers, and racks.

CRUSH Rules: the strategy for data mapping. These strategies can flexibly set the area where the object is stored. For example, you can specify that all objecst in pool1 are placed on rack 1, the first copy of all objects is placed on server An on rack 1, and the second copy is distributed on server B on rack 1. All the object in pool2 are distributed on racks 2, 3, and 4, the first copy of all Object is distributed on the servers of rack 2, the second copy is distributed on the servers of rack 3, and the third copy is distributed on the servers of rack 4.

Client gets CRUSH MAP, OSD MAP, CRUSH Ruleset from Monitors, and then uses the CRUSH algorithm to calculate the OSD set where Object is located. So Ceph does not need a Name server, Client communicates directly with OSD. The pseudo code is as follows:

The code is as follows:

Locator = object_name

Obj_hash = hash (locator)

Pg = obj_hash% num_pg

Osds_for_pg = crush (pg) # returns a list of osds

Primary = osds_for_pg [0]

Replicas = osds_for_ PG [1:]

The advantages of this data mapping are:

Grouping Object into groups reduces the amount of metadata that needs to be tracked and processed (at the global level, we don't need to track and process the metadata and placement of each object, we just need to manage the metadata of the PG. The order of magnitude of PG is much lower than that of object.

Increasing the number of PG can balance the load of each OSD and improve parallelism.

Separate failure domains and improve the reliability of data.

Strong consistency

The read and write operation of Ceph adopts Primary-Replica model, and Client only initiates read and write requests to the Primary of the OSD set corresponding to Object, which ensures the strong consistency of data.

Since there is only one Primary OSD per Object, updates to the Object are sequential and there is no synchronization problem.

When Primary receives a write request from Object, it is responsible for sending the data to other Replicas. As long as the data is saved on all OSD, Primary answers the write request from Object, which ensures the consistency of the copy.

Fault tolerance

In distributed systems, the common faults are network interruption, power outage, server outage, hard disk failure and so on. Ceph can tolerate these faults and repair them automatically to ensure the reliability of data and system availability.

Monitors is the Ceph butler who maintains the global state of Ceph. Monitors functions similar to zookeeper in that they use Quorum and Paxos algorithms to build consensus on the global state.

OSDs can be repaired automatically and in parallel.

Fault detection:

There is heartbeat detection between OSD. When OSD A detects that OSD B is not responding, it will report to Monitors that OSD B cannot connect, then Monitors marks OSD B as down status and updates OSD Map. When you cannot connect to OSD B after M seconds, Monitors marks OSD B as out (indicating that OSD B is not working) and updates the OSD Map.

Note: the value of M can be configured in Ceph.

Failure recovery:

When one of the OSD set corresponding to a PG is marked as down (if the Primary is marked as down, then a Replica will become the new Primary and handle all read and write object requests), the PG is in the active+degraded state, that is, the number of valid copies of the current PG is 1.

After M seconds, if the OSD still cannot be connected, it is marked as out,Ceph to recalculate the PG-to-OSD set mapping (and when a new OSD joins the cluster, all PG-to-OSD set mappings are also recalculated) to ensure that the number of valid copies of PG is N.

The Primary of the new OSD set first collects the PG log from the old OSD set, gets a copy of the Authoritative History (a complete, fully ordered sequence of operations), and asks the other Replicas to agree to the Authoritative History (that is, the other Replicas agrees on the status of all the objects of the PG), a process called Peering.

When the Peering process is completed and the PG enters the active+recoverying state, Primary migrates and synchronizes those degraded objects to all replicas, ensuring that the number of copies of these objects is N.

Let's take a look at deployment and configuration.

System environment: Ubuntu 12.04.2

The code is as follows:

Hostname:s1 osd.0/mon.a/mds.an ip:192.168.242.128

Hostname:s2 osd.1/mon.b/mds.b ip:192.168.242.129

Hostname:s3 osd.2/mon.c/mds.c ip:192.168.242.130

Hostname:s4 client ip:192.168.242.131

Key-free:

S1/s2/s3 enables root and configures keyless with each other.

The code is as follows:

Cat id_rsa.pub_s* > > authorized_keys

Installation:

The code is as follows:

Apt-get install ceph ceph-common ceph-fs-common (ceph-mds)

Update to the new version:

The code is as follows:

Wget-Q-O-'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' | sudo apt-key add-

Echo deb http://ceph.com/debian/ $(lsb_release-sc) main | tee / etc/apt/sources.list.d/ceph.list

Apt-get update

Apt-get install ceph

Partition and mount (using btrfs):

The code is as follows:

Root@s1:/data/osd.0# df-h | grep osd

/ dev/sdb1 20G 180m 19G 1% / data/osd.0

Root@s2:/data/osd.1# df-h | grep osd

/ dev/sdb1 20g 173m 19G 1% / data/osd.1

Root@s3:/data/osd.2# df-h | grep osd

/ dev/sdb1 20G 180m 19G 1% / data/osd.2

Root@s1:~/.ssh# mkdir-p / tmp/ceph/ (executed on each server)

Configuration:

The code is as follows:

Root@s1:/data/osd.0# vim / etc/ceph/ceph.conf

[global]

Auth cluster required = none

Auth service required = none

Auth client required = none

[osd]

Osd data = / data/$name

[mon]

Mon data = / data/$name

[mon.a]

Host = S1

Mon addr = 192.168.242.128pur6789

[mon.b]

Host = S2

Mon addr = 192.168.242.129VR 6789

[mon.c]

Host = S3

Mon addr = 192.168.242.130VR 6789

[osd.0]

Host = S1

Brtfs devs = / dev/sdb1

[osd.1]

Host = S2

Brtfs devs = / dev/sdb1

[osd.2]

Host = S3

Brtfs devs = / dev/sdb1

[mds.a]

Host = S1

[mds.b]

Host = S2

[mds.c]

Host = S3

Synchronous configuration:

The code is as follows:

Root@s1:~/.ssh# scp / etc/ceph/ceph.conf s2:/etc/ceph/

Ceph.conf 100% 555 0.5KB/s 00:00

Root@s1:~/.ssh# scp / etc/ceph/ceph.conf s3:/etc/ceph/

Ceph.conf 100% 555 0.5KB/s 00:00

Execute on all server:

The code is as follows:

Rm-rf / data/$name/* / data/mon/* (keep no data before initialization)

Root@s1:~/.ssh# mkcephfs-a-c / etc/ceph/ceph.conf-k / etc/ceph/ceph.keyring

Temp dir is / tmp/mkcephfs.qLmwP4Nd0G

Preparing monmap in / tmp/mkcephfs.qLmwP4Nd0G/monmap

/ usr/bin/monmaptool-create-clobber-add a 192.168.242.128 clobber 6789-add b 192.168.242.129 clobber 6789-add c 192.168.242.130 clobber 6789-print / tmp/mkcephfs.qLmwP4Nd0G/monmap

/ usr/bin/monmaptool: monmap file / tmp/mkcephfs.qLmwP4Nd0G/monmap

/ usr/bin/monmaptool: generated fsid c26fac57-4941-411f-a6ac-3dcd024f2073

Epoch 0

Fsid c26fac57-4941-411f-a6ac-3dcd024f2073

Last_changed 2014-05-08 16 purl 08purl 06.102237

Created 2014-05-08 16 purl 08purl 06.102237

0: 192.168.242.128:6789/0 mon.a

1: 192.168.242.129:6789/0 mon.b

2: 192.168.242.130:6789/0 mon.c

/ usr/bin/monmaptool: writing epoch 0 to / tmp/mkcephfs.qLmwP4Nd0G/monmap (3 monitors)

= osd.0 = =

* * WARNING: No osd journal is configured: write latency may be high.

If you will not be using an osd journal, write latency may be

Relatively high. It can be reduced somewhat by lowering

Filestore_max_sync_interval, but lower values mean lower write

Throughput, especially with spinning disks.

2014-05-08 16 411f-a6ac-3dcd024f2073 08 b72cc740 created object store 11.279610 / data/osd.0 for osd.0 fsid c26fac57-4941-411f-a6ac-3dcd024f2073

Creating private key for osd.0 keyring / tmp/mkcephfs.qLmwP4Nd0G/keyring.osd.0

Creating / tmp/mkcephfs.qLmwP4Nd0G/keyring.osd.0

= osd.1 = =

Pushing conf and monmap to s2:/tmp/mkfs.ceph.5884

* * WARNING: No osd journal is configured: write latency may be high.

If you will not be using an osd journal, write latency may be

Relatively high. It can be reduced somewhat by lowering

Filestore_max_sync_interval, but lower values mean lower write

Throughput, especially with spinning disks.

2014-05-08 16 purl 08Rose 21.146302 b7234740 created object store / data/osd.1 for osd.1 fsid c26fac57-4941-411f-a6ac-3dcd024f2073

Creating private key for osd.1 keyring / tmp/mkfs.ceph.5884/keyring.osd.1

Creating / tmp/mkfs.ceph.5884/keyring.osd.1

Collecting osd.1 key

= osd.2 = =

Pushing conf and monmap to s3:/tmp/mkfs.ceph.5884

* * WARNING: No osd journal is configured: write latency may be high.

If you will not be using an osd journal, write latency may be

Relatively high. It can be reduced somewhat by lowering

Filestore_max_sync_interval, but lower values mean lower write

Throughput, especially with spinning disks.

2014-05-08 16 411f-a6ac-3dcd024f2073 08 b72b3740 created object store 27.264484 / data/osd.2 for osd.2 fsid c26fac57-4941-411f-a6ac-3dcd024f2073

Creating private key for osd.2 keyring / tmp/mkfs.ceph.5884/keyring.osd.2

Creating / tmp/mkfs.ceph.5884/keyring.osd.2

Collecting osd.2 key

= mds.a = =

Creating private key for mds.a keyring / tmp/mkcephfs.qLmwP4Nd0G/keyring.mds.a

Creating / tmp/mkcephfs.qLmwP4Nd0G/keyring.mds.a

= mds.b = =

Pushing conf and monmap to s2:/tmp/mkfs.ceph.5884

Creating private key for mds.b keyring / tmp/mkfs.ceph.5884/keyring.mds.b

Creating / tmp/mkfs.ceph.5884/keyring.mds.b

Collecting mds.b key

= mds.c = =

Pushing conf and monmap to s3:/tmp/mkfs.ceph.5884

Creating private key for mds.c keyring / tmp/mkfs.ceph.5884/keyring.mds.c

Creating / tmp/mkfs.ceph.5884/keyring.mds.c

Collecting mds.c key

Building generic osdmap from / tmp/mkcephfs.qLmwP4Nd0G/conf

/ usr/bin/osdmaptool: osdmap file'/ tmp/mkcephfs.qLmwP4Nd0G/osdmap'

2014-05-08 16 purl 08purl 26.100746 b731e740 adding osd.0 at {host=s1,pool=default,rack=unknownrack}

2014-05-08 16 08 b731e740 adding osd.1 at 26.101413 {host=s2,pool=default,rack=unknownrack}

2014-05-08 16 purl 08purl 26.101902 b731e740 adding osd.2 at {host=s3,pool=default,rack=unknownrack}

/ usr/bin/osdmaptool: writing epoch 1 to / tmp/mkcephfs.qLmwP4Nd0G/osdmap

Generating admin key at / tmp/mkcephfs.qLmwP4Nd0G/keyring.admin

Creating / tmp/mkcephfs.qLmwP4Nd0G/keyring.admin

Building initial monitor keyring

Added entity mds.an auth auth (auid = 18446744073709551615 key=AQB3O2tTwDNwLRAAofpkrOMqtHCPTFX36EKAMA== with 0 caps)

Added entity mds.b auth auth (auid = 18446744073709551615 key=AQB8O2tT8H8nIhAAq1O2lh5IV/cQ73FUUTOUug== with 0 caps)

Added entity mds.c auth auth (auid = 18446744073709551615 key=AQB9O2tTWIfsKRAAVYeueMToC85tRSvlslV/jQ== with 0 caps)

Added entity osd.0 auth auth (auid = 18446744073709551615 key=AQBrO2tTOLQpEhAA4MS83CnJRYAkoxrFSvC3aQ== with 0 caps)

Added entity osd.1 auth auth (auid = 18446744073709551615 key=AQB1O2tTME0eChAA7U4xSrv7MJUZ8vxcEkILbw== with 0 caps)

Added entity osd.2 auth auth (auid = 18446744073709551615 key=AQB7O2tT0FUKERAAQ/EJT5TclI2XSCLAWAZZOw== with 0 caps)

= mon.a = =

/ usr/bin/ceph-mon: created monfs at / data/mon for mon.a

= mon.b = =

Pushing everything to s2

/ usr/bin/ceph-mon: created monfs at / data/mon for mon.b

= mon.c = =

Pushing everything to s3

/ usr/bin/ceph-mon: created monfs at / data/mon for mon.c

Placing client.admin keyring in / etc/ceph/ceph.keyring

The above indicates that journal is not configured.

The code is as follows:

Root@s1:~# / etc/init.d/ceph-a start

= mon.a = =

Starting Ceph mon.an on S1... Already running

= mds.a = =

Starting Ceph mds.an on S1... Already running

= osd.0 = =

Starting Ceph osd.0 on S1...

* * WARNING: Ceph is still under development. Any feedback can be directed * *

* * at ceph-devel@vger.kernel.org or http://ceph.newdream.net/. **

Starting osd.0 at 0.0.0.0 data/osd.0 6801 osd_data / data/osd.0 (no journal)

View status:

The code is as follows:

Root@s1:~# ceph-s

2014-05-09 09 MB used 37 pg 40.477978 pg v444: 594 pgs: 594 active+clean; 38199 bytes data, 531 MB used, 56869 MB / 60472 MB avail

2014-05-09 09 0=a=up:active 37 mds E23: 1-1-1 up {0=a=up:active}, 2 up:standby

2014-05-09 09 up 37 up 40.485601 osd e34: 3 osds: 3 in

2014-05-09 09 INF 37 log 40.486276 2014-05-09 09 09 36 log 25.843782 mds.0 192.168.242.128 8800 mds.0 1053 1: [INF] closing stale session client.4104 192.168.242.131Grade2123448720 after 302.954724

2014-05-09 09 37 mon 40.486577 E1: 3 mons at {axiom 192.168.242.128Paradigm 6789Unique 0magentic 192.168.242.129Rome6789Universe 0legence192.168.242.130MUR 6789B0}

Root@s1:~# for i in 1 2 3; do ceph health;done

2014-05-09 10 05Frl 30.306575 mon 'HEALTH_OK' (0)

2014-05-09 10 05V 30.330317 mon 'HEALTH_OK' (0)

2014-05-09 10 mon 'HEALTH_OK' (0)

And looking at S1, S2, S3 log at the same time, you can see that all three nodes are normal:

The code is as follows:

2014-05-09 09 handle_command mon_command 39V 32.316795 b4bfeb40 mon.a@0 (leader) E1 handle_command mon_command (health v 0) v1

2014-05-09 09 osd 39 osd 40.789748 osd e35: 3 osds: 3 up, 3 in

2014-05-09 09 osd 40V 00.796979 b4bfeb40 mon.a@0 (leader). Osd e36 e36: 3 osds: 3 up, 3 in

2014-05-09 09 handle_command mon_command 40 handle_command mon_command 41.781141 b4bfeb40 mon.a@0 (leader) E1 handle_command mon_command (health v 0) v1

2014-05-09 09 handle_command mon_command 40V 42.409235 b4bfeb40 mon.a@0 (leader) E1 handle_command mon_command (health v 0) v1

You will see the following time-out-of-sync information in log:

The code is as follows:

2014-05-09 09 message from mon.0 was stamped 4313. 485212 b49fcb40 log [WRN]: message from mon.0 was stamped 6.050738s in the future, clocks not synchronized

2014-05-09 09 message from mon.0 was stamped 43 13 86 1985 b49fcb40 log [WRN]: message from mon.0 was stamped 6.050886s in the future, clocks not synchronized

2014-05-09 09 message from mon.0 was stamped 4314. 012633 b49fcb40 log [WRN]: message from mon.0 was stamped 6.050681s in the future, clocks not synchronized

2014-05-09 09 in the future 4315. 809439 b49fcb40 log [WRN]: message from mon.0 was stamped 6.050781s in the future, clocks not synchronized

Therefore, before we do the cluster, we'd better make a good ntp server within the cluster to ensure that the time of each node is the same.

3. Next, do the verification operation on client S4:

The code is as follows:

Root@s4:/mnt# mount-t ceph S1 mnt/s1fs/ 6789 / mnt/s1fs/

Root@s4:/mnt# mount-t ceph s2mnt/s2fs/ 6789 / mnt/s2fs/

Root@s4:/mnt# mount-t ceph s3 mnt/s3fs/ 6789 / mnt/s3fs/

Root@s4:~# df-h

Filesystem Size Used Avail Use% Mounted on

/ dev/sda1 79G 1.3G 74G 2% /

Udev 241m 4.0K 241m 1% / dev

Tmpfs 100m 304K 99m 1% / run

None 5.0M 0 5.0M 0% / run/lock

None 248m 0 248m 0% / run/shm

192.168.242.130Rose 6789 / 60G 3.6G 56G 6% / mnt/s3fs

192.168.242.129 6789 / 60G 3.6G 56G 6% / mnt/s2fs

192.168.242.128 6789 / 60G 3.6G 56G 6 per cent / mnt/s1fs

Root@s4:/mnt/s2fs# touch aa

Root@s4:/mnt/s2fs# ls-al / mnt/s1fs

Total 4

Drwxr-xr-x 1 root root 0 May 8 18:08. /

Drwxr-xr-x 7 root root 4096 May 8 17:28.. /

-rw-r-r- 1 root root 0 May 8 18:08 aa

Root@s4:/mnt/s2fs# ls-al / mnt/s3fs

Total 4

Drwxr-xr-x 1 root root 0 May 8 18:08. /

Drwxr-xr-x 7 root root 4096 May 8 17:28.. /

-rw-r-r- 1 root root 0 May 8 18:08 aa

Root@s4:/mnt/s2fs# rm-f aa

Root@s4:/mnt/s2fs# ls-al / mnt/s1fs/

Total 4

Drwxr-xr-x 1 root root 0 May 8 2014. /

Drwxr-xr-x 7 root root 4096 May 8 17:28.. /

Root@s4:/mnt/s2fs# ls-al / mnt/s3fs/

Total 4

Drwxr-xr-x 1 root root 0 May 8 18:07. /

Drwxr-xr-x 7 root root 4096 May 8 17:28.. /

Next, let's verify a single point of failure:

Stop the S1 service

The code is as follows:

Root@s1:~# / etc/init.d/ceph stop

= mon.a = =

Stopping Ceph mon.an on S1... Kill 965... Done

= mds.a = =

Stopping Ceph mds.an on S1... Kill 1314... Done

= osd.0 = =

Stopping Ceph osd.0 on S1... Kill 2265... Done

The log on S2 immediately displays:

It saves a lot, which basically means that the mon monitoring center finds out, removes the faulty nodes, switches automatically, and the cluster recovers.

The code is as follows:

2014-05-09 10 a5af0b40 16 fault with nothing to send 44.906370 a5af0b40-192.168.242.129 0xb1e1b1a8 sd=19 pgs=3 cs=3 6802 pipe 1495 > > 192.168.242.128 purl 6802 pipe 1466 (0xb1e1b1a8 sd=19 pgs=3 cs=3 lump 0).

2014-05-09 10 a68feb40 16 fault with nothing to send 44.906982 fault with nothing to send-192.168.242.129 0xa6e00d50 sd=17 pgs=1 cs=1 6803 pipe 1495 > 192.168.242.128 0xa6e00d50 sd=17 pgs=1 cs=1 1467.

2014-05-09 10 0xb1e26d50 sd=20 pgs=1 cs=1 16 a63f9b40 44.907415 fault with nothing to send-192.168.242.129 0xb1e26d50 sd=20 pgs=1 cs=1 1506 > 192.168.242.128 14803 pipe (0xb1e26d50 sd=20 pgs=1 cs=1 lump 0).

2014-05-09 10 b5199b40 mds.0.6 handle_mds_map i am now mds.0.6 16purl 49.028640

2014-05-09 10 16purl 49.029018 b5199b40 mds.0.6 handle_mds_map state change up:reconnect-> up:rejoin

2014-05-09 10 b5199b40 mds.0.6 rejoin_joint_start 16purl 49.029260

2014-05-09 10 b5199b40 mds.0.6 rejoin_done 16purl 49.032134

= > / var/log/ceph/mon.b.log / var/log/ceph/mds.b.log up:active

2014-05-09 10 1614 49.073252 b5199b40 mds.0.6 recovery_done-successful recovery

2014-05-09 10 b5199b40 mds.0.6 active_start 16purl 49.073871

2014-05-09 10 b5199b40 mds.0.6 cluster recovered 16charge 49.073934.

= = > / var/log/ceph/mds.b.log up:active

2014-05-09 10 1614 49.073252 b5199b40 mds.0.6 recovery_done-successful recovery

2014-05-09 10 b5199b40 mds.0.6 active_start 16purl 49.073871

2014-05-09 10 b5199b40 mds.0.6 cluster recovered 16charge 49.073934.

= > / var/log/ceph/mon.b.log / var/log/ceph/mds.c.log / var/log/ceph/osd.2.log

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.