Ceph – Distributed Software OSDDefined Storage – Part 1
Ceph is an Open Source, Distributed Software Defined Storage platform aimed at bringing very large data stores to commodity hardware. Ceph focuses at having truly no single point of failure while being community focused.
As a whole, Ceph is broken down into 3 interfaces:
- Object Gateway – A S3 and Swift compatible gateway
- A Block Device – A standard block device to be presented to a system or used in virtualization
- File System – A distributed POSIX file system
All 3 interfaces live directly within the same RADOS cluster. A cluster is comprised of 3 members: OSD’s (Object Storage Demon), Monitors and Metadata Servers. The OSD is essentially a software layer sitting on top of a hard drive. It is responsible for serving objects to the client and peering with other OSD’s. The Monitors are responsible for maintaining cluster state and well as membership. These should always be provided in odd numbers to avoid the “split brain” (Half the monitor decide an OSD is active and the other half believe it is down, resulting in a tie) scenario. Last we have the MDS (Metadata Server), which is only required for our distributed filesystem and is not a required member for interaction with the other 2 interfaces.
Ceph does a few things that make it unique to traditional SAN solutions. First would be CRUSH. CRUSH is a file placement algorithm that is used as a replacement for metadata to determine which OSD an object is stored. The monitors are in charge of distributing a CRUSH Map of the current cluster state, and based on this map and the algorithm, clients will directly communicate to the OSD storing the object, There is no interacting with a metadata server (outside of cephfs). Ceph also introduces peering between the OSD’s. Once a monitor determines an OSD is out (total failure), The OSD’s will begin to peer the data until the proper number of replica’s is achieved. If you have 10 node’s and 1 is lost, Ceph will move 1/10th of the data, making recovery very fast.
Next I will cover a basic implementation, diving deeper into Ceph in a later article. Disclaimer: I will not be covering any of the prerequisites to the installation as they are distribution specific and possibly different based on your number of OSD’s. I will be using 3 raspberry pi Model 3’s, 1 monitor and 2 OSD’s. Please note that the time of this blog I will be using the Hammer release as apposed to the Jewel release as Jewel is not yet available to ARM.
After you have completed the prerequisites, created a ceph user on all the nodes, pushed your ssh-key out and installed your distribution specific ceph-deploy package, we will begin here.
The end result:
Cephpi01 – Admin node, Monitor (The admin node it essentially the system with ceph-deploy installed.)
Cephpi02 – OSD
Cephpi03 – OSD
Lets create our directory to store all of our cluster information.
ceph@cephpi01:~ $ mkdir cephpi ceph@cephpi01:~ $ cd cephpi/
Now we create our cluster, defining the initial monitors, in our case it is cephpi01. ceph@cephpi01:~/cephpi $ ceph-deploy new cephpi01
If you where to do an “ls” you would see 3 files have been created, a monitor secret key, and cluster configuration file and a log file. Ceph defaults to requiring 3 OSD’s to be healthy. Now it is time to install ceph on our 3 nodes.
ceph@cephpi01:~/cephpi $ ceph-deploy install cephpi01 cephpi02 cephpi03
With the Ceph packages installed across all 3 nodes we can begin to initialize our monitor and begin bringing OSD’s online.
ceph@cephpi01:~/cephpi $ ceph-deploy mon create-initial
You can view a list of disk available on the node by using the following command:
ceph@cephpi01:~/cephpi $ ceph-deploy disk list cephpi02
If you are pointing your OSD to a free disk, like we are, First use the gdisk utility to verify there is a functioning GPT Partition table on those disk. We will then deploy an OSD to those disks. I have chosen to use btrfs as my default filesystem. There are many well documented advantages to using btrfs with a Ceph OSD that goes well outside the scope of this article. While many believe btrfs is only is still unstable, it is nearly a decade into the development life cycle and is now the default filesystem recommend in Suse SLES 12 (Only 1 year older then ext4).
ceph@cephpi01:~/cephpi $ ceph-deploy osd create --fs-type btrfs cephpi02:/dev/sda ceph@cephpi01:~/cephpi $ ceph-deploy osd create --fs-type btrfs cephpi02:/dev/sdb ceph@cephpi01:~/cephpi $ ceph-deploy osd create --fs-type btrfs cephpi03:/dev/sda ceph@cephpi01:~/cephpi $ ceph-deploy osd create --fs-type btrfs cephpi03:/dev/sdb
Now that we have our OSD’s sitting on a proper filesystem, lets push out our configuration file and set the proper permissions.
ceph@cephpi01:~/cephpi $ ceph-deploy admin cephpi02 cephpi03 ceph@cephpi01:~/cephpi $ for i in cephpi02 cephpi03; do ssh $i "sudo chmod +r /etc/ceph/ceph.client.admin.keyring"; done
The OSD’s will acquire the CRUSH map and begin peering their placement groups. You can now check the health of the cluster by using:
ceph@cephpi01:~/cephpi $ ceph -s
You can view the space available and space used by using:
ceph@cephpi01:~/cephpi $ ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 796G 792G 134M 0.02 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd 0 0 0 259G 0
You can also view the CRUSH map live:
ceph@cephpi01:~/cephpi $ ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.77997 root default -2 0.38998 host cephpi02 0 0.28000 osd.0 up 1.00000 1.00000 1 0.10999 osd.1 up 1.00000 1.00000 -3 0.38998 host cephpi03 2 0.28000 osd.2 up 1.00000 1.00000 3 0.10999 osd.3 up 1.00000 1.00000
Ceph has made quite a name for itself, being adopted as an enterprise solution for Redhat and Suse, to becoming essentially the de facto storage solution for Openstack. Providing rapid growth at minimal cost seems to be the direction most technologies are heading and Ceph fits the current paradigm.
Now we have a general overview of Ceph and its role in the current market place. In a later article we will deep dive into distributing storage to clients, creating new pools, benchmarking, looking at placement groups and testing an OSD failure. Feel free to reach out to me if you require some assistance with the initial preparation of a host/cluster. I opted to leave this out for the sake of repeating documentation.