Six years ago I wrote this article : Open Source Storage in which I described that storage will become "Software Defined". Basically I already predicted SDN before the acronym was even invented. What I did not see coming is that Oracle would buy SUN and by doing that basically killing off the entire "Open Source" part of that article but hey, at least you can call yourself a Americas Cup sponsor and Larry Elisons yacht maintainer. 🙂
Fast forwarding 6 years and we land in 2015 we see the software defined storage landscape has expanded massively. Not only is there a huge amount of different solutions available now but the majority of them have evolved into a mature storage platform with almost infinite scalability towards capacity and performance.
The status quo of storage has been surpassed by new methods of storing and managing data. No longer are block devices in the form of SCSI addressable disk-drives sufficient to provide the exabyte scale of capacity demand especially when correlated with the ever increasing expectations of better performance. The entire abstraction layer has been pulled much higher up the stack where storage virtualisation layers provide the logical addressable space for applications and user-data. Don't get me wrong here, I'm saying that current solutions are no good. They still fit a vast range of applications and do provide a well proven method of business functionality so these are not going away anytime soon. You will certainly not be fired buying an HDS, EMC or 3PAR array.
The software storage stacks come in many different flavors but the five most mature and vendor supported software stacks are:
- Swift (and Cinder)
The first two fall under the covers of RedHat, HDFS is governed by the Apache Hadoop community and Swift plus Cinder are managed by the storage branch of the Openstack foundation. Last but not least Lustre has it origins all the way back in 2001 from Carnegie Melon University and has done some hoops via SUN and Oracle and does now fall under the OpenSFS as Oracle did no longer want to maintain that stack. They did put their eggs in the ZFS backet. (Nothing wrong with that either :-))
So what do they all do?
As I mentioned the addressable abstraction layer has been pulled up significantly. All the way up to where the CPU's do the grunt-work and DRAM memory is used for caching among other things. The storage stack is now using a more distributed method of allocating address space in a way that no longer very specialized hardware like ASIC's and FPGA's are needed. The storage layer is connected via Ethernet or Infiniband links and the building blocks act in a cluster format. Due to this distribution of both capacity and compute-power it is less prone to scalability limits and both grow linearly on demand. You add capacity and compute power in the same building blocks.
That's the short version. 🙂
Most, if not all, solutions I mentioned above have advanced data management capabilities where redundancy, data-placement, snap-shotting and tiering are already build in. In this post I'll primarily highlight Ceph. In later post I'll try and cover the others.
Ceph is a somewhat solitary solution when compared to the other but that is what makes it so attractive. It's not defined as a file system like GlusterFS nor is it really tied to a cloud solution like Swift or Cinder. I think its closest companion is Lustre as that uses a similar approach and architecture although the inner guts are somewhat different.
Ceph's core building block is the RADOS (Reliable Autonomic Distributed Object Store) object based storage system. This part takes care of the core-storage features like storing, distributing and monitoring the entire platform. Based on the requirements and applications that need access to this storage system three access layers are stacked on the RADOS core layer.
The first is the Ceph-FS file-system which allows for clients to use POSIX semantics to access the Ceph storage cluster. The Ceph client kernel portion is merged as of the 2.6.34 kernel and all major distro's provide the required modules and libraries. A simple "mount -t ceph ......" will provide access to the storage platform.
Ceph storage access
To be able to provide storage services to other applications a set of libraries have been developed collectively knows an LIBRADOS. This set of libraries enables third party developers to use the API's and access two services in the storage cluster. Ceph-monitor and Ceph-OSD. I come back to these later. The LIBRADOS also enabled the advanced storage services like snapshots
RDB - RADOS block device
The RDB also uses the LIBRADOS libraries to access the storage cluster. The RDB provides block devices to upper layer applications like KVM or QEMU clients. The block devices are thinly provisioned thus space efficient and the block images can be freely exported and imported into other Ceph clusters. The block images are not stored in a usual manner but also make good use of the distributed architecture and stripe the data across multiple nodes in the cluster. This enables a significant performance boost for read intensive applications.
Ceph also provides a REST access method for AWS S3 compatible interfaces. This enables cloud platforms like Openstack, Cloudstack, OpenNebula and others to access Ceph in a standard way. This also enables administrators to be able to migrate easier between private and public clouds.
The entire cluster is being watched over by two types of services the ones I mentioned above. The Ceph monitor and Ceph OSD (Object Store Daemon). The monitor daemon maintains a cluster map and can be placed anywhere in the cluster. Each cluster has at least one monitor but effectively more should be installed based on location and redundancy requirements. The OSD daemon is the storage workhorse of the solution. This daemon slices and dices your data and based on the configuration you've specified places these storage objects on nodes somewhere in the cluster.
Obviously when you see that scalability reaches into the exabytes you know that any form of central management of data objects is a no-go. Single point of failure plus performance degradation in case of contention/congestion on such centrally managed systems doesn't allow for the massive scalability that is required in such infrastructures. The Ceph storage cluster therefore uses a, so called, CRUSH algorithm. The CRUSH algorithm is designed to compute the most effective location of dataplacement without having to use a central entity for update and reference purposes. Based upon the state and configuration of the cluster the algorithm can dynamically adjust these placements and recover failed entities if needed. If you want to do a deep-dive on the algorithm check here.
In order to make sense of the entire cluster it is managed by all processes according to a configuration which is recorded into 5 maps collectively called the Cluster MAP.
- The Monitor map: contains things like the cluster id, number of monitors and where they are located, latest change time, current time etc. This is the instance clients use to obtain the cluster map and start storing and retrieving data on the respective OSD's.
- The OSD map: This contains the storage layout of the cluster in the form of pools, , replica sizes, placement group numbers (PG's), other OSD's and their status.
- The PG map: contains PG versions (needs to be the same across OSD's) timestamps and status of each PG.
- The CRUSH map: As described above the CRUSH map depicts the entire storage hierarchy from a device level (disks) to failure domains (for redundancy) and placement rules. The placement rules can take into account device types like flash, jbod, arrays etc.
- The MDS map. MDS is a meta data service is needed for client accessing the Ceph cluster via the Ceph-FS method. The MDS map keeps all meta-data that comes accompanied with the data object that is being stored. Things like access-time, attributes, authorization etc is stored in the MDS map.
Ceph provides the Cephx authentication system in order to keep the bad guys out. Its operations are similar to kerberos where an authentication key, or better known as ticket or token, is provided by the ceph monitors which is then checked by the OSD's and access is granted (or not). The authentication algorithm works between Ceph clients and servers only so it does not propagate to other systems.
Each application using the Ceph-API is by definition Ceph- Cluster aware. This means that it knows the layout and state of the cluster and thus is able to participate in the distributed fashion of that cluster. The Ceph client is therefore also able to keep updates of the state of the cluster and adjust its access accordingly. This enables a very flexible, scalable, and highly available storage infrastructure.
I hope this short intro explains a little about the birds-eye architectural view of Open Source storage specifically on Ceph.
Two books which describe Ceph and Ceph in an Openstack environment are:
These come highly recommended.