December 29, 2016
5 min Read
December 29, 2016
5 min Read
Ceph, a free-software storage platform, scalable to the exabyte level, became a hot topic this year. In fact, now Ceph is so stable it is used by some of the largest companies and projects in the world, including Yahoo!, CERN, Bloomberg. Hostinger also joined this league of Ceph users, but in our own way. We deployed our larger, better-performing cluster using not RadosGW or RBD, but the very fresh Ceph’s file system (CephFS).
Ceph is pretty hard to deploy and tune to make most of your cluster. There are tools to make deploying Ceph easier: ansible version ansible-ceph, saltstack version deep-sea, chef version ceph-chef and, of course, ceph-deploy.
Our decision to use Ceph for 000webhost.com project was thoroughly researched. At Hostinger we have already had experience using Ceph and Gluster in production. And now, we are confident, that using Ceph is the right choice.
Before migrating all our storage needs to Ceph, Hostinger used Gluster which is no longer suitable for our present needs. First of all, Gluster is harder to maintain than Ceph. Second, it is not as performant. Unlike Ceph, which has native Kernel client, Gluster is exported either through NFS (or Ganesha-NFS), or FUSE client. Even though Gluster documentation says that FUSE is meant to be used where high performance is required, FUSE can’t compete with kernel clients.
After mentioning that we use CephFS in production, we got reactions like “really!!!??” and some laughter. But as we are trying to continue to be an innovative company that keeps technology stack on edge, we continue to use CephFS. Another reason why we choose to use CephFS is that any alternatives are not good enough. Here at Hostinger we have experience trying to run Ceph RBD images over NFS share. This solution did provide us shared storage, but under certain loads (a mix of a large count of clients sending a mix of both – IO and throughput oriented – read and write requests) NFS (and Ganesha) was showing terrible performance with frequent lock-ups. Although it is stated that there is no iSCSI support for Ceph, we have managed to export RBD images over iSCSI (two years ago) and successfully ran a few production projects this way. Which, by the way, gave us a very decent performance which we were satisfied with. But this solution is no longer suitable for our present needs too because it does not provide shared storage.
For the initial launch of the project, we have deployed cluster that consists of:
For those wondering what NVMe is – it is a quite new technology, with the first specification released in 2011. Our NVMe drives can achieve up to 3GBps (24 Gbps) read throughput and up to 2 GBps (16 Gbps) write throughput, providing exceptional performance and enabling data throughput improvements over traditional HDDs and other SSDs.
As this cluster is built for shared hosting, most of operations are IOPS oriented.
It all boils down to the latency, and latency is the key:
While you can’t do much about some of these parts, some things are in your control.
There are a lot of kernel parameters that can improve your cluster performance, or make it worse, so you should leave them at defaults unless you really know what you are doing.
By default, ceph is configured to meet requirements of spindle disks. But many of clusters are SSD-only, a mix of SSD (for journals) backed by spindle HDDs, and NVMe for cache backed by spindle HDDs as we have here. So we had to tune many of the Ceph configuration parameters. One of the most performance impacting configuration parameters for us, as we use NVMe journals, were related to writeback throttling. The problem was throttling kicking in too soon. As a consequence journals were able to absorb much more load, but were forcibly slowed down by backing HDDs. You can, but should not turn off wb_throttling altogether as that might later introduce huge spikes and slowdowns because HDDs won’t keep up with the speed of NVMe. We have changed these parameters based on our calculations.
The rule of thumb for Ceph is – you need as many CPU cores as there are OSDs on the node after that only core frequency is what counts. And when it comes to CPU frequency, you should be aware of CPU states, namely C-States. You can turn them off as a tradeoff for power consumption.
Linux kernel has more than one IO scheduler. Different schedulers provide different performances for different workloads. For Ceph, you should use deadline scheduler.
When you are building hardware for your Ceph cluster, you need to take into account your RAID controller if you are going to use one. Not all RAID controllers are made equal. Some controllers will drain CPU more than others, which means that you will get fewer CPU cycles for Ceph operations. Benchmarks done at Ceph very well depicts this.
The filesystem should be configured accordingly, too.
Use XFS for the OSD file system with the following recommended options: noatime, nodiratime, logbsize=256k, logbufs=8, inode64.
Our first Ceph cluster was SSD-only, with nodes having 24 standalone OSDs. Doing benchmarks quickly revealed that there is somewhere a bottleneck at 24 Gbps, but LSI HBA has throughout of 48 Gbps. Digging into how the server is built and how the drives are connected, it was clear, that the bottleneck is… SAS Expander, because of the way drives are connected to it.
Our initial deployment was using FUSE Ceph client because it supported filesystem quotas and lightning fast file and directory size report because metadata server tracks every file operation (unlike traditional filesystems, where you have to calculate directory sizes every time you need it). But FUSE was unacceptably slow.
Let’s take some simple benchmarks:
The difference is huge. We believe same applies to Gluster FUSE client vs. Gluster kernel client. Oh, wait…
As kernel client (version 4.10) still does not support quotas, we tried to use fuse client, and had to rewrite CephFS mount wrapper
mount.fuse.ceph to use ceph-fuse and filesystem parameters correctly, at the same time solving issues that arose while trying to use old
mount.fuse.ceph with systemd.
To sum up, CephFS is very resilient. Extending storage is simple, and with every Ceph software release, our storage will become more and more stable and performant.
Furthermore, we have bigger plans how to use Ceph. But more on that – to come.