Thursday December 29, 2016
Hostinger Joins Yahoo!, CERN, and Bloomberg by Creating Scalable Hosting With CephFS
Ceph, a free-software storage platform that’s scalable to the exabyte level, became a hot topic this year. It is now Ceph so stable that it’s used by some of the largest companies and projects in the world, including Yahoo!, CERN, and Bloomberg. Hostinger also joined the host of Ceph users. However, instead of using RadosGW or RBD, we deployed our larger better-performing cluster with CephFS, Ceph’s new file system.
Ceph Setup Is Pretty Difficult But Not Impossible
Ceph is pretty difficult to deploy and tune to make the most out of your cluster. That said, there are tools to make deploying it easier, like ceph-ansible, a version of Ansible, DeepSea, a collection of Salt files, Ceph-Chef, a version of Chef, and, of course, ceph-deploy.
Our decision to use Ceph for the 000webhost.com project was thoroughly researched. At Hostinger, we have already had experience using Ceph and Gluster in production. Now, we are confident that going for Ceph was the right choice.
Not As Performant? The Best or Nothing
Before migrating all our storage needs to Ceph, Hostinger used Gluster which we’ve outgrown. First of all, Gluster is harder to maintain than Ceph. Secondly, it is not as performant. Unlike Ceph which has a native Kernel client, Gluster is exported either through NFS (or Ganesha-NFS) or the FUSE client. Even though Gluster documentation states that FUSE is meant to be used where high performance is required, FUSE can’t compete with kernel clients.
After mentioning that we use CephFS in production at LinuxCon Europe, there were reactions like “Really!?” and laughter. But as we remain an innovative company that keeps its technology stack cutting edge, we’ll keep using CephFS. We’ve also chosen CephFS because its alternatives are not good enough.
Here at Hostinger, we have experience trying to run Ceph RBD images over the NFS share. This solution did provide us with shared storage, but at certain loads (a mix of a large count of clients sending a mix of both IO and throughput oriented read and write requests) NFS (and Ganesha) was performing terribly with frequent lock-ups.
Although it has been stated that there is no iSCSI support for Ceph, we managed to export RBD images over iSCSI two years ago and successfully ran a few production projects this way. The performance was very decent and we were satisfied with it. Unfortunately, this solution is no longer suitable for our present needs either because it does not provide shared storage.
How We Did It
For the initial launch of the project, we deployed a cluster consisting of:
- Three monitor nodes also acting as metadata servers and the cache tier. Each had 1 NVMe SSD, totaling 800GB of the cache tier for the hottest data.
- 5 OSD nodes, each with 12 OSDs and 1 NVMe SSD for journals, totaling 60 OSDs and 120 TB of usable disk space (360 TB of raw space).
- Separate 10GE fiber networks – one public client network and one private cluster network. Like all of our internal networks, the Ceph cluster runs on an IPv6-only network.
NVMe is new technology, with the first specifications released in 2011. Our NVMe drives can achieve up to 3GBps (24 Gbps) read throughput and up to 2 GBps (16 Gbps) write throughput, providing exceptional performance and enabling data throughput improvements over traditional HDDs and other SSDs.
A Long Road of Benchmarking, Tuning, and Lessons Learned
As this cluster is built for shared hosting, most of the operations are IOPS oriented. It all boils down to managing latency:
- Latency induced by the network.
- Latency induced by drives and their controllers.
- Latency induced by the CPU.
- Latency induced by the RAID controller.
- Latency induced by other hardware components.
- Latency induced by the kernel.
- Latency induced by Ceph code.
- Latency induced by wrong kernel, Ceph, network, file system, and CPU configurations.
While you can’t do much about some of these parts, some things are in your control.
Tuning Kernel Configuration Parameters
For example, there are a lot of kernel parameters that can improve the performance of your cluster or make it worse. That’s why you should leave them at default values unless you really know what you are doing.
Tuning Ceph Configuration Parameters
Ceph is configured to meet the requirements of spindle disks by default. However, many of our clusters are SSD-only, a mix of SSD (for journals) backed by spindle HDDs, and NVMe for the cache backed by spindle HDDs. Consequently, we had to tune many of the Ceph configuration parameters.
Some of the configuration parameters that impacted the performance for us the most were related to writeback throttling (as we use NVMe journals). The problem was that throttling was kicking in too soon. As a result, the journals were able to absorb much more loadbut were forcibly slowed down by the backing HDDs. You can (but should not) turn off wb_throttling altogether as that might later introduce huge spikes and slowdowns because the HDDs won’t be able to keep up with the speed of a NVMe SSD.
In the end, we changed these parameters based on our calculations.
Tuning the CPU
The rule of thumb for Ceph is that you need as many CPU cores as there are OSDs in the node. After this, only the core clock speed counts. When it comes to CPU clock speeds, you should be aware of CPU states, namely C-States. You can turn them at a trade-off for power consumption.
Choosing the Right IO Scheduler
The Linux kernel has more than one IO scheduler. Each provides a different performance at different workloads. For Ceph, you should use the deadline scheduler.
When you are building the hardware for your Ceph cluster, you need to take your RAID controller into account if you are going to use one. Each RAID controller has different properties. Some will drain the CPU more than others, which means fewer CPU cycles for Ceph operations. Benchmarks done at Ceph demonstrate this very well.
Tuning the Underlying Filesystem
The filesystem should be configured accordingly, too. Use XFS for the OSD file system with the following recommended options – noatime, nodiratime, logbsize=256k, logbufs=8, inode64.
Other Hardware Components
Our first Ceph cluster was SSD-only, with nodes having 24 standalone OSDs. Benchmarks quickly revealed that there was a bottleneck at 24 Gbps somewhere, but LSI HBA had a throughput of 48 Gbps. Digging into how the server was built and how the drives were connected revealed that the bottleneck was the SAS Expander because of the way drives were linked to it.
The Kernel Client vs the FUSE Client
Our initial deployment used the FUSE Ceph client because it supports filesystem quotas and lightning fast file and directory size reports. This is made possible by the metadata server tracking every file operation (unlike traditional filesystems where you have to calculate directory sizes every time you need them). That said, FUSE was unacceptably slow.
Let’s look at some simple benchmarks:
- As this storage is used for the shared hosting platform, we consider WordPress deployment speed to be a good benchmark. Using the FUSE client, it took, on average, 30 seconds to extract and deploy WordPress on our CephFS filesystem. In comparison, using the kernel client, it takes up to 2 seconds.
- The FUSE client took 40 seconds on average to extract Drupal. With Kernel client, it takes up to 4 seconds.
- Another quick benchmark involves extracting the Linux kernel. Using the FUSE client it took 4 minutes on average to extract linux-4.10.tar.xz on the CephFS filesystem. Using the kernel client, it takes up to 30 seconds.
The difference is massive. We believe the same applies to the Gluster FUSE client vs. the Gluster kernel client.
As the kernel client (version 4.10) still does not support quotas, we tried to use the FUSE client and had to rewrite the CephFS mount wrapper
mount.fuse.ceph to use ceph-fuse and filesystem parameters correctly, simultaneously solving the issues that would arise while trying to use the old
mount.fuse.ceph with systemd.
- Latency is key.
- You have to inspect all the components.
- Sometimes it’s the little things that make a big difference.
- Default values are not always the best.
- Stop using ceph-deploy, use ceph-ansible or DeepSea (saltstack).
To sum up, CephFS is very resilient. Extending storage is simple, and with every new Ceph software release, our storage will become more and more stable and performant.
Furthermore, we have bigger plans for Ceph implementations. We’ll talk more about that in the future.