Engineering News

Interview with Balys Kriksciunas: 000webhost at Percona Live 2017

Scaling a million databases on 000webhost

Percona Live Open Source Database Conference is one of the industry leading events for scaling businesses and dynamic communities that develop and use open source database software. MySQL, MongoDB, PostgreSQL, and other world-class open source databases are the main focus, with a strong technical point of view on each of the topic. In-depth discussions on solution scalability, database architecture, security, and performance are presented at Percona Live Europe in order to suffice the needs of a growing business.

000Webhost at Percona Live Europe

000Webhost handles more than a million unique user queries per minute on a total of almost a million databases. Our head of engineering, Balys Kriksciunas gave a talk on the obstacles encountered on such a scale. In short, he explained why 000Webhost is now running on MariaDB in LXC containers and routes queries with HAProxy and ProxySQL. He presented the timeline of solution testing and trial. Balys also talked about our partnership with René Cannaò – the creator of ProxySQL and query routing logic in our new infrastructure.

000Webhost’s database issue was pretty common among hosting providers, one logical shared-platform. Databases were sharded manually to different physical database servers, which were manually assigned to web servers. This logic quickly became inefficient for 000webhost as the user base was growing extremely rapidly.

000Webhost is now running on heterogeneous hardware, where any user can be hosted on any server. 000Webhost has achieved seamless scaling by simply adding new servers. Without any downtime, automated load balancing solution migrates users between nodes, when they are running out of capacity or are overloaded. Each web node can route web requests to any other node. An orchestration of such multiple solutions now ensures a stable and trusted environment for the end user.

Scaling 1 Million Databases: Roadmap

 

What were the issues encountered at the beginning?

We were dependant on each server’s hardware resources, thus, unable to dynamically scale user resources. As each user is unique, servers had different workloads, some were close to idle, some were overloaded, even with the same database count. We had spiky performance and could not guarantee stability.

Due to hostnames hardcoded in the scripts, account migration incurred downtime and was error-prone. For example, if a user was running wordpress and needed to migrate from server A to server B, WordPress configuration files would have needed to be manually changed.

How did you come up with the solutions?

We started brainstorming and figured that if we used localhost in the user configuration files, all user migrations would become seamless. Their configuration files could remain untouched, even if we changed the server hosting the database. This began a chain of logical changes that we implemented – firstly, we added a HaProxy instance. It helped us to listen on a local socket on each web server. This allowed all users to use localhost in the scripts, which were connected to HAProxy and routing to physical MySQL server in the backend.

What happened next? How did you deal with database issues?

After we implemented HAProxy, only the migration issue was solved. All the issues with database servers were still present. Main problems were caused by acl_get() and file locking glitches, so we decided to containerize physical servers, minimizing the number of databases per each instance. We decided to use LXC instead of OpenVZ because at the time we used CephFS, and needed support for the newest kernel releases. Later we migrated from CephFS to pure-SSD but stayed with LXC.

All these changes implemented, did you feel any remarkable outcome?

Now, we had so many containerized instances, that HAProxy configuration exploded immensely. This has led us to choosing an extra query routing layer. Firstly, we tried mySQL proxy, a proxy which could rewrite database queries to remove overloads due to spikes. However, it could not route by a database username, and for our scale, it performed poorly. We also tried MaxScale made by MariaDB, which seemed optimal, but it stored all configurations in physical files. With millions of users and databases, configuration files were too large to perform.

Partnering with René Cannaò

Then we discovered ProxySQL. It could rewrite queries based on regex and route them by username to different shards (database instances). It was also able to apply limits to users and backlog queries. ProxySQL configuration was stored in SQLite and initial tests showed a great performance boost. We did run into several issues, – each time a new user was added, changed, or removed, ProxySQL required a reload of runtime configuration in memory. Such reload, with a million user base, took quite a long time, sometimes even 20 minutes, and caused global downtimes. We did not want to give up on ProxySQL, so we called out to Rene, the author of ProxySQL, to help us out. Rene rewrote some of the logic in ProxySQL, made reloads async, and they completed in sub-second time. Engineers at Hostinger appreciate ProxySQL a lot and also contributed to ProxySQL project by adding full IPv6 support, which 000Webhost is running on at the moment.

1 Comment

Click here to post a comment