Engineering

SRE Daily Life: Bleeding CART

As you know, we’re running Hostinger. It’s a critical component as it’s like our company’s greeting card. Even more, it’s the main gateway for onboarding new clients and making revenue through the /cart.

Initially, we only had one instance of our main website provisioned in a single data center located in the United Kingdom, which was equipped with an archaic network infrastructure. We were experiencing lots of DDoS attacks.

A quick solution was to put the website under Cloudflare to mitigate DDoS attacks. It worked well, but not for long. As we were treating the symptom, the actual problem persisted until we suffered a huge downtime when that aforementioned data center was down for more than 4 hours.

Scaling Issues

We decided to scale our application across the globe. Launched two instances of applications per location. We use our global Anycast at the moment in three locations: Singapore, United States, and the Netherlands. In total, we launched six instances.

Alright, the code was deployed, but now we had a problem with the database. How to make it easily available and accessible?

Our developers determined that they were almost all read requests that were going to the database, thus we decided to start using the Geo Percona XtraDB Cluster. Started one instance per location, overall three.

Later, it turned out that the /cart endpoint mostly writes to the database. The workload was not as expected and planned. We decided to move the /cart logic to Redis. This raised the question again, how to scale Redis to avoid the servers from melting down again? Mostly it’s used for caching /cart for one month and for shared PHP sessions. And of course, it’s not as critical as the database. It would compensate for some write loads to the database.

Bootstrapped one Redis instance per location without any shared state between them. If a request comes to Europe, then it will use Europe’s Redis instance. If the US, then the US, and so on.

World Domination Tour

Ok, so the solution wasn’t as future proof as we wanted it to be. Now we had Cloudflare on top, scaled application per multiple locations. But we found another issue. We noticed lots of SQLSTATE MySQL gone away errors from our application. I started digging around and spotted that the XtraDB cluster shot timeout errors forming a quorum. As usual, there was no time to investigate it further, thus we applied some quick fixes like an increased timeout for a cluster, retries and window sizes for replication buffers. It worked around 5% better, but we still had timeouts, connection drops, and experienced downtime. Then I checked the latency between XtraDB endpoints. Greetings from Asia, there was a huge packet loss between Asia and Europe.

I contacted the data center staff to re-route our prefixes through another upstream and the service was back again. Even more, latency between Asia and Europe is around 250ms, hence do the math and we have 4 requests per second for writes because XtraDB cluster acknowledges writes only if all the nodes in the cluster confirm.

Next day, the packet loss between locations started happening again. Contacted data center guys, fixed the problem temporary again. Eventually made a resolution to get rid of the XtraDB cluster. Next question was how to keep high availability for database connections if one instance goes down? We launched a custom MySQL cluster solution based on ExaZK. The situation changed a lot.

uptime checks chart

After monitoring about a week how the application lives with the new changes, we still noticed huge spikes in response times. Created a NewRelic account and started monitoring the whole application itself.

New Relic transaction time illustrated

We got some really handy metrics related to MySQL slow queries and external resources. It was very clear what was the problem – we dubbed our MySQL instances with unnecessary read requests by receiving translations for a certain language. We settled out to generate translation files into JSON format and loading them quickly instead of querying the database with every request.

JSON example instead of queries

GitHub merge commit into master branch

With this change, we cut the latency noticeably when JSON files were adopted for database requests.

Now we have another problem (not as critical as with the database), that JSON files are around 500kb size. They are read with every request and generates approximately 3k read() syscalls per second. 500kb / 8kb ~= 62 read()s to fully read the language’s file. For those who are interested I got those numbers using the Sysdig command:

What’s a day without surprises? We still noticed a very high rate of timeouts in our monitoring tools. We quickly went through the monitoring tools we have: Grafana, Prometheus, Graylog to double check what’s going on and cross-reference with StatusCake, Pingdom stats.

Well, well, well. The reason was that our top of rack switch was faulty and restarted a few times per day. You should notice the gaps in the graph below.

CPU usage graph comparing before and after

When this happened, ExaZK started to point to a live MySQL instance and it worked in an HA fashion. Eventually, we replaced our faulty network switch with a new one and we started having 100% uptime.

At the moment our website is working as shown in the screenshot below.

Uptime checks screenshot showing near 100% uptime

Improvements in the Roadmap:

Cache translation JSON files directly in the browser to shift the load to the client-side.

Implement GeoDNS to pick the nearest location by the client’s source IP address. This is already tested in our development environment, but waiting for a stable release of PowerDNS 4.2.

In the future, we would love to implement regional Anycast together with GeoDNS to failover over a live datacenter in case of failure. One global Anycast plus region allocated prefix. Both are overlapping prefixes which allow having smooth failover if one region goes down completely. For instance, your GeoDNS server responds to a CNAME record with an IP of 2A02:4780:C3::1 for the CDN’s resolver and at that moment this region is down. New connections will be redirected to the shortest AS-PATH PoP because of the global Anycast overlapped network.

Add Comment

Click here to post a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.