{"id":1003,"date":"2019-05-27T10:28:49","date_gmt":"2019-05-27T10:28:49","guid":{"rendered":"https:\/\/www.hostinger.com\/blog\/?p=1003"},"modified":"2022-09-26T16:10:25","modified_gmt":"2022-09-26T16:10:25","slug":"sre-daily-life-bleeding-cart","status":"publish","type":"post","link":"https:\/\/www.hostinger.com\/blog\/sre-daily-life-bleeding-cart","title":{"rendered":"SRE Daily Life: Bleeding CART"},"content":{"rendered":"<h2><\/h2><p>As you know, we run <a href=\"https:\/\/hostinger.com\" target=\"_blank\" rel=\"noopener\">Hostinger.com<\/a>. It&rsquo;s a critical component as it&rsquo;s like our company&rsquo;s greeting card. Even more, it&rsquo;s the main gateway for onboarding new clients and making revenue through the <em>\/cart<\/em>.<\/p><p>Initially, we only had one instance of our main website provisioned in a single data center located in the United Kingdom which was equipped with an archaic network infrastructure. We were experiencing lots of DDoS attacks.<\/p><p>A quick solution was to put the website under <a href=\"https:\/\/www.cloudflare.com\/\" target=\"_blank\" rel=\"noopener\">Cloudflare<\/a> to mitigate these attacks. It worked well but not for long. As we were treating the symptoms, the actual problem persisted until we suffered a huge blackout when that aforementioned data center went down for more than 4 hours.<\/p><h2 id=\"h-scaling-issues\">Scaling Issues<\/h2><p>We decided to scale our application across the globe by launching two instances of applications per location. We use our global <strong>Anycast<\/strong> at the moment in three locations: Singapore, the United States, and the Netherlands. In total, we launched six instances.<\/p><p>Once the code was deployed, we encountered a problem with the database &ndash; what should we do to make it easily available and accessible?<\/p><p>Our developers determined that the database was mostly receiving read requests, so we decided to start using the <strong>Geo Percona XtraDB<\/strong> cluster. We launched one instance per location, in total &ndash; three.<\/p><p>Later, it turned out that the <strong>\/cart<\/strong> endpoint mostly writes to the database. The workload was unexpected and unplanned. We decided to move the <strong>\/cart<\/strong> logic to <strong>Redis<\/strong>. This raised another question &ndash; how to scale Redis to avoid the servers from melting down again? It&rsquo;s mostly used for caching<strong> \/cart<\/strong> for one month and for shared PHP sessions. And of course, it&rsquo;s not as critical as the database. It would compensate for some write loads to the database.<\/p><p>We bootstrapped one Redis instance per location without any shared states between them. If a request comes to Europe, then it will use the European Redis instance. If the US, then the American, and so on.<\/p><h2 id=\"h-world-domination-tour\">World Domination Tour<\/h2><p>Ok, so the solution wasn&rsquo;t as future-proof as we wanted it to be. Now we had Cloudflare on top and a scaled application in multiple locations. But we ran into another issue. We noticed lots of &lsquo;SQLSTATE MySQL gone away&rsquo; errors from our application. I started digging around and noticed that the XtraDB cluster shot timeout errors forming a quorum. As usual, there was no time to investigate it further, so we applied some quick fixes like an increased timeout for a cluster, retries, and window sizes for replication buffers. It worked around 5% better, but we still had timeouts, connection drops, and downtime. Then I checked the latency between XtraDB endpoints. There was a lot of packet loss between Asia and Europe.<\/p><p>I contacted the data center staff to re-route our prefixes through another upstream and the service was back online again. The latency between Asia and Europe is around <strong>250ms<\/strong>, so we have 4 requests per second for writes because the XtraDB cluster acknowledges writes only if all the nodes in the cluster confirm.<\/p><p>The next day, the packet loss between locations started happening again. We contacted the data center guys and managed to fix the problem temporarily again. Eventually, we decided to get rid of the XtraDB cluster.<\/p><p>The next problem was how to keep high availability for database connections if one instance goes down. We launched a custom MySQL cluster solution based on <a href=\"https:\/\/www.hostinger.com\/blog\/mysql-setup-at-hostinger-explained\">ExaZK<\/a>. The situation changed a lot.<\/p><p><img decoding=\"async\" class=\"aligncenter wp-image-1006 size-full\" src=\"https:\/\/www.hostinger.com\/blog\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-15-28-52.png\" alt=\"uptime checks chart\" width=\"471\" height=\"421\" srcset=\"https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-15-28-52.png\/w=471,fit=scale-down 471w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-15-28-52.png\/w=300,fit=scale-down 300w\" sizes=\"(max-width: 471px) 100vw, 471px\" \/><\/p><p>We monitored how the application performs with the new changes for about a week but still noticed huge spikes in response times. We then created a <a href=\"https:\/\/newrelic.com\" target=\"_blank\" rel=\"noopener\">NewRelic<\/a> account and started monitoring the whole application itself.<\/p><h2><img decoding=\"async\" class=\"aligncenter wp-image-1027 size-full\" src=\"https:\/\/www.hostinger.com\/blog\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-23-10-18-06.png\" alt=\"New Relic transaction time illustrated\" width=\"1030\" height=\"413\" srcset=\"https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-23-10-18-06.png\/w=1030,fit=scale-down 1030w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-23-10-18-06.png\/w=300,fit=scale-down 300w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-23-10-18-06.png\/w=768,fit=scale-down 768w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-23-10-18-06.png\/w=1024,fit=scale-down 1024w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-23-10-18-06.png\/w=990,fit=scale-down 990w\" sizes=\"(max-width: 1030px) 100vw, 1030px\" \/><\/h2><p>We got some really handy metrics related to <strong>MySQL<\/strong> slow queries and external resources. It was very clear what the problem was &ndash; we dubbed our MySQL instances with unnecessary read requests by receiving translations for a certain language. We settled out to generate translation files in the JSON format and load them quickly instead of querying the database with every request.<\/p><p><img decoding=\"async\" class=\"aligncenter wp-image-1005 size-full\" src=\"https:\/\/www.hostinger.com\/blog\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-16-51-00.png\" alt=\"JSON example instead of queries\" width=\"604\" height=\"186\" srcset=\"https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-16-51-00.png\/w=604,fit=scale-down 604w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-16-51-00.png\/w=300,fit=scale-down 300w\" sizes=\"(max-width: 604px) 100vw, 604px\" \/><\/p><p><img decoding=\"async\" class=\"aligncenter wp-image-1004 size-full\" src=\"https:\/\/www.hostinger.com\/blog\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-16-52-17.png\" alt=\"GitHub merge commit into master branch\" width=\"541\" height=\"91\" srcset=\"https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-16-52-17.png\/w=541,fit=scale-down 541w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-16-52-17.png\/w=300,fit=scale-down 300w\" sizes=\"(max-width: 541px) 100vw, 541px\" \/><\/p><p>With this change, we cut the latency <span class=\"tlid-translation translation\" lang=\"en\"><span class=\"\" title=\"\">noticeably when JSON files were adopted for database requests. <\/span><\/span><\/p><p>Now we have another problem (not as critical as with the database) &ndash; JSON files are each around 500kb in size. They are read with every request and generate approximately 3k <strong>`read()`<\/strong> syscalls per second. 500kb \/ 8kb ~= 62 read()s to fully read the language&rsquo;s file. For those who are interested I got those numbers using the <strong>Sysdig<\/strong> command:<\/p><pre class=\"\"># sysdig evt.args contains \"json\"<\/pre><p>We still noticed a very high rate of timeouts in our monitoring tools. We quickly went through Grafana, Prometheus, and Graylog to double-check what&rsquo;s going on and cross-reference with <a href=\"http:\/\/StatusCake.com\" target=\"_blank\" rel=\"noopener\">StatusCake<\/a>, <a href=\"http:\/\/Pingdom.com\" target=\"_blank\" rel=\"noopener\">Pingdom<\/a> stats.<\/p><p>The issue was that our top of rack switch was faulty and restarted a few times per day. You can check out the gaps in the graph below.<\/p><p><img decoding=\"async\" class=\"aligncenter wp-image-1025 size-full\" src=\"https:\/\/www.hostinger.com\/blog\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-17-47-04.png\" alt=\"CPU usage graph comparing before and after\" width=\"921\" height=\"234\" srcset=\"https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-17-47-04.png\/w=921,fit=scale-down 921w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-17-47-04.png\/w=300,fit=scale-down 300w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-17-47-04.png\/w=768,fit=scale-down 768w\" sizes=\"(max-width: 921px) 100vw, 921px\" \/><\/p><p>When this happened, <a href=\"https:\/\/www.hostinger.com\/blog\/mysql-setup-at-hostinger-explained\/\">ExaZK<\/a> started to point to a live MySQL instance and it worked in an HA fashion. Eventually, we replaced our faulty network switch with a new one and we started having 100% uptime.<\/p><p>At the moment our website is working as shown in the screenshot below.<\/p><p><img decoding=\"async\" class=\"aligncenter wp-image-1007 size-full\" src=\"https:\/\/www.hostinger.com\/blog\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-15-28-33.png\" alt=\"Uptime checks screenshot showing near 100% uptime\" width=\"475\" height=\"830\" srcset=\"https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-15-28-33.png\/w=475,fit=scale-down 475w, https:\/\/imagedelivery.net\/LqiWLm-3MGbYHtFuUbcBtA\/wp-content\/uploads\/sites\/4\/2019\/05\/Screenshot-from-2019-05-22-15-28-33.png\/w=172,fit=scale-down 172w\" sizes=\"(max-width: 475px) 100vw, 475px\" \/><\/p><h2 id=\"h-improvements-in-the-roadmap\"><strong>Improvements in the Roadmap<\/strong><\/h2><p>We&rsquo;re planning to force cache translation JSON files directly in the browser to shift the load to the client-side. We&rsquo;re also going to implement GeoDNS to pick the nearest location to the client&rsquo;s source IP address. This is already tested in our development environment, but we&rsquo;re waiting for a stable release of PowerDNS 4.2.<\/p><p>In the future, we would love to implement regional <strong>Anycast<\/strong> together with <strong>GeoDNS<\/strong> to failover to a live data center in case of failure. One global Anycast plus region allocated prefix. Both are overlapping prefixes that allow having smooth failover if one region goes down completely. For instance, if your GeoDNS server responds to a CNAME record with an IP of <em>2A02:4780:C3::1<\/em> for the CDN&rsquo;s resolver and at that moment this region is down, new connections will be redirected to the shortest AS-PATH PoP because of the global Anycast overlapped network.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As you know, we run Hostinger.com. It&#8217;s a critical component as it&#8217;s like our company&#8217;s greeting card. Even more, it&#8217;s the main gateway for onboarding new clients and making reve\u2026<\/p>\n","protected":false},"author":39,"featured_media":1030,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[82],"tags":[1809,1808,1813,1810,1812,1811],"hashtags":[],"class_list":["post-1003","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-engineering","tag-anycast","tag-bgp","tag-exazk","tag-mysql","tag-percona","tag-php"],"hreflangs":[],"_links":{"self":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts\/1003","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/comments?post=1003"}],"version-history":[{"count":34,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts\/1003\/revisions"}],"predecessor-version":[{"id":3850,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts\/1003\/revisions\/3850"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/media\/1030"}],"wp:attachment":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/media?parent=1003"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/categories?post=1003"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/tags?post=1003"},{"taxonomy":"hashtags","embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/hashtags?post=1003"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}