Before we opened our data centres in the United Kingdom and the United States, we had some problems with our infrastructure. We had no automation, no version control for router configurations, a single uplink per-data centre, stretched VLANs between racks, plus static routes with ECMP, which causes a gap in the event of an anycast node failure.
As we outlined in a previous blogpost, we started automating our network, and we rolled out our first automated processes in Singapore and the Netherlands. A few weeks ago, we installed the redundant uplinks using different peers and locations in the Netherlands, which eliminated the single point of failure.
We keep an eye on our an SLO defined uptime which is 99.9%. We always try and exceed these numbers to spend more time on testing scenarios such as “what would happen if we reboot one of our spine routers?” or “what would happen if we upgrade our servers periodically?”. We carry out these tests on ‘Gaming Day’, which oddly, is nothing to do with gaming. We named it this because we carry out the tests like a game: you move, you wait for a response for the system, then you try something again. Whilst this process is lengthy, it means that our systems are always safe, up-to-date, and have the best possible quality for our clients.
To avoid spending our available SLO budget redundant connections and uplinks ensure high availability during routers upgrades or (un)planned datacenter maintenances.
Every server serving anycast must be connected with a ToR (top-of-rack) switch using a BGP protocol. Please note that we still have locations where we mix iBGP/eBGP sessions for internal peering. It’s not always easy to work with mixed iBGP/eBGP sessions, but it paves the way for experimentation to make improvements.
Before launching anycast in the Netherlands and Singapore, the following for all datacenters was adjusted:
- I removed static routes;
- I rebootstrapped anycast nodes with installed ExaBGP, which is responsible for announcements;
- I unified routing configurations across all datacenters.
As I mentioned before, we used mixed types of peerings, both of which have their own pros and cons. However, standardization and simplicity are the keys to automation.
Hence, the configuration snippet for our servers is relatively simple and looks like this:
router bgp <leaf-private-as>
neighbor SERVERS peer-group
neighbor SERVERS remote-as 65030
neighbor SERVERS local-as 65534 no-prepend replace-as
neighbor 2a02:4780:8::fefe peer-group SERVERS
leaf-private-as is a unique AS number per rack, which identifies routes and forces the use of eBGP connections. These avoid additional route reflectors in the middle. remote-as we use the same for all servers in all datacenters and local-as keep the same as well to have ONE configuration that matters. From the server point of view, it’s enough to define prefixes to announce and that’s it, everything else is done automatically.
We have expanded our anycast usage as much as we can to help our new products. At first, we launched 000webhost.com’s reverse proxies with then anycast network, and then later launched DNS services, API services; even Redis slaves are using anycast addresses to discover master node. We are also planning to anycast for MySQL setups.
Last week we migrated our file manager to the anycast network as well. We deployed two instances per datacentre and had a problem where should we store sessions for every location. Centralised memcached/Redis would be too slow for intercontinental connections. Spin up decentralized and dedicated sessions store per datacentre? How about keeping single and unified /etc/php.ini for every deployment? Not a problem, anycast to the rescue again. And we have a single role per deployment.
This is the DNS traffic switch to anycast for any1.hostinger.com and any2.hostinger.com.
To add more glue we use IPv6 everywhere internally(!) and almost everywhere externally. We even switched API calls and other critical parts through IPv6 instead of IPv4. Our new private network (10.0/16) just became the IPv6 network. Operating both protocols literally means having high availability because if IPv4 is down, it’s likely possible that traffic will flow through the IPv6 path and vice versa. Even different upstream along the path could be picked for the destination.
We have in our roadmap to provide a unique IPv6 address for every website or client. How cool is that? It would solve many issues. For instance, with IPv4, if IP is null-routed, many clients are affected if they point to the same IP. With IPv6 this is different, we can give unique IPv6 per website or client and avoid this kind of problem.
Over the last few months, we have massively improved our network infrastructure and we are certain that our clients will feel the huge performance gains we have made. In the near future, we will also upgrade the network in our United States datacentre.
Yet why are we making these changes? We are moving to a fully redundant network to ensure as less interruptions as possible during upgrades for our customers.
The anycast we deployed in two new locations will speed up the DNS responses for our customers. What’s more, 000webhost.com and some Hostinger clients are now using anycast with three locations: the United States, the Netherlands, and Singapore.