Feb 25, 2022
4 min Read
Donatas A.
This article is about how we built the new high scalable cloud hosting solution using IPv6-only communication between commodity servers, what problems we faced with IPv6 protocol, and how we tackled them for handling more than ten million active users.
At Hostinger we care about innovative technologies a lot, so we decided to run a new project named Awex that is based on this protocol. If we can, why not start today? Only frontend (user-facing) services are running in a dual-stack environment – everything else is IPv6-only for west-east traffic.
I don’t want to dive into the details in this post but I will describe the crucial components necessary to build this architecture.
We are using pods. A pod is a cluster that shares the same VIP (Virtual IP) addresses as anycast and can handle HTTP/HTTPS requests in parallel. Hundreds of nodes per pod can handle user requests simultaneously without saturating the single one. Parallelization is done using BGP and ECMP with resilient hashing to avoid traffic scattering. Hence every edge node is running the BGP daemon for announcing VIPs to the ToR switch. As the BGP daemon, we are running ExaBGP and using a single IPv6 session for announcing both protocols (IPv4/IPv6). The BGP session is configured automatically during the server bootstrap step. Announcements are different depending on the server’s role, including /64 prefix per node plus many VIPs for north-south traffic. /64 prefix is specially delegated for containers. Every edge node runs plenty of containers and they communicate with each other between other nodes and internal services.
Every edge node uses Redis as a slave replica to get upstream for a particular application, hence every upstream has thousands of containers (IPv6) as a list spanning between nodes in a pod. These huge lists are generated in real-time using consul-template. Edge node has many public IPv4 (512) and global IPv6 (512) addresses to tackle DDoS attacks. We use DNS to randomize A/AAAA for client response. The client points their domain to our CNAME record named route
, which in turn is randomized by our custom service named Razor. _We will talk about Razor in future posts.
At first, we decided to use OpenSwitch for ToR switches, which is quite a new but interesting and promising community project. We tested this OS in our lab for a few months, even contributed some changes to OpenSwitch, like this patch. We submitted a number of bugs and most of them were finally fixed. Unfortunately, not as fast as we needed, so we postponed experimenting with OpenSwitch for a while and gave Cumulus a try. By the way, we are still testing OpenSwitch in our lab because we are planning to use it in the near future.
Cumulus allows us to have a fully automated network, where we reconfigure the network including BGP neighbors, upstreams, firewalls, bridges, etc. For instance, if we add a new node, Ansible automatically sees changes in the Chef inventory by looking at LLDP attributes and regenerates the network configuration for the particular switch. If we want to add a new BGP upstream or firewall rule, we just create a pull request in our Github repo and everything is done automatically, including checking the syntax and deploying changes in production. Every node is connected with a single 10GE interface using Clos topology.
Here are a few examples of pull requests:
[2001:dead:beef::1]
), others do not (2001:dead:beef::1
). The best are ([2001::dead::beef::1::::1]
), (::ffff:<ipv4>
).# of pkts dropped due to large hdrs:126
.vmxnet3
driver and check vmxnet3_rx_error()
to see what buffer length is hitting the queues. That was really disappointing because the buffer size was 54 bytes and it wasn’t even IPv4 or IPv6 packet. Just some VMWare underlying headers. Finally, by adjusting MTU for nodes running on ESXi we were able to handle all packets without dropping them.
Leave a reply