Nov 19, 2021
7 min Read
Nov 19, 2021
7 min Read
The network is the most sensitive part of an infrastructure. To keep it running with fewer downtimes, there is a need to validate the configuration before deploying new changes to the production environment.
This article provides insights on how we test and validate the network changes, and how this evolution contributes to the level of trust we ourselves and our customers have in Hostinger’s services.
Back in 2015, we didn’t have any network automation at Hostinger. Instead, there were a couple of core routers (Cisco 6500 series) per datacenter and plenty of rather unmanaged HP switches to provide basic L2 connectivity. Pretty simple, no high availability, a huge failure domain, no spare devices, and so on.
No version control existed at that time at Hostinger, meaning configurations were kept somewhere or at some person’s computers. So, this begs the following question: how did we manage to run this network without automation, validation, and deployment?
How did we validate the config if it’s good or not? Nohow. Just some eyeballing, pushing changes directly via CLI, and praying. Even the config rollback feature wasn’t available. The out-of-band network did not exist. If you cut the connection – you are lost, something like Facebook recently did, and only physical access can help you bring it back.
The design and implementation were on the top of two people’s heads. The changes in the production network were painful and prone to human error. We had poor monitoring with basic alerting rules, a couple of traffic and errors graphs. No centralized logging system, but that was definitely better than nothing. It’s not an exaggeration to say that nowadays, small companies use this kind of simple method to monitor the network. If it’s working well and is good enough to function – don’t touch it.
The less you know about the state of the network, the fewer problems you have. Overall, we didn’t have any internal or external tools to do that fundamentally.
In 2016 we started building Awex, an IPv6-only service. Since it was being built from scratch, we began shipping automation from day 0. As soon as we noticed the good impact on automation, we started building new data centers using Cumulus Linux, automating them with Ansible, deploying the changes using Jenkins.
The simplified workflow was:
The drawback of this scheme is that configuration changes are automated but not validated or even tested before the deployment. That can cause substantial blast-radius failure. For instance, if a wrong loopback address or route-map is deployed, it can cause BGP sessions to flap or send the whole network into chaos.
The main reason for adding automation and validation is to save time debugging real problems in production, reduce the downtime and make end-users happier. However, you always have to ask yourself: where do you draw the line for automation? When do automating things stop adding value? Just how to put the automation process to the point where it makes sense.
Since then, we have focused on how to improve this process even more. When your network is growing and you build more and more data centers, maintenance is getting harder, slowing down the process of pushing changes in production.
As always, you have to trade-off between slower vs. safer deployment. At Hostinger, we are customer-obsessed, and that clearly says that we must prefer slower process management that leads to less unplanned downtimes.
Every failure gives you a new lesson on improving things and avoiding the same losses happening in the future. That’s why validation is a must for a modern network.
While most of the changes basically involve testing 2, 3, 4, 7 layers of the OSI model, there are always requests that should be tested by Layer8, which is not the scope of this blog post.
A couple of years later, we already have a few fully automated data centers. Over that time, we started using CumulusVX + Vagrant for pre-deployment testing. Now, catching bugs faster than the clients’ report is the primary goal.
Basically, this is the real-life testing scenario where you build virtually a fresh data center almost identical to what we use in production except that the hardware part (ASIC) can’t be simulated (programmed). Everything else can be tested quite well, and that saves hundreds of debugging hours in production. More sleep for engineers:)
So, when creating a Pull Request on Github, the pre-deployment phase launches a full-scale virtual data center and runs a bunch of unit tests. And, of course, some integration tests to see how the switches interact with each other. Or simulate other real-life scenarios, like connecting a server to EVPN and see if two hosts on the same L2VNI can communicate between two separate racks. That takes around 30 minutes. While we don’t push tens of changes every day, it’s good enough.
In addition, we run tests in production devices as well during pre-deployment and in post-deployment phases. This allows us to spot the difference when production was green before the merge and when suddenly something is wrong after the changes.
Known problems can lurk in production months, and without proper monitoring, you can’t spot them correctly. Or even worse – it can be behaving incorrectly even if you thought it was fine.
To achieve that, we use the Suzieq and PyTest framework for integrating both tools. Suzieq is an open-source multi-vendor network observability platform/application used for planning, designing, monitoring, and troubleshooting networks. It supports all the major routers and bridge vendors used in the data center.
It provides multiple ways to use it, from a network operator-friendly CLI to a GUI to a REST server and a python API. We primarily leverage the Python API to write our tests. Suzieq normalizes the data across multiple vendors and presents the information in an easy, vendor-neutral format. It allows us to focus on writing tests rather than on gathering the data (and on keeping abreast of vendor-related changes to their network OSs). We find the developers helpful and the community active, which is very important to get the fixes as fast as possible.
We currently use only Cumulus Linux, but you never know what’s going to be changed in the future, meaning that abstraction is the key.
Below are good examples of checking if EVPN fabric links are properly connected with correct MTU and link speeds.
Or, check if the routing table didn’t drop to less than expected and keep a consistent state between builds. For instance, expect more than 10k routes of IPv4 and IPv6 each per spine switch. Otherwise, some problems in the wild: neighbors are down, the wrong filter applied, interface down, etc.
We’ve just started this kind of testing and are looking forward to extending it more in the future. Additionally, we run more pre-deployment checks. We use Ansible for pushing changes to the network, and we should validate Ansible playbooks, roles, attributes carefully.
Pre-deployment is crucial, and even during the testing phase, you can realize that you are making absolutely wrong decisions, which eventually leads to over-engineering complex disasters. And fixing that later is more than awful. Fundamental things must remain fundamental, like the basic math arithmetic: add, subtract. You can’t have complex stuff in your head if you want to operate at scale. This is valid for any software engineering and, of course, for networks too.
Also, it’s worth mentioning that we also evaluated Batfish for configuration analysis. But, from what we tested, it wasn’t mature enough for Cumulus Linux, and we just dropped it for better times. Unexpected parsing failures like Parse warning This syntax is unrecognized. Hence, we will go back to Batfish next year to double-check if everything is fine with our configuration.
This is mostly the same as in the initial automation journey. Jenkins pushes changes to production if all pre-deployment validation is green and the Pull Request is merged in the master branch.
To speed up the deployment, we use multiple Jenkins slaves to distribute and split runs between regions to nearby devices. We use an out-of-band (OOB) network that is separated from the main control plane, which allows us to easily change even the most critical parts of the network gear. For resiliency, we keep the OOB network high-available to avoid a single point of failure and keep it running. This network is even connected to multiple ISPs.
If we lost the OOB network and the core network reachability, that’s probably data center issues. Unfortunately, we don’t run console servers or console networks because it’s too expensive and kind of security-critical.
Every Pull Request checks if Ansible inventory is correctly parsed, the syntax is correct, run ansible-lint to comply with standardization. We also rely a lot on Git.
Every commit is strictly validated, and, as you should notice, we use additional tags like Deploy-Tags: cumulus_frr that says only run Ansible tasks having this tag. It’s here just to explicitly tell what to run instead of everything.
We also have Deploy-Info: kitchen Git tag, which spawns virtual data centers in a Vagrant environment using the kitchen framework, and you can check the state in the pre-deployment stage. As I mentioned before, Git is the core to reflect the changes that do test or run for this commit.
Post-deployment validation is done after deploying changes to the network, to check if they had the intended impact. Errors can make it to the production network, but the duration of their impact is lowered. Hence, when the changes are pushed to the devices, we instantly run the same pre-deployment Suzieq tests to double-check if we have the same desired state of the network.
We are still learning as it’s a never-ending process. For now, we can more safely push changes to production because we have a layer that gives a bit of trust about the changes being pushed to production. If we trust our network, why shouldn’t our clients? At Hostinger, we always try to build a service, network, and software with failure in mind. That means always thinking that your software or network will fail someday, and you have to be prepared or at least ready to fix it as soon as you can.
Part 2 of this article is here, which details how Hostinger uses Suzieq to perform network validation and provides a more detailed overview of Batfish while evaluating.