Friday October 15, 2021
Network Validation Evolution at Hostinger
A network is the most sensitive part of an infrastructure. To keep it running with less downtime, there is a need to validate its configuration before deploying new changes to the production environment.
This article provides insights on how we test and validate the network changes and how this evolution contributes to the level of trust we and our customers have in Hostinger’s services.
The Reasons Behind It
Back in 2015, we didn’t have any network automation at Hostinger. Instead, there were a couple of core routers (Cisco 6500 series) per data center, and plenty of rather unmanaged HP switches to provide basic L2 connectivity. Pretty simple – no high availability, a huge failure domain, no spare devices, etc.
No version control existed at that time at Hostinger, which meant that configurations were kept somewhere on someone’s computer. This begs the question – how did we manage to run this network without automation, validation, and deployment?
How did we check whether the config is good or not? Well, we didn’t. There was only some eyeballing, pushing changes directly via CLI, and praying. Even the config rollback feature wasn’t available. The out-of-band network did not exist. If you cut the connection, you were lost, just like Facebook recently was, and only physical access could help you bring it back.
Two people were responsible for the design and implementation. Any changes in the production network were painful and prone to human error. We had poor monitoring with basic alerting rules and a couple of traffic and errors graphs – no centralized logging system. That said, this was better than nothing. It’s not an exaggeration to say that nowadays, small companies use such simple methods to monitor the network. Don’t touch it if it’s working well and is good enough to function.
The less you know about the state of the network, the fewer problems you’ll have. Overall, we didn’t have any internal or external tools to do that fundamentally.
Hostinger’s Solution To Network Validation
In 2016 we started building Awex, an IPv6-only service. Since it was being built from scratch, we began shipping automation from day 0. As soon as we noticed a good impact on automation, we started building new data centers using Cumulus Linux, automating them with Ansible, and deploying the changes using Jenkins.
The simplified workflow was:
- Make changes.
- Commit them by creating a pull request on GitHub.
- Wait for a review from others.
- Merge the pull request.
- Wait for the changes to be deployed by Jenkins to the switches.
The drawback of this process was that configuration changes were automated but not validated or even tested before deployment. This could cause substantial blast-radius failure. For instance, if a wrong loopback address or route-map were to be deployed, it could’ve caused BGP sessions to flap or send the whole network into chaos.
The main reason for adding automation and validation is to save time debugging real problems in production, reduce downtime, and make end-users happier. However, you always have to ask yourself where you draw the line for automation. When does automating things stop adding value? How to implement automation to the point where it would make sense?
Since then, we have focused on improving this process even more. When your network is growing, and you build more and more data centers, maintenance becomes more difficult, slowing down the process of pushing changes in production.
As always, you have a trade-off between slower vs. safer deployment. At Hostinger, we are customer-obsessed, and that clearly says that we must prefer slower process management that leads to less unplanned downtime.
Every failure gives you a new lesson on improving things and avoiding the same happening in the future. That’s why validation is a must for a modern network.
While most of the changes involve testing 2, 3, 4, or 7 layers of the OSI model, there are always requests that should be tested by Layer8, which is not the scope of this blog post.
A couple of years later, we already have a few fully automated data centers. We’ve started using CumulusVX + Vagrant for pre-deployment testing. Now, the primary goal is to catch any bugs faster than a client could report them.
This is a real-life testing scenario where you virtually build a fresh data center almost identical to what we use in production, excepting the hardware part (ASIC) that can’t be simulated (programmed). Everything else can be tested quite well, which saves hundreds of debugging hours in production. More sleep for engineers 🙂
So, when creating a pull request on GitHub, the pre-deployment phase launches a full-scale virtual data center and runs unit tests and, of course, some integration tests – either to see how the switches interact with each other or to simulate other real-life scenarios, like connecting a server to an EVPN to see if two hosts on the same L2VNI can communicate between two separate racks. That takes around 30 minutes. While we don’t push tens of changes every day, it’s good enough.
In addition, we run tests on production devices during pre-deployment and post-deployment. This allows us to spot any differences.
Problems can lurk in production for months, and without proper monitoring, you can’t spot them correctly. Or even worse – something may be behaving incorrectly even if you thought it was fine.
To achieve that, we use the Suzieq and PyTest framework for integrating both tools. Suzieq is an open-source multi-vendor network observability platform/application used for planning, designing, monitoring, and troubleshooting networks. It supports all major routers and bridge vendors used in our data centers.
There are multiple ways to use it – from a network operator-friendly CLI to a GUI to a REST server and a Python API. We primarily leverage a Python API to write our tests. Suzieq normalizes the data across multiple vendors and presents the information in an easy, vendor-neutral format. It allows us to focus on writing tests rather than gathering data (and keeping abreast of vendor-related changes to their network OSs). We’ve found the developers helpful and the community active, which is very important to get fixes as fast as possible.
We currently only use Cumulus Linux, but you never know what will change in the future, meaning that abstraction is key.
Below are good examples of checking if EVPN fabric links are properly connected with the correct MTU and link speeds.
Here’s how we check if the routing table hasn’t dropped to less than expected and is keeping a consistent state between builds. For instance, we expect more than 10k routes of IPv4 and IPv6 each per spine switch. Otherwise, there may be some problems in the wild – the neighbors are down, there’s been a wrong filter applied, the interface is down, etc.
We’ve just started this kind of testing and are looking forward to extending it more in the future. Additionally, we run more pre-deployment checks. We use Ansible for pushing changes to the network, so we should validate Ansible playbooks, roles, and attributes carefully.
Pre-deployment is crucial – even during the testing phase, you can realize that you are making the wrong decisions, leading to over-engineering complex disasters. Fixing that later is more than awful. Fundamental elements must remain fundamental, like basic arithmetic (adding and subtracting). You can’t have complex stuff in your head if you want to operate at scale. This is valid for any software engineering and, of course, for networks.
It’s worth mentioning that we’ve also evaluated Batfish for configuration analysis. But, from what we’ve tested, it wasn’t mature enough for Cumulus Linux – we encountered unexpected parsing failures like Parse warning This syntax is unrecognized. Hence, we will go back to Batfish next year to double-check if everything is fine with our configuration.
This is mostly the same as in the initial automation journey. Jenkins pushes changes to production if all pre-deployment validation is green and the pull request is merged in the master branch.
To speed up deployment, we use multiple Jenkins slaves to distribute and split runs between regions to nearby devices. We use an out-of-band (OOB) network that is separated from the main control plane, which allows us to easily change even the most critical parts of the network gear. For resiliency, we keep the OOB network highly available to avoid single points of failure and keep it running. This network is even connected to multiple ISPs.
If we lose the OOB network and the core network reachability, it probably means data center issues. Unfortunately, we don’t run console servers or console networks because it’s too expensive and security-critical.
Every pull request checks if the Ansible inventory has been correctly parsed, whether the syntax is correct. We run ansible-lint to comply with standardization. We also rely on Git a lot.
Every commit is strictly validated, and, as you should notice, we use additional tags like Deploy-Tags: cumulus_fr, which says to only run Ansible tasks with this tag. It’s here to explicitly tell what to run instead of everything.
We also have the Deploy-Info: kitchen Git tag, which spawns virtual data centers in a Vagrant environment using the kitchen framework, and you can check the state in the pre-deployment stage.
Post-deployment validation is done after deploying changes to the network to check if they had the intended impact. Errors can make it to the production network, but their impact is usually lowered. Hence, when the changes are pushed to the devices, we instantly run the same pre-deployment Suzieq tests to double-check if we have the same desired state of the network.
What Have We Learned?
We are still learning as it’s a never-ending process. For now, we can more safely push changes to production because we have a layer that provides a bit of trust about the changes being pushed to production. If we trust our network, why shouldn’t our clients? At Hostinger, we always try to build services, networks, and software anticipating possible errors. That means always being prepared if your software or network fails someday, so you can fix it as soon as possible.
Part 2 of this article is here. It details how Hostinger uses Suzieq to perform network validation and provides a more detailed overview of Batfish while evaluating.