Network Validation Evolution at Hostinger [Part 2]

Network Validation Evolution at Hostinger [Part 2]

In our previous post, we discussed how Hostinger started using network validation before going live. By implementing network validation for our core network, we have maintained complete control over running the network at scale. 

Among other things, the post summarizes the use of Suzieq to validate key aspects of the network. This time, we’ll get into more detail on how Hostinger uses Suzieq to perform network validation and cover a more detailed overview of Batfish

To give you some numbers, we have 9 data centers (DCs) around the globe, with more coming soon. Each DC is different in terms of size – it can span from a couple to tens of racks per data center.  Using automation on top of that doesn’t make a considerable difference, despite how quickly changes are pushed to production. For the end customer, using services provided by a company that continuously contributes and performs network validation adds to building the foundation of trust and reliability of Hostinger products.

Suzieq

Continuously Running Poller Vs Snapshot

One of the first decisions we had to make with any tool we used to perform network validation was whether to run the poller in standalone mode or in continuously running mode.

A continuously running poller has a higher engineering cost, no matter the tool, though it is the correct approach. With it, the poller has to be running all the time, and it must be highly available, i.e. the poller must recover from failures.

Running the poller in the snapshot mode is trivial from a maintainability perspective. It can be run independently in any environment – on a local machine (workstation) or in CI/CD without a need to have any running service in mind. In our case, we poll the data once and then run the Python tests. At Hostinger, we have deployments spread across many geographic regions – Asia, Europe, US, and we have multiple DCs in each of these regions. We use Jenkins for our CI/CD pipeline. To ensure we run the same tests across all regions, we launch multiple Jenkins slaves. If we’d used a continuously running poller, the engineering cost would’ve been higher to set up and maintain.  

Here’s an example of running sq-poller (running in a loop for each DC or region):

for DC in "${DATACENTERS[@]}"
do
  python generate_hosts_for_suzieq.py --datacenter "$DC"
  ../bin/sq-poller --devices-file "hosts-$DC.yml" \
    --ignore-known-hosts \
    --run-once gather \
    --exclude-services devconfig
  ../bin/sq-poller --input-dir ./sqpoller-output
  python -m pytest -s -v --no-header "test_$DC.py" || exit 5
done

You might be asking whether this combination of commands is necessary.

generate_hosts_for_suzieq.py serves as a wrapper to generate hosts from the Ansible inventory but with more sugar inside, like skipping specific hosts, setting ansible_host dynamically (because our OOB network is highly available, it means we have several doors to access it). 

The generated file looks similar to:

- namespace: xml
  hosts:
    - url: ssh://root@xml-oob.example.org:2232 keyfile=~/.ssh/id_rsa
    - url: ssh://root@xml-oob.example.org:2223 keyfile=~/.ssh/id_rsa

Why bundle run-once and sq-poller? There is an already open issue that is going to solve this problem. Eventually, it requires just adding a single –snapshot option, and that’s it.

Workflow for Validating Changes

Every new pull request (PR) creates a fresh, clean Python virtual environment (Pyenv) and starts the tests. The same happens when a PR is merged. 

The simplified workflow was: 

  1. Make changes.
  2. Commit changes, create a PR on GitHub.
  3. Poll and run PyTest tests with Suzieq (/tests/run-tests.sh <region|all>).
  4. We require tests to be green before a PR is allowed to merge. 
  5. Merge the PR.
  6. Iterate it on all our DCs one by one – deploy, and run post-deployment PyTests again.

Something like:

stage('Run pre-flight production tests') {
  when {
    expression {
      env.BRANCH_NAME != 'master' &amp;&amp; !(env.DEPLOY_INFO ==~ /skip-suzieq/)
    }
  }
  parallel {
    stage('EU') {
      steps {
        sh './tests/prepare-tests-env.sh &amp;&amp; ./tests/run-tests.sh ${EU_DC}'
      }
    }
    stage('Asia') {
      agent {
        label 'deploy-sg'
      }
    }

Handling False Positives

Every test has a chance of a false positive, i.e. the test reveals a problem that is not real. This can be true if it’s a test for a disease or a test for verifying a change. At Hostinger, we assume that false positives will happen, and that’s normal. So, how do we handle them, and when? 

In our environment, false positives occur mostly due to timeouts, connection errors during the scraping phase (poller), or when bootstrapping a new device. In such a case, we re-run the tests until it’s fixed (green in the Jenkins pipeline). But if we have a permanent failure (most likely a real one), tests always remain in a red state. This means the PR does not get merged, and the changes are not deployed. 

However, in the case of a false positive, we use a Git commit tag Deploy-Info: skip-suzieq to tell Jenkins pipelines to ignore tests after we see this behavior (as you may have noticed before in the pipeline file). 

Adding New Tests

We test new or modified tests locally first before they land in the Git repository. To add a useful test, it needs to be tested multiple times unless it’s really trivial. For example:

def bgp_sessions_are_up(self):
    # Test if all BGP sessions are UP
    assert (
        get_sqobject("bgp")().get(namespace=self.namespace, state="NotEstd").empty
    )

But if we are talking about something like

def uniq_asn_per_fabric(self):
    # Test if we have a unique ASN per fabric
    asns = {}
    for spine in self.spines.keys():
        for asn in (
            get_sqobject("bgp")()
            .get(hostname=[spine], query_str="afi == 'ipv4' and safi == 'unicast'")
            .peerAsn
        ):
            if asn == 65030:
                continue
            if asn not in asns:
                asns[asn] = 1
            else:
                asns[asn] += 1
    assert len(asns) &gt; 0
    for asn in asns:
        assert asns[asn] == len(self.spines.keys())

This needs to be carefully reviewed. Here we check if we have a unique AS number per DC. Skipping 65030 is used for routing on the host instances to announce some anycast services like DNS, load balancers, etc. This is the snippet of tests output (summary):

test_phx.py::test_bgp_sessions_are_up PASSED
test_phx.py::test_loopback_ipv4_is_uniq_per_device PASSED
test_phx.py::test_loopback_ipv6_is_uniq_per_device PASSED
test_phx.py::test_uniq_asn_per_fabric PASSED
test_phx.py::test_upstream_ports_are_in_correct_state PASSED
test_phx.py::test_evpn_fabric_links PASSED
test_phx.py::test_default_route_ipv4_from_upstreams PASSED
test_phx.py::test_ipv4_host_routes_received_from_hosts PASSED
test_phx.py::test_ipv6_host_routes_received_from_hosts PASSED
test_phx.py::test_evpn_fabric_bgp_sessions PASSED
test_phx.py::test_vlan100_assigned_interfaces PASSED
test_phx.py::test_evpn_fabric_arp PASSED
test_phx.py::test_no_failed_interface PASSED
test_phx.py::test_no_failed_bgp PASSED
test_phx.py::test_no_active_critical_alerts_firing PASSED
test_imm.py::test_bgp_sessions_are_up PASSED
test_imm.py::test_loopback_ipv4_is_uniq_per_device PASSED
test_imm.py::test_loopback_ipv6_is_uniq_per_device PASSED
test_imm.py::test_uniq_asn_per_fabric FAILED
test_imm.py::test_upstream_ports_are_in_correct_state PASSED
test_imm.py::test_default_route_ipv4_from_upstreams PASSED
test_imm.py::test_ipv4_host_routes_received_from_hosts PASSED
test_imm.py::test_ipv6_host_routes_received_from_hosts PASSED
test_imm.py::test_no_failed_bgp PASSED
test_imm.py::test_no_active_critical_alerts_firing PASSED

Here, we notice that this DC’s test_imm.py::test_uniq_asn_per_fabric test has failed. Since we use an auto-derived ASN per switch (no static AS numbers in the Ansible inventory), a race could happen that might have a duplicate ASN, which is bad. Or something like:

def loopback_ipv6_is_uniq_per_device(self):
    # Test if we don't have duplicate IPv6 loopback address
    addresses = get_sqobject("address")().unique(
        namespace=[self.namespace],
        columns=["ip6AddressList"],
        count=True,
        type="loopback",
    )
    addresses = addresses[addresses.ip6AddressList != "::1/128"]
    assert (addresses.numRows == 1).all()

This is done to check whether we have a duplicate IPv6 loopback address per device for the same data center. This rule is valid and was proven at least a couple of times. It mostly occurs when we bootstrap a new switch and the Ansible host file is copy-pasted. 

Mainly new tests are added when a failure occurs, and some actions need to be taken to quickly catch them or mitigate them in advance in the future. For instance, if we switch from an L3-only to EVPN design, we might be surprised when ARP/ND exhaustion hits a wall, or L3 routes drop from several thousand to just a few. 

Batfish

We have already evaluated Batfish twice. The first was kind of an overview and dry-run to see its opportunities for us. The first impression was something like “What’s wrong with my configuration?” because, at that time, Batfish didn’t support some of the configuration syntaxes for FRR. FRR is used by Cumulus Linux and many other massive projects. It’s becoming the de-facto best open-source routing suite. And that’s why Batfish has FRR as a vendor included as well. It’s just that FRR as a model needs more changes before being used in production (at least in our environment). 

Later on, a month or two ago, we began investigating the product again to see what could really be done. From the operational perspective, it’s a really cool product because it allows the operator to construct the network model by parsing configuration files. On top of that, you can create snapshots, make some changes and see how your network behaves. For example, disable a link or a BGP peer and predict the changes before they go live. 

We started looking at Batfish as an open-source project too to push changes back to the community. Here are a couple of examples of missing behavior modeling for our cases:

https://github.com/batfish/batfish/pull/7671/commits/4fa895fd675ae60a257f1e6e10d27348ed21d4a0

https://github.com/batfish/batfish/pull/7694/commits/115a81770e8a78471d28a6a0b209eef7bc34df88

https://github.com/batfish/batfish/pull/7670/commits/10ec5a03c15c48fd46890be4da394170fa6eb03a

https://github.com/batfish/batfish/pull/7666/commits/f440c5202dd8f338661e8b6bd9711067ba8652b6

https://github.com/batfish/batfish/pull/7666/commits/974c92535ecb5eedfe8fd57fc4295e59f2d4639d

https://github.com/batfish/batfish/pull/7710/commits/a2c368ae1b0a3477ba5b5e5e8f8ebe88e4bf2342

But a lot more are missing. We are big fans of IPv6, but unfortunately, IPv6 is not (yet?) well-covered in the FRR model in Batfish. 

This is not the first time we’ve missed IPv6 support, and, we guess, not the last either. We’re looking forward to and hoping Batfish will get IPv6 support soon. 

Some Best Practice Observations on Testing

We would say that segregated tests serve to avoid throwing spaghetti at the wall at first. Write easy, understandable tests. If you see that two tests are depending on each other, it’s better to split them into separate tests. 

Some tests can overlap, and if one fails, then the other will too. But that’s good because two failed tests can say more than one, even if they test similar functionality. 

To confirm that tests are useful, you have to run and use them daily. Otherwise, there isn’t much point in having them.

If you can guess what may happen in the future, covering the possibility in tests is a good idea unless it’s too noisy. 

As always, the Pareto Principle is the best answer to whether it’s worth it and how much worth is covered by tests. If you cover at least 20% of the critical pieces with tests, most likely, your network is in good shape. 

It’s absolutely not worth automating and testing all the things you come up with. It’s just additional taxation for no reason. You have to think about the maintainability of those tests with your team and make a decision. 

What makes us happy is that Suzieq is great by default, and there is no need to write very sophisticated tests in Python. CLI is really awesome and trivial even for starters. If you need something exceptional, you are always welcome to write the logic in Python which is also friendly. Wrapped with the pandas library you can manipulate your network data as much as you want, it’s very flexible.

 

Author
The author

Donatas Abraitis