Network Validation Evolution at Hostinger [Part 2]
access_time
hourglass_empty
person_outline

Network Validation Evolution at Hostinger [Part 2]

In our previous post, we discussed how Hostinger started using network validation before going live. By implementing a network validation for our core network, we maintained complete control of running the network at scale. 

Among other things, the post summarizes the use of Suzieq to validate key aspects of the network. This time, we’ll get into more detail on how Hostinger uses Suzieq to perform network validation and a more detailed overview of Batfish while evaluating. 

To give you some numbers, we have 9 data centers (DCs) over the globe and more coming soon. Every data center is different in terms of size – it can span from a couple of racks to tens of racks per data center.  Using automation on top of that doesn’t raise any considerable difference, despite how quickly changes are pushed to production. For the end-customer, using services provided by a company that continuously contributes and performs network validation adds to building the foundation of trust and reliability of Hostinger products.

Suzieq

Continuously Running Poller Vs Snapshot

One of the first decisions we had to make with any tool we used to perform network validation is whether to run the poller in standalone mode or in continuously running mode.

A continuously running poller has a higher engineering cost, no matter the tool, though it is the correct approach. In this approach, the poller has to be running all the time, and it must be highly available i.e. the poller must recover from failures… 

Running the poller in a “snapshot” mode is trivial from a maintainability perspective. It can be run independently in any environment: in a local machine (workstation) or in CI/CD without a need to have any running service in mind. In our case, we poll the data once and then run the Python tests. At Hostinger, we have deployments spread across many geographic regions: Asia, Europe, US, and we have multiple DCs in each of these regions. We use Jenkins for our CI/CD pipeline. To ensure we run the same tests across all regions, we launch multiple Jenkins slaves. If we had used a continuously running poller, the engineering cost would’ve been higher to set up and maintain.  

An example of running sq-poller (running in a loop for each DC or region).

for DC in "${DATACENTERS[@]}"
do
  python generate_hosts_for_suzieq.py --datacenter "$DC"

  ../bin/sq-poller --devices-file "hosts-$DC.yml" \
    --ignore-known-hosts \
    --run-once gather \
    --exclude-services devconfig

  ../bin/sq-poller --input-dir ./sqpoller-output

  python -m pytest -s -v --no-header "test_$DC.py" || exit 5
done

You might be asking whether this combination of commands is necessary?

generate_hosts_for_suzieq.py serves as a wrapper to generate hosts from Ansible inventory, but with more sugar inside, like skipping specific hosts, setting ansible_host dynamically (because our OOB network is highly available, it means we have several doors to access it). 

The generated file looks similar to:

- namespace: xml
  hosts:
    - url: ssh://root@xml-oob.example.org:2232 keyfile=~/.ssh/id_rsa
    - url: ssh://root@xml-oob.example.org:2223 keyfile=~/.ssh/id_rsa

Why bundle run-once and sq-poller? There is an already open issue that is gonna solve this problem. Eventually, it requires just adding a single –snapshot option, and that’s it.

Workflow For Validating Changes

Every new Pull Request (PR) creates a fresh, clean Python virtual environment (Pyenv) and starts the tests. The same happens when PR is merged. 

The simplified workflow was: 

  1. Do changes.
  2. Commit changes, create PR on Github.
  3. Poll and run Pytest tests with Suzieq (/tests/run-tests.sh <region|all>).
  4. We require tests to be green before it’s allowed to merge. 
  5. Merge PR.
  6. Iterate over all our DCs one by one: deploy, and run post-deployment Pytests again.

Something like:

stage('Run pre-flight production tests') {
  when {
    expression {
      env.BRANCH_NAME != 'master' && !(env.DEPLOY_INFO ==~ /skip-suzieq/)
    }
  }
  parallel {
    stage('EU') {
      steps {
        sh './tests/prepare-tests-env.sh && ./tests/run-tests.sh ${EU_DC}'
      }
    }
    stage('Asia') {
      agent {
        label 'deploy-sg'
      }
    }

Handling False Positives

Every test has the chance of a false positive i.e.the test reveals a problem that is not real. This is true if it’s a test for a disease or a test for verifying a change. At Hostinger, we assume that false positives will happen, and that’s normal. So, how do we handle them and when? 

In our environment, false positives occur mostly due to timeouts, connection errors during the scraping phase (poller), or when bootstrapping a new device. In such a case we re-run the tests until it’s fixed (green in Jenkins pipeline). But if we have a permanent failure (most likely a real one), tests always remain in a red state. This means the PR does not get merged, and the changes are not deployed. 

However, in the case of a false positive, we use a Git commit tag Deploy-Info: skip-suzieq to tell Jenkins pipelines to ignore tests hereafter we see this behavior (As you noticed before in the pipeline file). 

Adding New Tests

We test new or modified tests locally first before they land into the Git repository. To add a useful test, it needs to be tested multiple times unless it’s really trivial. For example:

def bgp_sessions_are_up(self):
    # Test if all BGP sessions are UP
    assert (
        get_sqobject("bgp")().get(namespace=self.namespace, state="NotEstd").empty
    )

But if we are talking about something like:

def uniq_asn_per_fabric(self):
    # Test if we have a unique ASN per fabric
    asns = {}
    for spine in self.spines.keys():
        for asn in (
            get_sqobject("bgp")()
            .get(hostname=[spine], query_str="afi == 'ipv4' and safi == 'unicast'")
            .peerAsn
        ):
            if asn == 65030:
                continue
            if asn not in asns:
                asns[asn] = 1
            else:
                asns[asn] += 1
    assert len(asns) > 0
    for asn in asns:
        assert asns[asn] == len(self.spines.keys())

This needs to be carefully reviewed. Here we check if we have a unique AS number per DC. Skipping 65030 is used for routing on the host instances to announce some anycast services like DNS, load balancers, etc. This is the snippet of tests output (summary):

test_phx.py::test_bgp_sessions_are_up PASSED
test_phx.py::test_loopback_ipv4_is_uniq_per_device PASSED
test_phx.py::test_loopback_ipv6_is_uniq_per_device PASSED
test_phx.py::test_uniq_asn_per_fabric PASSED
test_phx.py::test_upstream_ports_are_in_correct_state PASSED
test_phx.py::test_evpn_fabric_links PASSED
test_phx.py::test_default_route_ipv4_from_upstreams PASSED
test_phx.py::test_ipv4_host_routes_received_from_hosts PASSED
test_phx.py::test_ipv6_host_routes_received_from_hosts PASSED
test_phx.py::test_evpn_fabric_bgp_sessions PASSED
test_phx.py::test_vlan100_assigned_interfaces PASSED
test_phx.py::test_evpn_fabric_arp PASSED
test_phx.py::test_no_failed_interface PASSED
test_phx.py::test_no_failed_bgp PASSED
test_phx.py::test_no_active_critical_alerts_firing PASSED
test_imm.py::test_bgp_sessions_are_up PASSED
test_imm.py::test_loopback_ipv4_is_uniq_per_device PASSED
test_imm.py::test_loopback_ipv6_is_uniq_per_device PASSED
test_imm.py::test_uniq_asn_per_fabric FAILED
test_imm.py::test_upstream_ports_are_in_correct_state PASSED
test_imm.py::test_default_route_ipv4_from_upstreams PASSED
test_imm.py::test_ipv4_host_routes_received_from_hosts PASSED
test_imm.py::test_ipv6_host_routes_received_from_hosts PASSED
test_imm.py::test_no_failed_bgp PASSED
test_imm.py::test_no_active_critical_alerts_firing PASSED

Here, we catch that this DC test_imm.py::test_uniq_asn_per_fabric test failed. Since we use auto-derived ASN per switch (no static AS numbers in Ansible inventory), a race could happen that might have duplicate ASN, which is bad. 

Or something like:

def loopback_ipv6_is_uniq_per_device(self):
    # Test if we don't have duplicate IPv6 loopback address
    addresses = get_sqobject("address")().unique(
        namespace=[self.namespace],
        columns=["ip6AddressList"],
        count=True,
        type="loopback",
    )
    addresses = addresses[addresses.ip6AddressList != "::1/128"]
    assert (addresses.numRows == 1).all()

To check if we don’t have a duplicate IPv6 loopback address per device for the same data center. This rule is valid and was proven at least a couple of times. Mostly happens when we bootstrap a new switch and the Ansible host file is copy/pasted. 

Mainly new tests are added when a failure occurs, and some actions need to be taken to quickly catch them or mitigate them in advance in the future. For instance, if we switch from L3-only to EVPN design, we might be surprised when ARP/ND exhaustion hits the wall, or L3 routes drop from several thousand to just a few. 

Batfish

We have already evaluated Batfish twice. The first one was kind of an overview and dry-run to see its opportunities for us. The first impression was something like “What’s wrong with my configuration?” because, at that time, Batfish didn’t support some configuration syntax for FRR. FRR is used by Cumulus Linux and many other massive projects. It becomes the de-facto best open-source routing suite. And that’s why Batfish has FRR as a vendor included as well. Just FRR as a model needs more changes before being used in production (at least in our environment). 

Later on, a month or two ago, we began investigating the product again to see what could really be done. From the operational perspective, it’s a really cool product because it allows the operator to construct the network model by parsing configuration files. On top of that, you can create snapshots, make some changes and see how your network behaves. For example, disable a link or BGP peer and predict the changes before they go live. 

We started looking at Batfish as an open-source project, too, to push changes back to the community. A couple of examples of missing behavior modeling for our cases:

https://github.com/batfish/batfish/pull/7671/commits/4fa895fd675ae60a257f1e6e10d27348ed21d4a0

https://github.com/batfish/batfish/pull/7694/commits/115a81770e8a78471d28a6a0b209eef7bc34df88

https://github.com/batfish/batfish/pull/7670/commits/10ec5a03c15c48fd46890be4da394170fa6eb03a

https://github.com/batfish/batfish/pull/7666/commits/f440c5202dd8f338661e8b6bd9711067ba8652b6

https://github.com/batfish/batfish/pull/7666/commits/974c92535ecb5eedfe8fd57fc4295e59f2d4639d

https://github.com/batfish/batfish/pull/7710/commits/a2c368ae1b0a3477ba5b5e5e8f8ebe88e4bf2342

But a lot more are missing. We are big fans of IPv6, but unfortunately, IPv6 is not well (yet?) covered in the FRR model in Batfish. 

This is not the first time we missed IPv6 support, and, we guess, not the last one. Looking forward to and hoping Batfish will get the IPv6 support soon. 

We started looking at Batfish as an open-source project, too, to push changes back to the community. A couple of examples of missing behavior modeling for our cases:

Some Best Practice Observations on Testing

We would say that segregate tests serve to avoid throwing spaghetti on the wall at first. Write easy, understandable tests. If you see that two tests depend on each other, it’s better to split them into separate tests. 

Some tests can overlap, and if one fails, then the other fails too. But that’s good because two failed tests can say more than one, even if they test similar functionality. 

To confirm that tests are useful, you have to run and use them daily. Otherwise, we don’t see much point in having them.

If you can guess what can happen in the future, covering this in tests is a good idea unless it’s too noisy. 

As always, the Pareto Principle is the best answer to whether it’s worth it and how much worth is covered by tests. If you cover at least 20% of the critical pieces with tests, most likely, your network is in good shape. 

It’s absolutely not worth automating and testing all the things you just get in your head. It’s just additional taxation for no reason. You have to think about the maintainability of those tests with your team and decide. 

What makes us happy, that Suzieq is great by default, and there is no need to write very sophisticated tests in Python. CLI is really awesome and trivial even for starters. If you need something exceptional, you are always welcome to write the logic in Python which is also friendly. Wrapped with the pandas library you can manipulate your network data as much as you want, very flexible.

 

The author

Author

Donatas Abraitis / @ton31337

Related stories

Leave a reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Become a part of Hostinger now!