{"id":2164,"date":"2021-11-19T13:19:46","date_gmt":"2021-11-19T13:19:46","guid":{"rendered":"https:\/\/www.hostinger.com\/blog\/?p=2164"},"modified":"2022-03-29T13:15:59","modified_gmt":"2022-03-29T13:15:59","slug":"network-validation-evolution-at-hostinger-part-2","status":"publish","type":"post","link":"https:\/\/www.hostinger.com\/blog\/network-validation-evolution-at-hostinger-part-2","title":{"rendered":"Network Validation Evolution at Hostinger [Part 2]"},"content":{"rendered":"<p>In our previous <a href=\"https:\/\/www.hostinger.com\/blog\/network-validation-evolution-at-hostinger\">post<\/a>, we discussed how Hostinger started using network validation before going live. By implementing network validation for our core network, we have maintained complete control over running the network at scale.&nbsp;<\/p><p>Among other things, the post summarizes the use of <a href=\"https:\/\/suzieq.readthedocs.io\/en\/latest\" target=\"_blank\" rel=\"noopener\">Suzieq<\/a> to validate key aspects of the network. This time, we&rsquo;ll get into more detail on how Hostinger uses Suzieq to perform network validation and cover a more detailed overview of <a href=\"https:\/\/www.batfish.org\/\" target=\"_blank\" rel=\"noopener\">Batfish<\/a>.&nbsp;<\/p><p>To give you some numbers, we have 9 data centers (DCs) around the globe, with more coming soon. Each DC is different in terms of size &ndash; it can span from a couple to tens of racks per data center.&nbsp; Using automation on top of that doesn&rsquo;t make a considerable difference, despite how quickly changes are pushed to production. For the end customer, using services provided by a company that continuously contributes and performs network validation adds to building the foundation of trust and reliability of Hostinger products.<\/p><h2 class=\"wp-block-heading\" id=\"h-suzieq\"><strong>Suzieq<\/strong><\/h2><h3 class=\"wp-block-heading\" id=\"h-continuously-running-poller-vs-snapshot\">Continuously Running Poller Vs Snapshot<\/h3><p>One of the first decisions we had to make with any tool we used to perform network validation was whether to run the poller in standalone mode or in continuously running mode.<\/p><p>A continuously running poller has a higher engineering cost, no matter the tool, though it is the correct approach. With it, the poller has to be running all the time, and it must be highly available, i.e. the poller must recover from failures.<\/p><p>Running the poller in the snapshot mode is trivial from a maintainability perspective. It can be run independently in any environment &ndash; on a local machine (workstation) or in CI\/CD without a need to have any running service in mind. In our case, we poll the data once and then run the Python tests. At Hostinger, we have deployments spread across many geographic regions &ndash; Asia, Europe, US, and we have multiple DCs in each of these regions. We use Jenkins for our CI\/CD pipeline. To ensure we run the same tests across all regions, we launch multiple Jenkins slaves. If we&rsquo;d used a continuously running poller, the engineering cost would&rsquo;ve been higher to set up and maintain.&nbsp;&nbsp;<\/p><p>Here&rsquo;s an example of running&nbsp;<strong>sq-poller<\/strong>&nbsp;(running in a loop for each DC or region):<\/p><pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">for DC in \"${DATACENTERS[@]}\"\ndo\n  python generate_hosts_for_suzieq.py --datacenter \"$DC\"\n  ..\/bin\/sq-poller --devices-file \"hosts-$DC.yml\" \\\n    --ignore-known-hosts \\\n    --run-once gather \\\n    --exclude-services devconfig\n  ..\/bin\/sq-poller --input-dir .\/sqpoller-output\n  python -m pytest -s -v --no-header \"test_$DC.py\" || exit 5\ndone<\/pre><p>You might be asking whether this combination of commands is necessary.<\/p><p><strong>generate_hosts_for_suzieq.py<\/strong> serves as a wrapper to generate hosts from the Ansible inventory but with more sugar inside, like skipping specific hosts, setting <strong>ansible_host<\/strong> dynamically (because our OOB network is highly available, it means we have several doors to access it).&nbsp;<\/p><p>The generated file looks similar to:<\/p><pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">- namespace: xml\n  hosts:\n    - url: ssh:\/\/root@xml-oob.example.org:2232 keyfile=~\/.ssh\/id_rsa\n    - url: ssh:\/\/root@xml-oob.example.org:2223 keyfile=~\/.ssh\/id_rsa<\/pre><p>Why bundle run-once and sq-poller? There is an already open <a href=\"https:\/\/github.com\/netenglabs\/suzieq\/issues\/429\" target=\"_blank\" rel=\"noopener\">issue<\/a> that is going to solve this problem. Eventually, it requires just adding a single <em>&ndash;snapshot<\/em> option, and that&rsquo;s it.<\/p><h3 class=\"wp-block-heading\">Workflow for Validating Changes<\/h3><p>Every new pull request (PR) creates a fresh, clean Python virtual environment (Pyenv) and starts the tests. The same happens when a PR is merged.&nbsp;<\/p><p>The simplified workflow was:&nbsp;<\/p><ol class=\"wp-block-list\"><li>Make changes.<\/li><li>Commit changes, create a PR on GitHub.<\/li><li>Poll and run PyTest tests with Suzieq (<em>\/tests\/run-tests.sh &lt;region|all&gt;<\/em>).<\/li><li>We require tests to be green before a PR is allowed to merge.&nbsp;<\/li><li>Merge the PR.<\/li><li>Iterate it on all our DCs one by one &ndash; deploy, and run post-deployment PyTests again.<\/li><\/ol><p>Something like:<\/p><pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">stage('Run pre-flight production tests') {\n  when {\n    expression {\n      env.BRANCH_NAME != 'master' &amp;amp;&amp;amp; !(env.DEPLOY_INFO ==~ \/skip-suzieq\/)\n    }\n  }\n  parallel {\n    stage('EU') {\n      steps {\n        sh '.\/tests\/prepare-tests-env.sh &amp;amp;&amp;amp; .\/tests\/run-tests.sh ${EU_DC}'\n      }\n    }\n    stage('Asia') {\n      agent {\n        label 'deploy-sg'\n      }\n    }<\/pre><h3 class=\"wp-block-heading\">Handling False Positives<\/h3><p>Every test has a chance of a false positive, i.e. the test reveals a problem that is not real. This can be true if it&rsquo;s a test for a disease or a test for verifying a change. At Hostinger, we assume that false positives will happen, and that&rsquo;s normal. So, how do we handle them, and when?&nbsp;<\/p><p>In our environment, false positives occur mostly due to timeouts, connection errors during the scraping phase (poller), or when bootstrapping a new device. In such a case, we re-run the tests until it&rsquo;s fixed (green in the Jenkins pipeline). But if we have a permanent failure (most likely a real one), tests always remain in a red state. This means the PR does not get merged, and the changes are not deployed.&nbsp;<\/p><p>However, in the case of a false positive, we use a Git commit tag <strong>Deploy-Info: skip-suzieq<\/strong> to tell Jenkins pipelines to ignore tests after we see this behavior (as you may have noticed before in the pipeline file).&nbsp;<\/p><h3 class=\"wp-block-heading\">Adding New Tests<\/h3><p>We test new or modified tests locally first before they land in the Git repository. To add a useful test, it needs to be tested multiple times unless it&rsquo;s really trivial. For example:<\/p><pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def bgp_sessions_are_up(self):\n    # Test if all BGP sessions are UP\n    assert (\n        get_sqobject(\"bgp\")().get(namespace=self.namespace, state=\"NotEstd\").empty\n    )<\/pre><p>But if we are talking about something like<\/p><pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def uniq_asn_per_fabric(self):\n    # Test if we have a unique ASN per fabric\n    asns = {}\n    for spine in self.spines.keys():\n        for asn in (\n            get_sqobject(\"bgp\")()\n            .get(hostname=[spine], query_str=\"afi == 'ipv4' and safi == 'unicast'\")\n            .peerAsn\n        ):\n            if asn == 65030:\n                continue\n            if asn not in asns:\n                asns[asn] = 1\n            else:\n                asns[asn] += 1\n    assert len(asns) &amp;gt; 0\n    for asn in asns:\n        assert asns[asn] == len(self.spines.keys())<\/pre><p>This needs to be carefully reviewed. Here we check if we have a unique AS number per DC. Skipping 65030 is used for <em>routing on the host<\/em> instances to announce some anycast services like DNS, load balancers, etc. This is the snippet of tests output (summary):<\/p><pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">test_phx.py::test_bgp_sessions_are_up PASSED\ntest_phx.py::test_loopback_ipv4_is_uniq_per_device PASSED\ntest_phx.py::test_loopback_ipv6_is_uniq_per_device PASSED\ntest_phx.py::test_uniq_asn_per_fabric PASSED\ntest_phx.py::test_upstream_ports_are_in_correct_state PASSED\ntest_phx.py::test_evpn_fabric_links PASSED\ntest_phx.py::test_default_route_ipv4_from_upstreams PASSED\ntest_phx.py::test_ipv4_host_routes_received_from_hosts PASSED\ntest_phx.py::test_ipv6_host_routes_received_from_hosts PASSED\ntest_phx.py::test_evpn_fabric_bgp_sessions PASSED\ntest_phx.py::test_vlan100_assigned_interfaces PASSED\ntest_phx.py::test_evpn_fabric_arp PASSED\ntest_phx.py::test_no_failed_interface PASSED\ntest_phx.py::test_no_failed_bgp PASSED\ntest_phx.py::test_no_active_critical_alerts_firing PASSED\ntest_imm.py::test_bgp_sessions_are_up PASSED\ntest_imm.py::test_loopback_ipv4_is_uniq_per_device PASSED\ntest_imm.py::test_loopback_ipv6_is_uniq_per_device PASSED\ntest_imm.py::test_uniq_asn_per_fabric FAILED\ntest_imm.py::test_upstream_ports_are_in_correct_state PASSED\ntest_imm.py::test_default_route_ipv4_from_upstreams PASSED\ntest_imm.py::test_ipv4_host_routes_received_from_hosts PASSED\ntest_imm.py::test_ipv6_host_routes_received_from_hosts PASSED\ntest_imm.py::test_no_failed_bgp PASSED\ntest_imm.py::test_no_active_critical_alerts_firing PASSED<\/pre><p>Here, we notice that this DC&rsquo;s&nbsp;<em>test_imm.py::test_uniq_asn_per_fabric <\/em>test has failed. Since we use an auto-derived ASN per switch (no static AS numbers in the Ansible inventory), a race could happen that might have a duplicate ASN, which is bad.&nbsp;Or something like:<\/p><pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def loopback_ipv6_is_uniq_per_device(self):\n    # Test if we don't have duplicate IPv6 loopback address\n    addresses = get_sqobject(\"address\")().unique(\n        namespace=[self.namespace],\n        columns=[\"ip6AddressList\"],\n        count=True,\n        type=\"loopback\",\n    )\n    addresses = addresses[addresses.ip6AddressList != \"::1\/128\"]\n    assert (addresses.numRows == 1).all()<\/pre><p>This is done to check whether we have a duplicate IPv6 loopback address per device for the same data center. This rule is valid and was proven at least a couple of times. It mostly occurs when we bootstrap a new switch and the Ansible host file is copy-pasted.&nbsp;<\/p><p>Mainly new tests are added when a failure occurs, and some actions need to be taken to quickly catch them or mitigate them in advance in the future. For instance, if we switch from an L3-only to EVPN design, we might be surprised when ARP\/ND exhaustion hits a wall, or L3 routes drop from several thousand to just a few.&nbsp;<\/p><h2 class=\"wp-block-heading\" id=\"h-batfish\">Batfish<\/h2><p>We have already evaluated <a href=\"https:\/\/www.batfish.org\" target=\"_blank\" rel=\"noopener\">Batfish<\/a> twice. The first was kind of an overview and dry-run to see its opportunities for us. The first impression was something like &ldquo;<em>What&rsquo;s wrong with my configuration?<\/em>&rdquo; because, at that time, Batfish didn&rsquo;t support some of the configuration syntaxes for <a href=\"https:\/\/frrouting.org\" target=\"_blank\" rel=\"noopener\">FRR<\/a>. <a href=\"https:\/\/frrouting.org\" target=\"_blank\" rel=\"noopener\">FRR<\/a> is used by Cumulus Linux and many other massive projects. It&rsquo;s becoming the de-facto best open-source routing suite. And that&rsquo;s why Batfish has FRR as a vendor included as well.  It&rsquo;s just that FRR as a model needs more changes before being used in production (at least in our environment).&nbsp;<\/p><p>Later on, a month or two ago, we began investigating the product again to see what could really be done. From the operational perspective, it&rsquo;s a really cool product because it allows the operator to construct the network model by parsing configuration files. On top of that, you can create snapshots, make some changes and see how your network behaves. For example, disable a link or a BGP peer and predict the changes before they go live.&nbsp;<\/p><p>We started looking at Batfish as an open-source project too to push changes back to the community. Here are a couple of examples of missing behavior modeling for our cases:<\/p><p><a href=\"https:\/\/github.com\/batfish\/batfish\/pull\/7671\/commits\/4fa895fd675ae60a257f1e6e10d27348ed21d4a0\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/batfish\/batfish\/pull\/7671\/commits\/4fa895fd675ae60a257f1e6e10d27348ed21d4a0<\/a><\/p><p><a href=\"https:\/\/github.com\/batfish\/batfish\/pull\/7694\/commits\/115a81770e8a78471d28a6a0b209eef7bc34df88\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/batfish\/batfish\/pull\/7694\/commits\/115a81770e8a78471d28a6a0b209eef7bc34df88<\/a><\/p><p><a href=\"https:\/\/github.com\/batfish\/batfish\/pull\/7670\/commits\/10ec5a03c15c48fd46890be4da394170fa6eb03a\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/batfish\/batfish\/pull\/7670\/commits\/10ec5a03c15c48fd46890be4da394170fa6eb03a<\/a><\/p><p><a href=\"https:\/\/github.com\/batfish\/batfish\/pull\/7666\/commits\/f440c5202dd8f338661e8b6bd9711067ba8652b6\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/batfish\/batfish\/pull\/7666\/commits\/f440c5202dd8f338661e8b6bd9711067ba8652b6<\/a><\/p><p><a href=\"https:\/\/github.com\/batfish\/batfish\/pull\/7666\/commits\/974c92535ecb5eedfe8fd57fc4295e59f2d4639d\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/batfish\/batfish\/pull\/7666\/commits\/974c92535ecb5eedfe8fd57fc4295e59f2d4639d<\/a><\/p><p><a href=\"https:\/\/github.com\/batfish\/batfish\/pull\/7710\/commits\/a2c368ae1b0a3477ba5b5e5e8f8ebe88e4bf2342\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/batfish\/batfish\/pull\/7710\/commits\/a2c368ae1b0a3477ba5b5e5e8f8ebe88e4bf2342<\/a><\/p><p>But a lot more are missing. We are big fans of <a href=\"https:\/\/www.hostinger.com\/blog\/awex-ipv6\">IPv6<\/a>, but unfortunately, IPv6 is not (yet?) well-covered in the FRR model in Batfish.&nbsp;<\/p><p>This is not the <a href=\"https:\/\/www.hostinger.com\/blog\/proxysql-ipv6\">first time<\/a> we&rsquo;ve missed IPv6 support, and, we guess, not the last either. We&rsquo;re looking forward to and hoping Batfish will get IPv6 support soon.&nbsp;<\/p><h2 class=\"wp-block-heading\" id=\"h-some-best-practice-observations-on-testing\">Some Best Practice Observations on Testing<\/h2><p>We would say that segregated tests serve to avoid throwing spaghetti at the wall at first. Write easy, understandable tests. If you see that two tests are depending on each other, it&rsquo;s better to split them into separate tests.&nbsp;<\/p><p>Some tests can overlap, and if one fails, then the other will too. But that&rsquo;s good because two failed tests can say more than one, even if they test similar functionality.&nbsp;<\/p><p>To confirm that tests are useful, you have to run and use them daily. Otherwise, there isn&rsquo;t much point in having them.<\/p><p>If you can guess what may happen in the future, covering the possibility in tests is a good idea unless it&rsquo;s too noisy.&nbsp;<\/p><p>As always, the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Pareto_principle\" target=\"_blank\" rel=\"noopener\">Pareto Principle<\/a> is the best answer to whether it&rsquo;s worth it and how much worth is covered by tests. If you cover at least 20% of the critical pieces with tests, most likely, your network is in good shape.&nbsp;<\/p><p>It&rsquo;s absolutely not worth automating and testing all the things you come up with. It&rsquo;s just additional taxation for no reason. You have to think about the maintainability of those tests with your team and make a decision.&nbsp;<\/p><p>What makes us happy is that Suzieq is great by default, and there is no need to write very sophisticated tests in Python. CLI is really awesome and trivial even for starters. If you need something exceptional, you are always welcome to write the logic in Python which is also friendly. Wrapped with the <strong>pandas<\/strong> library you can manipulate your network data as much as you want, it&rsquo;s very flexible.<\/p><h2 class=\"wp-block-heading\" id=\"h-\">&nbsp;<\/h2>\n","protected":false},"excerpt":{"rendered":"<p>In our previous post, we discussed how Hostinger started using network validation before going live. By implementing network validation for our core network, we have maintained co\u2026<\/p>\n","protected":false},"author":39,"featured_media":2166,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[82],"tags":[2312,2279,2285,1195,264,2286,2283,2290,2289],"hashtags":[],"class_list":["post-2164","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-engineering","tag-batfish","tag-developer","tag-engineering","tag-hosting","tag-hostinger","tag-network","tag-network-validation","tag-suzieq","tag-testing"],"hreflangs":[],"_links":{"self":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts\/2164","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/comments?post=2164"}],"version-history":[{"count":14,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts\/2164\/revisions"}],"predecessor-version":[{"id":2756,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts\/2164\/revisions\/2756"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/media\/2166"}],"wp:attachment":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/media?parent=2164"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/categories?post=2164"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/tags?post=2164"},{"taxonomy":"hashtags","embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/hashtags?post=2164"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}