{"id":58,"date":"2016-09-26T10:37:05","date_gmt":"2016-09-26T10:37:05","guid":{"rendered":"http:\/\/blog.hostinger.io\/blog\/?p=58"},"modified":"2022-09-26T16:16:25","modified_gmt":"2022-09-26T16:16:25","slug":"the-cost-of-infrastructure-automation","status":"publish","type":"post","link":"https:\/\/www.hostinger.com\/blog\/the-cost-of-infrastructure-automation","title":{"rendered":"The Cost of Infrastructure Automation"},"content":{"rendered":"<p>This year we have started using automation for everything as much as we could, including network automation, server provisioning, and application deployments.<\/p><p>Every part has its own limitations and requirements. For network automation, we use <a href=\"https:\/\/www.ansible.com\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Ansible<\/a> because it has a flexible template mechanism, many third party modules, is fast, and really fits our needs. Servers are provisioned with Ansible too, but configurations are pushed using our first-class citizen <a href=\"https:\/\/www.chef.io\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Chef<\/a> while some parts are handled with <a href=\"https:\/\/www.consul.io\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Consul<\/a> plus <a href=\"https:\/\/github.com\/hashicorp\/consul-template\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">consul-template<\/a>.<\/p><p>Why not run a single tool for everything? Because every tool has its own limitations:<\/p><ul>\n<li><strong>Ansible<\/strong> doesn&rsquo;t have a mechanism to run itself periodically and is not as flexible as Chef<\/li>\n<li><strong>Chef<\/strong> doesn&rsquo;t have an orchestration layer<\/li>\n<li><strong>Consul<\/strong>&hellip; OK, but who would be responsible to bootstrap Consul clusters and other dependencies around it?<\/li>\n<\/ul><p>We make changes on <strong>Github<\/strong> by creating a pull request while <a href=\"https:\/\/jenkins.io\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Jenkins<\/a> pulls these changes. Before Github introduced <a href=\"https:\/\/github.com\/features\/code-review\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Code better with Reviews<\/a>, we would use :+1:,:-1: as a comment to approve or reject changes. Thanks to this new feature, we can do it more transparently.<\/p><p>Our most maintained Chef repository has enabled <code>Require pull request reviews before merging<\/code> protection, which means code review is mandatory. After pulling changes from Github, Jenkins starts building and does the rest. Some builds have multiple Jenkins slaves for executing specific builds, like building <a href=\"https:\/\/linuxcontainers.org\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">LXC<\/a> containers or <a href=\"https:\/\/www.docker.com\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Docker<\/a> containers, which requires different Linux distributions.<\/p><p>As explained above, we exploit Jenkins for every kind of automation. In most cases, we have two or three different builds per Github repository:<\/p><ul>\n<li><strong>ansible-network<\/strong> &ndash; apply changes in production from master branch<\/li>\n<li><strong>ansible-network-pr<\/strong> &ndash; bootstrap development environment and apply changes from pull request<\/li>\n<li><strong>ansible-repo<\/strong> &ndash; check the syntax for every playbook<\/li>\n<\/ul><p>Automation is the way we work at Hostinger. We don&rsquo;t SSH into the server and do not make any changes. We do changes locally on personal laptops first, then later push the code to <a href=\"https:\/\/github.com\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Github<\/a> as a Pull Request. SSH to server is necessary only for ad-hoc debugging <span style=\"font-weight: 400\">&ndash; <\/span>everything else is dynamically adjusted by Chef. So, there is no point to log in. Most of our user-facing servers are identical (depends on the role), so we simply choose one to verify configuration quickly.<\/p><p>Before any new feature is deployed into our current stack, we have to think about automation <strong>first<\/strong>. We have two environments <span style=\"font-weight: 400\">&ndash; <\/span> development and production. At first, changes go to the development environment, where we have more or less identical infrastructure (virtual) and changes are seen quickly after a merge. When everything is in order with the development environment, we are free to use the same versions of cookbooks in the production environment. Just lodge another pull request with increased versions.<\/p><p>When we provision a new server, it is automatically detected by our monitoring platform <a href=\"https:\/\/prometheus.io\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Prometheus<\/a>. Clusters are reconfigured according to decent cluster sizes and so on, so nothing else has to be done manually. If someone compromises or alters something in the configuration files, everything is reverted back automatically by <code>chef-client<\/code> which runs every 7 minutes in the background. However, some services need to respond faster than every 7 minutes, so consul-template helps here. We have consul clusters per region and consul-template runs as client where needed. For example, we use consul-template for regenerating upstreams for <a href=\"http:\/\/openresty.org\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Openresty<\/a>, as it requires nearly real-time operation.<\/p><p>Network automation is done using Ansible as the primary tool. We use the <a href=\"http:\/\/cumulusnetworks.com\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Cumulus<\/a> network operating system which allows us to have a fully automated network that we can reconfigure, including BGP neighbors, firewall rules, ports, bridges, and more. Nothing is changed directly inside the switch. Cumulus has a virtual instance called Cumulus VX which allows us to converge all changes before pushing them to production. Jenkins builds Cumulus VX convergences locally, applies Ansible playbook, and does tests. If everything is in order, then we are happy, too.<\/p><p>For instance, we add a new node, then Ansible will automatically see changes in the Chef inventory by looking at LLDP attributes and will regenerate network configuration for that particular switch. If we want to add a new BGP upstream or firewall rule, we simply create a pull request on our Github repo and everything is done automatically, including checking syntax and deploying changes in production. You can find more information about our network stack in the previous blog <a href=\"https:\/\/www.hostinger.com\/blog\/engineering\/awex-ipv6\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">post<\/a>.<\/p><h2 id=\"othersmallyetnicetohaveautomation\">Other Automation Processes<\/h2><p>We internally use <a href=\"https:\/\/slack.com\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Slack<\/a>, and it makes sense to do as much work as possible via chat. We have automated Jenkins builds and Github hooks, for example <span style=\"font-weight: 400\">&ndash; when a <\/span>new issue or pull request is created, a notification to the Slack channel is sent. Or, we can start Jenkins build directly from the channel by typing <code>ada j b 22<\/code>, putting the website to sleeping state via <code>ada sleep &lt;url&gt;<\/code>, and so forth.<\/p><h2 id=\"takeaways\">Takeaways<\/h2><ul>\n<li>You can automate 80% of tasks, but it&rsquo;s not necessary to cover 100%<\/li>\n<li>You have more time to spend on other tasks instead of copy-pasting around a fleet of servers<\/li>\n<li>Knowledge sharing <span style=\"font-weight: 400\">&ndash; <\/span> all organization members are able to see, comment, and do changes freely<\/li>\n<li>No secrets between teammates <span style=\"font-weight: 400\">&ndash; <\/span>infrastructure as a code is visible to everyone<\/li>\n<li>Servers that are not under automation costs more time and money than automated ones<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This year we have started using automation for everything as much as we could, including network automation, server provisioning, and application deployments.<\/p>\n<p>Every part has its o\u2026<\/p>\n","protected":false},"author":39,"featured_media":1940,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[82],"tags":[],"hashtags":[],"class_list":["post-58","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-engineering"],"hreflangs":[],"_links":{"self":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts\/58","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/comments?post=58"}],"version-history":[{"count":9,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts\/58\/revisions"}],"predecessor-version":[{"id":3852,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/posts\/58\/revisions\/3852"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/media\/1940"}],"wp:attachment":[{"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/media?parent=58"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/categories?post=58"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/tags?post=58"},{"taxonomy":"hashtags","embeddable":true,"href":"https:\/\/www.hostinger.com\/blog\/wp-json\/wp\/v2\/hashtags?post=58"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}