Monday September 26, 2016
The Cost of Infrastructure Automation
This year we have started using automation for everything as much as we could, including network automation, server provisioning, and application deployments.
Every part has its own limitations and requirements. For network automation, we use Ansible because it has a flexible template mechanism, many third party modules, is fast, and really fits our needs. Servers are provisioned with Ansible too, but configurations are pushed using our first-class citizen Chef while some parts are handled with Consul plus consul-template.
Why not run a single tool for everything? Because every tool has its own limitations:
- Ansible doesn’t have a mechanism to run itself periodically and is not as flexible as Chef
- Chef doesn’t have an orchestration layer
- Consul… OK, but who would be responsible to bootstrap Consul clusters and other dependencies around it?
We make changes on Github by creating a pull request while Jenkins pulls these changes. Before Github introduced Code better with Reviews, we would use :+1:,:-1: as a comment to approve or reject changes. Thanks to this new feature, we can do it more transparently.
Our most maintained Chef repository has enabled
Require pull request reviews before merging protection, which means code review is mandatory. After pulling changes from Github, Jenkins starts building and does the rest. Some builds have multiple Jenkins slaves for executing specific builds, like building LXC containers or Docker containers, which requires different Linux distributions.
As explained above, we exploit Jenkins for every kind of automation. In most cases, we have two or three different builds per Github repository:
- ansible-network – apply changes in production from master branch
- ansible-network-pr – bootstrap development environment and apply changes from pull request
- ansible-repo – check the syntax for every playbook
Automation is the way we work at Hostinger. We don’t SSH into the server and do not make any changes. We do changes locally on personal laptops first, then later push the code to Github as a Pull Request. SSH to server is necessary only for ad-hoc debugging – everything else is dynamically adjusted by Chef. So, there is no point to log in. Most of our user-facing servers are identical (depends on the role), so we simply choose one to verify configuration quickly.
Before any new feature is deployed into our current stack, we have to think about automation first. We have two environments – development and production. At first, changes go to the development environment, where we have more or less identical infrastructure (virtual) and changes are seen quickly after a merge. When everything is in order with the development environment, we are free to use the same versions of cookbooks in the production environment. Just lodge another pull request with increased versions.
When we provision a new server, it is automatically detected by our monitoring platform Prometheus. Clusters are reconfigured according to decent cluster sizes and so on, so nothing else has to be done manually. If someone compromises or alters something in the configuration files, everything is reverted back automatically by
chef-client which runs every 7 minutes in the background. However, some services need to respond faster than every 7 minutes, so consul-template helps here. We have consul clusters per region and consul-template runs as client where needed. For example, we use consul-template for regenerating upstreams for Openresty, as it requires nearly real-time operation.
Network automation is done using Ansible as the primary tool. We use the Cumulus network operating system which allows us to have a fully automated network that we can reconfigure, including BGP neighbors, firewall rules, ports, bridges, and more. Nothing is changed directly inside the switch. Cumulus has a virtual instance called Cumulus VX which allows us to converge all changes before pushing them to production. Jenkins builds Cumulus VX convergences locally, applies Ansible playbook, and does tests. If everything is in order, then we are happy, too.
For instance, we add a new node, then Ansible will automatically see changes in the Chef inventory by looking at LLDP attributes and will regenerate network configuration for that particular switch. If we want to add a new BGP upstream or firewall rule, we simply create a pull request on our Github repo and everything is done automatically, including checking syntax and deploying changes in production. You can find more information about our network stack in the previous blog post.
Other Automation Processes
We internally use Slack, and it makes sense to do as much work as possible via chat. We have automated Jenkins builds and Github hooks, for example – when a new issue or pull request is created, a notification to the Slack channel is sent. Or, we can start Jenkins build directly from the channel by typing
ada j b 22, putting the website to sleeping state via
ada sleep <url>, and so forth.
- You can automate 80% of tasks, but it’s not necessary to cover 100%
- You have more time to spend on other tasks instead of copy-pasting around a fleet of servers
- Knowledge sharing – all organization members are able to see, comment, and do changes freely
- No secrets between teammates – infrastructure as a code is visible to everyone
- Servers that are not under automation costs more time and money than automated ones