From Few Hours To A Reasonable Convergence Time

Last week we tackled a huge upgrade process where we moved our current Hostinger infrastructure to be fully automated. What was before, I will leave to your imagination.

Hence, after few days of discussions about what tools to use we decided to continue with Ansible. Why Ansible? Because it was the simplest solution for the short amount of time and few engineers on the way. Almost everyone had at least some experience with it, thus move on. This took only a few sprints, which was a deadline. A goal without a deadline is merely a dream, thus we did it.

In short, we converted our old infrastructure (custom compiled PHP, Nginx, Apache, etc.) to a more sustainable approach (CloudLinux CageFS, PHP selector, Openresty and so on and so forth). Since the upgrade we have been able to do pro-active monitoring which we didn’t have before, it allows us to spot the biggest problems sooner.

Early days were hard. Ansible was better than horrible, but it’s really not flexible as Chef, which we already use more for internal infrastructure and project. We were stuck on almost every second task and didn’t know how to best workaround them, but it’s good enough for now.

As we wrote in previous posts we use Jenkins for all sort of deployments. This was also not an exception. Once we tested Ansible convergence in a development environment, everything was smooth and fast, deploys took 3-5 min. on average. Unfortunately, in the production deployments took hours, which was a huge pain.

Performance optimizations for Ansible

Pipelining + ControlPersist

This was the first thing to enable. After it was turned on, we achieved 2x improvements in convergence time. Great success for a while, we reduced the time from more than 3h (3 hr 28 min) to ~2h (1 hr 48 min) per build. This was good as a first optimization, but there was still room left for improvements. We could reduce this time by running a single Ansible play, but we have a knob as a pre-flight check to deploy changes to development environment first. If it succeeds, then continue with the production environment.

Make sure to minimize ‘changed’ tasks

Before making any improvements I suggest to turn on profiling for Ansible tasks:

This will let you see most critical paths like:

Avoid using state=latest if you do not keep your system up to date, this will return almost immediately instead of looking for newest versions.

If you are using yum module with external sources like:

Make sure to check if repository file exists or not, because yum downloads this file every time:

Another trick we did, we disabled update for git module so that it does not check out the latest version every time, added creates option for almost every unarchive, uri, shell modules and so on.

After these changes, our build time dropped to 1 hr 0 min.

ansible jenkins deploy

Wrapping up

  • Always use profiling tools if performance matters;
  • Agentless automation tools are slower than agent-based in its nature.

Add Comment

Click here to post a comment