Managing Cloud Environments with Declarative Infrastructure: Part 1

Atanas Yankov | Posted on October 27, 2021

Introduction


Infrastructure management has transformed over the last several years due the massive migration towards cloud computing. Manually configuring servers is becoming antiquated, and administrators are shifting towards automating their day-to-day workloads.

Hyperscience recently announced that our Software-as-a-Service solution has secured general availability. As a result of this advancement, we wanted to explore some of the technical practices that the Cloud Engineering team at Hyperscience has adopted and how they fit into the general trend. In part 1 of this blog series, we will explain what certain terms — such as Infrastructure as code and GitOps — entail, the challenges they solve, popular tools for each, and how we fit them into our daily workflows.


Infrastructure as Code


Infrastructure as code, also known as IaC, is a simple yet powerful concept that describes one’s infrastructure — whether a single virtual machine or a fleet of Kubernetes clusters — and its configuration in code. The code is provided as input to specialized tools that reach the desired state in an idempotent manner, meaning that running the same structure multiple times will always create the same result. There are two main avenues for these programs — declarative and imperative. The former centers around what should be provisioned, whereas the latter on how it should be achieved. Terraform and Ansible are popular examples, respectively.

The two approaches also have a different take on infrastructure mutability. Terraform, a tool that we use extensively here at Hyperscience, typically destroy and recreate a particular resource instead of changing or upgrading it in place. Reasoning regarding the infrastructure’s condition is simplified when the only two possible states are present and not present.

Traditionally, either YAML or a domain-specific language like HashiCorp’s HCL is used to describe either resources or a sequence of configuration actions. This is similar to how conventional languages, such as Python, are used to describe data and control flow. Despite being less expressive, the aforementioned languages are intended as a compromise between human and machine readability. Ultimately, simplicity is desired because it makes the codebase easier to understand and maintain.

A major benefit of IaC is that it enforces the software development cycle on infrastructure management. Gone are the days when an administrator would manually connect to a server in order to run custom provisioning scripts. Now, it is common practice to have multiple people working on the same or related components, just like multiple developers work on mutually dependent features in a repository backed by a version control system (most often Git).


Dependencies


When a codebase’s size skyrockets, code reusability becomes vital to support its growth. Ansible Galaxy and the Terraform Registry allow users to use external packages or publish their own, either publicly or privately. However, when other people’s code gets involved, versioning is crucial to achieve reproducible infrastructure that is not affected by non-backwards compatible changes from upstream. Terraform, in particular, has two moving components that should be closely monitored to maintain stability. The first are plugins that talk to external APIs, which are called providers. The second are shareable snippets of code, which are called modules. Pulling in a slightly different version of either could potentially lead to a provisioning failure or, even worse, an unintended change.

Lock files are ideal for working around these issues. The idea behind them is to store the exact version of all the external code that is imported, usually using a Git hash or even a hash of the code itself. Thus, other people and CI systems will know exactly what was run and be able to recreate it. Unfortunately, Terraform currently stores only information about which providers have been used when spinning up resources. Since modules are not covered, we always pin a concrete version of their source. In our case, this is a Git tag reference.

However, the above unravels a different issue. How should we manage these tags?


Semantic Release


We use semantic release to automate most of the work around versioning. It utilizes the conventional commits specification for commit messages and semantic versioning for tag names. A CI job runs on the main branch that analyzes previous commits and decides whether to tag the repository with a new patch, minor, or major version. A different job is then triggered by the tagging event and appropriate steps are taken to publish the code. This may include building a Docker image and pushing it to a private registry or publishing a Python wheel to Gitlab’s package registry.

Overall, this process is extremely valuable to us for two main reasons. First, it forces developers to work on self-contained features because every merge request (pull request, if you come from Github) should be ready for tagging when it reaches the main branch. Hence, it can be considered stable. Second, releasing often makes upgrading easier. The automatically generated changelogs are small between versions and users get access to new features faster. Of course, that is assuming they do not postpone them. Going through three months worth of updates in an instant, for example, will be just as difficult. This goes to show that both the maintainer and its users must be aligned on the strategy.

Now that we have all of our immutable infrastructure configuration in frequently versioned Git repositories, where do we go from here?


GitOps


GitOps, much like IaC, is a set of techniques which aim to improve the infrastructure management workflow. It builds on top of the principles described in the previous section and suggests that the single source of truth for declarative infrastructure state should be the Git repository. In addition, it is typically associated with operating Kubernetes manifests. You can read more about the reasoning behind GitOps from the coiners of the term here.

Adopting this straightforward idea immediately brings us a number of advantages:

  • Git is skilled at preserving history. Thus, past changes are auditable and rollback to a last known good state can happen immediately in case of emergency.
  • By using a third party like Gitlab, a strict review and approval process can be enforced, important for our change management procedures.
  • CI is responsible for validating changes before they are made available for review and CD is responsible for applying them to the outside world after merge. The chances that an error may occur is significantly lower because no manual action is required.

A fitting example would be our Terraform pipeline. Within Gitlab, we have forbidden direct pushes to the main branch and every change has to go through the merge request (MR) process. Once an MR is open, jobs kick off to validate and plan its content. Both requesters and reviewers can view the generated plan, which is even integrated into Gitlab’s UI. Only after approval and merge is the apply job executed.

To build more confidence that the actual state matches our expectations, we also run a custom Terraform drift detector, which we discussed in a previous blog post. It runs on a schedule to make sure that stale resources are still relevant.

What if we want to add a bit more complexity to the system? Our objective is to provision numerous Kubernetes clusters, accompanying AWS resources, and then the YAML manifests within them. Some applications should be present in all clusters, like our Node exporter monitoring and Promtail logging daemon sets or AWS’ ELB ingress controller, to name a few. And, of course, the Hyperscience platform!

Terraform is invaluable when it comes to static resources such as AWS IAM roles or EC2 on-demand instances, but managing more dynamic environments, like Kubernetes, requires a bit more effort. Stay tuned for Part 2 of this series where we’ll introduce you to GitOps-specific tooling in the face of ArgoCD and how we build on top of the foundations introduced in this blog.

Atanas is a DevOps Engineer located in our Sofia office. Connect with him on LinkedIn.

Join Some of the Brightest Minds in AI

Turn cutting-edge ML techniques into enterprise-ready AI solutions for the world's largest organizations