Managing Cloud Environments with Declarative Infrastructure: Part 2

Atanas Yankov | Posted on December 21, 2021

In part 1 of the series, we discussed the foundations of GitOps and Infrastructure as Code (IaC) and how we utilize them at Hyperscience. Now, let’s dive into the technical details. As a refresher, our goal is to spin up multiple AWS EKS clusters and their accompanying resources, as well as the YAML manifests within them. They include our observability and networking essentials, and the Hyperscience Platform.

However, in order to have visibility over the whole lifecycle of our EKS cluster, another requirement emerges. The new cluster’s monitoring stack should be immediately deployed once the underlying infrastructure has been successfully created by Terraform. It consists of Prometheus for metrics, Promtail as a log shipper, and event-exporter which enables us to add custom alerts around Kubernetes events. These vital elements give us confidence that the nodes and the soon-to-be deployed applications on top of them are healthy and working as expected. 

Since we use Terraform for all of our cloud infrastructure anyway, should we do the same for all of our Kubernetes resources, as well? Let’s find out.

Implications of Kubernetes Management with Terraform

After an initial period of discovery, we came to the conclusion that we won’t be able to utilize Terraform to the fullest extent if we use it to manage Kubernetes manifests. There were a few points of friction that we found along the way.

First, Kubernetes already has its own backing store for all cluster data and state in the face of etcd. Terraform’s state files and providers will add an additional layer of abstraction which could be a source of inconsistencies between them and the Kubernetes cluster state. Even more so if Helm, a package manager for Kubernetes manifests, is added to the mix. Second, having too many Terraform-managed resources slows down day to day operations. If we wanted to bump only a single Helm chart’s version, for example, we would still need to wait for a refresh of every item within the plan graph. Even if we logically separate different system components into appropriate file system directories, running `terraform plan` or `apply` still takes up too much time. Also, in the common scenario when one would want to propagate a small change across multiple, if not all, clusters, going through the regular Terraform pipeline would not be feasible.

Those were some of the reasons why we decided to split the responsibility and look for a Kubernetes-native solution to run that piece of the puzzle. Luckily, these areas are exactly where GitOps-specific tools shine.

Argo CD

Argo CD is a declarative continuous delivery tool that targets Kubernetes. It enforces the desired state from Git repositories to target clusters in an event-based manner. Also, it periodically checks for drifts out of the box – a process called self-healing. This combination does not only prevent manual edit mistakes, but it also further mandates the Git workflow because the only way to push persistent changes to the system without them getting overwritten is through the source repository. The directory structure is not strictly defined and can contain a mix of Helm charts, Kustomization or regular YAML files, and even custom templating tools, such as cdk8s, for example.

A typical Argo CD workflow is very similar to what we already had in our IaC pipeline. To start, the user opens, gets approval on, and merges an MR against a particular branch, tracked by Argo CD. Usually, this is the `main` branch, but any branching strategy can be used. Then, the merge event forces synchronization of the resources from the git repository to the Kubernetes cluster, in a similar fashion to a regular `kubectl apply`. A slick user interface is also accessible to visualize the deployment process, help troubleshoot issues, and provide a rollback mechanism. However, before we could utilize the interface, we first had to make Terraform enable Argo CD and let it take control over the application layer, effectively separating it from the infrastructure layer.

Naturally, Argo CD needs to be provided with credentials and the location of a cluster’s Kubernetes API endpoint before it can create or delete objects within it. This operation is called registration. To bridge the gap between the two systems, we created a lightweight Python script that handles the machinery of registering a cluster. The script also sets the necessary labels that describe it, which will come into use in the next section. We then wrapped the code into a Terraform null_resource with a local-exec provisioner, allowing us to directly invoke it during execution. This way, we recieved the benefits of Terraform’s state management for free, ensuring that the script is run only once, while avoiding the use of additional external providers. We also have full control over the null_resource’s lifecycle and can implement the opposite actions when it is destroyed.

It’s worth noting that typically, this approach is considered a double-edged sword. It provides a significant amount of freedom at the cost of complexity. If not used carefully, due to the side-effects of running them, null_resources may be hard to recover from bugs. Nevertheless, we appreciated the flexibility of this resource type on a number of occasions. Ultimately, we will revisit this decision if we find out that the maintenance costs outweigh the benefits.

Applications

Now let’s take a look at how Argo CD decides what resources to deploy and where, once it can connect to an external Kubernetes’ API. The most basic building block is the Application custom resource definition (CRD). It contains information such as the address of a Git repository and directory within it to look for Kubernetes manifests, as well as the name or location of a selected cluster. The application controller, an important part of the Argo CD installation, picks up the new CRD object and starts applying the provided manifests.

Below is an example of an Application object. It will install our customized Promtail helm chart in the `example-eks` cluster. The parameter overwrite is used to attach the name as a label to each log message it ships.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: promtail
spec:
project: framework
source:
repoURL: git@${OUR_GITLAB_REPOSITORY}
targetRevision: HEAD
path: observability/promtail
helm:
valueFiles:
- values.yaml
parameters:
- name: promtail.cluster_name
value: example-eks
destination:
name: example-eks
namespace: framework

An Application CRD can be created in a number of different ways. One option is to go through the Argo CD web interface. Another way is to use the `argocd` command line application. However, since we want to automate as much as possible and let Argo CD manage itself, the Application CRD can be declaratively defined and directly applying the simple YAML file above with `kubectl` will work just fine.

However, this method of definition is not scalable once the number of Applications grows. This method either requires lots of copy-pasting inside our Git repo or even more custom tooling for YAML generation, neither of which is desired. Fortunately, the Argo developers have recently released a product that does just that!

ApplicationSets

The new application set controller goes one step further and abstracts away the creation of Application CRDs. It works with a new CRD, called ApplicationSet. It has two parts – a generator and a template. For every item that the generator returns, a new Application will be dynamically created, based on the template’s values. There are different types of generators. Some look for directories within a Git repository and some loop through a static, predefined list. We use a third type, the cluster generator. It takes as input a label selector and emits a subset of all registered clusters, containing only the ones that match the desired key-value pairs. For instance, the following ApplicationSet will run our Promtail Helm chart on every cluster with the `logging` label equal to `on`:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: promtail
spec:
generators:
- clusters:
selector:
matchLabels:
logging: on
template:
metadata:
name: "{{name}}-promtail"
spec:
project: framework
source:
repoURL: git@${OUR_GITLAB_REPOSITORY}
targetRevision: HEAD
path: observability/promtail
helm:
valueFiles:
- values.yaml
parameters:
- name: promtail.cluster_name
value: "{{name}}"
destination:
name: "{{name}}"
namespace: framework

Since the controller occasionally runs through the generator, newly registered clusters will almost immediately be picked up and have Applications created for them. This essentially allows us to achieve our goal of deploying manifests as soon as a cluster becomes available.

Note that even the Helm release values can be customized with data produced by the ApplicationSet generator. The cluster’s `{{name}}` is used in this case, but it could also be label value depending on how you configure the ApplicationSet. Thus, regardless of the labelling strategy you are using, Argo CD is flexible enough to make it work for your specific use case.

Also, as previously mentioned, we want to be able to easily push changes across our whole EKS fleet. With this approach, changing the Loki endpoint, for example, to which Promtail pods send logs, can be achieved by a single line modification in the `observability/promtail/values.yaml` file. Once the commit is pushed to the `main` branch, the application controller will notice that all Promtail Applications are out of sync because they are generated from the same template and reference the same file within the repository. To fix the drift, the new manifest will be applied to every cluster. Eventually, each Promtail deployment will gradually replace its pods to use the new configuration. Both Argo CD’s web interface and our Prometheus alerting rules are responsible for letting us know if something has gone wrong, so that we can revert any faulty commits in a timely manner. Of course, testing such changes before rolling them out everywhere can happen by selecting only targets with a dedicated `upgrade_testing` label to make sure the application is behaving as expected during and after the upgrade.

Ultimately, we are happy with the decision to offload the Kubernetes work to Argo CD from Terraform. It was the right choice for our use case because we are now more flexible and can propagate changes faster, compared to previous iterations. Hopefully, this trend will continue as we discover better ways to automate infrastructure provisioning.

Conclusion

In this two-part series, we examined what Infrastructure as Code and GitOps practices are, why they can be beneficial and how we’ve adopted them for building and managing the Hyperscience cloud offering. In addition, we’ve explored some of the tools and technical details around our Gitlab, Terraform and Kubernetes pipelines, which enable us to have a rigorous change management procedure in place.

If you’re interested in researching similar approaches, you can check our open Engineering positions here.

A big thank you to Steve Sklar for the support, guidance and proofreading throughout the writing of this blog post!

Atanas is a DevOps Engineer located in our Sofia office. Connect with him on LinkedIn.

Join Some of the Brightest Minds in AI

Turn cutting-edge ML techniques into enterprise-ready AI solutions for the world's largest organizations