Running Scheduled Jobs in AWS Using Terraform

Steve Sklar | Posted on June 29, 2021

Running scheduled tasks (cron jobs) is a critical component of almost every computing system.  Linux’s cron utility is a well-known tool for this problem. It has its limitations, however, especially in distributed Cloud computing environments where the current trend is to move towards more ephemeral infrastructure.

This post will describe how the Hyperscience CloudOps team runs scheduled tasks in AWS. I will first outline a general set of requirements for our workloads and describe which AWS technologies we chose to run them. Next, I will introduce an example of a job to schedule and use that as context while I provide a walkthrough of a Terraform module that provisions the infrastructure required to run the task at regular intervals. Finally, I will demonstrate how to use the module in your own infrastructure.

Requirements

There are four main requirements that we want our scheduled task to meet:

1. Reliability

We need to be confident that our job will run at its scheduled interval with minimal downtime.  This includes having a reasonably fast startup time. Although we don’t prioritize milli/micro/second-level precision when triggering the job, we also don’t want to wait several minutes to install packages or dependencies.

2. Observability  

We need to know when our job fails and easily obtain any logs or error traces from failures.

3. Simple and safe deployment 

Our job may change in the future. We need an easy and safe way to update it without breaking anything.

4. Flexible resource constraints  

Our job may run for an indeterminate amount of time, and may consume significant amounts of system resources.  We don’t want our job to be prematurely terminated before it completes execution.

Technology Preferences

Given these requirements, we decided to use AWS Elastic Container Service (ECS) on AWS Fargate as our execution environment and AWS CloudWatch to both trigger the job and collect logs.  We are orchestrating all of these resources using HashiCorp Terraform1.

ECS running on Fargate addresses the above requirements in a variety of ways. AWS Fargate is a managed Docker runtime that runs containers on ephemeral hosts, as opposed to a traditional ECS networking scheme that requires you to spin up and maintain a cluster of EC2 instances.

We are choosing Fargate over the similar AWS Lambda service because Lambda’s resource quotas can pose a problem for many use cases. For example, AWS limits a Lambda job’s runtime length to a maximum of 15 minutes, which can be a dealbreaker for many long-running maintenance or service-related tasks, while Fargate has no limitations on task length.  Lambda also has other stringent resource limits including /tmp directory storage size (512 MB) and available memory (10,240 MB)2. With Fargate, you can add add volume mounts to increase available disk space well into the GBs or allocate up to 30GB of RAM to your container’s runtime 3.

We are also using Fargate to offload resource management to AWS so we don’t have to worry about the overhead of managing EC2 instances or maintaining clusters of container hosts.  We simply pass AWS a Docker image with our runtime environment and provide a configuration file that specifies resource limits, volume mounts, or an entrypoint, along with other basic container orchestration settings.  Fargate will find us a suitable node to run our job and abstract away all of the infrastructure management.  If we need to change any aspect of the infrastructure that the job uses to run, we can simply modify our config file and let AWS manage the underlying resources to meet our demands. While there are certainly tradeoffs to this approach, there are a multitude of benefits to leveraging Fargate for small cron jobs.

AWS CloudWatch is used to trigger our scheduled task.  Again, we are offloading the responsibility of maintaining any infrastructure to AWS’ fleet, since their managed infrastructure is triggering the job.  There is an important tradeoff here between precision and reliability.  In order to make our job highly available, AWS sacrifices second-level precision when executing the task and only guarantees that a job will be triggered within the minute of its scheduled time.  For our purposes, this limitation is acceptable since jobs might run for tens of minutes and we aren’t triggering them more than once or twice an hour.

CloudWatch is also useful for handling our observability requirements.  ECS makes it easy to set up CloudWatch log groups that capture execution output of your containers running on Fargate.  We can also use CloudWatch event rules to set up triggers that fire when a job fails unexpectedly, delivering error notifications to email, Slack, OpsGenie, PagerDuty, or other alerting frameworks.

What Job Are We Running?

At Hyperscience, we utilize Terraform to manage all aspects of our AWS infrastructure.  We like that we can programmatically define our infrastructure and use software development best practices like unit/integration testing and source control to build safety nets and automation into our day-to-day workflows.  However, there can be periods when our real-world infrastructure differs from what is defined in Terraform state4. For example, a user could accidentally make a change to a Terraform-managed AWS resource in the console (instead of using our automated Terraform execution pipeline).  Without modifying our code to reflect this inadvertent change, future terraform applys in the same module would undo this user’s changes, causing an undesirable and unexpected infrastructure change.  Since our customer base includes large enterprise organizations that run critical processes through Hyperscience’s software, we need to ensure that our infrastructure code accurately reflects what’s running in AWS so we can be confident that our systems’ architecture is working as expected.

First Steps

The first step is writing your scheduled job logic and adding it to a Docker image.  While this is out of scope of the article, I do want to point out that Dockerizing your scheduled job allows you to write the logic in whichever language you feel comfortable.  It also allows you to customize your runtime environment by pre-installing any number of dependencies and picking your OS version.  This allows your job to be portable and executable in many different environments, such as Docker Swarm, Kubernetes, Mesos, AWS ECS, or other container orchestration systems.

In this case, we built a Docker image containing the terraform executable and added a script that pulls our infrastructure repository from source control, runs terraform plan iteratively in each directory, and sends a Slack notification to a special channel when any drift is detected (by parsing the output of the terraform plan command).

Implementing the Scheduled Job Using Terraform

To provision the scheduled job and its related resources, the Hyperscience team wrote a Terraform module that defines ECS, CloudWatch, IAM, and SNS resources.  The module accepts a set of variables including the ECR repository name (where our Docker image is stored), Docker image tag, execution schedule (in the form of a cron expression), and runtime cpu/memory limitations.  By modularizing this logic, we are able to reuse it for other scheduled jobs in the future.

To run a job in ECS, you first need an ECS cluster provisioned in your AWS account.  By default, the module spins up a new ECS cluster with the same name as your scheduled job. But if you set the ecs_cluster_name variable in the module declaration to something other than an empty string, your job will be scheduled on that cluster instead. This can be seen in the logic below, where we create a new Terraform resource if the var.ecs_cluster_name is an empty string. By using the count keyword to conditionally create resources, if we receive a cluster name in the variable, we can instead create a data object that references the cluster which you want to run your job. Personally, I recommend that you use a central cluster for all of your organization’s scheduled jobs for ease of management.

resource "aws_ecs_cluster" "this" {
count = var.ecs_cluster_name == "" ? 1 : 0
name = var.task_name
}
data "aws_ecs_cluster" "existing" {
count = var.ecs_cluster_name != "" ? 1 : 0
cluster_name = var.ecs_cluster_name
}
locals {
ecs_cluster_arn = var.ecs_cluster_name != "" ? data.aws_ecs_cluster.existing[0].arn : aws_ecs_cluster.this[0].arn
}

We also need to add an ECS Task Definition, which will specify the resources required to run a Docker container (or group of containers) as either a service or scheduled task in an ECS Cluster.  Here, we will use Terraform to create an aws_ecs_task_definition resource which is set to use Fargate networking and the cpu/memory limits specified in the module’s variables.

resource "aws_ecs_task_definition" "this" {
family = var.task_name
container_definitions = jsonencode(local.container_definitions)
task_role_arn = var.task_role_arn
execution_role_arn = aws_iam_role.task_execution_role.arn
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = tostring(var.task_cpu)
memory = tostring(var.task_memory)
}

The container_definitions argument (as seen below) is critical to configuring your task definition.  This is where you will provide details about the container that your job will run in. Here, we define a single container for our task in an HCL map object that we pass to the task definition’s container_definitions argument.  We also preconfigure the awslogs driver, which creates a dedicated CloudWatch log group for the task and then pipes execution logs to CloudWatch.  Finally, we merge this with an extra_container_defs variable so end users can add additional task definition configuration options such as volume mounts, secrets, a startup command, and other supported Docker-related settings in a native HCL code object 5.

locals {
container_definitions = [
merge({
"name" : var.task_name,
"image" : "${data.aws_ecr_repository.existing.repository_url}:${var.image_tag}",
"cpu" : var.task_cpu / 1024,
"memoryReservation" : var.task_memory,
"essential" : true,
"logConfiguration" : {
"logDriver" : "awslogs",
"options" : {
"awslogs-region" : data.aws_region.current.name,
"awslogs-group" : var.task_name,
"awslogs-stream-prefix" : var.task_name,
"awslogs-create-group" : "true"
}
}
}, var.extra_container_defs)
]
}

Next, we need to create a CloudWatch event rule to trigger our cron job and link that to our ECS task definition.  The trigger will be defined with a cron-like expression and passed to the module via the cloudwatch_schedule_expression variable. It is important to note that this expression needs to conform to AWS’ cron syntax6.  This CloudWatch trigger is connected to an aws_cloudwatch_event_target, which is the task definition that we created above, such that a new task (as configured by our task definition) will be created on our ECS cluster every time the CloudWatch event rule is triggered per the schedule_expression rules.  If you need help defining your cron expression, the AWS CloudWatch Console has a handy tool that displays the next several trigger times based on your schedule expression when you create a new event rule7.

resource "aws_cloudwatch_event_rule" "event_rule" {
name = var.task_name
schedule_expression = var.cloudwatch_schedule_expression
}

resource "aws_cloudwatch_event_target" "ecs_scheduled_task" {
rule = aws_cloudwatch_event_rule.event_rule.name
target_id = var.task_name
arn = local.ecs_cluster_arn
role_arn = aws_iam_role.cloudwatch_role.arn

ecs_target {
launch_type = "FARGATE"
platform_version = "LATEST"
task_count = 1
task_definition_arn = aws_ecs_task_definition.this.arn
network_configuration {
subnets = var.subnet_ids
}
}
}

Alerting

In our job’s script, we make a call to a Slack webhook to notify us upon any drift detection, but how can we be notified of script failures altogether?  For instance, if the script exits prematurely due to a bug, we could miss drift notifications in subdirectories that haven’t yet been evaluated at the time of failure.  After a while, if we don’t get any drift notifications, how can we be sure that there’s actually no Terraform drift in our infrastructure versus the possibility that that our job has a fatal bug? To resolve this, we will add a CloudWatch event rule to send a message to an SNS Topic if our task exits with an error by capturing task exits in our cluster with a non-0 status code. We also add the topic’s ARN to our module’s output so end-users can reference it in downstream infrastructure to add topic subscriptions that route messages to the best place for your organization, such as an email account or webhook.

resource "aws_cloudwatch_event_target" "ecs_scheduled_task" {
rule = aws_cloudwatch_event_rule.event_rule.name
target_id = var.task_name
arn = local.ecs_cluster_arn
role_arn = aws_iam_role.cloudwatch_role.arn

ecs_target {
launch_type = "FARGATE"
platform_version = "LATEST"
task_count = 1
task_definition_arn = aws_ecs_task_definition.this.arn
network_configuration {
subnets = var.subnet_ids
}
}
}

resource "aws_cloudwatch_event_rule" "task_failure" {
name = "${var.task_name}_task_failure"
description = "Watch for ${var.task_name} tasks that exit with non zero exit codes"

event_pattern = << EOF
{
"source": [
"aws.ecs"
],
"detail-type": [
"ECS Task State Change"
],
"detail": {
"lastStatus": [
"STOPPED"
],
"stoppedReason": [
"Essential container in task exited"
],
"containers": {
"exitCode": [
{"anything-but": 0}
]
},
"clusterArn": ["${local.ecs_cluster_arn}"],
"taskDefinitionArn": ["${aws_ecs_task_definition.this.arn}"]
}
}
EOF
}

resource "aws_sns_topic" "task_failure" {
name = "${var.task_name}_task_failure"
}

resource "aws_cloudwatch_event_target" "sns_target" {
rule = aws_cloudwatch_event_rule.task_failure.name
arn = aws_sns_topic.task_failure.arn
input = jsonencode({ "message" : "Task ${var.task_name} failed! Please check the logs at https://console.aws.amazon.com/cloudwatch/home#logsV2:log-groups/log-group/${var.task_name}" })
}
output "sns_topic_arn" {
value = aws_sns_topic.task_failure.arn
}

IAM Permissions

Finally, we need to handle IAM permissions for our newly created resources.  The module creates 2 IAM roles, along with an option to pass an existing IAM role to your task’s runtime in case your scheduled task accesses additional AWS resources.  Since this part of the code is a little tedious, I’ll leave it to the reader to check out the source in the module’s GitHub repository. Here is a quick summary of each IAM role:

1. The first role implements several policies related to ECS permissions.  It includes the built-in [AmazonECSTaskExecutionRolePolicy] that allows general ECS task execution and ECR repo access.  We add additional policies that allow the role to create CloudWatch log groups and write to CloudWatch log streams, which are required when using the awslogs ECS log driver.

2. The second role allows CloudWatch to pass the first role to ECS. Without this permission, CloudWatch would not be able to launch the task on our ECS cluster and our job would not run. This scenario is difficult to debug, since there are no logs outside of AWS CloudTrail that can help identify the reason why your job is not running.

One of the main benefits of modularizing this logic is specifically that we only need to figure out the IAM permissioning once.  Since it can be very tricky to set up IAM permissions that abide by the principle of least privilege and even trickier to debug permissions-related issues during development, we can now leverage our previous work in future modules that import this one.

Running the Module

You can find the complete module on GitHub

Now it’s time to use our module in Terraform code.  Before declaring the scheduled job module, we first declare an ECR repo for our task’s Docker container and a general-use ECS cluster that we will use to run all of our scheduled jobs.

resource "aws_ecr_repository" "terraform-drift" {
name = "terraform-drift"
}

resource "aws_ecs_cluster" "cron-jobs" {
name = "cron-jobs"
}

We then instantiate our module and pass it the ECR repo and ECS cluster names.  Assuming we already pushed our Docker image to the repo, we also add the tag of the image that we want to run.  Since we’re using Fargate, the only networking piece we need is a set of subnets in an existing VPC that AWS uses to run the the job in.  We also specify a cron trigger to run this job every hour, Monday through Friday. 

module "cron_job" {
source = "git@github.com:hyperscience/tf-aws-cron-job.git"

ecr_repo_name = aws_ecr_repository.terraform_drift.name
image_tag = "1.0.0"
ecs_cluster_name = "cron-jobs"
task_name = "detect-terraform-drift"
subnet_ids = module.vpc.private_subnets
task_role_arn = aws_iam_role.admin.arn
cloudwatch_schedule_expression = "cron(0 0/1 ? * MON-FRI *)"
}

Finally, we create a webhook subscription to the module’s SNS Topic to notify us of any task failures.

resource "aws_sns_topic_subscription" "task_failure" {
topic_arn = module.cron_job.sns_topic_arn
protocol = "email"
endpoint = var.slack_channel_email_address
}

Once you terraform plan and terraform apply, your highly available scheduled job should be up and running!

If you’re interested in solving challenges related to automating key business practices, we’re hiring across the globe. Check out our open Engineering positions here.

Author’s Note: Thanks to Atanas Yankov on the Engineering team for his work on the alerting and code review. Also thanks to the entire DevOps team for their input and suggestions while writing this article.

###

Steve Sklar is a Senior DevOps Engineer at Hyperscience based out of our New York office. Connect with Steve on LinkedIn.

  1.  https://www.terraform.io/
  2. https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html
  3. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html
  4.  https://www.hashicorp.com/blog/detecting-and-managing-drift-with-terraform
  5.  https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-task-definition.html
  6.  https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.htm
  7. https://console.aws.amazon.com/cloudwatch/

Join Some of the Brightest Minds in AI

Turn cutting-edge ML techniques into enterprise-ready AI solutions for the world's largest organizations