Exploring Cost Insights for ML Experiments at Hyperscience

Akshay Patil | Posted on December 14, 2021

Model Training at Hyperscience

At Hyperscience, we are continuously training and iterating on our models using our Machine Learning (ML) platform with the help of 30 ML Engineers across the globe. With numerous  Machine Learning experiments launched daily, we must ensure that our ML platform can support the load and scale of our ML experiments. A ML experiment runs any process related to the model lifecycle, such as training, evaluating, and preprocessing. 

The first version of our ML infrastructure was a setup consisting of an AWS EC2 instance that was launched on-demand. We had a Python client that would enable you to specify multiple options to set up your instance, such as type, git commit hash, ami id, and command to run on launch. Once ready, users can ssh into the instance to perform any additional experimentation. There were certain issues with this setup:

  • Management of the EC2 instances was an operational burden.
  • Setup led to stray ssh keys adding to the security burden.
  • Difficult to run distributed training using PyTorch or other frameworks.

Our second version of ML infrastructure (called MLLAB) leverages Elastic Kubernetes Service (EKS) for running all ML experiments. EKS is an AWS managed Kubernetes Service where the nodes (instances running ML experiments) auto-scale up and down as we are launching and destroying ML experiments. For authentication, we use AWS SSO, where each user is assigned to a group and each group has a specific IAM role. Since every ML Engineer is part of their team-specific IAM roles, we leverage these IAM roles to set up authentication to MLLAB. This allows Engineers to access resources specific to their teams. More information on users and groups can be found here. As we ran Kubernetes, it was simple to install kubeflow’s  PyTorch and tensorflow operators to run distributed training. 

Reporting on Cost

Allocating resources and costs was easier in our initial version of ML infrastructure since we merely had to tag the correct EC2 instances to display the cost related to any entities like team and user. With MLLAB, backed by Kubernetes and other containerization tools, the traditional process of allocating and reporting on costs no longer works. Below are some of the challenges when reporting the associated costs: 

  • Resources are shared across multiple applications
    • Kubernetes will deploy multiple application containers on a single instance. It becomes hard to track which container used how many resources if there is no hard limit set on the resource usage.
  • Containers are dynamic in nature
    • Kubernetes pods, which can contain one or more containers, can be destroyed and moved anytime and should be designed to be stateless (in most cases). The dynamic nature of the pods make it hard to keep track of the costs if they are moved around instances.
  • Autoscaling makes it harder
    • Kubernetes autoscaler will try to adjust the request and limit configuration to reduce the overhead. This makes it hard to track the costs if the configuration is changed at runtime.  

Below is  an example of how MLLAB costs are reported per user, using the traditional AWS Cost Explorer.

[AUTHOR’S NOTE] The numbers displayed above are fictional, and not actual AWS cost incurred.

Majority of the cost is not allocated to any user. This is a cause for concern, as we have instances running in our cluster without a way to correctly associate them to any user. 

Given the above challenges with reporting cost in MLLAB, we had to look into different frameworks that would help us solve this cost insight problem related to Kubernetes.

Requirements

We wanted to know answers to some of the cost related questions, such as: 

  • How much was spent on a particular experiment?
  • How much was the total spend of a particular ML team?
  • How much was the total spent by a user?
  • How much are we spending on our first version of ML infrastructure (which we are hoping to deprecate soon)?

[AUTHOR’S NOTE] This does not mean we will be capping the number of experiments being run by an engineer or a team. Rather, insight into the above questions helps us have meaningful, data-driven conversations.

We determined a set of requirements for our cost insights effort: 

  • Getting cumulative costs on weekly basis aggregated by: 
    • Team 
    • Users (i.e different ML, MLOPs engineers)
    • Experiments (training, evaluating etc. launched by a user)
    • Instance Types (i.e p3s vs c5s)
  • Single dashboard to view these costs for both versions of our ML infrastructure.

Cost Insights Framework: Kubecost

After an extensive amount of investigation and POC on various frameworks, we decided that Kubecost checks all the requirements above and would give us an out-of-the-box dashboard to view all of our costs.

Kubecost, which was started as an open source project, allows you to monitor your cost but also provides you with optimization solutions based on your usage. They have a free version which you can install for your cost insights which will show you cost metrics of the last 15 days. 

Below are some examples of cost insights that were provided to our internal teams after our first deployment of kubecost:

COST PER TEAM

Cost for the last 7 days aggregated by teams.

[AUTHOR’S NOTE] The numbers displayed above are fictional, and not actual AWS cost incurred.

COST PER OWNER

Cost for the last 7 days aggregated by owners.

[AUTHOR’S NOTE] The numbers displayed above are fictional, and not actual AWS cost incurred.

COST PER EXPERIMENT

Cost for the last 7 days aggregated by experiments.

[AUTHOR’S NOTE] The numbers displayed above are fictional, and not actual AWS cost incurred.

If you observe the external column in the images above, you will see that the costs incurred outside of our Kubernetes cluster. This helps us understand how much we spend on our first version of the ML infrastructure. We still have some users using the old infrastructure and as we add more features and migrate to our new infrastructure, this will give us some insight on our activities and costs on the old infrastructure, which we would want to reduce to $0. 

Prerequisites for Obtaining Cost Insights 

Tagging Resources

Kubecost will try to aggregate the costs based on the labels and tags on your AWS resources. An important task was to correctly label and tag all the AWS resources we will be using for our ML experiments. Our Python client that we built as a wrapper around launching experiments in both versions of ML infrastructure helps us to correctly tag any resources that are launched as part of the experiments. When launching any resources in MLLAB, the python Python client calls the kubernetes client underneath to add any specific tags or labels on the resources so they can be correctly used in kubecost.

Creating Cost and Usage Reports in AWS

In order to get accurate pricing of the resources being used outside of the kubernetes, it is useful to set up a cost and usage reports in S3 that gets updated up to three times a day. Kubecost will try to query S3 buckets every few hours to get the costs of different AWS resources and the tags in your ML account. This will determine how much was spent on a particular tag and match that tag with the labels on your kubernetes resources so you can get the aggregate cost of a particular tag/label like `team` or `user` on both in cluster and out of cluster resources. 

Another important element is to enable cost allocation on tags that were defined in your out of cluster resources. This will enable costs to be generated in the cost and usage reports for those tags. You can refer to this link to enable cost allocation on tags in your out of cluster resources.

Future Efforts

Obtaining cost insights is the first step towards any cost optimization efforts. Eventually, we will get some cost optimization efforts in place based on some insights that are available to use in those dashboards. Some future optimization efforts that we would want to concentrate on include:

  • Using the correct instance type based on the usage.
  • Tagging and naming the resources for unallocated cost.
  • Identifying idle cost and reducing that to as low as possible.
  • Tagging and aggregating costs of other resources apart from EC2.

Akshay Patil is an MLOps Staff Engineer based in Austin, Texas. Connect with Akshay on LinkedIn here.

Join Some of the Brightest Minds in AI

Turn cutting-edge ML techniques into enterprise-ready AI solutions for the world's largest organizations