How I cut my AKS cluster costs by 82%

With our recent announcement regarding the general availability of Ocean for Azure Kubernetes Service (AKS), I decided to migrate one of our production services. The service was already running on AKS, but will now be managed by Spot’s Ocean for AKS.

TL;DR: The results are pretty cool, as I was able to cut 82% out of the existing spending for this AKS cluster. You can see the results in the screen capture below.

Spot dashboard results

If you want to learn more about how I enabled Ocean for this AKS cluster in less than 5 minutes, please continue reading.

To begin with, my AKS cluster configuration was as follows:

  • Network type (plugin): Azure CNI
  • Node pools: 2 node pools (User, System)
  • Node sizes: Standard_DS2_v2, Standard_D2s_v3
  • Region: EastUS
  • Network Policy: Default
  • Cluster is using Managed Service Identities
  • Autoscaling: Enabled

This production application has a front and backend for internal developer use. It provides a number of utilities to create and manage test, staging, and production environments. In general, my team was unsatisfied with the utilization of the nodes in this AKS cluster, and the spending was very high. Moreover, we had cost limitations on this cluster and min/max restrictions on the number of nodes. The end result was a lot of periods where many pods were unscheduled waiting for other jobs to finish.

Below are the metrics from the AKS cluster, before being migrated to Spot Ocean.

AKS metrics before migration

As you can see, there are many pending pods from time to time, and the average memory (24%) & CPU (12.36%) utilization is very low.

Ocean claims it will optimize node utilization and reduce overall cluster cost without sacrificing availability or performance. Because this cluster has underutilized resources, pending pods, and was hitting budget limits, I decided to migrate this production cluster to Spot Ocean for AKS.

Getting Ocean configured and connected

Before connecting the AKS cluster to Ocean, I completed the following steps as outlined in the Spot Ocean documentation:

  1. Connected an Azure account to Spot.
  2. Verified access to our AKS cluster.
  3. Kubernetes command-line tool, `kubectl`, installed on my workstation and configured to work with the relevant AKS cluster.

The setup page to allow Ocean to operate on your AKS cluster is very simple and involves 3 clicks:

  1. Generating a token (Note: you can reuse an existing one.).
  2. Install the controller on the cluster with a simple command.
  3. Installing a job that sends the cluster metadata from a running node to Spot SaaS.

Below is a view of the setup page

Ocean setup steps

After the connectivity has been established, and step 4 has been completed, I proceeded to the Compute tab, in order to view the imported data from AKS.

As for the VM types and the resource limits in the cluster, Spot Ocean configures a VM family pool that matches the architecture of the resources in the AKS configuration. I selected a wide variety of instance types, as shown in blue in the screenshot below.

Note: You can always remove instance families you don’t want to use.

vm instance family selection and limits

Then, based on the node pool configuration in Azure, Ocean creates Virtual Node Groups (VNG), which is a component of Spot that provides a single layer of abstraction that enables you to manage different types of workloads on the same cluster. The respective VNGs will inherit the labels, taints, availability zones, and disk configuration that was already defined in Azure.

VNG imported node pool

The rest of the configuration does not usually need to be changed. It is a snapshot of the existing configuration in the AKS cluster. It will appear in the last tab – Review, under the JSON view.

imported configuration JSON

The configuration that you should expect to see is related to Images, Networks, Cluster tags, Login, Load Balancers, Disks, Extensions, and Authentication.

Finally, after a very easy process of importing the data, it’s time for the last click – Connect Ocean!

The Ocean cluster has been created, and now manages 0 out of the 17 nodes that we have in the AKS cluster.

ocean cluster created

Migrating workloads from AKS node pools to Ocean

In order to get everything to work, it’s time to migrate the workloads from AKS to Ocean.

First, we need to disable the AKS node pool autoscaling. This will let Ocean take over management of newly created pods and launch Virtual Machines (VM) for the pending pods that have no suitable node to be placed on.

After that, we’ll need to scale down the VMs in the AKS node pools in order to let Ocean spin VMs for them.

Important note: In order to let Ocean control the AKS cluster, we need to:

  1. Disable Cluster Autoscaling for the node pools
  2. For system node pools, the recommendation from Azure is to keep at least one node running, hence we’ll keep 1 node in the default System pool. For User pools, we’ll scale down to 0.

To do that, we first need to disable the AutoScaling in the AKS node pools, by changing it manually via UI or with the following commands:

CLI:

$ az aks nodepool update --disable-cluster-autoscaler -g ${resourceGroupName} -n ${nodePoolName} --cluster-name ${aksClusterName}

UI:

AKS scaling node pools to zero

 

After the node pools autoscaling is disabled, I scaled down the node pools as described above.

This can be done manually via Azure UI, or in Azure CLI with the following command:

$ az aks nodepool scale --resource-group ${resourceGroupName} --cluster-name ${aksClusterName} -n=${nodePoolName} --node-count 0

Now, as the pods that were scheduled on the scaled-down nodes are being Unscheduled, it’s time for Ocean to launch spot VMs for them.

As you can see in the screenshots below, half of the cluster is now managed by Ocean, and the other half is being cordoned and drained.

Ocean managing half the nodes

The nodes using the aks-XXXX… naming convention are the AKS managed nodes and the vm-XXXX… are nodes managed by Spot Ocean.

AKS and Ocean node names

Finally, after the first day that all of the workloads are now sitting on Spot VMs managed by Spot Ocean, I made the following observations:

  1. My cluster reached 97% utilization at first. That is an awesome improvement! In order for new pods to get scheduled faster, I configured 5% headroom for the “User” VNG. This reduced my utilization to 92%, but workloads can scale faster with a reasonable buffer of spare capacity.
  2. My cluster has 83% savings because Ocean provisions a variety of spot VM sizes. This is more flexible than what the AKS managed node pool provides. Bin-packing is optimizing workload placement and reducing the number of nodes needed.
  3. The number of unscheduled pods was drastically reduced.
  4. There were no interruptions at all for the spot VMs on the first day. Since then, I’ve observed an average of three interruptions a week. These are being handled seamlessly by Ocean without negatively impacting the running workloads.

What does this mean for you?

If you are already running workloads on AKS, this solution could really help reduce both your cost and operational overhead. With Spot Ocean for AKS, I was able to cut more than 80% out of this cluster’s compute costs. In addition to the cost savings, workloads are performing better, with less pods spending time unscheduled. Do you want to cut your cluster costs by 80%? Book a demo!