AWS Spot Instances and DIY

You may have heard that Spot Instances are ideal for slashing your EC2 costs. And you’d be right too, however there are a few considerations you must be aware of before launching your Spot Instances. 

But first, some brief background for those not fully familiar with EC2 pricing models.

EC2 Pricing in a Nutshell

AWS offers three primary ways to pay for and use EC2. 

  • On-Demand pricing which is essentially pay-as-you-go and is also the most expensive option.
  • Reserved Instances and Savings Plans which are upfront financial commitments for 1 or 3 years of EC2 usage (as well as some other AWS services). Savings compared to On-Demand are roughly in the 70% range. However, Reserved Instances and Savings Plans create lock-in, so if you don’t use what you committed to, you’ll potentially end up with a negative ROI. 
  • Spot Instances offer 90% cost reduction when compared to On-Demand pricing. The pricing model represent AWS’ excess capacity which they need to have available in case of surges in customer demand. To offset the loss of idle infrastructure, AWS offers this excess capacity at massive discount to drive usage. However, this comes with the caveat that AWS can “pull the plug” and terminate Spot Instances with just a 2 minute warning (they would do this in the event excess capacity is needed by Reserved Instance or On-Demand customers). Obviously, this sudden interruption of EC2 instances can result in data loss, service degradation, unavailable services and the like.

DIY with AWS Spot Instances and Spot Fleet

With our introduction out of the way, let’s dive into some of the issues to consider for the do-it-yourself approach. While managing a Spot Fleet on your own might seem like an attractive option, there are some things you should know.

No SLA for Availability

While Spot Instances offer the maximum potential for EC2 savings, they also are the least reliable in terms of availability, which is why AWS only recommends them for “stateless, fault-tolerant, or flexible applications such as big data, containerized workloads, CI/CD, web servers, high-performance computing (HPC), and other test & development workloads”.  In short, AWS does not provide any SLA for keeping your workload up and running, greatly limiting the cost benefit of Spot Instances for mission-critical, production workloads.

Container-Driven Autoscaling

EC2 Autoscaling groups use traditional scaling metrics (i.e. server CPU and RAM utilization) to autoscale Containerized applications. The issue with this is that instance/node metrics do not accurately represent the requirements of the application running on them.  So if you are running ECS or Kubernetes workloads, even if there is a surplus of EC2 resources, once a pod or a task requires more CPU or RAM than any single node has available, a new node won’t start and the pod/task won’t run, remaining in it’s pending state. Moreover, even if a scale up is triggered, it is unaware of which application triggered it, potentially resulting in an instance that still can’t accommodate the pending  pods/tasks.

As an alternative, you can try using the open source “Cluster Autoscaler” for EKS. However, there are some limitations such as lack of support for autoscaling groups which span multiple Availability Zones. To enjoy higher availability, you will need to manage multiple autoscaling groups, one per AZ.  For ECS,  Amazon recently announced Cluster Autoscaling feature which is a new option you may have heard about during re:Invent.

Instance Auto-Recovery, Graceful Draining and Fallback to On-Demand 

When AWS terminates a Spot Instance, it will not seek out an alternative instance type, unless you have defined the Spot Fleet as “maintain”. In that case, Spot Fleet will replace the terminated instances with a different, available type from one of the pre-selected Spot pools. Regarding draining your workload in the event of a Spot Instance termination, you only receive a two-minute warning which is often insufficient to properly drain and backup your workload. Finally, in the event that there are no available Spot pools, Spot Fleet will not fallback to On-Demand which means there is no way to fully guarantee for your workload’s uptime. 

Automatic Backup and Re-Attaching EBS Volumes

If you are looking to run Stateful applications on Spot Instances (yes, it is possible), Spot Fleet does not provide any IP persistence and while EBS volumes can be backed up, you will need to manually re-attach EBS volumes to any new instances. The only way to enjoy automatic EBS re-attachment, is to opt for the “hibernate” or “stop” behavior for interrupted Spot Instances. In that case, the EBS volume will be automatically re-attached, but the trade-off is that you’ll need to wait for the exact same type of Spot Instance to become available once again…and this might be minutes, hours or days. 

Prioritizing Workloads to Run on Unused Reserved Instances

When using Spot Fleet, there will be no prioritized placement of workloads onto unused Reserved Instances. You will only be able to leverage the cost-savings of Reserved Instances in the event that you have a blended cluster of On-Demand and Spot Instances, and the On-Demand Instance matches an available RI. 

Native Integrations with 3rd Party Vendors

While Spot Fleet allows you to work with ECS, EKS, Elastic Beanstalk and other AWS services, it does not have native integration with products like Rancher, D2iQ (Mesosphere), Docker Swarm, Chef, and many others. As you might be working with one of these platforms, you will need to develop some sort of integration on your own, adding extra time and effort to your project.

Build vs. Buy

By now you probably are getting the sense that while Spot Fleet is a super powerful tool for running some of your fault-tolerant workloads on Spot Instances, it comes with significant overhead in terms of configurations, custom scripting and even some manual intervention. 

Most significantly, without any SLA for availability, you cannot consider moving production workloads to Spot Instances with just Spot Fleet alone. 

This is where Spotinst enters the picture, providing you with a “set and forget” platform for running even mission-critical workloads on Spot Instances so you can enjoy the incredible cost-savings AND sleep well at night with an enterprise-level SLA for high availability.

Moreover, Spotinst is not just about Spot Instances, but rather is an end-to-end cloud cost optimization platform which starts with comprehensive spend analysis, leading to actionable recommendations for Reserved Instances and Savings Plans alongside Spot Instances and other cost-saving measures. And the best part is that all these recommendations can be easily implemented with the just the click of a button, saving you not only money, but also saving you time and effort in managing your cloud infrastructure. 

To learn more about the advantages Spotinst can provide when running production and mission-critical workloads on Spot Instances, feel free to schedule a complimentary demo with one of our solution architects.