Introducing Predictive Rebalancing: An application-driven approach for reliably utilizing spot instances

blog on new models for predicting EC2 spot instance interruptions

Here at Spot by NetApp we’re continuously innovating our machine learning models used for identifying and predicting spot capacity usage and interruptions for all major public clouds (AWS, Azure and GCP). These proprietary algorithms expand the ability to utilize spot capacity for production and mission-critical workloads, allowing our customers to enjoy up to 90% cloud compute cost reduction with SLAs and SLOs that guarantee availability.  

Recently we’ve rolled out a package of significant enhancements to those modelsmost notably our new Predictive Rebalancing feature. This new rebalancing model is applicationdriven, combining advanced predictive algorithms with a deep understanding of infrastructure and real-time analysis of workload requirements to automatically provision and allocate cloud infrastructure in the most efficient and effective way possible. This not only provides far earlier and more finely tuned predictions but takes your unique workload and application requirements into consideration thereby ensuring that each and every process has the time it needs to be completed. In the event of a predicted interruption, each application will be able to gracefully transition to new replacement instances. This alignment of predictions with application needs ensures uptime, scale and successful workload execution for any situation.  

In practical terms, you now can reliably run an even broader range of applications on spot capacity, as our more accurate visibility into supply and demand drives better selection of long-living spot instances and greater proactivity in avoiding interruptions and disruptions.  

Let’s peek under the hood as well as review some of the new features you’ll have access to, using AWS EC2 spot instances as an example. 

Tracking spot instance capacity 

AWS has approximately 15,000 spot instance capacity pools across the globeeach uniquely defined by its region/availability zone, instance type, size and operating system. With availability based on ever-changing supply and demanddetermining which spot instances will have greater longevity and which are about to be terminated requires access to significant amounts of both historical and current EC2 consumptionupon which machine-learning algorithms can learn to accurately predict these capacity pools’ behavior.   

With over billions of events, collected by our platform, Spot by NetApp has access to exactly this sort of unique data. Coupling this with our brand-new algorithm we are now able to reliably predict, to an even greater extent than before, which spot instances will be interrupted and which will enjoy greater longevitybenefiting our customers with lowest-cost cloud compute and enterprise-level SLA for high availability.

Machine learning algorithms for predicting spot instance interruptions  

Our updated algorithms for Predictive Rebalancing can accurately predict and replace spot instances, with an 85% level of accuracy, up to an hour ahead of an interruption during peak business hours. Additionally, when it comes to selecting desired spot instance types (see below), periods of longevity can be requested with our algorithms deploying matching instances with the highest probability of matching the defined requirements. 

These predictive algorithms sample multiple, statistically significant data sets and have been proven to accurately reflect spot instance behavior in the broader AWS EC2 environment. 

graph of EC2 spot instance interruptions across all spot pools

Predictive Rebalancing features

Spot by NetApp customers using Elastigroup and Ocean for web applications and containerized workloads respectively, can now run a broader range of workloads, even those that are more sensitive to interruptions – for example, applications that have long draining times – on inexpensive EC2 spot instances without concern for downtime or performance degradation.   

Once you have chosen your desired cluster configurations, you can define the parameters for managing your spot instances in the event of actual or potential interruptions 

Workload capacity

Here you can select either instances or vCPUs and the desired target, minimum and maximum respectively. If you wish to always run part of your workload with on-demand instances you can define either a specific number of instances or set a percentage of on-demand vs. spot instances.

selecting spot instances or desired vCPU

Optimization strategy 

This feature has three partsFirstly, you can select fallback to on-demand in the event there are no available spot instances. Secondly, you can opt to utilize any available reserved instances before spinning up new spot instances. This drives even greater cost-efficiency by using already paid for resources. 

Finally, you can define your spot instance portfolio’s overall orientation as follows: 

  • Cost – Spot by NetApp will seek out the least expensive spot instances to run on, even at the risk of more frequent replacements. 
  • Availability – Spot by NetApp will seek out the pool of spot instances with the greatest longevity for your workload. 
  • Balanced – This is the Spot by NetApp’s recommended orientation where a balance between cost and availability is sought. 

use on-demand, reserved and spot instances in a single cluster

Continuous optimization 

If fallback to on-demand instances occurred, you can choose when your workload should be returned to spot instances or moved to an instance type that has already-purchased and available reserved capacity.  

move back from on-demand to spot instance

Visibility into EC2 spot instance availability and lifespan 

Here you can select the desired amount of time you wish your workloads to run without any interruption to their underlying instances. Spot by NetApp will seek out spot instances from your selected families, sizes and AZ’s that have the greatest probability for running that long. 

You can also define the draining period your application requires so our automation will start replacing the instances with enough time before the interruption is predicted to occurallowing for complete and graceful draining.   

choosing spot instances with greatest longevity

Application Driven Predictive Rebalancing

Using spot instances for mission-critical workloads always carried the risk of interruptions, making their use, while financially attractive, less than ideal. Spot by NetApp has been enabling cloud consumers to use spot instances for dramatic cost savings, while ensuring high availability. Today we are taking this to the next level by coupling our new Predictive Rebalancing with our advanced cloud compute automation and continuous optimization, so all your applications and workloads will always have the resources they need for high availability, performance and maximum cost efficiency. 

Predictive Rebalancing is being rolled out across our customer base and is accessible via the Spot console. You will be able to configure Predictive Rebalancing through the UI, API or your preferred IAC – Terraform or Cloudformation. To get started contact us today!