Maximum availability without risk: Spot market scoring explained

Head of Product, Cloud Compute & Platform

December 13, 2021

5 min read

Using spot instances for mission-critical workloads always carried the risk of interruptions, making their use, while financially attractive, less than ideal from a reliability perspective. Spot has made it possible for cloud consumers to use spot instances for dramatic cost savings while ensuring high availability for all kinds of workloads. Core to our cloud infrastructure offerings is Spot Availability Scores, which are leveraged to provide maximum availability while mitigating risks.

Intelligent, continuous and automated spot selection

The AWS Spot market has approximately 15,000 spot instance capacity pools across the globe, each uniquely defined by its region/availability zone, instance type, size and operating system. With spot availability based on ever-changing supply and demand, determining which spot instances have longevity and which will be terminated requires access to significant amounts of data—both historical and current EC2 consumption—upon which machine learning algorithms can learn to accurately predict capacity pool behavior.

With billions of events collected by our platform, Spot has access to this unique data. Coupled with our predictive rebalancing algorithm, we are able to reliably predict—up to 90%—which spot instances will be interrupted and which will have greater longevity, giving our customers the lowest cost cloud compute and enterprise-level SLA for high availability.

One of the core components of the predictive rebalancing feature set is Spot market availability scoring. Spot market availability scores are determined based on capacity and behavior, as well as dynamic variables like long-term seasonality (e.g., the black Friday period) and short-term changes (e.g., an interruption that occurred a minute ago). Predictive rebalancing is a set of features that continuously optimize the mix of underlying instances to ensure cost optimization with high availability. Rebalancing features, including minimum instance lifetime and the ability to revert back to preferred configurations, utilize availability scores significantly to execute actions.

choosing spot instances with greatest longevity — Figure1: Based on a scale of 0 to 100 and dynamic thresholds that change according to the market trends, application behavior, and customer preferences, Spot by Netapp’s Elastigroup and Ocean decide which markets should run properly at a certain point in time.

Availability scoring as a data source for optimization

AWS recently announced EC2 Spot placement score, a new capability that allows customers to obtain some insight into the availability of capacity pools. The AWS placement scoring uses a scale from 1 to 10, with 10 indicating that a spot request is highly likely to succeed, and one suggesting that it will not.

While this availability scoring improves the experience of creating and updating configurations of autoscaling groups that use spot instances, it only provides insight into launch-instance operations and does not address market stability considerations needed for continuous optimization and automation. AWS allows users to query scores via an API, but they have the right to limit API calls if they “detect patterns not associated with the intended use of the Spot placement score feature.”

Market scores are dynamically updated based on endless market capacity and cost changes. Depending on the variable usage of all AWS customers, the market score can vary greatly between the time the group was configured and peak times. This is one of the significant benefits of Spot’s predictive rebalancing, which continuously monitors market scores and updates the groups’ market distribution in accordance with the current and predicted availability and costs.

Figure 2: An abstract overview of Spot machine learning algorithm for predicting market scores. For simplicity, we illustrate only a small fraction of the employed features in this high-level diagram. There are many more in practice.

Spot’s algorithm takes into account many statistics when calculating scores. As described in Figure 2, the input of the models are as follow:

High-frequency signals, e.g., the time since the last interruption, the stability of the sampled period one hour ago, and the number of interruptions one hour ago;
Long-range seasonal signal, such as the day of the week or holidays;
Global pool statistics, such as the number of on-demand instances versus the number of spot instances running currently in the market;
Meta-data, to learn the correlation between different families (e.g., is it a GPU instance?).
Combining all statistics, we can predict the future based on correlations. To this end, we employ state-of-the-art deep learning algorithms and gradient boosting models trained over years of data.

Market scores are monitored and rebalancing activities can potentially carry out:

Replacement operations – Predictive rebalancing continuously monitors the market’s scores and decides whether the current market for a specific instance is stable or should be replaced.
Scale-up operations – Predictive rebalancing checks the score of potential markets for newly launched instances: (i) Adding more capacity as part of scaling activity ; or (ii) A replacement operation that can be triggered by the predicted availability of an instance or by users’ preferences, such as revert back to preferred AZ, revert back to preferred instance type, and revert back to RI/SPs once available.

Predictive rebalancing checks the scores and determines the best markets to run on, based on user-defined and configured preferences.

For example, a user has specified to run on a preferred AZ (e.g., us-east-1) with a preferred instance type (e.g., m5.4xl). Predictive rebalancing will consider the user preference and the market’s scores to perform the launch-instance operation:

Spot continuously calculates market availability score for all the markets based on the recent behavior of all of our customers and long-term seasonality;
Based on the scoring, predictive rebalancing sorts the user preferred markets;
If the score is low for all the preferred markets, predictive rebalancing will try to launch the instance in one of the other configured AZs or spot instance types;
If there is no stable market, (i.e., all the configured markets are below a dynamic threshold) predictive rebalancing will fallback to on demand in order to maintain the required capacity and keep the availability of the application.

The above example generally describes some of the decisions and considerations during an instance launch operation. This is only a small fraction of the entire predictive rebalancing feature-set, which optimizes the group during its entire lifetime. Spot’s predictive algorithms help our customers continuously optimize their costs while ensuring availability.

To get started, learn more about Elastigroup or Ocean.