Stateful workloads: How to guarantee savings and continuity

Yarden Kesari

Product Manager

Li-Or Amir

Senior Product Marketing Manager

April 26, 2023

5 min read

Cost Reduction Elastigroup Microsoft Azure Spot VMs Stateful Workloads

Stateful workloads require consistent access to specific network and disk artifacts. Yet with little tolerance for interruptions, it’s no surprise that these workloads can be costly to run. As a result, organizations face a challenging trade-off between consistency and cost-efficiency as their cloud estates scale and cost concerns grow.

However, running stateful workloads doesn’t necessitate overprovisioned, expensive, and unpredictable compute costs. And while virtual machines (VMs) can be costly to run when compared with container architectures, there are still ways to save when running stateful VMs.

Identifying use cases for stateful workloads that can run on spot VMs (in Azure) or spot instances (in AWS) is a game changer in an organization’s cloud strategy. Here’s how you can safely run some stateful workloads and save up to 90% with spot compute.

Stateful applications: Definition, basic requirements, and use cases

In cloud computing, stateful (or “state”) refers to the consistency of attachments that help the workload (e.g., application, microservice, etc.) function despite the ephemeral nature of the cloud. These mission-critical workloads may be used in product development or populate a core function of the working software product.

A stateful application will save data — such as user preferences, logs, or other actions — from previous sessions. By contrast, a stateless application does not retain this data. Operations will run in a stateless application as if they are being processed for the first time.

Stateful attachments can include:

Stateful network: Consistent access to a fixed IP address
Stateful OS: Consistent operating system configurations (e.g., system files and their location, setup files, user settings and permissions, etc.)
Stateful data: Sustained access to all other non-OS disks

Tech companies, in particular, are taking advantage of stateful workloads, especially among the FinTech, data, gaming, and media streaming services categories. They might use one or more of the above attachments to run stateful workloads such as:

Data layer: e.g., Mongo, Cassandra, MySQL, Elasticsearch
Dev machines
QA and testing environments
Application servers in production
Gaming applications
Machine learning (ML) training
Media servers

Many enterprises will typically run stateful workloads on legacy architecture — monolith applications running on VMs. In cloud-native start-up and scale-up companies, however, stateful workloads may be containerized. Running stateful applications through Kubernetes is mostly done using StatefulSets.

Running stateful VMs: Considerations and costs

Another option for deploying stateful workloads is to run them on VMs. Naturally, companies choose to run stateful workloads on cloud resources which are natively stateful. In other words, once you buy them, the continuity and availability of attachments are guaranteed.

There are a few different pricing options for natively-stateful VMs. In both Amazon Web Services (AWS) and Microsoft Azure, these are either on-demand or commitment-based machines (i.e., Reserved Instances or Savings Plans). On-demand pricing is usually the highest, while RIs and Savings Plans can be less costly when utilized effectively.

However, the excess cloud capacity that both AWS and Azure offer can provide even more significant savings. AWS spot instances and Azure spot VMs are the cheapest form of compute, offered at up to 90% discount.

Making excess capacity truly stateful

There are four critical requirements that allow stateful workloads to run almost uninterrupted on spot machines:

1. Persistence

By sustaining attachments (e.g., data disk and network) through machine replacement, the workload recovery time decreases considerably. From the files and network point of view, the workload is ready to pick up from where it stopped. In ML workloads, you can go as far as a complete state, if you create a checkpoint file often enough.

2. Longevity

This involves choosing spot markets where the probability of machine eviction is relatively low to minimize machine replacements in the first place. However, this might require operating outside of your usual regions or availability zones.

3. Rebalancing

Rebalancing ensures that a workload will keep processing without interruption. To truly minimize downtime, you need the ability to predict the nearing machine eviction and spin up its replacement in advance.

4. Fallback

Falling back to on-demand or paid commitments can help guarantee workload continuity even in the case of complete spot unavailability. For continuous optimization, you might want to automate the transition back to spots once they’re available again. An enterprise-grade solution that automates the use of spot instances is a must for this task.

When to use Spot VMs for a stateful workload

Spot VMs are suitable for any workload with some downtime tolerance. Internal workloads like personal dev machines or testing environments are good places to start. For production, it’s a financial risk management question — does the amount of compute savings exceed the cost of downtime?

Therefore, organizations need to be aware of the cost-availability tradeoff with spot VMs. Notably, they can be evicted without any notice, resulting in the workload stopping. This makes running critical workloads on these low-cost resources risky. Plus, they are not natively stateful; some hyperscalers may automate machine replacement but without persisting IP, OS, or data disk.

Rightsizing stateful VMs with automation

VM rightsizing is the balancing act between machine size and workload reliability. Effectively rightsizing VMs requires thinking first about the performance needs of applications running on those VMs. Consider utilization metrics and trends around CPU, memory, network, and disk use to ensure that downsizing the VM won’t affect workload performance.

Another helpful measure is to use automation to set thresholds, so you know when it’s time to rightsize VMs. In general, if usage exceeds 80%, you should consider sizing up VMs, and if it stays below 20% for a fair amount of time, you can probably downsize. Monitoring how the application uses server CPU and memory can help determine what’s the best instance type/size VMs to run on.

VMs are a common culprit of overspending in the cloud, and storage is another area that can require rightsizing to keep costs down. Third-party tools can make it easier to recommend and automatically deploy the most cost-efficient disk type and file system size to meet the application’s performance needs. Rightsizing storage also includes being able to locate and remove waste in the form of idle disks or disks with excessive IOPS configured.

Support and optimize your stateful workloads

Most cloud optimization solutions will offer to run stateful workloads while leveraging commitments/reservations. However, this will not get you the highest savings possible. Spot VMs will. The only problem: how do you keep the VM state?

Spot recently made Azure Stateful Node generally available. Stateful Node adds IP and data persistence to excess capacity compute, making them suitable for multiple stateful use cases. This way, enterprises and scale-up organizations can instantly get more compute power from their existing budget — without compromising availability.

Request a Spot demo today