Stop & Resume EC2 Spot Workloads - 3 Important things to know -

Stop & Resume EC2 Spot Workloads – 3 Important things to know

Reading Time: 2 minutes

AWS has recently announced a new capability that adds “Termination behavior” for Spot Instances. This property can be defined as either “Stop” or “Terminate“. “Stop” basically means that upon termination (and termination only) your Spot Instance will be in “stopped” state rather than terminated.

Stop is not “Pause” – Only valid for Spot Interruption

For those of you who asked themselves, “Can I really ‘pause‘ my Spot Instance now”, the answer is no. AWS has enabled this feature only in the event of Spot interruption. Meaning, that you can’t just “Stop” a running spot instance, and “pause” it for a certain amount of time.
We gave it a try and noticed the following, you can’t perform “Stop” to intentionally pause your Spot instance.

Spot instance cannot be stopped

“Pause” feature is available via Elastigrup – please check this out.

Single-AZ Support Only

Upon Inturroption, AWS will try to launch a new Spot Instance, based on your Spot Instance configuration only in the same Availability Zone, that is true because EBS volumes always exist within a particular AZ. Nonetheless, if a Spot Instance cannot be obtained in the same AZ (due to a capacity shortage on Spot) your request might ‘hang’ and will wait until the next Spot becomes available.

In case you wondered, Elastigroup supports Multi-AZ recovery mode by using incremental snapshots to migrate EBS volumes to another AZ upon termination (if necessary)

Downtime Considerations

Had you planned on running a Stateful service that makes use of checkpoints, Stop & Resume is a perfect match for you. However, if you plan on having the capacity constantly – there is a risk that capacity won’t become available in your specific AZ. To solve that, you might want to Recover to a different AZ, or just use On-Demand Instances until the capacity becomes available.

Here is a great use case that makes a use of Stop, Pause and Resume of Spot Instances – Running cluster on Spot.