Deep learning is gaining in popularity for many AI use cases such as Computer Vision, Speech Recognition, Natural Language Processing, Recommendation Engines and more.
Not surprisingly, cloud computing is a major enabler for deep learning models. Cloud Providers such as AWS, offer a broad range of EC2 instance types with varying GPU and CPU specs where you can manage everything yourself. They also offer fully-managed AI platforms such as Sagemaker where you can deploy your machine learning models without dealing with the underlying infrastructure.
In our post however, we will focus on the do-it-yourself approach for deploying deep learning models on EC2 spot instances, so you can control the underlying infrastructure with dramatic cost optimization.
EC2 has it all. But at what price?
In general, developing and training deep learning Neural Network models doesn’t require much effort on AWS. You simply choose your preferred instance with the optimal GPU or CPU for your workload, relevant platform, AWS Custom ML AMI, clone your code to the instance and start running.
However, training neural networks can take hours or even days depending on the complexity and the dataset size. If you check EC2 pricing, you will see that the GPU supported instances can be extremely expensive. This little financial detail, if not handled properly, can turn your Machine Learning project into a huge line item on your next AWS bill.
The solution to this issue can be found by using spot instances. Spot instances are an AWS pricing model that offers up to 90% discount in comparison to on-demand pricing, for the exact same instance.
The only caveat is that AWS can pull the plug on your spot instances with just a 2 minute warning. This is clearly not ideal for deep learning models, but as we’ll see, with Spot, you can have your proverbial cake and eat it too.
Getting started with training neural networks on spot instances
As we mentioned before, spot instances offer up to 90% pricing discount for the same instance type. However, there are some very significant challenges that Spot handles:
- Spot instances are not persistent as AWS can interrupt them at any time. Therefore, they are not recommended for time-sensitive workloads.
- Unplanned instance termination can cause data loss if the training progress is not saved properly.
To address these issues Spot offers Managed Instance, a solution for running stateful workloads on spot instances. This allows you to train your deep learning models on spot instances with full persistence for root and data volumes as well as private and public IPs, and automated instance recovery.
Principles: training neural networks over spot instances
Now that we have covered the basics of spot instances, let’s focus on some specific principles to keep in mind for machine learning projects over spot instances.
- Decouple compute, storage and code. By decoupling the different components we can control each part of the training process. The compute instance will be stateless, while the data will be persistent along the training process.
- Use a dedicated volume for datasets, checkpoints, saved model and logs. This ensures that the data will be persistent, and interruptions will not affect the training process.
Let’s get this machine learning project started on Spot
Here is what we need to do to get going in Spot.
- Active Spot account connected to AWS. (https://help.spot.io/managed-instances/)
- Github repository with your training code.
Starting with the infrastructure:
- Go to your Spot account and start new Managed Instance. In our case, we will persist only the data volume, since we will not make any special changes for our OS.
- Start configuring your Managed Instance
When configuring your Managed Instance you should pay attention to the following fields:
- Region: in case you want to use GPUs you should pay attention to the region, not all AWS regions support gpu. https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html
- Image: Deep Learning AMI (ami-0027dfad6168539c7) is an Amazon machine image with pre-installed deep learning frameworks.
- Key-pair: in order to connect to the instance via SSH key pair must be configured
- Choose your market: Recommended Instance type: g2.2xlarge (Basic GPU instance type)
- As discussed before we should maintain data persistency in order to prevent unnecessary data loss on interruption. Re-attach policy will guarantee that the same EBS volume will be reattached to the new spot instance when interruption occurs.
- Configure user data as follows: data_persistency_userdata.sh
- The new instance will be created without any volumes. In order to maintain persistence we should create and attach new volumes. This should only occur at the first time. Let Spot automate the rest of the process for you.
aws ec2 create-volume \ --size 20 \ --region <REGION> \ --availability-zone <AZ> \ --volume-type gp2 aws ec2 attach-volume \ --volume-id <volume id> \ --instance-id <instanceid> \ --device /dev/xvdb
- Now you should see your instance connected with the volume (SSH to your machine and validate that the instance has been mounted the volume correctly).
- After attaching the volume, recycle your instance (For User-Data ml-actions).
Deep learning example:
Our example represents mnist dataset using convolutional neural networks. The MNIST dataset is an image dataset of handwritten digits. It has 60,000 training images and 10,000 test images, each of which are grayscale 28 x 28 sized images.
(trained model files in the /dl/model directory)
Here we will mainly focus on the code infrastructure. As we saw before using deep learning on spot instances could not be run out of the box. Some code formatting should be done before the code can run.
The full training script can be found here: train_network.py
- Prepare the environment: When using keras over AWS first we need to use the conda ENV. So we need to add the following script to our userdata.
sudo -H -u ubuntu bash -c "source /home/ubuntu/anaconda3/bin/activate tensorflow_p27; python train_network.py
- Using checkpoints and callbacks
When using deep learning over spot instances, in order to keep on persistent training without any data loss, we should maintain continuous checkpoints and callbacks.
What are Checkpoints and Callbacks
A callback is a set of functions to be applied at given stages of the training procedure.
The relevant methods of the callback will be called at each stage or epoch of the training process.
The most important callback function that is recommended when training Neural network over spot instances, is the checkpoint function:
Checkpoints are snapshots of your training values at a specific time point.
The checkpoints are configured as part of the callback function and occur in each epoch.
In the checkpoint file all the training parameters will be saved including the current training value and weights.
The checkpoints should be saved in the main volume. During “volume reattach” the model should be loaded from the latest checkpoint that was created in the checkpoints folder.
Creating checkpoint callback function:
The checkpoint directory in the middle of the training:
How the Checkpoint helps maintain persistence during interruptions:
As mentioned above, the main challenge with spot instances is that AWS can interrupt and terminate them with just a short notice. To overcome this issue, in every epoch the callback function for the checkpoint will occur, saving the network specific statistics and weights in the persistent volume checkpoint directory (in the example /dl/checkpoints folder).
That way we will always know where we stopped and we will be able to continue exactly from that point.
In the screenshot below you can see that when the checkpoints exist in the volume, the model and the epoch number will be taken from there. Otherwise the model will be built from scratch.
The full example can be found here:
I hope you have found this blog post useful in keeping your machine learning project within budget. Feel free to reach out to the Spot team with any questions or feedback you might have.
*Github is an open source repository and the code shown there was provided by the author of this blog. The code is subject to change and is only an example for a simple use case as a baseline for deep learning projects.