Zalando Runs Mission-Critical Workloads on Spot Instances with Spot
Zalando, a European e-commerce company follows a platform approach, offering Fashion and Lifestyle products to customers in 17 European markets. With companies who are operating in such a large scale such as Zalando, the positivity of this growth comes with one common caveat: increased cloud infrastructure costs. In their journey of continuously optimizing and controlling their cloud costs, across over 200 cloud accounts and thousands of developers, they have also explored the option of using Amazon EC2 Spot Instances.
Once Zalando have decided they would like to leverage EC2 Spot Instances, they have faced a whole new challenge of provisioning, operating and more importantly guaranteeing availability for their applications when using EC2 Spot Instances. They were concerned mainly with 2 things:
- 2 minutes heads up on Spot termination is not enough for the vast majority of their applications
- Lose the entire cluster at the same time, and having their critical services become unavailable or in degradation of performance.
The SRE team at Zalando, looked for ways to utilize EC2 Spot Instances across all of their environments. They began their search for a solution that can help them reduce costs, minimize operations and implement it easily across their +200 business units. One option was to develop a solution in house, another option was to look for a 3rd party platform that already does that, This is when they found Elastigroup by Spot and decided to try it mainly because of 3 key features that were appealing to them:
- Termination prediction of up to 15 minutes in advance, allowing them to run more complex environments without the risk of a down time
- Ability to automatically fallback to On-Demand instances
- Abstracting the complexity of using multiple instance types and sizes to decrease the risk of losing instances or to increase the chances of getting it.
Testing the waters with Spot using a complex environment
When Zalando decided to start a POC with Spot, they have deliberately selected a more complex use case than simple Stateless applications behind a load balance in order to truly test the power of the platform. They have decided to try Spot support for stateful workloads and run Cassandra nodes on Elastigroup. “We decided to see without any investment of time on our side, just to test if they can deliver what they promise,” said Luis Mineiro, Site Reliability Engineer at Zalando
Spot Elastigroup allows customers to persist the instance’s storage and network configuration and its state. When working with Cassandra clusters, each node is identified by an IP address so it’s crucial that the nodes will maintain their configuration even during a Spot replacement. With this capability, and while experiencing high-availability and long instance lifetime, Zalando were able to confirm this solution was reliable for them and to roll it out to their entire company.
Seamless integration with existing CI/CD tools
At Zalando, there are more than 200 independent teams and each is using different tooling for provisioning and deployments including CloudFormation. Finding a tool that can easily and seamlessly integrate with their existing tools without having their teams change the way they work, was in the top of their list.
Luckily, Spot natively supports AWS CloudFormation and allowed Zalando to provision and manage everything programmatically in the exact same way their developers and teams are used to. With a short implementation time, they were able to easily change their CFN templates to work directly with Elastigroup.
Deploying Spot across 200 teams in production
After a successful POC where Zalando were able to prove cost savings, ease of use and reliability of Elastigroup, they have started to implement it across all of their different teams. Spot was able to support their large scale of 200 development teams managing over 300 AWS accounts and automatically integrate with their SSO. Thanks to Elastigroup, Zalando can focus more on creating quality applications for their customers and less time worrying about the underlying infrastructure.