Chegg Cuts EC2 Costs by 70% While Simplifying ECS Infrastructure Management
Challenge – Empowering Engineering Team While Containing EC2 Costs
Chegg began their textbook rental business in 2005, running primarily as a monolithic web application, hosted in a colocation data center. After doing a forklift migration to AWS, they quickly realized that despite all the available instance types and flexible EC2 pricing, simply “running in the cloud” was not enough to drive product agility and innovation.
Chegg embraced microservices and containers using AWS ECS, to empower their various engineering teams to develop more without being held up by their centralized operations team or any other dependencies.
Steve Evans, VP of Engineering Services at Chegg recalled that “as soon as we unlocked the gates, the number of services being launched in ECS skyrocketed, which was great for Engineer productivity, but we needed to seriously manage our growing Opex.”
While previously Chegg had managed to run most of their legacy application on just ~600 EC2 instances, they now had ~1200 EC2 instances to pay for.
Of equal significance, Chegg’s lean operations team was oversubscribed with properly managing ECS and the underlying EC2 infrastructure.
Solution – Abstracting Infra Management Affordably
With Chegg running their microservices on stateless containers, using Spot Elastigroup to automate spot instance provisioning and management for their ECS workload, was a no-brainer.
Spot’s enterprise-level SLAs and “pay-as-you-save” pricing model made it an easy sell to both the Finance and Operations teams.
Chegg started using Spot in 2016, with a 70% EC2 cost reduction achieved in a short time. Their operations team, now free from “micro-managing” all the infra, could handle more critical issues, while the engineering team was coding and launching containers at will.
“Spot abstracts away all the nitty-gritty details of managing spot and reserved instances for ECS, with their support team and online resources providing us with in-depth ECS expertise,” described Evans. He added, “When we want best practices and tips for standing up and managing ECS, Spot is our first call.”
In 2017 AWS conducted around 1,400 EC2 Maintenance Events for the Spectre Intel Patch. This normally would have required Chegg’s operation team to handle the scheduling and execution of those maintenance events.
However, with Spot, all of this was completely automated with business availability and continuity unaffected, and without the usual operational headache.
When it came to understanding the cost of specific ECS Services, Application and Task, Spot provided unrivalled visibility with comprehensive and clear cost allocation dashboards.
Evans concluded, “Chegg’s successful adoption of microservices and containers, in large part, can be attributed to Spot keeping our infra cost and management to a bare minimum.”
Results and Benefits
Some of the benefits that were realized by Chegg include:
Infrastructure Abstraction Freed Up Time for More Important Operations
Whether handling dynamic changes in ECS Tasks and associated resource requests for CPU and Memory, or simply dealing with planned EC2 interruptions, Spot’s workload automation eliminated the need for Chegg’s operations team to manage all the infrastructure configurations.
Over 70% Savings on EC2 with Reliable Usage of Spot and Intelligent Utilization of Unused RIs
Using Spot, Chegg not only optimized their EC2 cost by leveraging spot instances, but was also able to recoup their investment in unused, but fully paid-for, Reserved Instances.
Whenever Spot identified an unused RI, it immediately moved a relevant workload from a spot instance to the RI, thereby generating even greater cost efficiency.
Unparalleled Visualization of ECS Cost and Infra Allocation
As typical for large organizations, keeping track of what various departments and applications are consuming in the cloud, can be a challenge. But with Spot, the Chegg finance and operations teams had comprehensive visibility into their ECS infra costs and usage broken down by Services, Applications and Tasks, making for easy showback.
Ongoing Cluster Optimization During Weekly Blue/Green Deployment
With Spot Cluster Roll, modifying AMIs, Startup Scripts, etc. can be accomplished in a single click, with actual Task workload needs, driving the precise deployment of the right amount and type of instances.
Quick Time-to-Value with Full Access to 24/7 Support
The Chegg team received step-by-step guidance from their Spot support and customer success managers, on best practices for standing up their ECS workloads, improving Terraform templates and troubleshooting any issues along the way. This white-glove service helped Chegg achieve rapid cost optimization while erasing the irritating minutiae that their Operations team had to handle in the past.