Freshworks integrates by Spot to automate AWS-Opsworks on spot instances
Since they were founded, Freshwork’s product suite has expanded to 11 products, including:
Freshsales, for sales teams; Freshrelease, for project management; and Freshcaller, a solution for call centers.
Freshworks has raised $250 million, to date, and is backed by Accel, Sequoia Capital, and CapitalG.
Freshworks now has over 2000 employees across 10 global office locations and over 150000 customers which are spread across 197 countries.
Freshworks’ Cloud Team:
The site reliability team (SRE) is the technological group within Freshworks which is responsible for the availability, performance, latency, efficiency, change management, monitoring, emergency response, and capacity planning of Freshworks’ cloud infrastructure.
In the company’s early years, the SRE team focused on developing and optimizing the underlying platform infrastructure in order to support the release of new products and prepare for scale, as there was a steady increase in the number of customers.
As of today, the SRE team is managing:
- Thousands of servers on AWS (CloudFront, Route53, KMS, DynamoDB, Opsworks, etc.)
- 0.5 million requests per minute (peak traffic)
- 4 million DB reads per minute
- 15TB logs per day
Adopting a cost-efficient mindset
As the company scaled, Freshworks AWS bill increased too, and therefore wanted to optimize their cloud computing infrastructure cost.
“In the beginning, we had never allocated a budget for our infrastructure costs, as it was considered part of the operational costs of running our applications, but as the company grew, we realized that cost efficiency is becoming a necessity when running at scale,”said Pradeep Thangavel, Engineering Manager, Site Reliability Engineering.
In order to optimize cloud spending, the SRE team received a budget for infrastructure costs and was required to provide an uptime SLA to Freshworks platform, with that given budget.
“Before we were introduced to Spot, the only main cost-saving strategy we were able to adopt was purchasing prepaid RI’s (Reserved Instances)” said Pradeep Thangavel.
AWS EC2 Reserved instances provide a significant discount (up to 75%) compared to On-Demand instances and are pre-purchased 1-3 years in advance.
However, purchasing RI’s is a financial commitment, moreover, the RI’s are not fully utilized and not relevant for scale in peak traffic.
Spot Automation – The Challenge
As the most effective cost-reduction approach, Freshworks thought about leveraging spot instances to potentially reduce the infrastructure costs by up to 80%.
Spot instances are AWS’ spare EC2 instances that are offered to the market at a significant discount of up to 90% (compared to On-Demand). Spot instances may be used for a large variety of workloads, and are leveraged also when workloads need to scale.
However, the AWS frameworks implemented as the backbone of their architecture were not ideal to run on spot instances. Modifying the architecture was not a viable option either.
One of Freshworks’ main framework components is ‘AWS Opsworks’, an AWS configuration management service that the SRE team uses to automate the configuration of Freshworks products.
“The main challenge for us was to integrate both spot instances and AWS Opsworks to work together because each has its own lifecycle,” said Pradeep
In case the SRE team were to independently start using spot instances for their Opsworks EC2 instances, they would have been burdened with the overhead of managing the termination and launching of instances while maintaining the same configuration.
On top of that, the two-minute interruption notification by AWS before spot termination may leave the application unavailable, breaching the uptime SLA, and therefore directly impacting the company’s business.
“After thorough research, we came to realize that reliably managing spot instances is a massive automation challenge for us,” said Pradeep Thangavel, Engineering Manager
Elastigroup by Spot – Spot Automation Solution
In order to address the given challenges, Freshworks was looking for a fully managed Spot automation solution that will answer the company’s requirements.
When the SRE team was introduced to Elastigroup by Spot, they were impressed with the fact that it easily integrated with the frameworks they were using, and the migration was a ‘one-time setup’.
Elastigroup by Spot for AWS is a SaaS platform that provisions, manages, and scales compute infrastructure and saves up to 80% on the cloud-compute costs, by reliably leveraging spot instances for the EC2 workloads.
In the integration with Opsworks, the SRE team performed a direct mapping between Elastigroup by Spot to the Opsworks layers. Every Opsworks layer, which is comprised of hundreds of EC2 instances, is directly mapped to a dedicated Elastigroup.
“Each of our R&D teams operates on a separate individual framework, and luckily the integration between Spot and our workloads on Opsworks, EKS, and Rancher, was seamless,” quoted Pradeep
Handling Spot Interruptions:
“In most cases, Spot was able to predict an interruption 15 minutes prior to AWS’ notification and immediately scheduled the EC2 Spot Instance for replacement, and in edge cases, we were even notified 20 minutes in advance,” said Pradeep
When the SRE team first designed the company’s cloud architecture, they took into consideration many factors in order to provide support for scaling EC2 instances, thus adding several layers of elasticity to their infrastructure when running on spot instances.
Freshworks’ cloud workloads were ideal candidates to run on spot instances due to the fact that 90% of them are stateless applications that may handle interruptions.
Apart from that, the SRE team has dedicated a lot of time and effort in adjusting their startup and shutdown scripts to properly drain and disconnect the EC2 instances from the load balancer, and in parallel, already spin up a new EC2 instance with the same configuration.
The smooth scaling mechanism provided them with a comfortable headroom to handle and tolerate Spot interruptions.
Spot was founded in 2015 and is thriving to revolutionize the way companies manage and orchestrate their cloud-compute workloads in the cloud.
Spot was a natural choice for Freshworks due to the fact that Spot commits to a 99.9% uptime SLA, and this directly addressed the SRE team’s challenge in ensuring a 99.8% uptime SLA of Freshworks’ applications.
On top of that, Elastigroup’s ‘Fall-back to On-Demand’ feature ensures a highly available cluster by falling back to an On-demand instance, in cases where the spot market is unstable, or unavailable for that instance type.
“Our Journey with Spot started in 2016 with a small PoC, and as our confidence in the platform grew stronger, we gradually on-boarded more AWS accounts, and today we are running hundreds of EC2 instances across several AWS accounts with Elastigroup by Spot,” said Pradeep
Immediate Cost reduction and visibility:
After the initial workload migration to Elastigroup, the SRE team immediately observed a massive reduction in their cloud-compute spending, with an average of 65% in savings, as opposed to running solely with On-demand instances for the instances that have been migrated so far
In addition to that, the Elastigroup dashboard provided them with deeper visibility into their cloud-compute costs, thereby allowing them to stay on top of their spending at any given time.