Lingk.io is a data loading, data pipelines, and integration platform built on top of Apache Spark, serving commercial customers, with expertise in the education sector. Their visual interface makes it easy to load, deduplicate and enrich data from dozens of sources, and promote projects from development to production in a few clicks. They were looking to migrate to reduce their AWS costs, improve their customer experience and streamline operational work for the data team.
By migrating from EMR to Ocean for Apache Spark, Lingk’s customers now enjoy ~2x faster Spark applications, Lingk’s AWS bill has decreased by 65%, and Lingk’s team can spend less time managing their infrastructure to focus on expanding their Spark-based data integration platform!
As a data integration platform, Lingk makes it easy for its customers to run Spark jobs, whether it’s for ad-hoc projects or for automatically scheduled production data pipelines, making Apache Spark core to Lingk’s business.
The data engineering team at Lingk had several challenges working with EMR:
Ocean for Apache Spark is deployed on a managed Kubernetes cluster (EKS) inside Lingk’s AWS account. Ocean for Apache Spark automatically scales the cluster up-and-down based on load and tunes the Spark configurations based on historical data. Instead of HDFS, S3 is used for intermediate storage, with fast access guaranteed using optimized S3 committers.
Lingk’s team does not have to manage clusters anymore, they just submit dockerized Spark apps through the Ocean for Apache Spark REST API and enjoy a serverless experience. The team has control over the docker images used by Spark, which brings 3 additional benefits:
The automated configuration tuning enabled several performance optimizations:
The migration from EMR to Ocean for Apache Spark was a big win:
Lingk was also able to gradually upgrade Spark to 3.0, which was made easy by the Spark-on-Kubernetes architecture which enabled native dockerization. The team at Lingk can now confidently expand their data integration platform toward new ambitious use cases.
Thanks to the migration, Lingk’s AWS costs decreased by 65%, the application startup time was halved, and the average app duration decreased by 40%.
for up to 20 instances