Weather2020 extracts weather data from public agencies around the globe in a variety of formats – including industry-specific formats which are not suited for big data. They work with time series data spanning over 40 years of meteorological and geospatial data.
They needed pipelines to pull this data, clean it, enrich it, aggregate it, and store it in a cloud-based data lake in a cost-effective way. Their data is then ready to be consumed by multiple downstream data products:
Weather2020’s team had solid data engineering skills and custom knowledge around extracting and modeling weather data, but they did not have any prior experience with Apache Spark.
EMR required too much setup and maintenance work. We didn’t want to spend our time writing bash scripts to manage and configure it. Databricks felt like a casino. It didn’t seem like the right product for our technical team, and their steep pricing ruled them out for us. – Max, Lead Data Engineer @ Weather2020.
Spot’s Ocean for Apache Spark lowered the barrier to entry for Apache Spark by making it more developer friendly, while minimizing infrastructure costs thanks to the autopilot features.
*Enabling dynamic allocation and using i3 instances with large SSDs brought the most significant performance improvements given the scale of the pipelines (shuffle-heavy jobs processing TBs of data).
Deliver Projects Faster: It took only 3 weeks to build and productionize Terabytes-scale data ingestion pipelines, without prior experience with Apache Spark.
Keep Costs In Check: Performance optimizations incentivized by a fair pricing structure achieved a 60% reduction in total cost of ownership compared to Databricks.
A Flexible and Scalable Architecture: The Ocean Spark is deployed on a managed, autoscaled Kubernetes (EKS) cluster inside Weather2020’s AWS account.
for up to 20 instances