Shuffle Data Store Now Available for Ocean for Apache Spark™

NetApp is excited to announce the addition of Shuffle Data Store for Ocean for Apache Spark™. Customers running on AWS can now benefit from improved resilience and efficiency for multi-step data preparation pipelines that require Apache Spark to shuffle data. The external shuffle solution is backed by either Amazon FSx for NetApp ONTAP storage or Amazon S3 storage, which have been optimally configured for Spark workloads. With Shuffle Data Store for Ocean for Apache Spark, customers can now complete data engineering sooner, on massive data sets, speeding the time to completion and to extract value from analytics.

Prior to the introduction of Shuffle Data Store, data preparation pipelines would experience delays when shuffle data stored local to the nodes, was lost due to node failures or spot kills. These delays resulted in wasted time and compute resources as the steps in the pipelines had to be repeated. The advent of AI applications has led to higher volume in our customers’ data preparation pipelines, leading to more shuffle data loss, making it more critical to have an external shuffle solution. Furthermore, Spark applications were unable to scale down efficiently because Spark Dynamic Allocation was less effective when data accumulated on executors.

With Shuffle Data Store, shuffle data is now automatically persisted externally to the Spark cluster, using remote storage. The Spark application can quickly resume from node failure, eliminating the need to repeat compute steps as data is progressively saved to the external shuffle data store. Additionally, NetApp has made a contribution to the open-source plugin, to augment its interaction with the Dynamic Allocation feature of Apache Spark.

“With Shuffle Data Store, analytic results are now automatically persisted on the external file system, ensuring they are not lost even when cluster nodes are lost. This approach allows customers to fully leverage the cost-saving benefits of Spark Dynamic Allocation features, as the cluster can scale down nodes when they are not in use and contain no data. Additionally complex multi-step data preparation pipelines can now get the benefit of running on lower-cost spot instances,” said Paul Aubrey Director of Product Management Instaclustr by NetApp.

With Shuffle Data Store, NetApp provides an external shuffle storage solution for your Spark applications utilizing AWS file storage (FSx for NetApp ONTAP) and object storage (S3). Azure storage and Google storage options are coming soon. The shuffle data store is suited for complex and high data volume Spark shuffle workloads, offering improved performance and reliability when running on spot instances.

Experience the benefits of NetApp’s managed Ocean for Apache Spark solution by signing up today and running your Spark applications. Benefit from our optimized infrastructure, delivering the lowest-cost and highest performance for your Spark workloads. Alternatively, schedule a meeting with our team of experts to discuss your specific use case and explore how we can help optimize Spark for your organization. Take the next step towards efficient and scalable data processing by getting started with us now.