Dynamic Kubernetes PVC reuse with Ocean for Apache Spark

How to reuse Dynamic Kubernetes PersistentVolumeClaim (PVC) with Ocean for Apache Spark

Reading Time: 5 minutes

One of the most exciting implementations to come out of Apache Spark™ on Kubernetes is the dynamic creation, mounting, and remounting of a PersistentVolumeClaim (PVC) within a Spark application.

We’ve spoken previously about how PVCs can be used to recover shuffle data and prevent application failure after a spot kill has occurred, negating one of the biggest drawbacks of spot instance usage in Spark workloads. Previously, a race condition existed that prevented the PVCs from being released in time to be picked up by the newly created executor node, but thanks to some improvements in Spark 3.4, we can confirm that dynamic PVC reuse is operating as expected.

In this article, we will discuss what dynamic PVC reuse is and how you can configure Ocean for Apache Spark to take advantage of this powerful feature – ensuring your data engineering applications don’t lose valuable data.


PVC is a Kubernetes abstraction for a storage resource

In Kubernetes, PVC stands for PersistentVolumeClaim. It is an abstraction that allows users to request and use storage resources in a portable manner. PVCs provide a way for applications running in Kubernetes pods to request specific storage resources, such as disks or volumes, without needing to know the details of the underlying infrastructure.

When an application requires storage, it can create a PVC object and specify the desired storage characteristics like size, access mode, and storage class. The PVC is then bound to an available PersistentVolume (PV) that satisfies the requested characteristics.


PVCs can improve Spark application performance

Dynamic PVC reuse was introduced in Spark 3.2. When the Spark application kicks off, the driver will create PVCs and provision volumes to be used by your executors in the Spark application. The spark driver handles creation, provisioning, mounting, and removal so your Spark application has the resources it needs to complete processing, but the resources are released after the application finishes. That way, you only pay for storage resources while your application runs, providing an abstraction layer that decouples the application from the specific storage details.


PVC reuse can avoid data loss upon spot kill

PVC reuse becomes particularly useful when running Spark applications on spot instances. While running Spark on spot instances is a great way to reduce the overall cost of your Spark platform, there are some major drawbacks.

When your Spark application receives a spot kill, it loses all metadata or shuffle data that is stored on the recalled node. If your Spark application requires this data to complete ongoing stages or future tasks, it will be forced to recompute the data and re-execute previously completed work to make up for the lost data.

Often, this recomputation is not resource intensive, and you will only notice a slight delay in the performance of your application. For heavier operations or workloads, this will add significant time to your application, or in certain situations, cause an application failure as Spark is not able to catch up in time and can never locate the missing shuffle data. You will see a failure message similar to the image below:


Screenshot showing how Ocean for Apache Spark surfaces issue around shuffle data loss
Screenshot showing how Ocean for Apache Spark surfaces issue around shuffle data loss


PVC reuse can avoid these failures.

Imagine a simple Spark application with one executor (exec-0) and dynamic PVC reuse enabled that has provisioned a volume (pvc-0). Now let’s assume this application has completed a few operations that generated shuffle data and wrote that data to disk. When this application receives a spot kill and loses exec-0, PVC reuse will take pvc-0 and attach it to exec-1 when it is provisioned to replace exec-0.

Instead of rescheduling stages and tasks to recompute the lost shuffle data, Spark will be able to identify the existing shuffle data from pvc-0 and immediately pick up where the last executor left off. Your application will have a slight delay as it acquires the new node(s) from the cloud provider, but you won’t have to repeat any previously completed work and your application won’t fail as a result of missing shuffle data!


Problems with PVC Reuse in Spark 3.2

When this feature was first released in Spark 3.2, we were very excited to test it out. To enable PVC reuse in your Spark application, you’ll need to add the following config to your Spark application.


Screenshot of Ocean for Apache Spark configuration for PVC
Screenshot of Ocean for Apache Spark configuration for PVC


However, we ran into an issue in the implementation. Let’s assume the same configuration as the previous step (exec-0, pvc-0).

To test this functionality, we ran a few commands that generated shuffle data, and then we removed the node that exec-0 was deployed on to simulate a spot kill. When we inspected the pods after the simulated spot kill, we saw exec-0 decommissioned and a new node spinning up for exec-1. However, pvc-0 was not being attached to exec-1 and a second PVC was created and mounted to the new node.

Upon further investigation, we realized there was a race condition where the PVC from the initial executor was not being released in time to be acquired and mounted to the new executor. In another test, we had a few more executors in the application and simulated two to three spot kills over the course of five to 10 minutes. We were able to observe that pvc-0 would eventually be mounted to a new executor, but it was not 1-1 and new PVCs without the existing shuffle data were still being created.

While this functionality could still be useful for long running applications or streaming jobs, it was not unlocking the full potential of dynamic PVC reuse and working as a reliable solution for handling spot kills with shuffle data.


Working solution with Spark 3.4

Thanks to the work on this ticket, the race condition for dynamic PVC reuse has been remedied. With the inclusion of an additional configuration setting, Spark 3.4 correctly provisions, mounts, removes, and remounts the PVCs within the same Spark application. You can get the Spark 3.4 image from here and set the following configuration:

"spark.kubernetes.driver.waitToReusePersistentVolumeClaim": "true"

You will need to configure a storage class that can be used by the PVC. In the example above, the ‘standard’ storage class was created with the following yaml:


Screenshot of Ocean for Apache Spark configuration of a Kubernetes storage class
Screenshot of Ocean for Apache Spark configuration of a Kubernetes storage class


In addition, you must also attach a cluster role of “edit” to the spark-driver service account. You can run the following command to attach the cluster role:

kubectl create clusterrolebinding --clusterrole=edit --serviceaccount=spark-apps:spark-driver --namespace=default

In our next blog post, we will examine the performance benefits of dynamic PVC reuse. Specifically, we will look at the cost and performance savings achieved through consecutive runs, as well as fix an application that was previously failing in the process of recomputation. In the meantime, sign up today and experience the incredible power of Ocean for Apache Spark for yourself.