Optimized Spark Docker Images Are Now Available

Reading Time: 4 minutes

We’re excited to publicly release our optimized Docker images for Apache Spark. They can be freely downloaded from our DockerHub repository, whether you’re a Spot by NetApp customer or not.

This is the result of a lot of work from Spot’s Ocean for Apache Spark team to ensure that we can:

  • Build a combinations of Docker Images to serve our customers needs – with various versions of Spark, Python, Scala, Java, Hadoop, and all the popular data connectors
  • Automatically test them across various workloads, to ensure the included dependencies are working together (in other words, save you from “dependency hell”).

Our philosophy is to provide high quality Docker images that come “with batteries included”, meaning you will be able to get started and do your work with all the common data sources supported by Spark. We hope these images will just work for you, out of the box.

We will maintain this fleet of images over time, up to date with latest versions and bug fixes of Spark and the various built-in dependencies.

Have you ever blocked all the containers in production due to dependency issues? We hope to save you from this.

What’s a Docker Image for Spark?

When you run Spark on Kubernetes, the Spark driver and executors are Docker containers. These containers use an image specifically built for Spark, which contains the Spark distribution itself (Spark 2.4, 3.0, 3.1). This means that the Spark version is not a global cluster property, as it is for YARN clusters.

You can also use Docker images to run Spark locally. For example you can run Spark in a driver-only mode (in a single container), or run Spark on Kubernetes on a local minikube cluster. Many of our users choose to do this during their development and their testing.

Using Docker will speed up your development workflow and give you fast, reliable, and reproducible production deployments.

To learn more about the benefits of using Docker for Spark, and see the concrete steps to use Docker in your development workflow, check out our article: “Spark and Docker: Your development cycle jut got 10x faster!”.

What’s in these optimized Docker Images?

They contain the Spark distribution itself – from open-source code, without any proprietary modifications.

They come built-in with connectors to common data sources:

  • AWS S3 (s3a:// scheme)
  • Google Cloud Storage (gs:// scheme)
  • Azure Blob Storage (wasbs:// scheme)
  • Azure Datalake generation 1 (adls:// scheme)
  • Azure Datalake generation 2 (abfss:// scheme)
  • Snowflake
  • Delta Lake

They also come built-in with Python & PySpark support, as well as pip and conda so that it’s easy to to install additional Python packages. (If you don’t need PySpark, you can use the lighter image with the tag prefix ‘jvm-only’)

Finally, each image uses a combination of the versions from the following components:

  • Apache Spark: 2.4.5 to 3.1.1
  • Apache Hadoop: 3.1 or 3.2
  • Java: 8 or 11
  • Scala: 2.11 or 2.12
  • Python: 3.7 or 3.8

Note that not all the possible combinations exist, check out our DockerHub page to find them.

Our images includes connectors to GCS, S3, Azure Data Lake, Delta, and Snowflake, as well as support for Python, Java, Scala, Hadoop and Spark!

How To Use Our Spark Docker Images

Update (October 2021): See our step-by-step tutorial on how to build an image and get started with it with our boilercode template!

You should use our Spark Docker images as a base, and then build your own images by adding your code dependencies on top. Here’s a Dockerfile example to help get you started:

Dockerfile to build a custom Spark image

Once you’ve built your Docker image, you can run it locally by running: docker run {{image_name}} driver local:///opt/application/main.py {args}

Or you can push your newly built image to a Docker registry that you own, then use it on your production k8s cluster!

Do not directly pull our DockerHub images from your production cluster in an unauthenticated way, as you risk hitting rate limits. It’s best to push your image to your own registry, or purchase a paid plan from Dockerhub.

Spot by NetApp users can directly use the images from our documentation. They have a higher availability and a few additional capabilities exlusive to Spot, like Jupyter support.

Conclusion – We hope these images will be useful to you

Are these images working well for you? Do you need new connectors or versions to be added? Let us know, we’d love your feedback.

Are you interested in getting a trial of the Data Mechanics platform to test the benefits of a containerized Spark platform powered by Kubernetes, deployed in your cloud account? Schedule a demo with us and we’ll show you how to get started.