Tutorial: Running PySpark inside Docker containers

Reading Time: 4 minutes

In this article, we’re going to show you how to start running PySpark applications inside of Docker containers, by going through a step-by-step tutorial with code examples (see github).

There are multiple motivations for running Spark application inside of Docker container, we covered them in our article “Spark & Docker – Your Dev Workflow Just Got 10x Faster”:

  • Docker containers simplify the packaging and management of dependencies like external java libraries (jars) or python libraries that can help with data processing or help connect to an external data storage. Adding or upgrading a library can break your pipeline (e.g. because of conflicts). Using Docker means that you can catch this failure locally at development time, fix it, and then publish your image with the confidence that the jars and the environment will be the same, wherever your code runs.
  • Docker containers are also a great way to develop and test Spark code locally, before running it at scale in production on your cluster (for example a Kubernetes cluster).

At Data Mechanics we maintain a fleet of Docker images which come built-in with a series of useful libraries like the data connectors to data lakes, data warehouses, streaming data sources, and more. You can read more about these images here, and download them for free on Dockerhub.

In this tutorial, we’ll walk you through the process of building a new Docker image from one of our base images, adding new dependencies, and testing the functionality you have installed by using the Koalas library and writing some data to a postgres Database.

Requirements

For the purpose of this demonstration, we will create a simple PySpark application that reads population density data per country  from the public dataset – https://registry.opendata.aws/dataforgood-fb-hrsl, applies transformations to find the median population, and writes the results to a postgres instance. To find the median, we will utilize koalas, a Spark implementation of the pandas API.

From Spark 3.2+, the pandas library will be automatically bundled with open-source Spark. In this tutorial we use Spark 3.1, but in the future you won’t need to install Koalas, it will work out of the box. 

Finding the median values using Spark can be quite tedious, so we will utilize the koalas functionality for a concise solution (Note: To read from the public data source, you will need access to an AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. The bucket is public, but AWS requires user creation in order to read from public buckets).

If you do not have a postgres database handy, you can create one locally using the following steps:

docker pull postgres
docker run  -e POSTGRES_PASSWORD=postgres -e POSTGRES_USER=postgres -d -p 5432:5432 postgres

Building the Docker Image

Dockerfile

To start, we will build our Docker image using the latest Data Mechanics base image – gcr.io/datamechanics/spark:platform-3.1-latest. There are a few methods to include external libraries in your Spark application. For language specific libraries, you can use a package manager like pip for python or sbt for scala to directly install the library. Alternatively, you can download the library jar into your Docker image and move the jar to /opt/spark/jars. Both methods achieve the same result, but certain libraries or packages may only be available to install with one method.

To install koalas, we will use pip to add the package directly to our python environment. The standard convention is to create a requirements file that lists all of your python dependencies and use pip to install each library into your environment. In your application repo, create a file called requirements.txt, and add the following line – koalas==1.8.1. Copy this file into your Docker image and add the following command RUN pip3 install -r requirements.txt. Now you should be able to import koalas directly into your python code.

Next, we will use the jar method to install the necessary sql driver so that Spark may write directly to postgres. First, let’s pull the postgres driver jar into your image. Add the following line to your Dockerfile – RUN wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar and move the returning jar into /opt/spark/jars. When you start your Spark application, the postgres driver will be in the classpath, and you will be able to successfully write to the database.

Developing Your Application Code

Example PySpark Workflow

Finally, we’ll add our application code. In the code snippet above, we have a simple Spark application that reads a DataFrame from the public bucket source. Next, we transform the Spark DataFrame by grouping the country column, casting the population column to a string, and aggregating. We then convert the Spark DataFrame to a Koalas DataFrame and apply the median function (an operation not available in Spark). Lastly, we transform the Koalas DataFrame back to a Spark DataFrame and add a date column so that we may utilize the write.jdbc function and output our data into a SQL table.

Monitoring and Debugging

Script to build and run the image locally (note: this is a justfile)
To help with running the application locally, we’ve included a justfile with a few helpful commands to get started.
  • To build your Docker image locally, run just build
  • To run the PySpark application, run just run
  • To access a PySpark shell in the Docker image, run just shell

You can also execute into the Docker container directly by running docker run -it <image name> /bin/bash. This will create an interactive shell that can be used to explore the Docker/Spark environment, as well as monitor performance and resource utilization.

Conclusion

We hope you found this tutorial useful! If you would like to dive into this code further, or explore other application demos, head to our Github examples repo.