Container of the Week: gettyimages/spark

This week we are looking at a container for Apache Spark. Spark is a cluster-computing framework for data processing, in particular MapReduce and more recently machine learning, graph analysis and streaming analytics. Clustered systems are sometimes difficult to run on a single machine, for example a laptop or desktop, as this use case is often not given a high priority by developers. Luckily, there is the gettyimages/spark image available for those who wish to quickly and easily explore the Spark environment.

Download the gettyimages/spark image using docker pull. Since it is a JVM-based project, the container image is quite large – 715 MB. To execute a standalone version of the Spark shell inside a container run the following command:

$ docker run --rm -it -p 4040:4040 \
    gettyimages/spark bin/spark-shell

The docker run command brings up a Spark shell running on standard input and the Spark shell application UI is also exposed as a web interface. Different aspects of the Spark environment can be viewed using the UI. Point your browser at http://localhost:4040 and have a look around.

Running Spark in a single container is handy but if you want to try out a clustered installation gettyimages/spark comes with a Docker compose file.  You can use this to try out a Spark cluster consisting of container images. Note that you will need to create a clone of the image’s source git repository to get the compose file in addition the container image.

$ git clone \
    https://github.com/gettyimages/docker-spark.git
$ cd docker-spark
$ docker-compose up

This setup creates a two-node cluster with a master and a single worker both running as containers.

To connect to the master we can an interactive version of gettyimages/spark using a similar command line to the standalone version above:

$ docker run --rm -it gettyimages/spark bin/spark-shell \     spark://$DOCKER_IP:7077

Use the address of your Docker server for $DOCKER_IP which will either be localhost if you are running Docker locally, or your Docker bridge IP address.  For Docker for Mac this is 172.17.0.1.