Container of the Week – kaggle/python

Machine Learning is a very popular field at the moment and is something that’s in the news and geek culture a lot.  Kaggle is a machine learning competition site where you can take part in a (usually sponsored) competition to apply your skills and solve a real-world problem.

Putting aside the controversial nature of spec work (read no!spec and this Wired article for some background) Kaggle have put together a pretty nice container image for getting started with machine learning.

The kaggle/python container is a pre-built environment that includes SciPy, the principal package for doing machine learning in Python. SciPy consists of a number of sub-packages which might be familiar:

  • NumPy for numerical computation using arrays and matrices,
  • Pandas for data frame manipulation and analysis,
  • Matplotlib to generate production-quality graphs; and
  • IPython for working interactively with data

That sounds like a great set of tools, which it is, but that container image is absolutely enormous: 9.1GB. This is more than grabbing a coffee while the command completes and more like lunch.

$ docker pull kaggle/python

There are a couple of ways to get started using this container. To run a regular old-school Python REPL just call docker run like you would expect:

$ docker run --rm -it kaggle/python python
Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

As you can see the container is based on the well-known Anaconda distribution which is a package manager and platform for data science.

The other way to use the kaggle/python container is to run a Jupyter notebook server. Run the container server like this:

$ docker run --rm -it \
    -v $PWD:/tmp/working -w=/tmp/working \
    -p 8888:8888 kaggle/python jupyter notebook \
    --no-browser --ip="0.0.0.0" \
    --notebook-dir=/tmp/working \
    --allow-root

Browse to http://localhost:8888 to start using the notebook

This is a somewhat complicated command line that does a couple of useful things. Firstly it uses Docker to map the current working directory to /tmp/working in the container which is useful if this is where your code and data is. Secondly it runs jupyter as a server inside the container and uses /tmp/working as the notebook directory.

When you initially connect to the  notebook you will need to enter a random token that’s generated at startup. Examine the log output when starting up the container and cut & paste the value of the token.