Machine Learning is a very popular field at the moment and is something that’s in the news and geek culture a lot. Kaggle is a machine learning competition site where you can take part in a (usually sponsored) competition to apply your skills and solve a real-world problem.
Putting aside the controversial nature of spec work (read no!spec and this Wired article for some background) Kaggle have put together a pretty nice container image for getting started with machine learning.
The kaggle/python container is a pre-built environment that includes SciPy, the principal package for doing machine learning in Python. SciPy consists of a number of sub-packages which might be familiar:
- NumPy for numerical computation using arrays and matrices,
- Pandas for data frame manipulation and analysis,
- Matplotlib to generate production-quality graphs; and
- IPython for working interactively with data
That sounds like a great set of tools, which it is, but that container image is absolutely enormous: 9.1GB. This is more than grabbing a coffee while the command completes and more like lunch.
$ docker pull kaggle/python
There are a couple of ways to get started using this container. To run a regular old-school Python REPL just call docker run like you would expect:
$ docker run --rm -it kaggle/python python Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>>
As you can see the container is based on the well-known Anaconda distribution which is a package manager and platform for data science.
The other way to use the kaggle/python container is to run a Jupyter notebook server. Run the container server like this:
$ docker run --rm -it \ -v $PWD:/tmp/working -w=/tmp/working \ -p 8888:8888 kaggle/python jupyter notebook \ --no-browser --ip="0.0.0.0" \ --notebook-dir=/tmp/working \ --allow-root
Browse to http://localhost:8888 to start using the notebook
This is a somewhat complicated command line that does a couple of useful things. Firstly it uses Docker to map the current working directory to /tmp/working in the container which is useful if this is where your code and data is. Secondly it runs jupyter as a server inside the container and uses /tmp/working as the notebook directory.
When you initially connect to the notebook you will need to enter a random token that’s generated at startup. Examine the log output when starting up the container and cut & paste the value of the token.