There is a rather unhealthy obsession, in my opinion, in the Docker community about developing the smallest possible container size. Obviously you don’t want your container to contain hundreds of megabytes of useless junk, but perhaps we have passed the point of diminishing returns. It turns out that it is less expensive to have files in your base image that aren’t used than it is to have duplicated files in higher layers.
A while ago it was announced that Docker Inc. was moving it’s official Docker images to Alpine Linux. Some people have speculated that this was partly to do with lowering their hosting bill for the public Docker Hub. I can empathise with that as I imagine that Docker Inc. serves up many terabytes of image data per month. Cutting down the size of a handful of popular images could reduce the load on the public Docker Hub significantly.
The obsession with image size seems to have spurred some friendly competition amongst the various base container image vendors. The various flavours of Debian and Ubuntu base images have in some cases come down in size by hundreds of megabytes.
Is an extremely small base container an obvious win though? Let’s compare the costs and benefits of having one image that’s slightly larger than the other.
The primary costs of a larger image are that it takes longer to download from the Docker Hub, and it takes up more space on disk. That’s it though. Once the image is downloaded and stored locally it doesn’t cost any more to start up, run or shutdown a larger container than a smaller one.
One of the really nice innovations that Docker brought to us was the use of overlay filesystems for sharing image layers between containers. On Debian and Ubuntu we have the aufs and overlay2 filesystems, and on RedHat and friends we have a devicemapper storage driver. By sharing common layers using the filesystem we can make use of the Linux kernel’s extremely effective caching to get big performance improvements for containers that are isolated from one another.
Sharing common files between containers via a base layer is an obvious win. The layer only needs to downloaded once and Linux’s page cache improves performance when common files are accessed by multiple containers.
However if we have worked really hard to slim down our base container, we’re increasing the chances that downstream users are going to need to install additional packages to make their container useful. If we remove Perl from our base image but everyone using it needs to install Perl, then have we really saved anything?
It turns out that duplicated files, that is files that are present in multiple images but not shared as part of a common layer, have a higher cost than you think. Not only do duplicated files get stored multiple times on disk, they can also be stored multiple times in the page cache. This crowds out other potentially more useful data and reduces the overall effectiveness of the cache.
Compared with the one-off download and storage cost of unused files in an image we can conclude that having unused files in a base image incurs a lower cost than duplicating files across multiple containers.
This is a good argument for having a larger base container image. While you pay more to initially download and store the larger image, there’s much more opportunity for runtime efficiencies by sharing the common files via Linux’s page cache. These opportunities are lost with duplicate files and they actually make performance worse due to cache crowding.
I think there’s a middle ground between a completely minimal install of Alpine or Debian and that middle ground is a larger image than most people think.