Robert's Data Science Blog

Updated Docker Images for R

In an earlier post I wrote about Docker images for R. While these images work as expected, they are (IMO) quite large – the base R 3.5.0 is for example about 530 MB.

It is of course possible to install R in a smaller base image like Alpine Linux, but several popular packages have system requirements on Linux that are easily installed on the larger distributions, but difficult to install on smaller distributions.

To get a list of installed programs sorted by size run the following command:

docker run --rm r-base:3.5.0  dpkg-query -Wf '${Installed-Size}\t${Package}\n' | sort -n

The top 7 in this output is as follows:

24375   g++-7
25997   gcc-7
31070   libicu60
40421   libopenblas-base
52155   libopenblas-dev
134299  libgl1-mesa-dri
179065  openjdk-11-jre-headless

Quite peculiarly the openjdk-11-jre-headless takes up a lot of space despite being uninstalled after compiling R. But this figure also turns out to be misleading: After removing openjdk-11-jre-headless later it turns out to take up approximately no space.

The package libgl1-mesa-dri is for graphics. This is not necessary if we only intend to use the image for computations, but if we later wish to use it a the foundation for e.g. an image with Shiny server it is needed.

The libopenblas packages provide basic linear algebra subprograms and is the foundation for all math in R – it is unavoidable.

The package libicu60 is for handling unicode and is therefore also needed.

Then we reach the packages related to the GNU C compilers. Here the numbers are misleading as they actually depend on other packages that sum up to about 170 MB. The compilers are not needed at runtime, but since most modern R packages contain C++ code they are needed for installing these packages.

The R images

My new sequence of Docker images are as illustrated in the image below.

docker image deps

The r-minimal image contains only R, the very small remotes package for installing other packages, but no compilers. r-deps can be an actual image with runtime dependencies or it can be an intermediate image in a multi-stage build.

The r-base image builds on r-minimal and have C(++) and Fortran compilers.

Finally, the r-test image builds on r-base and have the covr package, devtools package, roxygen2 package and testthat package for testing purposes. These four packages and their dependencies take up quite a lot of space and are time consuming to install, which is why I have a dedicated image.

Since I now use remotes instead of devtools for installing packages inside the images, my Dockerfile for r-test now installs devtools because it is needed for testing.

ARG R_VERSION
FROM r-base:${R_VERSION}

USER root

RUN apt-get update \
        && apt-get install -y --no-install-recommends \
                libxml2-dev \
                libssl-dev \
                libssh2-1-dev \
                zlib1g-dev \
        && rm -rf /var/lib/apt/lists/*

USER shiny

RUN Rscript -e 'install.packages(c("covr", "devtools", "roxygen2", "testthat"))' \
        && rm -rf /tmp/*

COPY --chown=shiny:shiny run_tests.R $HOME/package

WORKDIR $HOME/package

CMD ["Rscript", "/home/shiny/run_tests.R"]

For R 3.5.0 the image summary is as follows.

$ docker image ls
REPOSITORY          TAG             SIZE
r-test              3.5.0           726MB
r-base              3.5.0           530MB
r-minimal           3.5.0           354MB
r-deps              3.5.0           293MB
ubuntu              18.04           87.5MB

I have updated my GitHub repository with the Dockerfiles.

Using the images

Subsequent images are based off either r-minimal or r-base depending on how easy it should be to install new packages that needs compiled code.

Say that I want an image that has the jsonlite package installed, but based off r-minimal. jsonlite has C++ code and no system requirements, so it can be installed in r-base and then the entire folder with R packages is copied into r-minimal – this is a multi-stage build.

ARG R_VERSION
FROM r-base:${R_VERSION} as deps

RUN Rscript -e "install.packages('jsonlite')"


# ----------------------------------------------------------

ARG R_VERSION
FROM r-minimal:${R_VERSION}

COPY --from=deps --chown=shiny:shiny /usr/local/lib/R/site-library/ /usr/local/lib/R/site-library/

CMD ["R"]

Instead of tying the Dockerfile to a specific image, we specify the R version as a build argument. An image is therefore built like this:

docker build --build-arg R_VERSION=3.5.0 --tag myimage:mytag .

LaTeX

None of these images include LaTeX/TeXLive even though this is one of the build requirements. As noted in this RStudio support article LaTeX is needed to produce PDF files from R Markdown. And some of the documentation during installation. As I use neither in these images, I do not include LaTeX.

Comparisons with Python

Some Pythonistas have snidely pointed out to me that minimal Python Docker images are much smaller than the ones I have made here.

That is certainly true, but R come with a lot more batteries baked into the core language and standard library.

Consider for example the Docker image built from this Dockerfile that installs numpy in a small Python distribution.

FROM python:3.7-slim

RUN python -m pip install --no-cache --compile --user numpy

These images now have the following stats.

$ docker image ls
REPOSITORY       TAG             SIZE
numpy            3.7             219MB
python           3.7-slim        143MB

I am honestly not sure if this discards the 17 MB that are downloaded to install numpy, but the added space would still be comparable to the installed libopenblas packages. If we where to add matplotlib for plotting and other Python packages that provide counterparts to R’s built-in functionality, we would probably end up having an image of the same size as r-minimal.