Robert's Data Science Blog

Docker Images for R

I use Docker for various purposes, two of which is working with R and Shiny. In this post I will go through the Docker images I use for R. I will share more details on the usage in a later post. The starting point are images from the Rocker Project that provide a lot of R-related Docker images.

Docker images are layered and in the context of R images I apply this in the following manner:

The rest of my images built on r-base and r-devtools.

The Dockerfiles are hosted on my GitHub profile.

The base R image

I care deeply about reproducibility. One downside of R that you can experience in the daily use is if you return to a an old project on a new computer. Simply installing the latest version of R and the necessary packages will quite likely result in different versions of R and the packages. If you are lucky your code will still run and return the same results. Otherwise, you have to update your code or try to find and install the required packages in the version you had when you wrote the code.

Packages like packrat and checkpoint aim to help you manage the packages you use in a project. However, you still need access to old versions of R. Actively maintained Linux distributions often only have the bleeding edge in the repositories.

An alternative is to compile R yourself – which is what happens in the Version-stable Rocker images.

As noted in the source repository, this does not guarantee very long term reproducibility, as we still need a base image with the dependencies.

Besides the compiled R the Version-stable Rocker images use Microsoft’s daily snapshot of the official CRAN package repository with the date set to be the last day that a particular version of R was the most recent release. This means that any official package installed within such an r-base image (or any image that builds on it) works with that version of R.

That was the big lines. So why don’t I just use the Version-stable Rocker images? Because there are a few things I want to do differently.

The base image

The Rocker images are all based on Debian and I prefer Debian’s offspring Ubuntu. This also gives more consistency with the Docker images created with Azure Machine Learning Studio that we have used at work.

The downside is that not all dependencies are the same in Ubuntu and Debian. In the Version-stable Rocker images’ GitHub repo there are commands for finding dependencies. The dependencies needed at runtime can be found with the command

apt-cache show r-base-core | grep ^Depends

The dependencies needed for the compilation can be found with the command

apt-cache showsrc r-base-core | grep ^Build-Depends
I do not use all of the dependencies, as some are for graphics in the above document (marked with X11).

It is important to remove the build dependencies afterwards: The final r-base image takes up around 440 megabytes when they are removed and a whooping 2.2 gigabytes when they are not removed.

The user

By default, the user in a Docker container is root, but this is discouraged in the best practices for writing Dockerfiles. Docker imposes no restrictions for the non-root user in the containers, so my choices aim to make life easier in a later image that do have requirements: Shiny Server.

I create a user called shiny in a group called shiny, that owns a global package directory as well as a all coming subfolders (packages) using a sticky permission.

The devtools image

The purpose of the r-devtools image is to make it easier to make images with custom packages.

Custom packages

The devtools package makes it easy to install custom packages in an image. Consider the package MyPackage's folder:

├── Dockerfile
├── man
├── MyPackage.Rproj
├── R
└── tests

Here the Dockerfile look as follows:

FROM r-devtools:3.4.4

COPY --chown=shiny:shiny . /tmp/MyPackage

RUN Rscript -e 'devtools::install("/tmp/MyPackage")' \
    && rm -rf /tmp/*

CMD ["R"]

When we build this image from the MyPackagefolder we copy all content in the folder into /tmp/MyPackage in the image. We can then use devtools to install the package and remove the source.

Every statement in a Dockerfile results in an intermediate image. When building the same image repeatedly it means that a succesful step does not have to be rebuilt, but if an image changes the remainding images also have to be rebuilt.

If MyPackage has many dependencies the command devtools::install("/tmp/MyPackage") can take a long time. During experimentation where the files in MyPackage changes this results in long build time for the final image.

To work around this I often install the dependencies in a separate RUN before the COPY statement, i.e., I include a line like

RUN Rscript -e 'install.packages(c("foo", "bar"))'
COPY --chown=shiny:shiny . /tmp/MyPackage