Docker Cache Insights.


I have gained various insights over the time about the better way of using the feature of Docker Caching for the couple of scenarios. I have tried to detail the couple of them which might be helpful for the engineering teams to be more productive and efficient.

About the Docker Cache, Docker provides the functionality to package an application in such a way that “Build once and run anywhere”. During development and continuous deployment, someone has to deal with dozens of continuous docker builds. To facilitate faster turnarounds, docker has the ability to cache the various steps of the build process so that subsequent builds are almost instantaneous. Yes, we all know that Docker is inbuilt with the caching and here are some details about the insights.

Dealing with source control operations.

To get a copy of source code in docker container, developers prefers to use git commands in or wrap into the a script and use Docker’s ‘RUN’ step.

# Clone the git repo
RUN /bin/sh /git clone git@github.com/<account>/<repo>.git
# Git repo operations in a script
RUN /bin/sh /git-clone.sh

Docker clones the repository for that first time and then caches the RUN command executes instantaneously for subsequent builds. Great! but it causes an issue, if repository gets updated after cache and docker build new container, it uses previously cached version as an output resulting a stale codebase. The expectation here is an output of an RUN command should be updated each time of new container build. As it stands, unless the RUN command itself changes (and thus invalidates Docker’s on-host cache), Docker will reuse the previous results from the cache. In this case once ‘RUN git clone*’ gets executed once, the subsequent build uses the same copy of the code and won't try to execute each time to get latest changes.

One quick way to get rid of this is disabling the cache by building with ‘–no-cache’ flag. But it invalidate all the build steps and not a specific step of ‘RUN command’ resulting in the execution of all steps again and totally defeating the purpose of the cache. One of the way to deal with this is to generate unique RUN command each time ensuring it gets run each time

  1. We can wrap the Docker build in another script that generates a uniquely numbered mini script for the clone operation. This step would insert the invocation of that script into the Dockerfile that is generated on-the-fly just prior to build time, such that for the operation that must be run every time – the clone – its RUN statement is indeed unique.

    i.e. RUN /bin/sh /git-clone-123abc.sh

    where ‘-123abc’ is uniquely generated and appended for each build (and subsequent executions create something like ‘clone-124abc.sh’) and contains the git clone operation.

  2. Place this source control operations into the last RUN that is listed in the Dockerfile. This guarantees that Docker will run the clone during each build while having the advantages of being both fully automated and ensuring that the cache is used right up to that last unique RUN.

Appropriate usage of ‘ADD’ step in Dockerfile.

‘ADD’ step is being used most commonly in Dockerfile. But I think if it’s not been used properly, it becomes a common cause of cache busting during builds.

Let me give an example through Node.js MEAN stack docker build steps,

# 1. Add the application folder. It has package.json
        ADD . /src
# 2. Go to the package.json directory and build the app using npm package manager.
        RUN cd /src && npm install

The first step adds the current folder to container’s /src folder. The second step is very time-consuming as it installs dependencies. But at the same time package.json don’t get change often, hence during subsequent build this step should be very fast.

But here is the glitch. In the first step, when Docker does its comparison either to use cache, it compares the folder (the ‘.’) against the previously built folder, and if any file has changed from the current folder in meantime, the cache gets busted and second step of ‘npm install’ get executed even though package.json didn’t change.

This behavior has been ignored many times. This can be avoided with reordering of the steps.

# 1. Add package.json
    ADD package.json /src/package.json
#2. Run the Build with package.json
    RUN cd /src && npm install
# 3. Add the application folder.
    ADD . /src

In this case, even though any file in application folder gets change, it doesn't affect the first two steps unless package.json itself change. Docker caches the first two steps easily and gets bust third command onwards.

mTime(modified time) TimeStamp

Docker determines whether or not to use the cached version of a file is by comparing several attributes of the older and the newer version, including mtime. For example, during the MEAN stack build steps

# 1. Add package.json
    ADD package.json /src/package.json

After the first docker build, docker uses the cached copy of ‘package.json’ during subsequent builds if docker build is happening on the same project from the same directory several times in a row.

But most of the time, engineering team uses continuous delivery environments like Jenkins or CD. So to configure docker hook up with the build trigger a fresh clone of the application’s repository occurs. When this happens the mtime of a new copy of ‘package.json’ gets different than the previous cached ‘package.json’. And this is a problem for docker. As Docker uses the mtime of a file when doing its comparison and the mtime changes for our package.json file on every single clone, the cached version can never be used although technically file content never changed.

Fortunately, there is a solution! As part of our build process, after cloning the Git repository mtime of the ‘package.json’ The file needs to be changed to the time file was last changed within Git. This means that on subsequent clones if the file has not been modified according to Git, then the mtime will be consistent and thus Docker can use the cached version.Below sample script is one of the way to deal with mTime Timestamp.

#!/bin/bash
# 1. Get the git revision for package.json
    REV=$(git rev-list -n 1 HEAD 'package.json');
# 2. Get the timestamp of the last commit
    STAMP=$(git show --pretty=format:%ai --abbrev-commit "$REV" | head -n 1);
# 3. Update the file with the last commit timestamp
    touch -d "$STAMP" package.json;

Step 3 ensures mTime is always same for all the git clone until unless the updated version of the package.json file gets checked into source control.

This solution is just a workaround. There are a couple of proposed enhancements to the behavior of docker caching strategy with consideration of other attributes like sha1sum and md5sum along with a timestamp.https://github.com/docker/docker/issues/4351.If this feature gets added then workaround mention here not be needed.

Avoid installing unnecessary packages

Every docker file starts with system installation instructions like,...

RUN apt-get update && apt-get install wget -yy
RUN wget -q -O - http://pkg.jenkins-ci.org/debian/jenkins-ci.org.key | apt-key add -
RUN echo "deb http://pkg.jenkins-ci.org/debian binary/" >> /etc/apt/sources.list
RUN apt-get update && apt-get install -y -qq --no-install-recommends git jenkins curl telnet unzip openssh-client && apt-get clean

Throughout development, more and more package gets added. Same time developers generally don’t try to refactor this portion of dockerfile considering it is a very core part of the docker build and package removal might encounter unnecessary issues down the line. As per my experience working with the a big team, this section becomes the hub of unnecessary packages.

The solution is simple, just have continuous refactoring of setup steps and avoid installation of unnecessary packages. Breaking the lengthier steps into multi-line and then sorting the arguments also help to get rid of duplicates.

RUN apt-get update && apt-get install -y -qq --no-install-recommends\
git \
jenkins \
curl \
telnet \
unzip \
openssh-client \
&& apt-get clean