Are Dockerfiles good enough?

For those looking for a fast overview of containers click here.

Containers have quickly become the favorite way to deploy software, for a lot of good reasons. They have allowed, for the first time, developers to test "as close to production" as possible. Unlike say, VMs, containers have a minimal performance hit and overhead. Almost all of the new orchestration technology like Kubernetes relies on them and they are an open standard, with a diverse range of corporate rulers overseeing them. In terms of the sky-high view, containers have never been in a better place.

I would argue though that in our haste to adopt this new workflow, we missed some steps. To be clear, this is not to say containers are bad (they aren't) or that they aren't working correctly (they are working mostly as advertised). However many of the benefits to containers aren't being used by organizations correctly, resulting in a worse situation than before. While it is possible to use containers in a stable and easy-to-replicate workflow across a fleet of servers, most businesses don't.

We're currently in a place where most organizations relying on containers don't use them correctly. At the same time, we also went back 10+ years in terms of the quality of tools Operation teams have for managing servers as defined broadly as "places where our code runs and accepts requests". There has been a major regression inside of many orgs who now tolerate risks inside containers that never would have been allowed on a fleet of virtual machines.

For me a lot of the blame seems to rest with Dockerfiles. They aren't opinionated enough or flexible enough, forcing people into workflows where they can make catastrophic mistakes with no warning, relying too much on brittle bash scripts and losing a lot of the tools we gained in Operations over the last decade.

What did containers replace?

In the beginning, there were shell scripts and they were bad. The original way a fleet of servers was managed when I started was, without a doubt, terrible. There was typically two physical machines for the databases, another 4 physical machines for the application servers, some sort of load balancer, and then networking gear at the top of the rack. You would PXE boot the box onto an install VLAN and it would kind of go from there.

There was a user with an SSH key added, usually admin. You would then run a utility to rsync a directory of bash scripts that would run. Very quickly, you would run into problems. Writing bash scripts is not "programming light", it's just real programming. But it's programming with both hands tied behind your back. You still need to be writing functions, encapsulate the logic, handling errors, etc. But bash doesn't want to help you do this.

You can still get thrown with undefined variables, comparisons vs assignment is a constant issue when people start writing bash ( foo=bar vs foo = bar), you might not check to make sure bash is the shell you are running and a million other problems. Often you had these carefully composed scripts with raw sh just in case the small things bash does to make your life better were not there. I have worked with people who are expert bash programmers and can do it correctly, but it is not a safer, easier or more reliable programming environment.

Let's look at a basic example I see all the time.

for f in $(ls *.csv); do    
    some command $f         
done

I wish this worked like I assumed it did for years. But it doesn't. You can't treat ls like a stable list and iterate over it. You'll have to account for whitespace in the file name, checking for glob characters, ls can mangle filenames. This is just a basic example of something that everyone assumes they are doing right until it causes a massive problem.

The correct way I know to do this looks like this:

while IFS= read -r -d '' file; do
  some command "$file"
done < <(find . -type f -name '*.csv' -print0)

Do you know what IFS is? It's ok if you don't, I didn't for a long time. My point is that this requires a lot of low-level understanding of how these commands work in conjunction with each other. But for years around the world, we all made the same mistakes over and over. However, things began to change for the better.

As time went on new languages became the go-to for Sys Admin tasks. We started to replace bash with python, which was superior in every imaginable way. Imagine being able to run a debugger with a business-critical bootstrapping script? This clearly emerged as the superior paradigm for Operations. Bash still has a place, but it couldn't be the first tool we reached for every time.

So we got new tools to match this new understanding. While there are a lot of tools that were used to manage fleets of servers, I'm going to focus on the one I used the most professionally: Ansible. Ansible is a configuration management framework famous for a minimal set of dependencies (Python and SSH), being lightweight enough to deploy to thousands of targets from a laptop and having a very easy-to-use playbook structure.

Part of the value of Ansible was its flexibility. It was very simple to write playbooks that could be used across your organization, applying different configurations to different hosts depending on a variety of criteria, like which inventory they were in or what their hostnames were. There was something truly magical about being able to tag a VLAN at the same time as you stood up a new database server.

Ansible took care of the abstraction between things like different Linux distributions, but its real value was in the higher-level programming concepts you could finally use. Things like sharing playbooks between different sets of servers, writing conditionals for whether to run a task or not on that resource, even writing tests on information I could query from the system. Finally, I could have event-driven system administration code, an impossibility with bash.

With some practice, it was possible to use tools like Ansible to do some pretty incredible stuff, like calling out to an API with lookups to populate information. It was a well-liked, stable platform that allowed for a lot of power. Tools like Ansible Tower allowed you to run Ansible from a SaaS platform that made it possible to keep a massive fleet of servers in exact configuration sync. While certainly not without work, it was now possible to say with complete confidence "every server in our fleet is running the exact same software". You could even do actual rolling deploys of changes.

This change didn't eliminate all previous sources of tension though. Developers could not just upgrade to a new version of a language or install random new binaries from package repositories on the system. It created a bottleneck, as changes had to be added to the existing playbooks and then rolled out. The process wasn't too terrible but it wasn't hands-off and could not be done on-demand, in that you could not decide in the morning to have a new cool-apt-package in production by that afternoon.

Then containers appeared

When I was first introduced to Docker, I was overjoyed. This seemed like a great middle step between the two sets of demands. I could still rely on my mature tooling to manage the actual boxes, but developers would have control and responsibility for what ran inside of their containers.  Obviously, we would help but this could be a really good middle ground. It certainly seemed superior to developers running virtual machines on their laptops.

Then I sat down and started working with containers and quickly the illusion was shattered. I was shocked and confused, this was the future? I had to write cron jobs to clean up old images, why isn't this a config file somewhere? Why am I managing the docker user and group here? As it turns out installing docker would be the easy part.

Application teams began to write Dockerfiles and my heart started to sink. First, because these were just the bash scripts of my youth again. The learning curve was exactly the same, which is to say a very fast start and then a progressively more brutal arc. Here are some common problems I saw the first week I was exposed to Dockerfiles that I still see all the time:

  • FROM: ubuntu:latest Already we have a problem. You can pull that down to your laptop, work for a month, deploy it to production, and be running a totally different version of Ubuntu. You shouldn't use latest but you also shouldn't be using other normal tags. The only tool Docker gives you to ensure everyone is running the exact same thing is the SHA. Please use it. FROM ubuntu@sha256:cf25d111d193288d47d20a4e5d42a68dc2af24bb962853b067752eca3914355e is less catchy but it is likely what you intended. Even security updates should be deliberate.
  • apt-get is a problem. First, don't run apt-get upgrade otherwise we just upgraded all the packages and defeated the point. We want consistent, replicable builds. I've also seen a lot of confusion between users on apt vs apt-get.
  • COPY yourscript.py before RUN install dependencies breaks the caching functionality.
  • Running everything as root. We never let your code run as root before, why is it now suddenly a good idea? RUN useradd --create-home cuteappusername should be in there.
  • Adding random Linux packages from the internet. I understand it worked for you, but please stick to the official package registry. I have no idea what this package does or who maintains it. Looking at you, random curl in the middle of the Dockerfile.
  • Writing brittle shell scripts in the middle of the Dockerfile to handle complicated operations like database migrations or external calls, then not accounting for what happens if they fail.
  • Please stop putting secrets in ENV. I know, we all hate secrets management.
  • Running ADD against unstable URL targets. If you need it, download it and copy it to the repo. Stop assuming random URL will always work.
  • Obsessing about container size over everything else. If you have a team of Operations people familiar with Debian, following Debian releases, plugging into the ecosystem, why throw all that expertise in the trash for a smaller container?
  • Less && and && \please. This one isn't your fault but sometimes looking at complicated Dockerfiles makes my eyes hurt.
  • Running a full Linux container for a script. Thankfully Google has already solved this one.

This is not your fault

You may be looking at this list and be like "I know all this because I read some article or book". Or maybe you are looking at this list and thinking "oh no I do all of that". I'm not here to judge you or your life. Operations people knew this was a problem and it was a problem we had come to the conclusion could not be fixed by assuming people would magically discover this information.

My frustration is that we already went through this learning. We know that the differences between how distros handle packages throw people off. We know bash scripts are hard to write and easy to mess up. The entire industry learned through years of pain that it was essential you be able to roll back not just your application, but the entire infrastructure that the application is running on. Creating endless drift in infrastructure worked until it didn't when suddenly teams had to spend hours trying to reverse engineer what of the dozens of changes introduced with the latest update caused a problem.

In our rush to get to a place where obstacles were removed from developers, we threw away years of hard-earned experience. Hoping for the best and having absolutely no way to recover if it doesn't work isn't a plan. It isn't even really a philosophy. Saying "well the Linux part isn't the important part of my application" is fine until that is very much not the case. Then you are left in an extremely difficult position, reaching for troubleshooting skills your organization might not even have anymore.

Stuff we can do now

  • Start running a linter against our Dockerfiles: https://github.com/hadolint/hadolint
  • Look at alternatives to conventional Dockerfiles. Below is an example of combining Ansible and the Dockerfile template.
FROM debian@sha256:47b63f4456821dcd40802ac634bd763ae2d87735a98712d475c523a49e4cc37e

# Install Ansible
RUN apt-get update && apt-get install -y wget gcc make python python-dev python-setuptools python-pip libffi-dev libssl-dev libyaml-dev
RUN pip install -U pip
RUN pip install -U ansible

# Setup environment
RUN mkdir /ansible
COPY . /ansible
ENV ANSIBLE_ROLES_PATH /ansible/roles
ENV ANSIBLE_VAULT_PASSWORD_FILE /ansible/.vaultpass

# Launch Ansible playbook
RUN cd /ansible && ansible-playbook -c local -v example.yml

# Cleanup
RUN rm -rf /ansible
RUN apt-get purge -y python-dev python-pip
RUN apt-get autoremove -y && apt-get autoclean -y && apt-get clean -y

# Final steps
ENV HOME /home/test
WORKDIR /
USER test

CMD ["/bin/bash"]
It's not perfect but is is better. 
  • Better than this would be to use Packer. It allows for developers to string together Docker as a builder and Ansible or Puppet as a provisioner! It's the best of all possible worlds. Here are the details. Plus you can still run all the Dockerfile commands you want.

Place I would love to get to

I would love for some blessed alternative to Dockerfiles to emerge. We don't want to break backwards compatibility but I would love a less brittle tool to work with. Think like Terraform or Packer, something sitting between me and the actual build. It doesn't need to be a full programming language but some guardrails around me making common mistakes is desperately needed, especially as there are fewer and fewer restrictions between developers and production.

Questions/comments/does this tool already exist and I don't know about it? Hit me up on twitter.