Context

Why building images ?

Since most cloud oriented open-source projects provide their own pre-build images from highly available public repositories (quay.io, gcr.io, docker.io, …), you can legitimately ask if it’s worth wasting time building and distributing your own images.

In fact, apart from the lab environment where it makes sense to quickly evaluate a solution using pre-built images, I found that consuming public images for production was unpractical at best, and dangerous at worst :

each image comes with its own way of configuring and running the services (environment variables, volumes, secrets, user, …),
the distribution, tooling, size and quality vary greatly between images,
images don’t necessarily follow closely the upstream distribution or project’s revisions, integrate the options you need (compilation flags for example), apply important security patches on dependencies, run as unprivileged user, or handle properly the PID 1,
if you use containers on the edge, you will probably have difficulties to find images for your architecture (arm64).

Your cluster is lot easier to manage if you can maintain a consistency among your images: same distribution, same way to generate configurations, access secrets, and run services. This is a prerequisite for having a golden image you could use as a basis for all the other.

I wanted to share with you an alternative way of building container images, but first let’s go back in the past to understand how we ended up reinventing a squared wheel with docker.

The docker way

When docker came out in 2013 it revolutionized software distribution by allowing anyone to easily build, ship and run full Linux system images on top of lightweight isolation features provided by the Linux kernel: cgroups.

Before docker, building a system image that you could use on different servers required :

extensive Linux system administration skills,
complex system building process (Linux from scratch, …),
complex virtualization tools (vmware ESXi, qemu, kvm, …), and big network volumes or external devices to install the images from,
an heavy manual procedure in between (what some refers to as ClickOps)

docker fully automated this process :

by using a single file (Dockerfile) describing how to build the image and how to run container based on that image,
by simplifying images distribution using a network registry,
and by reducing every step of the container’s life-cycle to a single command line instruction (pull, build, push, run).

The installation of docker itself was reduced to the copy of a single big fat binary somewhere in your operating system.

People started to abuse the limited scripting expressiveness of the Dockerfile the moment they used it to build software instead of just assembling parts to create an image. In the name of repeatability and centralization of the whole build process, features were progressively added to docker to overcome problems caused by this misuse. This turned the situation even worse.

Layers are useless

Each statement in a Dockerfile that has a side effect on the file system triggers the creation of a new layer in the resulting image. Having a layer built with every RUN statement was nice at the beginning because it allowed to develop an image step by step and restart the process where it failed by retrieving quickly the previous successful layers from the cache.

If this makes sense during development, it fails short when you want to go for production. You end up with multi gigabytes images made of tens of layers containing all the steps of the build: installation of the development packages, intermediate files, and even deleted files.

People started to use tools to squash image layers, or used ugly chains of instructions separated by && in a single RUN statement to limit the creation of new layers at the cost of a lesser readability, a harder debug processes, and a much longer time to rebuild in case of changes. The recent introduction of a heredoc syntax, 8 years after the first release of docker, says much about the weight of bad design decisions and the time needed to mitigate them.

If layers seem a good idea on paper, they are in fact pretty useless. Modern image build tools like buildha don’t enable cache by default and squash layers to only 2: The base image (FROM statement), and the rest.

Multi-stage build is a hack

And what to say about multi-stage build functionality which seems to exist only for not having to think too much about cleaning our mess after a build, and not including the whole tool-chain inside the image ?

Look for instance to this Dockerfile taken from the meilisearch repository :

# Compile
FROM    alpine:3.14 AS compiler

RUN     apk update --quiet \
        && apk add -q --no-cache curl build-base

RUN     curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

WORKDIR /meilisearch

COPY    Cargo.lock .
COPY    Cargo.toml .

COPY    meilisearch-auth/Cargo.toml meilisearch-auth/
COPY    meilisearch-error/Cargo.toml meilisearch-error/
COPY    meilisearch-http/Cargo.toml meilisearch-http/
COPY    meilisearch-lib/Cargo.toml meilisearch-lib/

ENV     RUSTFLAGS="-C target-feature=-crt-static"

# Create dummy main.rs files for each workspace member to be able to compile all the dependencies
RUN     find . -type d -name "meilisearch-*" | xargs -I{} sh -c 'mkdir {}/src; echo "fn main() { }" > {}/src/main.rs;'
# Use `cargo build` instead of `cargo vendor` because we need to not only download but compile dependencies too
RUN     if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
            export JEMALLOC_SYS_WITH_LG_PAGE=16; \
        fi && \
        $HOME/.cargo/bin/cargo build --release
# Cleanup dummy main.rs files
RUN     find . -path "*/src/main.rs" -delete

ARG     COMMIT_SHA
ARG     COMMIT_DATE
ENV     COMMIT_SHA=${COMMIT_SHA} COMMIT_DATE=${COMMIT_DATE}

COPY    . .
RUN     if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
            export JEMALLOC_SYS_WITH_LG_PAGE=16; \
        fi && \
        $HOME/.cargo/bin/cargo build --release

# Run
FROM    alpine:3.14

ENV     MEILI_HTTP_ADDR 0.0.0.0:7700
ENV     MEILI_SERVER_PROVIDER docker

RUN     apk update --quiet \
        && apk add -q --no-cache libgcc tini curl

COPY    --from=compiler /meilisearch/target/release/meilisearch .

EXPOSE  7700/tcp

ENTRYPOINT ["tini", "--"]
CMD     ./meilisearch

It is not obvious to figure out

where the different parts start and end,
what is constructed or shared between steps and how,
where every piece is coming from,
and where it is finally installed.

There is no contract or constraint between the different parts, and it’s even a common pattern to use different images/distributions in the FROM statements. We are simply copying blobs between stages without consistency or dependency checks. I can only imagine the kind of bugs we can potentially introduce with such a construction.

And yet it can work pretty well, but can’t we do better ?

A better way

Back to square one

Linux’s distributions have accumulated decades of experience on how to build and package software in the most efficient and reliable way. The distribution packaging tool-chains prevent you to reproduce the same mistakes all over again that stacking build instructions in a Dockerfile inevitably expose you to make. They can :

get source code from reliable sources,
verify check-sums and signatures,
patch files,
build and test in isolation for multiple hardware architectures,
optimize (remove symbols),
split in sub-packages (architecture independent assets, doc, dev, …),
cleanup,
trace dependencies automatically,
compress and package everything into simple archives you can install everywhere, even in running containers

All of that without having to know the exhausting list of tools needed to complete all those tasks, and with the guarantee of not skipping any step on the way. As a bonus, you get proper error and warning messages when something goes south or looks suspicious.

There is NOT a single task from the list above that a Dockerfile can do for you out of the box, and that fact alone, should make you think several times before building something within a Dockerfile ever again.

Containerfile for assembly

Instead of building your software directly in a Dockerfile, why not using the packaging tools of the distribution you are targeting for your image (alpine Linux), and just install the package ?

This is for example a Containerfile (which is the OCI name for Dockerfile) for another meilisearch image

FROM reg.itsufficient.me/alpine:3.15
MAINTAINER eric@itsufficient.me

RUN apk add --no-cache meilisearch=0.25.2-r0

## add s6 configuration
ADD etc /etc

# meilisearch port
EXPOSE 7700

The Containerfile is straightforward and easy to understand.

No need for layers or cache: it just installs packages and copy files and completes very quickly.
The build process is now composable: you can easily use the same package in totally different contexts (even on running containers).
It is flexible: we split the build (packages) and the assembly process (image) we can manage in different projects with different teams and different security policies.
It scales better: you can clearly see the dependency graph, implement it in the form of CI/CD pipelines, and let the system execute everything in parallel whenever possible.

You can even build multiple image versions with the same Containerfile without modifying it :

FROM reg.itsufficient.me/alpine:3.15
MAINTAINER eric@itsufficient.me

ARG TAG
RUN apk add --no-cache meilisearch=${TAG}-r0

## add s6 configuration
ADD etc /etc

# meilisearch port
EXPOSE 7700

You just have to add a tag to your repository, and configure your CI/CD pipeline script to pass the tag to the build engine (docker, podman, buildah, kaniko, …) with --build-arg TAG="${CI_COMMIT_TAG}".

One can argue that we just moved the build complexity behind :

a base image that hides the configuration of the package repository (URL and public keys) and entry point,
a supervision suite (s6) that probably breaks the mantra one process per container (we will come back on that later).

On top of that, we now need a way to build and deploy packages to a private repository, and rely on CI/CD pipelines to glue everything together.

These are valid points but regarding all the qualitative and functional advantages we get by moving all build logic out of the Containerfile, it is totally worth the effort as we will see now.

Building packages

alpine Linux uses APKBUILD files (heavily inspired by gentoo build system) for creating a package.

This is what is needed for instance to build, test, package, and optimize meilisearch. It is pretty straightforward and easy to understand even if you know nothing about APKBUILD syntax (structured sh).

# Maintainer: Eric BURGHARD <eric@itsufficient.me>
pkgname=meilisearch
pkgver=0.25.2
pkgrel=0
pkgdesc="Powerful, fast, and an easy to use search engine"
url="https://www.meilisearch.com"
arch="x86_64"
license="MIT"
makedepends="cargo"
install="$pkgname.pre-install"
source="$pkgname-$pkgver.tar.gz::https://github.com/$pkgname/$pkgname/archive/v$pkgver.tar.gz"

build() {
  cargo build --release
}

check() {
  cargo test all --release
}

package() {
  install -Dm755 target/release/"$pkgname" "$pkgdir"/usr/bin/"$pkgname"
}

sha512sums="
fb22c314b3d2dae4b46640d2ed6fd91a1e80f649c597f8b1bcb63a259173a0a54810dc0a9fe1cddcc991652cd0a590eed827d4f65754e404aa862d4de1b4fa92  meilisearch-0.25.2.tar.gz
"

build(), check() and package() are special functions (or hooks) and most of the heavy work is done automatically behind your back. A lot of these functions have a default implementation and run even if not defined/overridden in the APKBUILD file (fetch(), unpack(), prepare(), …).

By just running abuild -r, you end up with a software properly verified, tested, optimized and packaged. Dependencies are installed with the package and additional scripts (ex: user/group creation) can be executed at several points during the process (pre-install, post-intall, …).

We can see immediately what is going to be build, for which architecture, what version and where it will be installed. More importantly, you can feel that by just replacing some values you could build a totally different rust project. In contrast, everything seems mixed up and WET in the meilisearch Dockerfile and there is probably nothing to keep if you want to adapt it to another project.

From experience, you rarely need more than 50 lines to build anything in whatever framework or language. You can easily find examples you can just copy/paste and quickly adapt to your needs.

You can follow the tutorial Building alpine Linux packages inside a container to convince yourself that the procedure is easy. What follows now is what is needed to streamline the production of packages and images for production.

Build cache

Cache is best handled at the language tool level because of the granularity of dependencies. A cache at the image level would be invalidated as soon as one dependency changes, whereas build tools are smart enough to rebuild only what is necessary. As projects often have hundreds of dependencies, no one can seriously consider image layers as an effective cache system.

As the CI/CD scheduler can start a build pipeline in any available physical servers (node), we need a way to share the cache through the network. Most modern tools now use connector for S3 compatible store, and GitLab chart (Kubernetes) comes with a pre-configured minio instance you can use to add your own buckets. For rust, I use sccache by just defining environment variables before running cargo :

AWS=XXXX
RUSTC_WRAPPER=/usr/bin/sccache
SCCACHE_BUCKET=sccache
SCCACHE_ENDPOINT=myendpoint:443
SCCACHE_S3_USE_SSL=true

A lot of build tools (bazel, cmake, …) support S3 out of the box or via plugins, but if nothing is available, you can still leverage the cache functionality integrated in GitLab.

CI/CD to rule them all

0/18

Multi projects pipeline to build and consume a container image

0/18

CI/CD has taken a central place in software development and is used to build, test, assess, package and deploy applications. GitLab generalizes the notion of package as being some files served under a well-defined protocol, and expands the list of integrated registry types with each new version at an incredible pace.

To build software outside a Containerfile we need at least 3 projects which share the same name and tags but located in different namespaces (or groups). These projects are built either at each new commit (tag) or through API triggers, and publish different kind of packages upon successful execution. Each project has eventually the right to make changes and commit to its downstream project, as well as getting packages published by its upstream project :

A source project builds, tests, and publishes source releases. It is allowed to update a package project (inject the tag and update checksum in a APKBUILD file),
a package project gets and verifies the source release, then builds, tests and deploys packages to an alpine repository. It is allowed to update an image project (inject a tag in a Containerfile),
an image project assembles and pushes images. It is allowed to update a manifest project (inject a tag in a YAML manifest)
an optional manifest project describes how the application is deployed :
- Either a CI/CD GitOps pipeline runs when something changes in the project git repository (push strategy),
- or a GitOps controller constantly monitorizes the repository (pull strategy) and reacts to changes.
The manifest can be patched with kustomize to match the running environment before being submitted to the control plane (using kubectl or any other tools), which finally pulls the container image.

We can test in some pipelines if the tag of the downstream project already exists to trigger a rebuild instead of committing changes (which then triggers a build). For instance if a package project-0.1.0-r1.apk replaces project-0.1.0-r0.apk, the already tagged image project:0.1.0 will not change. A rebuild is triggered, and the most recent package will be picked up during the assembly phase. However, this does not apply to the source project as it is tightly coupled to the package project due to the signature of the source archive present in the APKBUILD file. In that case we always need to commit the new checksum.

Configuring the CI/CD for each alpine package is really easy and totally DRY thanks to GitLab CI/CD templates and to the fact that projects share the same name and tags (only their group are different). You can look at my GitLab templates if this is of any interest for you.

Here is for instance the .gitlab-ci.yml file I use for all my alpine package projects :

include:
  - project: templates/gitlab
    file:
      - deploy_apk.yml
      - commit_downstream_container.yml

variables:
  BUILD_VARIANT: /rust

Packages repository

reposerve

GitLab has package registry (npm, maven, conan, …), container registry (OCI), and infrastructure registry (helm), but hasn’t tackled yet the problem of OS packages registry (rpm, deb, apk, …).

All packages registries I know rely on simple static file servers, and alpine is no exception to the rule. Nevertheless, if you want to be able to deploy packages from a CI/CD pipeline with a POST for instance, you need more than a static server to handle authorization, uploads, and trigger actions when new packages are posted (update and sign index). I didn’t find any software capable of handling these problems out of the box.

The generalization of microservices and emergence of new compiled languages changed the way we approach HTTP services nowadays. From using a generic solution with a complex configuration and scripts (nginx), we shifted to tailor made solutions with simple configurations. Rust is a perfect candidate for these kinds of projects because it has some of the fastest HTTP frameworks and gives some strong guarantees over memory and thread safety.

reposerve is a static file server over directories containing alpine packages and indexes. It offers an /upload path (which access can be restricted to the use of a valid JWT token) to post one or several packages. It signs automatically the indexes when new packages are uploaded using alpine build tools.

Server configuration

Configuring reposerve is easy.

dir: /home/packager/packages
tls:
  crt: /var/run/secrets/reposerve/tls.crt
  key: /var/run/secrets/reposerve/tls.key
jwt:
  jwks: https://gitlab.example.com/-/jwks
  claims:
    iss: 'gitlab.example.com'
    ref_protected: 'true'
    ref_type: 'tag'
    namespace_path: 'alpine'

We need to provide a JWKS URL where we can find the public key needed to verify the signature of the JWT token as well as a list of claims the JWT token should include. Here we just ask that reposerve should verify that the token is issued by gitlab.example.com. The other claims are GitLab specific and mean that the upload should come from a pipeline originating from a project from the alpine namespace (group) and from a commit with a protected tag.

Client configuration

Once reposerve is deployed under a defined domain (alpine.example.com), the configuration on the consumer side is simple. Depending on the version of alpine (3.15) you are running, you just have to append the deployment’s URL to the list of the repositories.

sed -i "1ihttps://alpine.example.com/3.15/main" /etc/apk/repositories
sudo apk update

CI/CD configuration

GitLab automatically generates a short-lived (1h) JWT token for every running pipelines under the environment variable CI_JOB_JWT. Inside the token we find claims that match the project connected to the pipeline and which must match the ones indicated in the reposerve configuration. Otherwise, the upload is rejected.

I use this script (embedded in the deploy_apk.yml GitLab template presented above) to build and upload the packages to the repository :

# run abuild in the directory containing the APKBUILD file
/usr/bin/abuild -r -P /tmp/packages
# post the packages to the repository
apkdeploy.sh

apkdeploy.sh use curl and the JWT token to post all the packages found in a directory to the repository, detecting architectures and versions automatically :

#!/bin/sh

. /etc/os-release

VERSION="${VERSION_ID%.*}"
REPO="$(basename $(dirname $(pwd)))"
DIR="${1:-/tmp/packages}"

# this is to deal with multi-arch build
for arch in "$(find $DIR -name APKINDEX.tar.gz)"; do
  ARCH="$(basename $(dirname $arch))"
  args="-H 'Authorization: Bearer $CI_JOB_JWT_V2' -F 'version=$VERSION' -F 'repo=$REPO' -F 'arch=$ARCH' "

  for file in "$(find $DIR -name '*.apk')"; do
    args="${args}-F file=@$(basename $file) "
  done

  (cd "$(dirname $arch)" && eval curl $args https://$HOST/upload)
done

Entry point

Process supervision

Process supervision is a crucial part of any Linux system whether it’s running on a real host, a vm, or a container.

The fact you could suddenly package a Linux distribution without previous knowledge about system init, process management or packaging software was bad in terms of quality and security. A container provides a thin abstraction layer on top of the host operating system and gives the false impression that these questions are not relevant anymore.

That belief also takes root from the mantra many are still relaying: a container should run only one process. The number of project in GitHub alone dedicated to be used as docker entry point (tinit, dumb-init, pid1, …), or the number of images that carelessly run the executable as PID 1 just says long about the gap between theory and reality.

Process supervision is tricky. There is a lot of corner cases and I think you should dedicate a vast amount of time to understand everything correctly. This is not my case and as a matter of fact, I thought for a long time that systemd was a superior init system because it provided a better way of describing services and their dependencies than a regular sysvinit, and because it was compiled (C) and declarative instead of being interpreted. I was really surprised that it was apparently not used in the container world.

In fact, systemd is so intricately tied to kernel functionalities that it’s impossible to run in a containerized environment unless you also tie systemd to the container engine as podman did, and drill a lot of security holes by allowing the container to manage critical host’s resources (cgroup) just to make systemd happy.

I later discovered thanks to the s6 supervision suite, why systemd approach was flawed and why it would probably never be usable in musl environments due to its strong libc dependency (which also makes systemd poorly portable).

s6

s6 is the perfect candidate for process supervision and PID 1 management in containerized environment because its scope has been limited to do that an only that in the most efficient way. It is lightweight (5Mb all included) and has been carefully crafted by Laurent Bercot who has an extensive (low level) experience on Linux process programming and supervision. At some point, s6 should become the official service manager of alpine.

The best way to use s6 inside your container is to use s6 overlay.

Choosing a supervision suite just for managing correctly the PID 1 seems overkill, but I have at least 2 running processes inside all my containers :

rconfd manages configuration files and get secrets. When secrets change, services are signaled (SIGHUP) and can reload their configuration. This is way better than the traditional approach: no proper secret management or waiting for Kubernetes to kill a pod because it became unresponsive after loosing its access.
The main service (generally corresponding to the name of the container) which should wait rconfd startup notification (meaning configuration files have been successfully generated). I use for that the poll free lightweight startup notification of s6.

The way a s6 container starts is consistent with the way every Linux host OS starts: by running an init script (provided by the s6-overlay package). s6 init supervises a services tree based on structured directories. This what I use for my golden base image :

Containerfile

FROM alpine:3.15
MAINTAINER eric@itsufficient.me

ARG TAG=${TAG}
ARG RCONFD_VERSION=0.11.2-r0

## install s6
RUN apk add --no-cache \
      s6-overlay=${TAG}-r0 \
      rconfd=${RCONFD_VERSION}

## add s6 base configuration
ADD etc /etc

## run s6 as root
ENV TERM xterm
USER root
ENTRYPOINT ["/init"]

You just have to install whatever package you need and ADD some scripts in the /etc/services.d directory of the derived image. s6 will do the rest. No need to redefine ENTRYPOINT or CMD. You could also package the scripts in a $pkgname-s6 package and just install it, but as this is a frequent moving part in images (several images can start services differently) I add it directly in the Containerfile instead.

One key component of s6 is execline which goal is to replace the interpreter used by init scripts (i.e. bash) with a no-interpreter, and reduce the scripts to one-liners. It sounds completely silly, but it is in fact brilliant.

An execline script is a chain of commands + arguments. Each command consumes its own arguments, completes its task and then replaces itself (like exec shell command) with the remaining arguments (chain-loading). The script is parsed only once at startup, no interpreter lies in memory during the process, and yet you can do everything bash can do. It looks like an impossible mission script that is consuming itself to the end. Only the part that has not been executed stays in memory at each step. No interpreter means fewer security risks, fewer resources allocated, and instant startup.

execline is the preferred method for defining services under s6. We could write the following script to start meilisearch. Placed in the right directory (/etc/services.d/meilisearch), it will be picked up automatically by the init script. This is the last missing part for creating our meilisearch container :

/etc/services.d/meilisearch/run

#!/bin/execlineb -P
with-contenv
foreground { s6-echo start meilisearch }
cd /var/lib/meilisearch
s6-setuidgid meilisearch
/usr/bin/meilisearch

The foreground instruction looks weird, but as s6-echo accepts a variable number of arguments and doesn’t exec into anything else, we must use foreground which forks and waits the {} delimited block then exec into the remaining script (from the cd point). This is chain-loading at work: even if I used new lines for readability, it is really only one line.

Conclusion

No doubt that docker has changed radically the way we build and deploy applications and will forever be remembered as a disruptive technology. If the simplification of all the processes involved in the life cycle management of containers undeniably explains its dazzling adoption, it also became its curse after a few years. It bet everything on an all-in-one approach, and sometimes tried to reinvent the wheel just to stay the central piece of everything. This approach also impacted negatively the quality of what was built with docker.

Kubernetes won because of its modularity. It allowed each component, through a stable and evolutionary API, to have their own release schedule and make their own experiments, and it quickly became the center of a staggering number of external contributions over a broad range of cloud oriented technologies. It ended up stripping out docker itself. By slowly replacing it pieces by pieces (OCI, CNI, CSI, …), it finally deprecated docker as a container engine altogether. Today the combo crio, crun, kubelet is faster than docker while being highly composable.

Running containers shouldn’t be taken differently than running real hosts. Concerns about efficiency and security should remain the same to really benefit from their additional isolation features. A software running on a host should run seamlessly on a container and vice versa, and you should always prefer less instead of more because it leads to simple and composable instead of cluttered and tied.

I’m grateful to s6 author to have patiently constructed such a nice init system for container and for showing that simple works better. He recently managed to be sponsored to work full time on his project. I think that s6 will soon be ready to chase systemd on its land. Unifying host and container init system would facilitate system administration and turn container oriented distributions (flatcar) even faster, lighter and more secure.

I hope to have successfully shown that building up procedures on top of decade old tools or habits is not necessarily a bad thing, and that taking shortcuts in the name of simplified procedures or modernity is not always a good thing. As always, feel free to give me your impressions in the comments’ area below.

Éric BURGHARD

A better way to build containers images

Context

Why building images ?

The docker way

Layers are useless

Multi-stage build is a hack

A better way

Back to square one

Containerfile for assembly

Building packages

Build cache

CI/CD to rule them all

Packages repository

reposerve

Server configuration

Client configuration

CI/CD configuration

Entry point

Process supervision

s6

Conclusion

Related posts

Comments

A better way to build containers images

Context#

Why building images ?#

The docker way#

Layers are useless#

Multi-stage build is a hack#

A better way#

Back to square one#

Containerfile for assembly#

Building packages#

Build cache#

CI/CD to rule them all#

Packages repository#

reposerve#

Server configuration#

Client configuration#

CI/CD configuration#

Entry point#

Process supervision#

s6#

Conclusion#

Related posts

Comments

Context

Why building images ?

The docker way

Layers are useless

Multi-stage build is a hack

A better way

Back to square one

Containerfile for assembly

Building packages

Build cache

CI/CD to rule them all

Packages repository

reposerve

Server configuration

Client configuration

CI/CD configuration

Entry point

Process supervision

s6

Conclusion