Demistifying Containers and Docker

Introduction

Containers have swiftly taken over the software development world over the past decade, saving many software developers tons of time. But it is often taken for granted, as it "just works" and stays out of your way with its black box magic.

But when the time comes that you have to mess with it, unfamiliarity can bring a lot of friction, and you may wonder why your team is introducing this seemingly unnecessary complexity. "Why can't we just run the app locally like normal people?!" you may think. Well today, I am here to tell you that containers provide you a ton of value!

Summary

In this series, we will break the mystery surrounding the black box magic of containers. In this post in particular, we will start with asking ourselves: what are containers, and what is a container actually doing when it is running our application inside of it? what difference does it make to run the application locally vs. running it in a container? What problems does containerization solve?

We will focus on standard containers as defined by the Open Container Initiative, of which Docker is part. There are other types of containers that are out of scope for this post.

Application Dependence on Environment

When developing software, you are not only dealing with your application's code. There are piles of dependencies onto which your code is stacked and relies upon to behave as expected. I am not only talking about dependencies that your code explicitly lists as packaged dependencies in your favorite package manager; there are many other dependencies that are less visible. For example:

programming language's runtime or compiler
Operating System and Distribution (Debian, Alpine, CentOS, Red Hat, etc.)
Web server or reverse proxy (nginx, apache, etc.)
Operating system libraries and applications (glibc, the shell, etc).
cryptographic dependencies (e.g. OpenSSL, GPG)
Database (PostgreSQL, MySQL, MongoDB, etc.)

And there are likely many others, depending on the application. Each of those has its own explicit and implicit dependencies. Consider that a difference in the versions of one of those dependencies, or in their configuration or setup, may cause a variation or failure in your application's execution. Imagine how hard it is to track down the source of this change in behavior.

This is only exacerbated if the application is running on a computer that is not even your's. How could you possibly know the environment details of a remote computer where your application is running?

Reproducibility of Environments

The natural solution to the issue of the reliance on environment for an application to work properly is to tell the user how to setup the correct environment, via a setup guide (README.md). This is harder for consumer-facing applications, as it relies on technical competency of the user. Often times, however, and especially for enterprise applications (which is the focus for this post), the "user" is another developer (or devOps engineer) using or deploying the application.

However, even when the user is technically competent, there is still large room for error, and it is almost impossible to nail down every possible error the user can do.

There is no telling what is installed on the user's computer. With millions of software applications out there, do you really know how each of them will affect your application?

And even if you do, the constantly changing state of your application, and updates to the environment, would require constant re-evaluation of this setup guide. Will you go through re-setting up your environment every time you make changes to your codebase?

Another computer may have a different operating system, a different set of applications they have pre-installed, or certain configurations. Will your setup guide cover all those edge cases or tell the user if their edge case is not covered?

If only there was a way to automatically setup an environment without the requirement of user intervention ... and somehow make this environment isolated, such that it is controlled, predictable, and does not effect the rest of the computer ... and somehow make all of this portable and very easy for someone to take it and run it on their own computer ...

That is precisely what Containerization technology aims to do!

What are Containers

For ease of understanding, let's break down the individual pieces of the definition. A Container is a:

isolated and restricted environment
encapsulates an application and all of its dependencies (down to OS libraries, but not the kernel )
have predictable behavior regardless of underlying infrastructure
runs on a host Linux kernel shared with other containers

The above are what we need for this first part of the series. For the next parts of this series, we will unpack that containers:

have standardized tools and methods to operate (create, run, stop, snapshot, etc.)
can be uploaded and downloaded and ran by a runtime without modification

Containers are specified in greater detail in the Open Container Initiative Specification, which is an effort to standardize containerization technology, of which Docker is a part.

Let's break down this definition further.

Isolation

Containers are isolated environments, and this is one of the most important concepts. Isolation helps keep the environment controlled and the application's execution predictable. What exactly does this mean? what exactly is isolated?

Container isolation occur at multiple levels: file system isolation, processes, user IDs, network, system resources (RAM, CPU, etc) and others.

To achieve this, containers use existing Linux features: Namespaces and Control groups (cgroups). cgroups allow isolation and restriction of system resources, while namespaces handle most of the rest.

File System Isolation

containers provide an isolated file system. In other words, the container has its own file system that is separate from the host computer's and other containers' file systems.

To better understand this, think of the file system of your current computer. On linux, if you look at the root path, you will find this:

[user@computer /]$ ls /
bin   dev  home  mnt         proc run   srv   sys  usr
boot  etc  lib   lost+found  opt  root  sbin  tmp  var

now imagine if a similar structure to this is replicated, but under a subdirectory.

[user@computer ~]$ mkdir /home/user/sub-directory
[user@computer ~]$ cd /home/user/sub-directory
# create all 
[user@computer ~]$ mkdir bin dev home mnt ...

So now we have a structure that is very similar to our root, except all the folders are empty. But it is starting to resemble its own small computer inside our big one!

To take this further, we can use change-root, or chroot. change-root changes the root directory. The root directory is the directory accessible by the / path, like when you do cd / or ls / like we did above. But chroot allows us to change this (temporarily) to another sub-directory to perform certain commands.

For chroot to work, we need at least one program to be present so we can run it. We will use bash, which can usually be found under /bin/bash.

# print current working directory
[user@computer ~]$ pwd
/home/user/sub-directory
[user@computer ~]$ cp -a /bin/bash ./bin/bash
# copy libraries used by bash
# For convenience, we will copy the entire lib directory
[user@computer ~]$ cp -a /lib ./
[user@computer ~]$ cp -a /lib64 ./
# chroot syntax: chroot [path to chroot into] [command to run]
[user@computer ~]$ chroot ./ /bin/bash
bash-5.1#

To exit the change-rooted environment, use "exit" command.

Now we have entered the change-rooted environment! It is not very useful at the current state, though. If we try to run most things, chances are it will error out:

bash-5.1# ls
bash: ls: command not found

the ls command does not exist in our isolated environment. This means that whenever we load our own application in here, it will only use programs it can find in the isolated environment, which is exactly what we are looking for!

Note: While chroot is a useful analogy, it does not create a secure file system isolation, and it is possible for a malicious program to affect the host file system. Moreover, contrary to popular belief, standard containers don't use chroot anymore, but use pivot_root instead.

Note 2: change-rooting is not the only mechanism used for file system isolation. Namespace mount isolation is also used. It is very similar to process isolation, which also utilizes namespaces, and is discussed in the next section.

Process Isolation

A process in Linux is a running program. When a program is ran, it is assigned a process ID, which helps keep track of it, and perform operations like stopping or pausing it, launch spin-off processes, allocating resources, and so on.

to check currently running processes, we can use the ps command.

[user@computer ~]$ ps -au
    PID TTY          TIME CMD
    345 tty1     00:00:10 sway
    398 tty1     00:00:02 swaybar
    610 tty1     00:00:00 swaybg
   1841 tty1     00:00:01 vim
   2155 tty1     00:00:05 bash
   3001 tty1     00:00:08 node

Containers use a Linux feature called namespaces, in which Linux can create a brand new process namespace for the container. You can imagine it as an isolated, blank space fo the containers processes and process IDs. Processes of other containers (and the host) cannot be "seen" by processes inside the container, or be influenced or manipulated by them. It looks as if only the container's processes are the ones that exist. You can verify this by running the ps command inside the container to see the visible processes inside.

Linux namespaces support more than just process namespaces! Let's talk about few of the important ones.

Network Namespace

Containers also use network namespaces for network isolation. This helps prevent network collisions and cross-container influences. For example, suppose you have two applications choosing to run on the same port 3000. Since network ports are namespaced, you can avoid these collisions. Both containers can run on port 3000 in their own network namespace, which are isolated from each other.

We can also map port 3000 from each of those containers to a different port on the host, giving them access to and from the wider network.

User Namespace

Containers can create their own users and groups that are separate from other containers and the host system. I spoke extensively about users and groups in my series on traditional Unix permissions.

Control Groups

control groups, or cgroups, are another linux feature that allow isolating and restricting system resource utilization by containers. This allows us to limit resource usage by containers, or ensure that they are provided sufficient resources to run.

Although containers strive to create isolated environments, their isolation stops at the kernel level. Everything above the kernel, such as Operating System libraries and user land applications are isolated, but the kernel itself is not.

Wait a minute, what even is a kernel?
The kernel is the program that acts as the first layer that interacts with the hardware. It abstracts many of the varying hardware features and makes it available to user land programs via its API.
Moreover, the kernel is also what makes it possible to run multiple applications at once and schedule them. It manages resource allocation between those different processes. The kernel is very important, and most of our code is written with the assumption that there is a Kernel handling those tasks.

Instead, containers either use the same host kernel if it is Linux, or they emulate a Linux kernel and run all the containers on the same emulated kernel.

Virtual Machines (VMs) are another tool used for creating isolated environments. They emulate the hardware, and each VM has its own instance of a kernel running, providing deeper isolation.

Why not Use VMs then?

VMs are much slower, since emulating hardware and running a separate kernel is a resource-intensive task. It creates subpar performance, and for not that much added benefit. The Kernel is rarely ever the cause of software incompatibility, and it is usually the same across different machines.

Moreover, containerization is a technology for software development productivity. Its main purpose is easing the software development and deployment cycle, and thus provide tons of features making them very easy to use.

While VMs provide a ton of added security, that is usually not why containers are used. Containers are used to make development easier. To ensure that the execution of our application is predictable and reproducible. We don't like surprises in production!

Conclusion

Containers avoid developers a ton of wasted time. They allow us to keep the execution of our applications predictable, and to avoid the issues caused by differing system environments across different computers.

We saw how containers create isolated environments using Linux features. They isolate the file system, the process space, and others, in order to rule out the possibility of interference from the host environment.

This demistifies what a running container looks like, and what exactly the runtime does to achieve this. In the next part of this series, we will look at how containers are created and built, and how it is ensured that this build process yields the output we expect. Stay tuned!