Logo
Building a Docker-like Container From Scratch
Overview
Building a Docker-like Container From Scratch

Building a Docker-like Container From Scratch

March 24, 2025
4 min read (5 min read total)
1 subpost

Assemble a Docker-like container using only stock linux tools: unshare, mount and pivot_root.

Prepare rootfs

Directory for container rootfs

Terminal window
CONTAINER_DIR=/opt/container-1
ROOTFS_DIR=${CONTAINER_DIR}/rootfs
mkdir -p $ROOTFS_DIR

Extract Container Image Filesystem to borrow the rootfs files

Terminal window
crane export <image> | sudo tar -xvC $ROOTFS_DIR

Prepare /etc files to replace generic ones with container-specific variants i.e:

/etc/hosts
cat <<EOF | sudo tee /opt/container-1/hosts
127.0.0.1 localhost container-1
::1 localhost ip6-localhost ip6-loopback
EOF
/etc/hostname
cat | sudo tee /opt/container-1/hostname <<EOF
container-1
EOF
/etc/resolv.conf
sudo cp /etc/resolv.conf /opt/container-1/resolv.conf

Create Namespaces

Terminal window
sudo unshare --mount --pid --fork --cgroup --uts --net bash
  • --mount -> Creates a separate mount table; changes (e.g., mounts/unmounts) stay inside the namespace
  • --pid -> Starts a new process tree; the first process inside becomes PID 1.
    • without it ps command will show full list of processes on the server
  • --fork -> Spawns a new process so that --pid works properly (since PID 1 must start in a child)
  • --cgroup -> Creates an isolated view of cgroup hierarchy; processes can manage their own limits and resources
  • --uts -> Gives an independent hostname and domain name; prevents container hostname changes from affecting the host
  • --net -> Provies a new network stack; includes isolated interfaces, routing tables, and IP addresses Other possible namespaces
  • --ipc -> Isolates shared memory, semaphores, and message queues between processes

Isolate new mount namespace

Disable mount propagation

Terminal window
mount --make-rslave /

Set root filesystem itself as a mount point

Terminal window
mount --rbind $ROOTFS_DIR $ROOTFS_DIR

Set propagation type of the root fs to shared

Terminal window
mount --make-private $ROOTFS_DIR

Prepare pseudo filesystem

/proc

mount /proc pseudo filesystem:

Terminal window
mkdir -p $ROOTFS_DIR/proc
mount -t proc proc $ROOTFS_DIR/proc
Security Caveat

Untrusted root filesystems can contain symlinks that escape $ROOTFS_DIR and allow host-filesystem corruption. Production container runtimes prevent this by using openat2() with RESOLVE_NO_SYMLINKS and by performing mount/FS operations on file descriptors to avoid TOCTTOU issues.

/dev

mount /dev pseudo filesystem as a regulat tmpfs:

Terminal window
mount -t tmpfs -o nosuid,strictatime,mode=0755,size=65536K tmpfs $ROOTFS_DIR/dev

create the standard character devices (/dev/null, /dev/zero, /dev/random, etc):

Terminal window
mknod -m 666 "$ROOTFS_DIR/dev/null" c 1 3
mknod -m 666 "$ROOTFS_DIR/dev/zero" c 1 5
mknod -m 666 "$ROOTFS_DIR/dev/full" c 1 7
mknod -m 666 "$ROOTFS_DIR/dev/random" c 1 8
mknod -m 666 "$ROOTFS_DIR/dev/urandom" c 1 9
mknod -m 666 "$ROOTFS_DIR/dev/tty" c 5 0
chown root:root "$ROOTFS_DIR/dev/"{null,zero,full,random,urandom,tty}

create typical symlinks:

Terminal window
ln -sf /proc/self/fd "$ROOTFS_DIR/dev/fd"
ln -sf /proc/self/fd/0 "$ROOTFS_DIR/dev/stdin"
ln -sf /proc/self/fd/1 "$ROOTFS_DIR/dev/stdout"
ln -sf /proc/self/fd/2 "$ROOTFS_DIR/dev/stderr"
ln -sf /proc/kcore "$ROOTFS_DIR/dev/core"

Create subordinate filesystems: /dev/pts

Terminal window
mkdir -p "$ROOTFS_DIR/dev/pts"
mount -t devpts \
-o newinstance,ptmxmode=0666,mode=0620 devpts \
$ROOTFS_DIR/dev/pts
ln -sf /dev/pts/ptmx "$ROOTFS_DIR/dev/ptmx"

/dev/mqueue

Terminal window
mkdir -p "$ROOTFS_DIR/dev/mqueue"
mount -t mqueue \
-o nosuid,nodev,noexec mqueue \
$ROOTFS_DIR/dev/mqueue

/dev/shm

Terminal window
mkdir -p "$ROOTFS_DIR/dev/shm"
mount -t tmpfs \
-o nosuid,nodev,noexec,mode=1777,size=67108864 tmpfs \
$ROOTFS_DIR/dev/shm

/sys

Mount a read-only sys pseudo fs:

Terminal window
mkdir -p "$ROOTFS_DIR/sys"
mount -t sysfs \
-o ro,nosuid,nodev,noexec sysfs \
$ROOTFS_DIR/sys

Mount the subordinate cgroup2 filesystem as /sys/fs/cgroup

Terminal window
mkdir -p "$ROOTFS_DIR/sys/fs/cgroup"
mount -t cgroup2 \
-o ro,nosuid,nodev,noexec cgroup2 \
$ROOTFS_DIR/sys/fs/cgroup

Bind hostname, hosts, and resolv.conf files

Bind the container-specific hostname, hosts and resolv.conf files, masking the original files in the rootfs’ /etc directory:

Terminal window
sudo mount --bind /opt/container-1/hosts /opt/container-1/rootfs/etc/hosts
sudo mount --bind /opt/container-1/hostname /opt/container-1/rootfs/etc/hostname
sudo mount --bind /opt/container-1/resolv.conf /opt/container-1/rootfs/etc/resolv.conf

i.e:

Terminal window
for p in hostname hosts resolv.conf
do
touch $ROOTFS_DIR/etc/$p
mount --bind "$CONTAINER_DIR/$p" $ROOTFS_DIR/etc/$p
done

Pivot into the new rootfs

Pivot into the prepared root filesystem using pivot_root

Terminal window
cd $ROOTFS_DIR
mkdir -p .oldroot
pivot_root . .oldroot

Then, spawn new shell from the new rootfs (old shell process might be broken)

Terminal window
exec /bin/sh
Note

Real container runtimes doesn’t need to spawn a new shell as they communicate with the kernel directly though syscalls instead of a shell commands.

configure the propagation type of the container’s root filesystem

Terminal window
mount --make-rslave
Rootfs mount propagation

We set it to slave arbitrarily but the OCI Runtime Spec supports both private and shared

get rid of the link to the old root filesystem:

Terminal window
umount -l .oldroot
rmdir .oldroot

set the hostname of the container using the value from the container’s /etc/hostname

Terminal window
hostname $(cat /etc/hostname)

Harden container filesystem

Set parts of /proc filesystem read-only:

Terminal window
for d in bus fs irq sys sysrq-trigger
do
if [ -e "/proc/$d" ]; then
mount --bind "/proc/$d" "/proc/$d"
mount -o remount,bind,ro "/proc/$d"
fi
done

Mask sensitive paths in the /proc and /sys

Terminal window
for p in \
/proc/asound \
/proc/interrupts \
/proc/kcore \
/proc/keys \
/proc/latency_stats \
/proc/timer_list \
/proc/timer_stats \
/proc/sched_debug \
/proc/acpi \
/proc/scsi \
/sys/firmware
do
if [ -d "$p" ]; then
# Masking a folder
mount -t tmpfs -o ro tmpfs $p
elif [ -f "$p" ]; then
# Masking a regular file
mount --bind /dev/null $p
fi
done

Containerized environment is now ready to be used

Terminal window
APP=${APP:-/bin/sh}
exec $APP