Assemble a Docker-like container using only stock linux tools: unshare, mount and pivot_root.
Prepare rootfs
Directory for container rootfs
CONTAINER_DIR=/opt/container-1ROOTFS_DIR=${CONTAINER_DIR}/rootfsmkdir -p $ROOTFS_DIRExtract Container Image Filesystem to borrow the rootfs files
crane export <image> | sudo tar -xvC $ROOTFS_DIRPrepare /etc files to replace generic ones with container-specific variants i.e:
cat <<EOF | sudo tee /opt/container-1/hosts127.0.0.1 localhost container-1::1 localhost ip6-localhost ip6-loopbackEOFcat | sudo tee /opt/container-1/hostname <<EOFcontainer-1EOFsudo cp /etc/resolv.conf /opt/container-1/resolv.confCreate Namespaces
sudo unshare --mount --pid --fork --cgroup --uts --net bash--mount-> Creates a separate mount table; changes (e.g., mounts/unmounts) stay inside the namespace--pid-> Starts a new process tree; the first process inside becomes PID 1.- without it
pscommand will show full list of processes on the server
- without it
--fork-> Spawns a new process so that--pidworks properly (since PID 1 must start in a child)--cgroup-> Creates an isolated view of cgroup hierarchy; processes can manage their own limits and resources--uts-> Gives an independent hostname and domain name; prevents container hostname changes from affecting the host--net-> Provies a new network stack; includes isolated interfaces, routing tables, and IP addresses Other possible namespaces--ipc-> Isolates shared memory, semaphores, and message queues between processes
Isolate new mount namespace
Disable mount propagation
mount --make-rslave /Set root filesystem itself as a mount point
mount --rbind $ROOTFS_DIR $ROOTFS_DIRSet propagation type of the root fs to shared
mount --make-private $ROOTFS_DIRPrepare pseudo filesystem
/proc
mount /proc pseudo filesystem:
mkdir -p $ROOTFS_DIR/procmount -t proc proc $ROOTFS_DIR/procSecurity Caveat
Untrusted root filesystems can contain symlinks that escape $ROOTFS_DIR and allow host-filesystem corruption.
Production container runtimes prevent this by using openat2() with RESOLVE_NO_SYMLINKS and by performing mount/FS operations on file descriptors to avoid TOCTTOU issues.
/dev
mount /dev pseudo filesystem as a regulat tmpfs:
mount -t tmpfs -o nosuid,strictatime,mode=0755,size=65536K tmpfs $ROOTFS_DIR/devcreate the standard character devices (/dev/null, /dev/zero, /dev/random, etc):
mknod -m 666 "$ROOTFS_DIR/dev/null" c 1 3mknod -m 666 "$ROOTFS_DIR/dev/zero" c 1 5mknod -m 666 "$ROOTFS_DIR/dev/full" c 1 7mknod -m 666 "$ROOTFS_DIR/dev/random" c 1 8mknod -m 666 "$ROOTFS_DIR/dev/urandom" c 1 9mknod -m 666 "$ROOTFS_DIR/dev/tty" c 5 0
chown root:root "$ROOTFS_DIR/dev/"{null,zero,full,random,urandom,tty}create typical symlinks:
ln -sf /proc/self/fd "$ROOTFS_DIR/dev/fd"ln -sf /proc/self/fd/0 "$ROOTFS_DIR/dev/stdin"ln -sf /proc/self/fd/1 "$ROOTFS_DIR/dev/stdout"ln -sf /proc/self/fd/2 "$ROOTFS_DIR/dev/stderr"ln -sf /proc/kcore "$ROOTFS_DIR/dev/core"Create subordinate filesystems:
/dev/pts
mkdir -p "$ROOTFS_DIR/dev/pts"
mount -t devpts \ -o newinstance,ptmxmode=0666,mode=0620 devpts \ $ROOTFS_DIR/dev/pts
ln -sf /dev/pts/ptmx "$ROOTFS_DIR/dev/ptmx"/dev/mqueue
mkdir -p "$ROOTFS_DIR/dev/mqueue"
mount -t mqueue \ -o nosuid,nodev,noexec mqueue \ $ROOTFS_DIR/dev/mqueue/dev/shm
mkdir -p "$ROOTFS_DIR/dev/shm"
mount -t tmpfs \ -o nosuid,nodev,noexec,mode=1777,size=67108864 tmpfs \ $ROOTFS_DIR/dev/shm/sys
Mount a read-only sys pseudo fs:
mkdir -p "$ROOTFS_DIR/sys"
mount -t sysfs \ -o ro,nosuid,nodev,noexec sysfs \ $ROOTFS_DIR/sysMount the subordinate cgroup2 filesystem as /sys/fs/cgroup
mkdir -p "$ROOTFS_DIR/sys/fs/cgroup"
mount -t cgroup2 \ -o ro,nosuid,nodev,noexec cgroup2 \ $ROOTFS_DIR/sys/fs/cgroupBind hostname, hosts, and resolv.conf files
Bind the container-specific hostname, hosts and resolv.conf files, masking the original files in the rootfs’ /etc directory:
sudo mount --bind /opt/container-1/hosts /opt/container-1/rootfs/etc/hostssudo mount --bind /opt/container-1/hostname /opt/container-1/rootfs/etc/hostnamesudo mount --bind /opt/container-1/resolv.conf /opt/container-1/rootfs/etc/resolv.confi.e:
for p in hostname hosts resolv.confdo touch $ROOTFS_DIR/etc/$p mount --bind "$CONTAINER_DIR/$p" $ROOTFS_DIR/etc/$pdonePivot into the new rootfs
Pivot into the prepared root filesystem using pivot_root
cd $ROOTFS_DIR
mkdir -p .oldroot
pivot_root . .oldrootThen, spawn new shell from the new rootfs (old shell process might be broken)
exec /bin/shNote
Real container runtimes doesn’t need to spawn a new shell as they communicate with the kernel directly though syscalls instead of a shell commands.
configure the propagation type of the container’s root filesystem
mount --make-rslaveRootfs mount propagation
We set it to slave arbitrarily but the OCI Runtime Spec supports both private and shared
get rid of the link to the old root filesystem:
umount -l .oldrootrmdir .oldrootset the hostname of the container using the value from the container’s /etc/hostname
hostname $(cat /etc/hostname)Harden container filesystem
Set parts of /proc filesystem read-only:
for d in bus fs irq sys sysrq-triggerdo if [ -e "/proc/$d" ]; then mount --bind "/proc/$d" "/proc/$d" mount -o remount,bind,ro "/proc/$d" fidoneMask sensitive paths in the /proc and /sys
for p in \ /proc/asound \ /proc/interrupts \ /proc/kcore \ /proc/keys \ /proc/latency_stats \ /proc/timer_list \ /proc/timer_stats \ /proc/sched_debug \ /proc/acpi \ /proc/scsi \ /sys/firmwaredo if [ -d "$p" ]; then # Masking a folder mount -t tmpfs -o ro tmpfs $p elif [ -f "$p" ]; then # Masking a regular file mount --bind /dev/null $p fidoneContainerized environment is now ready to be used
APP=${APP:-/bin/sh}exec $APP