Run a Nested OCI Runtime
Some services run their own containers — CI runners (gitea-act-runner, Forgejo Runner, Drone) execute every job inside a fresh OCI container; build daemons (buildkitd) launch sandboxes per build; emulator services pull and run arbitrary images on demand. Without a real container engine inside the service, those workloads can’t be sandboxed properly and the service can’t isolate untrusted user code.
StartOS supports running a rootless OCI engine — Podman or Docker — inside an opt-in service via manifest.nestedRuntime: true. The OS exposes /dev/fuse and /dev/net/tun so the engine can use fuse-overlayfs storage and slirp4netns/pasta networking. The service’s own LXC remains userns-mapped and AppArmor-confined; nothing about the host’s posture changes.
Solution
- Set
nestedRuntime: trueat the manifest top level. - Bake the OCI engine and its rootless prerequisites into the service image.
- Add a non-root user and
/etc/subuid//etc/subgidranges that fit inside the subcontainer’s user namespace. - (Docker only) Drop a tiny
runcwrapper into the image and pointdefault-runtimeat it viadaemon.json— Docker injects anet.ipv4.ip_unprivileged_port_startsysctl by default that runc fails to apply across the nested-userns boundary.
Podman works out of the box once the prerequisites are in place. Docker needs the wrapper.
Reference: Manifest · Project Structure
Manifest
import { setupManifest } from '@start9labs/start-sdk'
import { short, long } from './i18n'
export const manifest = setupManifest({
id: 'gitea-runner',
title: 'Gitea Actions Runner',
// ...
volumes: ['main'],
images: {
main: {
source: { dockerBuild: { workdir: '.' } },
arch: ['x86_64', 'aarch64'],
},
},
alerts: { /* ... */ },
dependencies: {},
nestedRuntime: true,
})
What StartOS provides
When nestedRuntime is set, the per-service LXC gets:
/dev/fuse— char device 10:229, world-RW. Required byfuse-overlayfsfor rootless layered storage. Kernel overlayfs-on-overlayfs is denied for unprivileged users, so fuse-overlayfs is the only viable rootless storage driver inside a userns LXC./dev/net/tun— char device 10:200, world-RW. Required byslirp4netnsandpastafor rootless container networking.
Both devices are bind-mounted from the host (via the same machinery that handles hardwareAcceleration for GPU nodes). The host’s fuse and tun kernel modules are auto-loaded at boot.
The host-level sysctls kernel.unprivileged_userns_clone=1 and user.max_user_namespaces=28633 are pinned at install time so unprivileged user-namespace creation is allowed and headroom for nested namespaces is reserved.
Image: Podman
FROM debian:trixie-slim
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
podman fuse-overlayfs uidmap iproute2 iptables \
slirp4netns ca-certificates \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
RUN mkdir -p /etc/containers && \
printf 'unqualified-search-registries = ["docker.io"]\n' \
> /etc/containers/registries.conf
# Subordinate UIDs/GIDs for nested user namespaces. The range MUST live
# inside the subcontainer's userns (mapped 0..65535) and MUST NOT overlap
# with the calling user's own UID — kernel rejects uid_map writes with
# EINVAL when outside ranges overlap.
RUN useradd --create-home --uid 1000 --shell /bin/bash app \
&& echo 'app:1001:64535' > /etc/subuid \
&& echo 'app:1001:64535' > /etc/subgid
USER app
WORKDIR /home/app
Inside setupMain, run Podman as app:
podman --root=$HOME/.local/share/containers/storage \
--runroot=$XDG_RUNTIME_DIR/containers \
--cgroup-manager=cgroupfs \
run --network=slirp4netns --rm docker.io/library/alpine echo ok
(--cgroup-manager=cgroupfs is required because there’s no user systemd session inside the subcontainer.)
Image: Docker
Docker rootless needs the same prerequisites — subuid, fuse-overlayfs, slirp4netns — plus one workaround. Docker’s container spec includes a net.ipv4.ip_unprivileged_port_start=0 sysctl by default; runc opens that proc file in the parent userns and re-opens the file descriptor inside the nested userns, where it EPERMs. The kernel itself is fine with the write — unshare -Urn from the same shell can do it — but runc’s reopen-after-pivot pattern breaks under nested userns. Setting --sysctl net.ipv4.ip_unprivileged_port_start=… on the command line doesn’t help: runc still does the reopen.
The fix is a thin runc wrapper that strips that sysctl from the OCI bundle before exec’ing real runc. Drop it in the image, register it as the default runtime in /etc/docker/daemon.json. Skipping the sysctl is harmless — ports < 1024 just stay privileged inside the container, which is the upstream Linux default anyway.
FROM debian:trixie-slim
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
curl gnupg fuse-overlayfs uidmap iproute2 iptables \
slirp4netns jq ca-certificates \
&& install -m 0755 -d /etc/apt/keyrings \
&& curl -fsSL https://download.docker.com/linux/debian/gpg \
-o /etc/apt/keyrings/docker.asc \
&& chmod a+r /etc/apt/keyrings/docker.asc \
&& echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian trixie stable" \
> /etc/apt/sources.list.d/docker.list \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
docker-ce docker-ce-cli containerd.io docker-ce-rootless-extras \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
RUN useradd --create-home --uid 1000 --shell /bin/bash app \
&& echo 'app:1001:64535' > /etc/subuid \
&& echo 'app:1001:64535' > /etc/subgid
# runc wrapper — strips the sysctl runc can't apply across the nested
# userns boundary. See https://github.com/Start9Labs/start-os/pull/3209.
COPY runc-nested.sh /usr/local/bin/runc-nested
RUN chmod +x /usr/local/bin/runc-nested
# Tell dockerd to use it as the default runtime.
RUN mkdir -p /etc/docker
COPY daemon.json /etc/docker/daemon.json
runc-nested.sh:
#!/bin/sh
# Strip net.ipv4.ip_unprivileged_port_start from the OCI spec — runc's
# reopen of that sysctl across a nested userns boundary EPERMs in a
# StartOS service subcontainer. Skipping it is harmless; ports < 1024
# just stay privileged inside the container.
set -e
bundle=""
prev=""
for arg in "$@"; do
case "$prev" in --bundle|-b) bundle="$arg"; break;; esac
case "$arg" in --bundle=*) bundle="${arg#--bundle=}"; break;; esac
prev="$arg"
done
cfg="${bundle}/config.json"
if [ -n "$bundle" ] && [ -f "$cfg" ]; then
tmp=$(mktemp "${cfg}.XXXXXX")
jq 'del(.linux.sysctl["net.ipv4.ip_unprivileged_port_start"])' \
"$cfg" > "$tmp" && mv "$tmp" "$cfg"
fi
exec /usr/bin/runc "$@"
daemon.json:
{
"storage-driver": "fuse-overlayfs",
"default-runtime": "runc-nested",
"runtimes": {
"runc-nested": { "path": "/usr/local/bin/runc-nested" }
}
}
Once dockerd is running (rootful, since dockerd-rootless.sh requires the calling user to not be uid 0 and there’s no user systemd session inside the subcontainer), default docker run and docker build work with bridge networking and no extra flags.
Caveats
- No daemon manager. There’s no systemd-user session inside the subcontainer, so engines that prefer systemd cgroups need
--cgroup-manager=cgroupfs(Podman) or run rootful (Docker). Either is fine; just don’t expectloginctl enable-lingeror user-scoped systemd units. - Subordinate-UID range overlap.
/etc/subuid//etc/subgidranges must live inside the subcontainer’s userns (UIDs 0..65535) AND must not overlap with the calling user’s own UID. Withuseradd --uid 1000, the subordinate range must skip 1000 —app:1001:64535works,app:1:65535does not. fuse-overlayfsonly. Kernel overlayfs-on-overlayfs is denied for unprivileged users, so don’t try--storage-driver=overlay2.fuse-overlayfsis the only rootless option.- No bridge IPv6 by default. Rootless networking via slirp4netns is IPv4-only out of the box. If you need IPv6 inside nested containers, configure pasta (
--network=pasta) instead. hardwareAccelerationis independent. If a service needs both GPU access and a nested OCI engine (e.g. an LLM-driven CI runner), set both flags.