[ops] Docker on TF Grid VM needs btrfs storage driver — overlayfs fails on whiteout files #181

Closed
opened 2026-04-25 00:16:21 +00:00 by mik-tf · 2 comments
Owner

Symptom

docker pull onlyoffice/documentserver:latest (or any image with whiteout files) fails on TF Grid VMs with:

failed to extract layer ... to overlayfs as "extract-...": failed to convert whiteout file
"etc/nginx/sites-enabled/.wh.default": operation not permitted

The default Docker storage driver overlayfs cannot create whiteout files because the TF Grid flist filesystem doesn't permit the underlying mknod operation.

Root cause

TF Grid VMs boot from an Ubuntu flist (essentially a layered immutable rootfs). When Docker tries to use overlayfs on top of this, it can't perform the privileged operations needed to materialize whiteout files for image layers that delete files from a base image. OnlyOffice (and many other multi-layer images) hits this immediately.

Workaround

Move Docker's data root to /data (a btrfs partition that supports the operations Docker needs) and switch the storage driver to btrfs:

pkill dockerd
mkdir -p /data/docker
cat > /etc/docker/daemon.json <<'JSON'
{
  "data-root": "/data/docker",
  "storage-driver": "btrfs"
}
JSON
nohup dockerd > /var/log/dockerd.log 2>&1 &
sleep 6
docker info | grep -E 'Storage Driver|Docker Root'
# Storage Driver: btrfs
# Docker Root Dir: /data/docker

After this, OnlyOffice's image extracts cleanly.

  • TF Grid VMs don't run systemd as PID 1 (zinit instead). systemctl start docker fails with "System has not been booted with systemd as init system." Workaround: nohup dockerd > /var/log/dockerd.log 2>&1 &.
  • Default rootfs is 2 GB which fills up immediately on apt installs. Already tracked in home#161. Worked around by setting rootfs_size = 16384 in the TF Grid Terraform.
  • Mycelium route propagation can take 15+ min on fresh TF Grid nodes. Already tracked in home#165. Worked around by publicip = true.

Prod-level fix path

Add to hero_demo/docs/ops/DEPLOYMENT_NU_HERO_OS.md a "Docker on TF Grid" subsection:

  • apt install docker.io
  • Configure /etc/docker/daemon.json to use /data/docker + btrfs driver
  • Start with nohup dockerd > /var/log/dockerd.log 2>&1 & (no systemd)
  • This is needed for OnlyOffice (home#174) and any future Docker-based service

If hero_skills ever grows a service_docker.nu (or any service that requires Docker), it should bake this configuration into the installer.

Demo state on herodemo (2026-04-24)

  • Docker installed via apt install docker.io
  • daemon.json configured for btrfs/data-root=/data/docker
  • dockerd running via nohup (no systemd)
  • OnlyOffice image pulling successfully (in progress)
  • home#174 — OnlyOffice Document Server (the consumer that hit this)
  • home#161 — disaster recovery + rootfs-size lessons
  • home#160 — consolidated demo state

Signed-off-by: mik-tf

## Symptom `docker pull onlyoffice/documentserver:latest` (or any image with whiteout files) fails on TF Grid VMs with: ``` failed to extract layer ... to overlayfs as "extract-...": failed to convert whiteout file "etc/nginx/sites-enabled/.wh.default": operation not permitted ``` The default Docker storage driver `overlayfs` cannot create whiteout files because the TF Grid flist filesystem doesn't permit the underlying mknod operation. ## Root cause TF Grid VMs boot from an Ubuntu flist (essentially a layered immutable rootfs). When Docker tries to use overlayfs on top of this, it can't perform the privileged operations needed to materialize whiteout files for image layers that delete files from a base image. OnlyOffice (and many other multi-layer images) hits this immediately. ## Workaround Move Docker's data root to `/data` (a btrfs partition that supports the operations Docker needs) and switch the storage driver to `btrfs`: ```bash pkill dockerd mkdir -p /data/docker cat > /etc/docker/daemon.json <<'JSON' { "data-root": "/data/docker", "storage-driver": "btrfs" } JSON nohup dockerd > /var/log/dockerd.log 2>&1 & sleep 6 docker info | grep -E 'Storage Driver|Docker Root' # Storage Driver: btrfs # Docker Root Dir: /data/docker ``` After this, OnlyOffice's image extracts cleanly. ## Related TF Grid quirks already documented - TF Grid VMs **don't run systemd as PID 1** (`zinit` instead). `systemctl start docker` fails with "System has not been booted with systemd as init system." Workaround: `nohup dockerd > /var/log/dockerd.log 2>&1 &`. - **Default rootfs is 2 GB** which fills up immediately on apt installs. Already tracked in [home#161](https://forge.ourworld.tf/lhumina_code/home/issues/161). Worked around by setting `rootfs_size = 16384` in the TF Grid Terraform. - **Mycelium route propagation** can take 15+ min on fresh TF Grid nodes. Already tracked in [home#165](https://forge.ourworld.tf/lhumina_code/home/issues/165). Worked around by `publicip = true`. ## Prod-level fix path Add to `hero_demo/docs/ops/DEPLOYMENT_NU_HERO_OS.md` a "Docker on TF Grid" subsection: - `apt install docker.io` - Configure `/etc/docker/daemon.json` to use `/data/docker` + `btrfs` driver - Start with `nohup dockerd > /var/log/dockerd.log 2>&1 &` (no systemd) - This is needed for OnlyOffice (home#174) and any future Docker-based service If `hero_skills` ever grows a `service_docker.nu` (or any service that requires Docker), it should bake this configuration into the installer. ## Demo state on herodemo (2026-04-24) - Docker installed via `apt install docker.io` - daemon.json configured for btrfs/data-root=/data/docker - dockerd running via `nohup` (no systemd) - OnlyOffice image pulling successfully (in progress) ## Related - [home#174](https://forge.ourworld.tf/lhumina_code/home/issues/174) — OnlyOffice Document Server (the consumer that hit this) - [home#161](https://forge.ourworld.tf/lhumina_code/home/issues/161) — disaster recovery + rootfs-size lessons - [home#160](https://forge.ourworld.tf/lhumina_code/home/issues/160) — consolidated demo state Signed-off-by: mik-tf
Author
Owner

Correction (2026-04-24)

I overstated this issue. The Hero team already solved Docker-on-TF-Grid in the legacy docker-pipeline setup script: hero_demo/deploy/single-vm/scripts/setup.sh:62-95. That script:

  • Detects /data filesystem
  • Configures data-root: /data/docker in /etc/docker/daemon.json
  • Uses native overlay2 for ext4/xfs/btrfs (with fuse-overlayfs fallback for unusual filesystems)
  • Handles the no-systemd case by starting dockerd & directly
  • Cleans up /var/lib/docker/* after switching to /data

So the knowledge already exists. The real gap is that the nu-shell deployment runbook (hero_demo/docs/ops/DEPLOYMENT_NU_HERO_OS.md) doesn't reference or extract this Docker-on-TF-Grid section. Anyone running the nu-shell flow who needs Docker (e.g. for OnlyOffice per home#174) will rediscover the issue from scratch — like I did.

Reframed proposal

  • Add a "Docker on TF Grid" subsection to DEPLOYMENT_NU_HERO_OS.md
  • Either reference scripts/setup.sh:62-95 directly, or extract those lines into a standalone snippet that nu-shell deploys can source or copy
  • The actual config knowledge is correct as encoded in setup.sh — no codebase fix needed

On herodemo (2026-04-24)

I configured Docker with explicit storage-driver: btrfs rather than the legacy script's default overlay2. Both work on btrfs in practice; my config is harmless deviation but should be brought back in line with the team convention (overlay2 native) when the runbook subsection is written.

  • Legacy reference: hero_demo/deploy/single-vm/scripts/setup.sh:62-95
  • home#174 — OnlyOffice (the Docker consumer)
  • home#161 — disaster recovery + rootfs-size lessons

Reframing this as a docs gap, not a Docker support gap. Closing as "documented in setup.sh, just needs to land in the nu-shell runbook."

Signed-off-by: mik-tf

## Correction (2026-04-24) I overstated this issue. The Hero team **already solved Docker-on-TF-Grid** in the legacy docker-pipeline setup script: `hero_demo/deploy/single-vm/scripts/setup.sh:62-95`. That script: - Detects `/data` filesystem - Configures `data-root: /data/docker` in `/etc/docker/daemon.json` - Uses native `overlay2` for ext4/xfs/btrfs (with `fuse-overlayfs` fallback for unusual filesystems) - Handles the no-systemd case by starting `dockerd &` directly - Cleans up `/var/lib/docker/*` after switching to `/data` So the **knowledge already exists**. The real gap is that the nu-shell deployment runbook (`hero_demo/docs/ops/DEPLOYMENT_NU_HERO_OS.md`) doesn't reference or extract this Docker-on-TF-Grid section. Anyone running the nu-shell flow who needs Docker (e.g. for OnlyOffice per home#174) will rediscover the issue from scratch — like I did. ## Reframed proposal - Add a "Docker on TF Grid" subsection to `DEPLOYMENT_NU_HERO_OS.md` - Either reference `scripts/setup.sh:62-95` directly, or extract those lines into a standalone snippet that nu-shell deploys can `source` or copy - The actual config knowledge is correct as encoded in setup.sh — no codebase fix needed ## On herodemo (2026-04-24) I configured Docker with explicit `storage-driver: btrfs` rather than the legacy script's default `overlay2`. Both work on btrfs in practice; my config is harmless deviation but should be brought back in line with the team convention (overlay2 native) when the runbook subsection is written. ## Related - Legacy reference: `hero_demo/deploy/single-vm/scripts/setup.sh:62-95` - [home#174](https://forge.ourworld.tf/lhumina_code/home/issues/174) — OnlyOffice (the Docker consumer) - [home#161](https://forge.ourworld.tf/lhumina_code/home/issues/161) — disaster recovery + rootfs-size lessons Reframing this as a docs gap, not a Docker support gap. Closing as "documented in setup.sh, just needs to land in the nu-shell runbook." Signed-off-by: mik-tf
Author
Owner

Resolved by lhumina_code/hero_skills@7c823d1 (PR lhumina_code/hero_skills#126).

Part of Phase 2 tracker #185.

Resolved by https://forge.ourworld.tf/lhumina_code/hero_skills/commit/7c823d1 (PR https://forge.ourworld.tf/lhumina_code/hero_skills/pulls/126). Part of Phase 2 tracker https://forge.ourworld.tf/lhumina_code/home/issues/185.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#181
No description provided.