Implement prepared shared rootfs reuse as the primary deduped runtime path #48

Open
opened 2026-03-19 19:45:43 +00:00 by thabeta · 0 comments
Owner

Problem

This builds on the broader storage-deduplication work tracked in #35.

The current branch already caches pulled image contents, but it still materializes too much state per VM. The result is avoidable disk duplication and extra startup work when multiple VMs use the same image.

Current behavior

1. The cache stops at extracted rootfs

The image cache stores layers and can extract one shared rootfs per image digest, but there is no cached prepared runtime base for the shared-root path.

That means the cache helps with pulling and extracting, but not enough with the work required to actually run many VMs efficiently from the same image.

2. Backend selection is not capability-driven

Storage backend selection currently resolves directly to block storage unless virtiofs is explicitly chosen.

So even when the runtime environment could support a shared-root design, the normal path still pushes users toward a heavier per-VM block-image workflow.

3. Storage preparation consumes the runtime rootfs source directly

The storage preparation path uses the stored rootfs source directly instead of resolving a separate immutable prepared base for image-backed VMs.

That keeps the design centered around "prepare per VM" rather than "prepare once per image, then reuse many times".

4. Shared-root plus extra filesystem mounts needs valid hypervisor argument assembly

The hypervisor argument builder emits one --fs entry for the root shared filesystem and then emits another --fs entry for each extra shared mount.

That is fragile for hypervisor argument parsing. A VM with a shared-root base plus one or more additional shared mounts needs one valid combined filesystem-share argument structure, not repeated top-level flags assembled in a way that can break launch.

5. First-time prepared-base creation needs serialization

Once a prepared shared rootfs cache exists, cold starts from the same image/variant must not race each other. Without a per-image/per-variant lock, two concurrent first boots can both try to create the same prepared base.

Why this matters

  • Disk usage grows much faster than necessary when the same image is launched many times.
  • Startup latency includes repeated per-VM preparation work that should have been amortized.
  • Shared-root should be the efficient first-class path on supported kernels, not an afterthought.
  • Concurrency bugs in first-time base preparation will only show up under realistic parallel starts, which makes them easy to miss and painful to debug later.

Suggested implementation

Prepared image base

  1. Add a cached prepared rootfs per image digest and runtime variant.
  2. Build it once and treat it as immutable shared input.
  3. Include whatever static guest/runtime assets are required for the supported shared-root path.

Backend policy

  1. Introduce an automatic storage mode that prefers the shared-root backend when the selected kernel supports it.
  2. Keep block storage as an explicit or fallback compatibility path.

Per-VM state model

  1. Limit per-VM writable state to overlay upper/work data, sockets, logs, and metadata.
  2. Do not rebuild or recopy the full runtime filesystem for every VM from the same image.

Hypervisor argument assembly

  1. Build filesystem-share arguments in a form that supports the root shared filesystem plus additional shared mounts in a single valid launch configuration.

Concurrency

  1. Serialize first-time prepared-base creation with a per-image/per-variant lock.
  2. Make later launches reuse the completed prepared base rather than rebuilding it.

Acceptance criteria

  • Two VMs from the same image can start concurrently without racing during first-time prepared-base creation.
  • Repeated VMs from the same image reuse one prepared shared base on disk.
  • Per-VM writable state is limited to overlay/runtime data rather than a full duplicated rootfs.
  • A VM using a shared-root base can also mount additional shared filesystem volumes successfully.
## Problem This builds on the broader storage-deduplication work tracked in #35. The current branch already caches pulled image contents, but it still materializes too much state per VM. The result is avoidable disk duplication and extra startup work when multiple VMs use the same image. ## Current behavior ### 1. The cache stops at extracted rootfs The image cache stores layers and can extract one shared rootfs per image digest, but there is no cached prepared runtime base for the shared-root path. That means the cache helps with pulling and extracting, but not enough with the work required to actually run many VMs efficiently from the same image. ### 2. Backend selection is not capability-driven Storage backend selection currently resolves directly to block storage unless `virtiofs` is explicitly chosen. So even when the runtime environment could support a shared-root design, the normal path still pushes users toward a heavier per-VM block-image workflow. ### 3. Storage preparation consumes the runtime rootfs source directly The storage preparation path uses the stored rootfs source directly instead of resolving a separate immutable prepared base for image-backed VMs. That keeps the design centered around "prepare per VM" rather than "prepare once per image, then reuse many times". ### 4. Shared-root plus extra filesystem mounts needs valid hypervisor argument assembly The hypervisor argument builder emits one `--fs` entry for the root shared filesystem and then emits another `--fs` entry for each extra shared mount. That is fragile for hypervisor argument parsing. A VM with a shared-root base plus one or more additional shared mounts needs one valid combined filesystem-share argument structure, not repeated top-level flags assembled in a way that can break launch. ### 5. First-time prepared-base creation needs serialization Once a prepared shared rootfs cache exists, cold starts from the same image/variant must not race each other. Without a per-image/per-variant lock, two concurrent first boots can both try to create the same prepared base. ## Why this matters - Disk usage grows much faster than necessary when the same image is launched many times. - Startup latency includes repeated per-VM preparation work that should have been amortized. - Shared-root should be the efficient first-class path on supported kernels, not an afterthought. - Concurrency bugs in first-time base preparation will only show up under realistic parallel starts, which makes them easy to miss and painful to debug later. ## Suggested implementation ### Prepared image base 1. Add a cached prepared rootfs per image digest and runtime variant. 2. Build it once and treat it as immutable shared input. 3. Include whatever static guest/runtime assets are required for the supported shared-root path. ### Backend policy 1. Introduce an automatic storage mode that prefers the shared-root backend when the selected kernel supports it. 2. Keep block storage as an explicit or fallback compatibility path. ### Per-VM state model 1. Limit per-VM writable state to overlay upper/work data, sockets, logs, and metadata. 2. Do not rebuild or recopy the full runtime filesystem for every VM from the same image. ### Hypervisor argument assembly 1. Build filesystem-share arguments in a form that supports the root shared filesystem plus additional shared mounts in a single valid launch configuration. ### Concurrency 1. Serialize first-time prepared-base creation with a per-image/per-variant lock. 2. Make later launches reuse the completed prepared base rather than rebuilding it. ## Acceptance criteria - Two VMs from the same image can start concurrently without racing during first-time prepared-base creation. - Repeated VMs from the same image reuse one prepared shared base on disk. - Per-VM writable state is limited to overlay/runtime data rather than a full duplicated rootfs. - A VM using a shared-root base can also mount additional shared filesystem volumes successfully. ## Related work - #35
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
geomind_code/my_hypervisor#48
No description provided.