Make garbage collection ownership-aware before deleting host network resources #45

Closed
opened 2026-03-19 19:45:41 +00:00 by thabeta · 3 comments
Owner

Problem

The current garbage-collection logic for host networking is too heuristic-driven. It can delete resources based on naming and subnet guesses instead of verified ownership.

On a host where other tools or manual admin work also create TAP devices or NAT rules, that is not safe enough.

Current behavior

TAP cleanup

The cleanup logic scans network interfaces, looks for names starting with tap-, extracts the suffix, and treats that suffix as a VM prefix.

If that prefix is not in the current set of running VMs, the interface is treated as orphaned and deleted.

This means the decision is based on:

  • interface name pattern only
  • current running set only
  • no persisted ownership marker
  • no check that the interface was actually created by this project

NAT rule cleanup

The iptables cleanup logic scans iptables -t nat -S, filters rules that mention the project subnet, skips broad bridge-level rules, extracts a per-VM-looking IP, and deletes any rule whose IP is not currently attached to a running VM.

Again, that is based on heuristic matching rather than ownership.

A rule can match the subnet shape and still belong to:

  • manual host networking
  • another local tool
  • a previously migrated environment
  • an intentionally persistent setup that is not tied to a currently running VM

Why this matters

A garbage-collection command must be conservative. If it can delete unrelated host networking state, it stops being a maintenance tool and becomes a host-risking operation.

The current behavior is especially risky on:

  • shared hosts
  • development machines with multiple VM/networking tools
  • systems where the same subnet range is reused intentionally

Suggested implementation

Ownership model

  1. Record ownership at resource creation time.
  2. Persist enough metadata to tie each TAP device and firewall rule back to a specific VM.
  3. Prefer explicit identifiers over name/subnet inference.

Good examples of ownership evidence would be:

  • persisted VM-to-interface mapping
  • a rule comment or equivalent metadata for firewall entries
  • stored cleanup metadata in VM runtime state

Cleanup policy

  1. Only remove resources that are both project-owned and associated with a stopped/crashed VM.
  2. Keep heuristic cleanup as a separate emergency mode, not the default behavior.
  3. Add a dry-run mode so operators can inspect proposed deletions first.

Failure handling

  1. If ownership cannot be proven, skip deletion and report why.
  2. Make cleanup output say whether a resource was deleted because it was explicitly owned, skipped because ownership was unknown, or skipped because it still belongs to a running VM.

Acceptance criteria

  • Cleanup does not remove unrelated tap-* devices created outside this project.
  • Cleanup does not remove unrelated NAT rules in the same subnet if they are not owned by this project.
  • Crash recovery still removes networking resources for known stopped/crashed VMs.
  • A dry-run mode shows what would be removed and why.
  • The default cleanup path is ownership-aware, not purely heuristic.

Affected areas

  • doctor --gc
  • TAP discovery and cleanup logic
  • iptables/NAT cleanup logic
  • persisted runtime metadata for network resources
## Problem The current garbage-collection logic for host networking is too heuristic-driven. It can delete resources based on naming and subnet guesses instead of verified ownership. On a host where other tools or manual admin work also create TAP devices or NAT rules, that is not safe enough. ## Current behavior ### TAP cleanup The cleanup logic scans network interfaces, looks for names starting with `tap-`, extracts the suffix, and treats that suffix as a VM prefix. If that prefix is not in the current set of running VMs, the interface is treated as orphaned and deleted. This means the decision is based on: - interface name pattern only - current running set only - no persisted ownership marker - no check that the interface was actually created by this project ### NAT rule cleanup The iptables cleanup logic scans `iptables -t nat -S`, filters rules that mention the project subnet, skips broad bridge-level rules, extracts a per-VM-looking IP, and deletes any rule whose IP is not currently attached to a running VM. Again, that is based on heuristic matching rather than ownership. A rule can match the subnet shape and still belong to: - manual host networking - another local tool - a previously migrated environment - an intentionally persistent setup that is not tied to a currently running VM ## Why this matters A garbage-collection command must be conservative. If it can delete unrelated host networking state, it stops being a maintenance tool and becomes a host-risking operation. The current behavior is especially risky on: - shared hosts - development machines with multiple VM/networking tools - systems where the same subnet range is reused intentionally ## Suggested implementation ### Ownership model 1. Record ownership at resource creation time. 2. Persist enough metadata to tie each TAP device and firewall rule back to a specific VM. 3. Prefer explicit identifiers over name/subnet inference. Good examples of ownership evidence would be: - persisted VM-to-interface mapping - a rule comment or equivalent metadata for firewall entries - stored cleanup metadata in VM runtime state ### Cleanup policy 1. Only remove resources that are both project-owned and associated with a stopped/crashed VM. 2. Keep heuristic cleanup as a separate emergency mode, not the default behavior. 3. Add a dry-run mode so operators can inspect proposed deletions first. ### Failure handling 1. If ownership cannot be proven, skip deletion and report why. 2. Make cleanup output say whether a resource was deleted because it was explicitly owned, skipped because ownership was unknown, or skipped because it still belongs to a running VM. ## Acceptance criteria - Cleanup does not remove unrelated `tap-*` devices created outside this project. - Cleanup does not remove unrelated NAT rules in the same subnet if they are not owned by this project. - Crash recovery still removes networking resources for known stopped/crashed VMs. - A dry-run mode shows what would be removed and why. - The default cleanup path is ownership-aware, not purely heuristic. ## Affected areas - `doctor --gc` - TAP discovery and cleanup logic - iptables/NAT cleanup logic - persisted runtime metadata for network resources
Member

Implementation Spec for Issue #45: Ownership-Aware GC for Host Network Resources

Objective

Replace the heuristic-based GC logic in doctor --gc with an ownership-aware system that persists metadata tying each TAP device and iptables rule to a specific VM. The default cleanup path must only remove resources it can prove belong to this project and are associated with a stopped/crashed VM. Heuristic cleanup becomes a separate emergency mode. A dry-run mode is added for safe inspection.

Requirements

  • Record ownership metadata (VM ID, TAP device name, list of iptables rules) at network resource creation time, persisted to disk alongside VM state
  • Default GC path only removes resources that are both (a) owned by this project and (b) associated with a stopped/crashed VM
  • GC skips deletion and reports a warning when ownership cannot be proven
  • A --dry-run flag on doctor --gc shows what would be removed and why, without actually removing anything
  • A --force-heuristic flag preserves the current heuristic cleanup as an explicit emergency option
  • Crash recovery still works for known stopped/crashed VMs using persisted ownership data
  • External tap-* devices and NAT rules in the same subnet that were not created by this project are never touched by the default GC path

Files to Modify

File Change
crates/my_hypervisor-lib/src/vm/state.rs Add NetworkOwnership struct to RuntimeState
crates/my_hypervisor-lib/src/network/traits.rs Add iptables_rules to NetworkResult
crates/my_hypervisor-lib/src/network/ops/mod.rs Change NatOps::add_port_forward return type to include rule strings
crates/my_hypervisor-lib/src/network/local/nat.rs Return created iptables rule strings
crates/my_hypervisor-lib/src/network/mosnet/nat.rs Update to match new return type
crates/my_hypervisor-lib/src/network/backends/tap_backend.rs Propagate ownership through NetworkResult
crates/my_hypervisor-lib/src/vm/manager.rs Persist ownership, add gc_owned_resources, clear on teardown
crates/my_hypervisor-cli/src/cli.rs Add --dry-run and --force-heuristic flags to DoctorArgs
crates/my_hypervisor-cli/src/commands/doctor.rs Wire up ownership-aware GC, keep heuristic as fallback

Implementation Plan

Step 1: Add NetworkOwnership struct to RuntimeState in state.rs
Step 2: Extend NetworkResult and NatOps trait to surface created iptables rules
Step 3: Update LocalOps NAT implementation to return rule strings
Step 4: Propagate ownership through TapNetwork::setup
Step 5: Persist ownership metadata in VmManager::setup_network
Step 6: Add ownership-aware GC method gc_owned_resources to VmManager
Step 7: Add CLI flags and wire up in doctor command
Step 8: Clear ownership on teardown in refresh_status, stop, remove
Step 9: Add unit tests and integration tests

Acceptance Criteria

  • Cleanup does not remove unrelated tap-* devices created outside this project
  • Cleanup does not remove unrelated NAT rules in the same subnet
  • Crash recovery still removes networking resources for known stopped/crashed VMs
  • --dry-run mode shows what would be removed and why
  • Default cleanup path is ownership-aware, not purely heuristic
  • Heuristic cleanup available via --force-heuristic flag
  • Backward compatible with old state.json files
  • All existing tests pass
## Implementation Spec for Issue #45: Ownership-Aware GC for Host Network Resources ### Objective Replace the heuristic-based GC logic in `doctor --gc` with an ownership-aware system that persists metadata tying each TAP device and iptables rule to a specific VM. The default cleanup path must only remove resources it can prove belong to this project and are associated with a stopped/crashed VM. Heuristic cleanup becomes a separate emergency mode. A dry-run mode is added for safe inspection. ### Requirements - Record ownership metadata (VM ID, TAP device name, list of iptables rules) at network resource creation time, persisted to disk alongside VM state - Default GC path only removes resources that are both (a) owned by this project and (b) associated with a stopped/crashed VM - GC skips deletion and reports a warning when ownership cannot be proven - A `--dry-run` flag on `doctor --gc` shows what would be removed and why, without actually removing anything - A `--force-heuristic` flag preserves the current heuristic cleanup as an explicit emergency option - Crash recovery still works for known stopped/crashed VMs using persisted ownership data - External `tap-*` devices and NAT rules in the same subnet that were not created by this project are never touched by the default GC path ### Files to Modify | File | Change | |------|--------| | `crates/my_hypervisor-lib/src/vm/state.rs` | Add `NetworkOwnership` struct to `RuntimeState` | | `crates/my_hypervisor-lib/src/network/traits.rs` | Add `iptables_rules` to `NetworkResult` | | `crates/my_hypervisor-lib/src/network/ops/mod.rs` | Change `NatOps::add_port_forward` return type to include rule strings | | `crates/my_hypervisor-lib/src/network/local/nat.rs` | Return created iptables rule strings | | `crates/my_hypervisor-lib/src/network/mosnet/nat.rs` | Update to match new return type | | `crates/my_hypervisor-lib/src/network/backends/tap_backend.rs` | Propagate ownership through `NetworkResult` | | `crates/my_hypervisor-lib/src/vm/manager.rs` | Persist ownership, add `gc_owned_resources`, clear on teardown | | `crates/my_hypervisor-cli/src/cli.rs` | Add `--dry-run` and `--force-heuristic` flags to `DoctorArgs` | | `crates/my_hypervisor-cli/src/commands/doctor.rs` | Wire up ownership-aware GC, keep heuristic as fallback | ### Implementation Plan **Step 1:** Add `NetworkOwnership` struct to `RuntimeState` in `state.rs` **Step 2:** Extend `NetworkResult` and `NatOps` trait to surface created iptables rules **Step 3:** Update `LocalOps` NAT implementation to return rule strings **Step 4:** Propagate ownership through `TapNetwork::setup` **Step 5:** Persist ownership metadata in `VmManager::setup_network` **Step 6:** Add ownership-aware GC method `gc_owned_resources` to `VmManager` **Step 7:** Add CLI flags and wire up in doctor command **Step 8:** Clear ownership on teardown in `refresh_status`, `stop`, `remove` **Step 9:** Add unit tests and integration tests ### Acceptance Criteria - [ ] Cleanup does not remove unrelated `tap-*` devices created outside this project - [ ] Cleanup does not remove unrelated NAT rules in the same subnet - [ ] Crash recovery still removes networking resources for known stopped/crashed VMs - [ ] `--dry-run` mode shows what would be removed and why - [ ] Default cleanup path is ownership-aware, not purely heuristic - [ ] Heuristic cleanup available via `--force-heuristic` flag - [ ] Backward compatible with old state.json files - [ ] All existing tests pass
rawan self-assigned this 2026-03-25 12:53:00 +00:00
Member

Test Results

  • Total: 167
  • Passed: 167
  • Failed: 0
  • Clippy: Clean (no warnings)
  • Format: Clean

All existing tests pass. 4 new tests added:

  • network_ownership_serialization_roundtrip
  • network_ownership_default_is_empty
  • runtime_state_backward_compat_without_network_ownership
  • runtime_state_with_network_ownership
## Test Results - **Total**: 167 - **Passed**: 167 - **Failed**: 0 - **Clippy**: Clean (no warnings) - **Format**: Clean All existing tests pass. 4 new tests added: - `network_ownership_serialization_roundtrip` - `network_ownership_default_is_empty` - `runtime_state_backward_compat_without_network_ownership` - `runtime_state_with_network_ownership`
Member

Implementation Summary

Changes Made

Files modified (9):

File Change
crates/my_hypervisor-lib/src/vm/state.rs Added NetworkOwnership struct and field on RuntimeState with serde backward compat
crates/my_hypervisor-lib/src/network/traits.rs Added iptables_rules: Vec<String> to NetworkResult
crates/my_hypervisor-lib/src/network/ops/mod.rs Changed NatOps::add_port_forward return type to Result<Vec<String>>
crates/my_hypervisor-lib/src/network/local/nat.rs Returns created iptables rule strings in delete form
crates/my_hypervisor-lib/src/network/mosnet/nat.rs Updated to match new return type (returns empty vec)
crates/my_hypervisor-lib/src/network/backends/tap_backend.rs Collects iptables rules from port forwards into NetworkResult
crates/my_hypervisor-lib/src/vm/manager.rs Persists ownership in setup_network, clears on stop/refresh_status, adds gc_owned_resources method with GcAction/GcResult/GcWarning types
crates/my_hypervisor-cli/src/cli.rs Added --dry-run and --force-heuristic flags to DoctorArgs
crates/my_hypervisor-cli/src/commands/doctor.rs Added ownership-aware gc_cleanup_owned function; default GC is now ownership-aware, heuristic available via --force-heuristic

How It Works

  1. At VM creation: setup_network records TAP device, bridge, IP, and exact iptables rules in NetworkOwnership on RuntimeState, persisted to state.json
  2. At VM stop/crash: stop() and refresh_status() clear network_ownership after successful teardown
  3. GC (doctor --gc): Default path reads ownership metadata from stopped/crashed VMs, only deletes resources it can prove it owns. Skips VMs without ownership metadata (with a warning)
  4. --dry-run: Shows what would be removed without acting
  5. --force-heuristic: Falls back to the original heuristic cleanup for emergency use
  6. Backward compatible: Old state.json files without network_ownership deserialize fine via serde(default)

Test Results

  • 167 tests passed, 0 failed
  • 4 new tests added for ownership serialization and backward compatibility
  • Clippy clean, fmt clean
## Implementation Summary ### Changes Made **Files modified (9):** | File | Change | |------|--------| | `crates/my_hypervisor-lib/src/vm/state.rs` | Added `NetworkOwnership` struct and field on `RuntimeState` with serde backward compat | | `crates/my_hypervisor-lib/src/network/traits.rs` | Added `iptables_rules: Vec<String>` to `NetworkResult` | | `crates/my_hypervisor-lib/src/network/ops/mod.rs` | Changed `NatOps::add_port_forward` return type to `Result<Vec<String>>` | | `crates/my_hypervisor-lib/src/network/local/nat.rs` | Returns created iptables rule strings in delete form | | `crates/my_hypervisor-lib/src/network/mosnet/nat.rs` | Updated to match new return type (returns empty vec) | | `crates/my_hypervisor-lib/src/network/backends/tap_backend.rs` | Collects iptables rules from port forwards into `NetworkResult` | | `crates/my_hypervisor-lib/src/vm/manager.rs` | Persists ownership in `setup_network`, clears on `stop`/`refresh_status`, adds `gc_owned_resources` method with `GcAction`/`GcResult`/`GcWarning` types | | `crates/my_hypervisor-cli/src/cli.rs` | Added `--dry-run` and `--force-heuristic` flags to `DoctorArgs` | | `crates/my_hypervisor-cli/src/commands/doctor.rs` | Added ownership-aware `gc_cleanup_owned` function; default GC is now ownership-aware, heuristic available via `--force-heuristic` | ### How It Works 1. **At VM creation**: `setup_network` records TAP device, bridge, IP, and exact iptables rules in `NetworkOwnership` on `RuntimeState`, persisted to `state.json` 2. **At VM stop/crash**: `stop()` and `refresh_status()` clear `network_ownership` after successful teardown 3. **GC (`doctor --gc`)**: Default path reads ownership metadata from stopped/crashed VMs, only deletes resources it can prove it owns. Skips VMs without ownership metadata (with a warning) 4. **`--dry-run`**: Shows what would be removed without acting 5. **`--force-heuristic`**: Falls back to the original heuristic cleanup for emergency use 6. **Backward compatible**: Old state.json files without `network_ownership` deserialize fine via `serde(default)` ### Test Results - 167 tests passed, 0 failed - 4 new tests added for ownership serialization and backward compatibility - Clippy clean, fmt clean
rawan added this to the ACTIVE project 2026-03-26 13:27:29 +00:00
rawan closed this issue 2026-04-01 10:23:11 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
geomind_code/my_hypervisor#45
No description provided.