Resource Cleanup on Panic/Crash #26

rawan · 2026-03-08T10:05:30Z

rawan commented

2026-03-08 10:05:30 +00:00

related issues: #11

Change 1: Auto-cleanup in refresh_status()

Added teardown_network + cleanup_storage() to refresh_status() when a dead VM process is detected
Previously it only marked the VM as Stopped — now it cleans up network interfaces and storage before updating state

Change 2: chvm doctor --gc

Added --gc flag to the doctor command for manual garbage collection
Level 1 (state-aware): Uses refresh_status() to detect crashed VMs and clean their resources
Level 2 (stateless orphan detection): Scans for tap-* interfaces not belonging to any running VM and removes them. Scans iptables nat rules referencing chvm-subnet IPs (192.168.200.X) not owned by a running VM and removes them

Other

Added install-global Makefile target to copy binaries to /usr/local/bin

How to test

Test 1: refresh_status() auto-cleanup (via chvm list)

1. Start a VM with TAP networking

doas chvm run -d --name test-gc alpine:latest

2. Note the cloud-hypervisor PID

doas chvm inspect test-gc | grep vmm_pid

3. Verify TAP device and iptables rules exist

ip link show | grep tap-
doas iptables -t nat -S | grep 192.168.200

4. Kill the cloud-hypervisor process directly (simulate crash)

doas kill -9 <vmm_pid>

5. Run chvm list — refresh_status should detect the dead process, tear down TAP device and iptables rules, then mark VM as Stopped

doas chvm list -a

6. Verify TAP device and iptables rules are gone

ip link show | grep tap-
doas iptables -t nat -S | grep 192.168.200

Test 2: chvm doctor --gc orphan cleanup

1. Create an orphaned TAP device manually (simulates leftover from crash)

doas ip tuntap add dev tap-deadbeef mode tap
doas ip link set tap-deadbeef up

2. Verify it exists

ip link show | grep tap-deadbeef

3. Run doctor --gc

doas chvm doctor --gc

4. Verify it was removed

ip link show | grep tap-deadbeef # should be gone

Test 3: Orphaned iptables rules (no running VMs)

1. Make sure no VMs are running

doas chvm stop $(doas chvm list -q) 2>/dev/null

2. Add a fake orphaned iptables rule

doas iptables -t nat -A PREROUTING -p tcp --dport 9999
-j DNAT --to-destination 192.168.200.99:80

3. Verify it exists

doas iptables -t nat -S | grep 192.168.200.99

4. Run doctor --gc

doas chvm doctor --gc

5. Verify it was removed

doas iptables -t nat -S | grep 192.168.200.99 # should be gone

related issues: https://forge.ourworld.tf/geomind_code/my_hypervisor/issues/11 ## Change 1: Auto-cleanup in refresh_status() Added teardown_network + cleanup_storage() to refresh_status() when a dead VM process is detected Previously it only marked the VM as Stopped — now it cleans up network interfaces and storage before updating state ## Change 2: chvm doctor --gc Added --gc flag to the doctor command for manual garbage collection Level 1 (state-aware): Uses refresh_status() to detect crashed VMs and clean their resources Level 2 (stateless orphan detection): Scans for tap-* interfaces not belonging to any running VM and removes them. Scans iptables nat rules referencing chvm-subnet IPs (192.168.200.X) not owned by a running VM and removes them ## Other Added install-global Makefile target to copy binaries to /usr/local/bin # How to test ## Test 1: refresh_status() auto-cleanup (via chvm list) #### 1. Start a VM with TAP networking doas chvm run -d --name test-gc alpine:latest #### 2. Note the cloud-hypervisor PID doas chvm inspect test-gc | grep vmm_pid #### 3. Verify TAP device and iptables rules exist ip link show | grep tap- doas iptables -t nat -S | grep 192.168.200 #### 4. Kill the cloud-hypervisor process directly (simulate crash) doas kill -9 <vmm_pid> #### 5. Run chvm list — refresh_status should detect the dead process, tear down TAP device and iptables rules, then mark VM as Stopped doas chvm list -a #### 6. Verify TAP device and iptables rules are gone ip link show | grep tap- doas iptables -t nat -S | grep 192.168.200 ## Test 2: chvm doctor --gc orphan cleanup #### 1. Create an orphaned TAP device manually (simulates leftover from crash) doas ip tuntap add dev tap-deadbeef mode tap doas ip link set tap-deadbeef up #### 2. Verify it exists ip link show | grep tap-deadbeef #### 3. Run doctor --gc doas chvm doctor --gc #### 4. Verify it was removed ip link show | grep tap-deadbeef # should be gone ## Test 3: Orphaned iptables rules (no running VMs) #### 1. Make sure no VMs are running doas chvm stop $(doas chvm list -q) 2>/dev/null #### 2. Add a fake orphaned iptables rule doas iptables -t nat -A PREROUTING -p tcp --dport 9999 \ -j DNAT --to-destination 192.168.200.99:80 #### 3. Verify it exists doas iptables -t nat -S | grep 192.168.200.99 #### 4. Run doctor --gc doas chvm doctor --gc #### 5. Verify it was removed doas iptables -t nat -S | grep 192.168.200.99 # should be gone

rawan added 2 commits

2026-03-08 10:05:30 +00:00

refactor(vm): make refresh_status async with resource cleanup f8253b87af

When refresh_status() detects a dead VM process, it now tears down
network interfaces and cleans up storage resources before marking the
VM as stopped. This prevents TAP devices and iptables rules from
leaking when a VM process crashes.

feat: added garbage cleaner for network resources

Unit and Integration Test / test (push) Successful in 2m49s

Details

aa0c42f658

rawan referenced this pull request

2026-03-08 10:06:12 +00:00

[ENHANCEMENT] Resource Cleanup on Panic/Crash #11

salmaelsoly requested changes

2026-03-08 11:22:17 +00:00

Dismissed

salmaelsoly left a comment

can we have tests may be integration test for whole flow and unit test for functions like extract_chvm_ip, can we update docs as well to reflect new flag usage?

crates/chvm-cli/src/commands/doctor.rs Outdated

					
				@ -124,0 +158,4 @@

				    let cleaned = running_before.len() - running_after.len();

				    if cleaned > 0 {

				        println!("Cleaned up {} crashed VM(s)", cleaned);

during refresh status there can be call for more running vm so running after maybe bigger thah running before len( ) returns usize in this case it can panic i think we can use saturating sub

In practice running_after <= running_before always holds since refresh_status() only moves VMs from Running to Stopped, not the other way

but i added the suturating sub for extra safety

In practice running_after <= running_before always holds since refresh_status() only moves VMs from Running to Stopped, not the other way but i added the suturating sub for extra safety

crates/chvm-cli/src/commands/doctor.rs

					
				@ -124,0 +169,4 @@

				        .map(|id| id[..std::cmp::min(8, id.len())].to_string())

				        .collect();

				    // Collect IPs of running VMs for iptables orphan detection

can we extract this 8 in a constant?

rawan marked this conversation as resolved

crates/chvm-cli/src/commands/doctor.rs Outdated

					
				@ -124,0 +273,4 @@

				            Ok(out) if out.status.success() => removed += 1,

				            _ => {}

				        }

				    }

i think we can add warn log here in case of failed deletions

rawan marked this conversation as resolved

rawan added 1 commit

2026-03-08 22:13:23 +00:00

refactor: updating docs, adding unit, integration tests, refactoring doctor

Unit and Integration Test / test (push) Successful in 1m49s

Details

f8920b8ff6

rawan requested review from salmaelsoly

2026-03-08 22:16:30 +00:00

salmaelsoly approved these changes

2026-03-09 09:07:51 +00:00

rawan merged commit af7e1e59dd into development

2026-03-09 10:10:51 +00:00

rawan deleted branch development_network_cleanup

2026-03-09 10:10:51 +00:00

rawan referenced this pull request from a commit

2026-03-09 10:10:53 +00:00

Merge pull request 'Resource Cleanup on Panic/Crash' (#26) from development_network_cleanup into development

Sign in to join this conversation.

No reviewers

No labels

No milestone

No project

No assignees

2 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

geomind_code/my_hypervisor!26

Rows
Columns