Resource Cleanup on Panic/Crash #26

Merged
rawan merged 3 commits from development_network_cleanup into development 2026-03-09 10:10:51 +00:00
Member

related issues: #11

Change 1: Auto-cleanup in refresh_status()

Added teardown_network + cleanup_storage() to refresh_status() when a dead VM process is detected
Previously it only marked the VM as Stopped — now it cleans up network interfaces and storage before updating state

Change 2: chvm doctor --gc

Added --gc flag to the doctor command for manual garbage collection
Level 1 (state-aware): Uses refresh_status() to detect crashed VMs and clean their resources
Level 2 (stateless orphan detection): Scans for tap-* interfaces not belonging to any running VM and removes them. Scans iptables nat rules referencing chvm-subnet IPs (192.168.200.X) not owned by a running VM and removes them

Other

Added install-global Makefile target to copy binaries to /usr/local/bin

How to test

Test 1: refresh_status() auto-cleanup (via chvm list)

1. Start a VM with TAP networking

doas chvm run -d --name test-gc alpine:latest

2. Note the cloud-hypervisor PID

doas chvm inspect test-gc | grep vmm_pid

3. Verify TAP device and iptables rules exist

ip link show | grep tap-
doas iptables -t nat -S | grep 192.168.200

4. Kill the cloud-hypervisor process directly (simulate crash)

doas kill -9 <vmm_pid>

5. Run chvm list — refresh_status should detect the dead process, tear down TAP device and iptables rules, then mark VM as Stopped

doas chvm list -a

6. Verify TAP device and iptables rules are gone

ip link show | grep tap-
doas iptables -t nat -S | grep 192.168.200

Test 2: chvm doctor --gc orphan cleanup

1. Create an orphaned TAP device manually (simulates leftover from crash)

doas ip tuntap add dev tap-deadbeef mode tap
doas ip link set tap-deadbeef up

2. Verify it exists

ip link show | grep tap-deadbeef

3. Run doctor --gc

doas chvm doctor --gc

4. Verify it was removed

ip link show | grep tap-deadbeef # should be gone

Test 3: Orphaned iptables rules (no running VMs)

1. Make sure no VMs are running

doas chvm stop $(doas chvm list -q) 2>/dev/null

2. Add a fake orphaned iptables rule

doas iptables -t nat -A PREROUTING -p tcp --dport 9999
-j DNAT --to-destination 192.168.200.99:80

3. Verify it exists

doas iptables -t nat -S | grep 192.168.200.99

4. Run doctor --gc

doas chvm doctor --gc

5. Verify it was removed

doas iptables -t nat -S | grep 192.168.200.99 # should be gone

related issues: https://forge.ourworld.tf/geomind_code/my_hypervisor/issues/11 ## Change 1: Auto-cleanup in refresh_status() Added teardown_network + cleanup_storage() to refresh_status() when a dead VM process is detected Previously it only marked the VM as Stopped — now it cleans up network interfaces and storage before updating state ## Change 2: chvm doctor --gc Added --gc flag to the doctor command for manual garbage collection Level 1 (state-aware): Uses refresh_status() to detect crashed VMs and clean their resources Level 2 (stateless orphan detection): Scans for tap-* interfaces not belonging to any running VM and removes them. Scans iptables nat rules referencing chvm-subnet IPs (192.168.200.X) not owned by a running VM and removes them ## Other Added install-global Makefile target to copy binaries to /usr/local/bin # How to test ## Test 1: refresh_status() auto-cleanup (via chvm list) #### 1. Start a VM with TAP networking doas chvm run -d --name test-gc alpine:latest #### 2. Note the cloud-hypervisor PID doas chvm inspect test-gc | grep vmm_pid #### 3. Verify TAP device and iptables rules exist ip link show | grep tap- doas iptables -t nat -S | grep 192.168.200 #### 4. Kill the cloud-hypervisor process directly (simulate crash) doas kill -9 <vmm_pid> #### 5. Run chvm list — refresh_status should detect the dead process, tear down TAP device and iptables rules, then mark VM as Stopped doas chvm list -a #### 6. Verify TAP device and iptables rules are gone ip link show | grep tap- doas iptables -t nat -S | grep 192.168.200 ## Test 2: chvm doctor --gc orphan cleanup #### 1. Create an orphaned TAP device manually (simulates leftover from crash) doas ip tuntap add dev tap-deadbeef mode tap doas ip link set tap-deadbeef up #### 2. Verify it exists ip link show | grep tap-deadbeef #### 3. Run doctor --gc doas chvm doctor --gc #### 4. Verify it was removed ip link show | grep tap-deadbeef # should be gone ## Test 3: Orphaned iptables rules (no running VMs) #### 1. Make sure no VMs are running doas chvm stop $(doas chvm list -q) 2>/dev/null #### 2. Add a fake orphaned iptables rule doas iptables -t nat -A PREROUTING -p tcp --dport 9999 \ -j DNAT --to-destination 192.168.200.99:80 #### 3. Verify it exists doas iptables -t nat -S | grep 192.168.200.99 #### 4. Run doctor --gc doas chvm doctor --gc #### 5. Verify it was removed doas iptables -t nat -S | grep 192.168.200.99 # should be gone
When refresh_status() detects a dead VM process, it now tears down
network interfaces and cleans up storage resources before marking the
VM as stopped. This prevents TAP devices and iptables rules from
leaking when a VM process crashes.
feat: added garbage cleaner for network resources
All checks were successful
Unit and Integration Test / test (push) Successful in 2m49s
aa0c42f658
salmaelsoly requested changes 2026-03-08 11:22:17 +00:00
Dismissed
salmaelsoly left a comment
Member

can we have tests may be integration test for whole flow and unit test for functions like extract_chvm_ip, can we update docs as well to reflect new flag usage?

can we have tests may be integration test for whole flow and unit test for functions like extract_chvm_ip, can we update docs as well to reflect new flag usage?
@ -124,0 +158,4 @@
let cleaned = running_before.len() - running_after.len();
if cleaned > 0 {
println!("Cleaned up {} crashed VM(s)", cleaned);
Member

during refresh status there can be call for more running vm so running after maybe bigger thah running before len( ) returns usize in this case it can panic i think we can use saturating sub

during refresh status there can be call for more running vm so running after maybe bigger thah running before len( ) returns usize in this case it can panic i think we can use saturating sub
Author
Member

In practice running_after <= running_before always holds since refresh_status() only moves VMs from Running to Stopped, not the other way

but i added the suturating sub for extra safety

In practice running_after <= running_before always holds since refresh_status() only moves VMs from Running to Stopped, not the other way but i added the suturating sub for extra safety
@ -124,0 +169,4 @@
.map(|id| id[..std::cmp::min(8, id.len())].to_string())
.collect();
// Collect IPs of running VMs for iptables orphan detection
Member

can we extract this 8 in a constant?

can we extract this 8 in a constant?
rawan marked this conversation as resolved
@ -124,0 +273,4 @@
Ok(out) if out.status.success() => removed += 1,
_ => {}
}
}
Member

i think we can add warn log here in case of failed deletions

i think we can add warn log here in case of failed deletions
rawan marked this conversation as resolved
refactor: updating docs, adding unit, integration tests, refactoring doctor
All checks were successful
Unit and Integration Test / test (push) Successful in 1m49s
f8920b8ff6
rawan merged commit af7e1e59dd into development 2026-03-09 10:10:51 +00:00
rawan deleted branch development_network_cleanup 2026-03-09 10:10:51 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
geomind_code/my_hypervisor!26
No description provided.