Fix false-green validation and failed-start cleanup paths #67
No reviewers
Labels
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
geomind_code/my_hypervisor!67
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "development_fix_test_summary"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
#44
@ -617,0 +607,4 @@info!(vm_id = %state.id, "Detected dead VM process, cleaning up resources");if state.vm_config.network.mode != NetworkMode::None {let _ = self.teardown_network(&state).await;why ignoring error here?
@ -1153,0 +1169,4 @@let _ = std::fs::remove_file(vm_dir.join("vsock.sock"));if state.vm_config.network.mode != NetworkMode::None {let _ = self.teardown_network(state).await;here too
@ -567,0 +565,4 @@if state.status == VmState::Running && state.vmm_pid > 0 && !process_running(state.vmm_pid){self.reconcile_dead_vm(state).await;race condition here:
Between process_running() returning true and the actual API socket call, the process can die. This was always possible, but the new early-return path creates a false sense of safety — callers may stop handling socket-level errors gracefully. The API call below still needs to handle ECONNREFUSED/BrokenPipe and trigger reconciliation, otherwise a process that dies in that window produces an unreconciled opaque error.@ -616,1 +603,4 @@/// Clean up resources and mark a VM as Stopped when its hypervisor process/// has died out-of-band. Used by both `stats` and `refresh_status`.async fn reconcile_dead_vm(&mut self, state: RuntimeState) {takes ownership but callers sometimes hold a borrow
both callers create their own clones and pass them to the function, the function needs ownership because it needs to mutate the stats in:
@ -656,6 +693,11 @@ impl VmManager {.unwrap_or(MyceliumHealth::Unknown)};if health != MyceliumHealth::Healthy && !process_running(state.vmm_pid) {shouldn't we check after the update?
after what update?
// Update persisted state
what current order does:
1- If unhealthy + process dead → reconcile, skip the persist, return Unknown
2- Otherwise → persist the health update (healthy or unhealthy but process still alive)
Moving the check after the persist would mean writing state that gets immediately discarded by reconciliation