Fix false-green validation and failed-start cleanup paths #67

					
				@ -617,0 +607,4 @@

				        info!(vm_id = %state.id, "Detected dead VM process, cleaning up resources");

				        if state.vm_config.network.mode != NetworkMode::None {

				            let _ = self.teardown_network(&state).await;

why ignoring error here?

rawan marked this conversation as resolved

rawdaGastan reviewed

2026-03-24 15:41:24 +00:00

crates/my_hypervisor-lib/src/vm/manager.rs Outdated

					
				@ -1153,0 +1169,4 @@

				        let _ = std::fs::remove_file(vm_dir.join("vsock.sock"));

				        if state.vm_config.network.mode != NetworkMode::None {

				            let _ = self.teardown_network(state).await;

here too

rawan marked this conversation as resolved

rawdaGastan reviewed

2026-03-24 15:42:40 +00:00

crates/my_hypervisor-lib/src/vm/manager.rs Outdated

					
				@ -567,0 +565,4 @@

				        if state.status == VmState::Running && state.vmm_pid > 0 && !process_running(state.vmm_pid)

				        {

				            self.reconcile_dead_vm(state).await;

race condition here: Between process_running() returning true and the actual API socket call, the process can die. This was always possible, but the new early-return path creates a false sense of safety — callers may stop handling socket-level errors gracefully. The API call below still needs to handle ECONNREFUSED/BrokenPipe and trigger reconciliation, otherwise a process that dies in that window produces an unreconciled opaque error.

race condition here: `Between process_running() returning true and the actual API socket call, the process can die. This was always possible, but the new early-return path creates a false sense of safety — callers may stop handling socket-level errors gracefully. The API call below still needs to handle ECONNREFUSED/BrokenPipe and trigger reconciliation, otherwise a process that dies in that window produces an unreconciled opaque error.`

rawan marked this conversation as resolved

rawdaGastan reviewed

2026-03-24 15:43:04 +00:00

crates/my_hypervisor-lib/src/vm/manager.rs

					
				@ -616,1 +603,4 @@

				    /// Clean up resources and mark a VM as Stopped when its hypervisor process

				    /// has died out-of-band. Used by both `stats` and `refresh_status`.

				    async fn reconcile_dead_vm(&mut self, state: RuntimeState) {

takes ownership but callers sometimes hold a borrow

both callers create their own clones and pass them to the function, the function needs ownership because it needs to mutate the stats in:

        let mut updated = state;
        updated.status = VmState::Stopped;
        updated.stopped_at = Some(Utc::now());

both callers create their own clones and pass them to the function, the function needs ownership because it needs to mutate the stats in: ```rust let mut updated = state; updated.status = VmState::Stopped; updated.stopped_at = Some(Utc::now()); ```

rawdaGastan marked this conversation as resolved

rawan added 1 commit

2026-03-25 11:32:22 +00:00

fix: remove silent errors, fixed race condition

Unit and Integration Test / test (push) Successful in 1m48s

Details

cd1f59522c

rawan requested review from rawdaGastan

2026-03-25 11:37:02 +00:00

rawdaGastan reviewed

2026-03-25 13:35:01 +00:00

crates/my_hypervisor-lib/src/vm/manager.rs

					
				@ -656,6 +693,11 @@ impl VmManager {

				            .unwrap_or(MyceliumHealth::Unknown)

				        };

				        if health != MyceliumHealth::Healthy && !process_running(state.vmm_pid) {

shouldn't we check after the update?

after what update?

// Update persisted state

what current order does:

1- If unhealthy + process dead → reconcile, skip the persist, return Unknown
2- Otherwise → persist the health update (healthy or unhealthy but process still alive)

Moving the check after the persist would mean writing state that gets immediately discarded by reconciliation

what current order does: 1- If unhealthy + process dead → reconcile, skip the persist, return Unknown 2- Otherwise → persist the health update (healthy or unhealthy but process still alive) Moving the check after the persist would mean writing state that gets immediately discarded by reconciliation