geomind_code/my_hypervisor

Fork 0

Fix false-green validation and failed-start cleanup paths #44

New issue

Closed

opened 2026-03-19 19:45:41 +00:00 by thabeta · 3 comments

thabeta commented

2026-03-19 19:45:41 +00:00

Owner

Problem

The current validation and runtime lifecycle paths can report success even when a suite or VM has already failed underneath them.

This is really three related problems in the same flow:

The top-level shell runner can produce a false green summary.
VM start is not treated as a transactional operation.
Runtime inspection can talk to stale sockets instead of first reconciling whether the process is still alive.

Current behavior

1. False-green test summary

In tests/run_tests.sh, each suite is run with:

bash "$suite" || suite_rc=$?

but the final result only reads the suite result file if that file exists and is non-empty, and the final exit code is based only on the accumulated TOTAL_FAIL count.

That means a suite can exit non-zero before it writes its summary file, disappear from the totals entirely, and the runner can still print ALL TESTS PASSED.

The helper flow makes this easy to hit:

test_end() returns non-zero when any test failed
suites commonly run with set -e
export_results() only writes the summary file if the suite reaches it

So a suite can fail, exit early, never export its counts, and be silently omitted from the final summary.

2. Failed starts leave partial state behind

In the VM start path, storage is prepared first, then the rootfs source is rewritten to point at the prepared storage, then networking and volume mounts are set up, and only after that is the hypervisor launched.

If anything in the later part of that sequence fails, the operation is not treated as a rollbackable transaction. Partial artifacts can remain behind, for example:

prepared storage state
runtime directories
helper processes
stale sockets
partially updated persisted state

That makes retries and crash recovery harder than they need to be.

3. `stats` uses persisted state without first reconciling liveness

The stats path checks whether the stored VM status says Running, then goes directly to the API socket. It does not first refresh process liveness.

If the hypervisor died out of band, users can get a low-level API/socket failure instead of a stable "not running" result.

Why this matters

The test harness becomes untrustworthy exactly when it is needed most: during regressions.
A failed or crashed start can poison the next run with leftovers from the previous attempt.
Operational commands become noisy and confusing because they surface stale runtime artifacts rather than the real VM state.

Suggested implementation

Test runner

Make every suite produce an explicit final status record even if helpers return non-zero.
Treat a missing or empty results file as a suite failure, not as "no data".
Include suite_rc in the final aggregation logic.
Fail the overall run if any suite crashed before export.

Start path

Treat start as a transactional sequence.
Track which artifacts were created during the current attempt.
On any failure after state creation, clean up storage, sockets, helper processes, and any temporary runtime metadata created during that attempt.
Make cleanup idempotent so it is safe to run during both failed starts and crash recovery.

Runtime inspection

Refresh liveness before calling the hypervisor API.
If the process is already dead, update state and return a stable not-running error.
Remove stale API/vsock sockets during crash cleanup so restarts do not collide with old runtime artifacts.

Acceptance criteria

A suite that exits non-zero before export_results() causes the top-level runner to fail.
The top-level runner cannot print ALL TESTS PASSED if any suite crashed, exited early, or failed to export its results.
A failed start leaves no stale runtime socket, helper daemon, or partially prepared VM state behind.
Restarting after a crash does not fail because of leftover API/vsock artifacts.
stats on a dead VM returns a stable not-running result instead of a raw socket/API error.

Affected areas

tests/run_tests.sh
tests/helpers.sh
VM manager start / crash-cleanup / stats paths

## Problem The current validation and runtime lifecycle paths can report success even when a suite or VM has already failed underneath them. This is really three related problems in the same flow: 1. The top-level shell runner can produce a false green summary. 2. VM start is not treated as a transactional operation. 3. Runtime inspection can talk to stale sockets instead of first reconciling whether the process is still alive. ## Current behavior ### 1. False-green test summary In `tests/run_tests.sh`, each suite is run with: ```sh bash "$suite" || suite_rc=$? ``` but the final result only reads the suite result file if that file exists and is non-empty, and the final exit code is based only on the accumulated `TOTAL_FAIL` count. That means a suite can exit non-zero before it writes its summary file, disappear from the totals entirely, and the runner can still print `ALL TESTS PASSED`. The helper flow makes this easy to hit: - `test_end()` returns non-zero when any test failed - suites commonly run with `set -e` - `export_results()` only writes the summary file if the suite reaches it So a suite can fail, exit early, never export its counts, and be silently omitted from the final summary. ### 2. Failed starts leave partial state behind In the VM start path, storage is prepared first, then the rootfs source is rewritten to point at the prepared storage, then networking and volume mounts are set up, and only after that is the hypervisor launched. If anything in the later part of that sequence fails, the operation is not treated as a rollbackable transaction. Partial artifacts can remain behind, for example: - prepared storage state - runtime directories - helper processes - stale sockets - partially updated persisted state That makes retries and crash recovery harder than they need to be. ### 3. `stats` uses persisted state without first reconciling liveness The `stats` path checks whether the stored VM status says `Running`, then goes directly to the API socket. It does not first refresh process liveness. If the hypervisor died out of band, users can get a low-level API/socket failure instead of a stable "not running" result. ## Why this matters - The test harness becomes untrustworthy exactly when it is needed most: during regressions. - A failed or crashed start can poison the next run with leftovers from the previous attempt. - Operational commands become noisy and confusing because they surface stale runtime artifacts rather than the real VM state. ## Suggested implementation ### Test runner 1. Make every suite produce an explicit final status record even if helpers return non-zero. 2. Treat a missing or empty results file as a suite failure, not as "no data". 3. Include `suite_rc` in the final aggregation logic. 4. Fail the overall run if any suite crashed before export. ### Start path 1. Treat start as a transactional sequence. 2. Track which artifacts were created during the current attempt. 3. On any failure after state creation, clean up storage, sockets, helper processes, and any temporary runtime metadata created during that attempt. 4. Make cleanup idempotent so it is safe to run during both failed starts and crash recovery. ### Runtime inspection 1. Refresh liveness before calling the hypervisor API. 2. If the process is already dead, update state and return a stable not-running error. 3. Remove stale API/vsock sockets during crash cleanup so restarts do not collide with old runtime artifacts. ## Acceptance criteria - A suite that exits non-zero before `export_results()` causes the top-level runner to fail. - The top-level runner cannot print `ALL TESTS PASSED` if any suite crashed, exited early, or failed to export its results. - A failed start leaves no stale runtime socket, helper daemon, or partially prepared VM state behind. - Restarting after a crash does not fail because of leftover API/vsock artifacts. - `stats` on a dead VM returns a stable not-running result instead of a raw socket/API error. ## Affected areas - `tests/run_tests.sh` - `tests/helpers.sh` - VM manager start / crash-cleanup / stats paths

rawdaGastan was assigned by thabeta

2026-03-19 19:56:30 +00:00

rawan self-assigned this

2026-03-24 09:07:11 +00:00

rawdaGastan was unassigned by rawan

2026-03-24 09:07:23 +00:00

rawan commented

2026-03-24 09:34:44 +00:00

Member

Implementation Spec for Issue #44: Fix False-Green Validation and Failed-Start Cleanup Paths

Objective

Fix three reliability issues: (1) the test runner can report "ALL TESTS PASSED" when suites crash before exporting results, (2) a failed VM start leaves behind partial state/sockets/processes that poison retries, and (3) stats on a VM whose hypervisor died out-of-band produces a raw socket error instead of a clean "not running" message.

Requirements

A suite that exits non-zero before calling export_results() must cause the top-level runner to fail.
The runner must never print ALL TESTS PASSED if any suite crashed, exited early, or failed to export results.
A failed start must leave no stale API socket, vsock socket, virtiofsd process, or partially-updated persisted state.
Restarting a VM after a crash must not fail due to leftover artifacts from the prior attempt.
stats on a VM whose hypervisor process has died must return a stable "not running" result (not a raw socket/API error).

Files to Modify

File	Purpose
`tests/run_tests.sh`	Fix false-green aggregation logic
`tests/helpers.sh`	Make `export_results` robust against early exits
`crates/my_hypervisor-lib/src/vm/manager.rs`	Transactional start cleanup; liveness-aware `stats`
`crates/my_hypervisor-cli/src/commands/stats.rs`	Adjust for `&mut self` signature change on `stats()`

Implementation Plan

Step 1: Fix false-green test runner (`tests/run_tests.sh`)

Files: tests/run_tests.sh

Track SUITE_CRASHED counter alongside TOTAL_FAIL
When results file is missing/empty AND suite_rc != 0, increment TOTAL_FAIL and SUITE_CRASHED
When suite_rc != 0 but results file shows 0 failures, trust suite_rc and increment fail count
In final verdict, fail if TOTAL_FAIL > 0 OR SUITE_CRASHED > 0
Dependencies: none

Step 2: Make `export_results` fire on exit via trap (`tests/helpers.sh`)

Files: tests/helpers.sh

Add export_results call inside existing trap_cleanup function
This ensures results are always written, even when set -e kills the suite early
Explicit export_results calls at end of suites become redundant but harmless (idempotent)
Dependencies: none

Step 3: Make VM start transactional with cleanup on failure (`manager.rs`)

Files: crates/my_hypervisor-lib/src/vm/manager.rs

Create private cleanup_failed_start() helper that cleans up: storage, sockets (api.sock + vsock.sock), virtiofsd processes, network
On any failure after prepare_storage, call cleanup_failed_start before returning error
At start of start(), pre-clean stale sockets from prior crashed attempts
Make cleanup idempotent (ignore "not found" errors)
Dependencies: none

Step 4: Make `stats` liveness-aware (`manager.rs`)

Files: crates/my_hypervisor-lib/src/vm/manager.rs, crates/my_hypervisor-cli/src/commands/stats.rs

Before checking API socket, verify hypervisor process is alive via process_running()
If process is dead: reconcile state to Stopped, clean up network/storage, return VmNotRunning
Change method signature from &self to &mut self if needed
Update CLI caller to pass mutable reference
Dependencies: Step 3 (uses cleanup helpers)

Step 5: Extend `cleanup_storage` to handle vsock sockets and mount dirs

Files: crates/my_hypervisor-lib/src/vm/manager.rs

Add vsock.sock removal to stop path (api.sock already removed)
Scan vm_dir for mount-* subdirectories and kill virtiofsd pid files found there
Dependencies: none

Acceptance Criteria

A suite that exits non-zero before export_results() causes the top-level runner to report failure
The runner cannot print ALL TESTS PASSED if any suite crashed/exited early/failed to export
export_results is called automatically via EXIT trap even when a suite dies from set -e
A failed start leaves no stale API socket, vsock socket, virtiofsd process, or partially prepared storage
Restarting after a crash does not fail due to leftover artifacts (stale sockets removed at start)
stats on a dead VM returns VmNotRunning error instead of a raw socket/API error
stats reconciles persisted state when it detects a dead hypervisor process
All existing unit tests pass (cargo test --workspace)
cargo clippy -- -D warnings passes

Notes

The stats method signature change from &self to &mut self may require updating callers
setup_network failure currently warns and continues (non-fatal). This behavior is preserved unless the reviewer requests it to be fatal.
cleanup_failed_start must be idempotent: safe to call multiple times, safe if some artifacts were never created
Steps 1 & 2 (test runner fixes) are complementary — the trap ensures results are written, and the runner is defensive about missing results as a safety net

## Implementation Spec for Issue #44: Fix False-Green Validation and Failed-Start Cleanup Paths ### Objective Fix three reliability issues: (1) the test runner can report "ALL TESTS PASSED" when suites crash before exporting results, (2) a failed VM start leaves behind partial state/sockets/processes that poison retries, and (3) `stats` on a VM whose hypervisor died out-of-band produces a raw socket error instead of a clean "not running" message. ### Requirements - A suite that exits non-zero before calling `export_results()` must cause the top-level runner to fail. - The runner must never print `ALL TESTS PASSED` if any suite crashed, exited early, or failed to export results. - A failed `start` must leave no stale API socket, vsock socket, virtiofsd process, or partially-updated persisted state. - Restarting a VM after a crash must not fail due to leftover artifacts from the prior attempt. - `stats` on a VM whose hypervisor process has died must return a stable "not running" result (not a raw socket/API error). ### Files to Modify | File | Purpose | |------|---------| | `tests/run_tests.sh` | Fix false-green aggregation logic | | `tests/helpers.sh` | Make `export_results` robust against early exits | | `crates/my_hypervisor-lib/src/vm/manager.rs` | Transactional start cleanup; liveness-aware `stats` | | `crates/my_hypervisor-cli/src/commands/stats.rs` | Adjust for `&mut self` signature change on `stats()` | ### Implementation Plan #### Step 1: Fix false-green test runner (`tests/run_tests.sh`) Files: `tests/run_tests.sh` - Track `SUITE_CRASHED` counter alongside `TOTAL_FAIL` - When results file is missing/empty AND `suite_rc != 0`, increment `TOTAL_FAIL` and `SUITE_CRASHED` - When `suite_rc != 0` but results file shows 0 failures, trust `suite_rc` and increment fail count - In final verdict, fail if `TOTAL_FAIL > 0` OR `SUITE_CRASHED > 0` Dependencies: none #### Step 2: Make `export_results` fire on exit via trap (`tests/helpers.sh`) Files: `tests/helpers.sh` - Add `export_results` call inside existing `trap_cleanup` function - This ensures results are always written, even when `set -e` kills the suite early - Explicit `export_results` calls at end of suites become redundant but harmless (idempotent) Dependencies: none #### Step 3: Make VM start transactional with cleanup on failure (`manager.rs`) Files: `crates/my_hypervisor-lib/src/vm/manager.rs` - Create private `cleanup_failed_start()` helper that cleans up: storage, sockets (api.sock + vsock.sock), virtiofsd processes, network - On any failure after `prepare_storage`, call `cleanup_failed_start` before returning error - At start of `start()`, pre-clean stale sockets from prior crashed attempts - Make cleanup idempotent (ignore "not found" errors) Dependencies: none #### Step 4: Make `stats` liveness-aware (`manager.rs`) Files: `crates/my_hypervisor-lib/src/vm/manager.rs`, `crates/my_hypervisor-cli/src/commands/stats.rs` - Before checking API socket, verify hypervisor process is alive via `process_running()` - If process is dead: reconcile state to `Stopped`, clean up network/storage, return `VmNotRunning` - Change method signature from `&self` to `&mut self` if needed - Update CLI caller to pass mutable reference Dependencies: Step 3 (uses cleanup helpers) #### Step 5: Extend `cleanup_storage` to handle vsock sockets and mount dirs Files: `crates/my_hypervisor-lib/src/vm/manager.rs` - Add vsock.sock removal to stop path (api.sock already removed) - Scan vm_dir for `mount-*` subdirectories and kill virtiofsd pid files found there Dependencies: none ### Acceptance Criteria - [ ] A suite that exits non-zero before `export_results()` causes the top-level runner to report failure - [ ] The runner cannot print `ALL TESTS PASSED` if any suite crashed/exited early/failed to export - [ ] `export_results` is called automatically via EXIT trap even when a suite dies from `set -e` - [ ] A failed start leaves no stale API socket, vsock socket, virtiofsd process, or partially prepared storage - [ ] Restarting after a crash does not fail due to leftover artifacts (stale sockets removed at start) - [ ] `stats` on a dead VM returns `VmNotRunning` error instead of a raw socket/API error - [ ] `stats` reconciles persisted state when it detects a dead hypervisor process - [ ] All existing unit tests pass (`cargo test --workspace`) - [ ] `cargo clippy -- -D warnings` passes ### Notes - The `stats` method signature change from `&self` to `&mut self` may require updating callers - `setup_network` failure currently warns and continues (non-fatal). This behavior is preserved unless the reviewer requests it to be fatal. - `cleanup_failed_start` must be idempotent: safe to call multiple times, safe if some artifacts were never created - Steps 1 & 2 (test runner fixes) are complementary — the trap ensures results are written, and the runner is defensive about missing results as a safety net

rawan commented

2026-03-24 11:40:54 +00:00

Member

Test Results

Total: 186
Passed: 186
Failed: 0
Clippy: Clean (0 warnings)

All unit tests pass. Clippy with -D warnings is clean.

## Test Results - **Total**: 186 - **Passed**: 186 - **Failed**: 0 - **Clippy**: Clean (0 warnings) All unit tests pass. Clippy with `-D warnings` is clean.

rawan commented

2026-03-24 11:41:12 +00:00

Member

Implementation Summary

Changes Made

tests/run_tests.sh — Fixed false-green test aggregation:

Added SUITE_CRASHED counter to detect suites that exit before exporting results
Missing/empty results with non-zero exit code now counted as failures
Non-zero suite exit with 0 reported failures also counted as crash
Final verdict fails if any suite crashed or had failures

tests/helpers.sh — Made export_results fire automatically on exit:

Added export_results call inside trap_cleanup (runs on EXIT trap)
Results are now written even when set -e aborts a suite early
Idempotent: explicit calls at end of suites remain harmless

crates/my_hypervisor-lib/src/vm/manager.rs — Transactional start + liveness-aware stats:

Pre-clean stale sockets (api.sock, vsock.sock) at start of start() so retries after crashes work
New cleanup_failed_start() helper: tears down storage, sockets, network on any failure after prepare_storage
prepare_volume_mounts and hypervisor launch failures now trigger full cleanup before returning
stop() now also removes vsock.sock (previously only removed api.sock)
stats() now checks process liveness before querying API socket — if hypervisor died out-of-band, reconciles state to Stopped and returns VmNotRunning

crates/my_hypervisor-cli/src/commands/stats.rs + main.rs — Updated stats caller for &mut self signature change

## Implementation Summary ### Changes Made **tests/run_tests.sh** — Fixed false-green test aggregation: - Added `SUITE_CRASHED` counter to detect suites that exit before exporting results - Missing/empty results with non-zero exit code now counted as failures - Non-zero suite exit with 0 reported failures also counted as crash - Final verdict fails if any suite crashed or had failures **tests/helpers.sh** — Made `export_results` fire automatically on exit: - Added `export_results` call inside `trap_cleanup` (runs on EXIT trap) - Results are now written even when `set -e` aborts a suite early - Idempotent: explicit calls at end of suites remain harmless **crates/my_hypervisor-lib/src/vm/manager.rs** — Transactional start + liveness-aware stats: - Pre-clean stale sockets (`api.sock`, `vsock.sock`) at start of `start()` so retries after crashes work - New `cleanup_failed_start()` helper: tears down storage, sockets, network on any failure after `prepare_storage` - `prepare_volume_mounts` and hypervisor launch failures now trigger full cleanup before returning - `stop()` now also removes `vsock.sock` (previously only removed `api.sock`) - `stats()` now checks process liveness before querying API socket — if hypervisor died out-of-band, reconciles state to Stopped and returns VmNotRunning **crates/my_hypervisor-cli/src/commands/stats.rs** + **main.rs** — Updated `stats` caller for `&mut self` signature change

rawan referenced this issue

2026-03-24 14:34:20 +00:00

Fix false-green validation and failed-start cleanup paths #67

rawdaGastan added this to the ACTIVE project

2026-03-24 15:17:46 +00:00

rawan closed this issue

2026-03-26 12:48:27 +00:00

No labels

No milestone

No project

No assignees

2 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

geomind_code/my_hypervisor#44

No description provided.

Rows
Columns

Fix false-green validation and failed-start cleanup paths #44

Problem

Current behavior

1. False-green test summary

2. Failed starts leave partial state behind

3. stats uses persisted state without first reconciling liveness

Why this matters

Suggested implementation

Test runner

Start path

Runtime inspection

Acceptance criteria

Affected areas

Implementation Spec for Issue #44: Fix False-Green Validation and Failed-Start Cleanup Paths

Objective

Requirements

Files to Modify

Implementation Plan

Step 1: Fix false-green test runner (tests/run_tests.sh)

Step 2: Make export_results fire on exit via trap (tests/helpers.sh)

Step 3: Make VM start transactional with cleanup on failure (manager.rs)

Step 4: Make stats liveness-aware (manager.rs)

Step 5: Extend cleanup_storage to handle vsock sockets and mount dirs

Acceptance Criteria

Notes

Test Results

Implementation Summary

Changes Made

3. `stats` uses persisted state without first reconciling liveness

Step 1: Fix false-green test runner (`tests/run_tests.sh`)

Step 2: Make `export_results` fire on exit via trap (`tests/helpers.sh`)

Step 3: Make VM start transactional with cleanup on failure (`manager.rs`)

Step 4: Make `stats` liveness-aware (`manager.rs`)

Step 5: Extend `cleanup_storage` to handle vsock sockets and mount dirs