Fix false-green validation and failed-start cleanup paths #44

Closed
opened 2026-03-19 19:45:41 +00:00 by thabeta · 3 comments
Owner

Problem

The current validation and runtime lifecycle paths can report success even when a suite or VM has already failed underneath them.

This is really three related problems in the same flow:

  1. The top-level shell runner can produce a false green summary.
  2. VM start is not treated as a transactional operation.
  3. Runtime inspection can talk to stale sockets instead of first reconciling whether the process is still alive.

Current behavior

1. False-green test summary

In tests/run_tests.sh, each suite is run with:

bash "$suite" || suite_rc=$?

but the final result only reads the suite result file if that file exists and is non-empty, and the final exit code is based only on the accumulated TOTAL_FAIL count.

That means a suite can exit non-zero before it writes its summary file, disappear from the totals entirely, and the runner can still print ALL TESTS PASSED.

The helper flow makes this easy to hit:

  • test_end() returns non-zero when any test failed
  • suites commonly run with set -e
  • export_results() only writes the summary file if the suite reaches it

So a suite can fail, exit early, never export its counts, and be silently omitted from the final summary.

2. Failed starts leave partial state behind

In the VM start path, storage is prepared first, then the rootfs source is rewritten to point at the prepared storage, then networking and volume mounts are set up, and only after that is the hypervisor launched.

If anything in the later part of that sequence fails, the operation is not treated as a rollbackable transaction. Partial artifacts can remain behind, for example:

  • prepared storage state
  • runtime directories
  • helper processes
  • stale sockets
  • partially updated persisted state

That makes retries and crash recovery harder than they need to be.

3. stats uses persisted state without first reconciling liveness

The stats path checks whether the stored VM status says Running, then goes directly to the API socket. It does not first refresh process liveness.

If the hypervisor died out of band, users can get a low-level API/socket failure instead of a stable "not running" result.

Why this matters

  • The test harness becomes untrustworthy exactly when it is needed most: during regressions.
  • A failed or crashed start can poison the next run with leftovers from the previous attempt.
  • Operational commands become noisy and confusing because they surface stale runtime artifacts rather than the real VM state.

Suggested implementation

Test runner

  1. Make every suite produce an explicit final status record even if helpers return non-zero.
  2. Treat a missing or empty results file as a suite failure, not as "no data".
  3. Include suite_rc in the final aggregation logic.
  4. Fail the overall run if any suite crashed before export.

Start path

  1. Treat start as a transactional sequence.
  2. Track which artifacts were created during the current attempt.
  3. On any failure after state creation, clean up storage, sockets, helper processes, and any temporary runtime metadata created during that attempt.
  4. Make cleanup idempotent so it is safe to run during both failed starts and crash recovery.

Runtime inspection

  1. Refresh liveness before calling the hypervisor API.
  2. If the process is already dead, update state and return a stable not-running error.
  3. Remove stale API/vsock sockets during crash cleanup so restarts do not collide with old runtime artifacts.

Acceptance criteria

  • A suite that exits non-zero before export_results() causes the top-level runner to fail.
  • The top-level runner cannot print ALL TESTS PASSED if any suite crashed, exited early, or failed to export its results.
  • A failed start leaves no stale runtime socket, helper daemon, or partially prepared VM state behind.
  • Restarting after a crash does not fail because of leftover API/vsock artifacts.
  • stats on a dead VM returns a stable not-running result instead of a raw socket/API error.

Affected areas

  • tests/run_tests.sh
  • tests/helpers.sh
  • VM manager start / crash-cleanup / stats paths
## Problem The current validation and runtime lifecycle paths can report success even when a suite or VM has already failed underneath them. This is really three related problems in the same flow: 1. The top-level shell runner can produce a false green summary. 2. VM start is not treated as a transactional operation. 3. Runtime inspection can talk to stale sockets instead of first reconciling whether the process is still alive. ## Current behavior ### 1. False-green test summary In `tests/run_tests.sh`, each suite is run with: ```sh bash "$suite" || suite_rc=$? ``` but the final result only reads the suite result file if that file exists and is non-empty, and the final exit code is based only on the accumulated `TOTAL_FAIL` count. That means a suite can exit non-zero before it writes its summary file, disappear from the totals entirely, and the runner can still print `ALL TESTS PASSED`. The helper flow makes this easy to hit: - `test_end()` returns non-zero when any test failed - suites commonly run with `set -e` - `export_results()` only writes the summary file if the suite reaches it So a suite can fail, exit early, never export its counts, and be silently omitted from the final summary. ### 2. Failed starts leave partial state behind In the VM start path, storage is prepared first, then the rootfs source is rewritten to point at the prepared storage, then networking and volume mounts are set up, and only after that is the hypervisor launched. If anything in the later part of that sequence fails, the operation is not treated as a rollbackable transaction. Partial artifacts can remain behind, for example: - prepared storage state - runtime directories - helper processes - stale sockets - partially updated persisted state That makes retries and crash recovery harder than they need to be. ### 3. `stats` uses persisted state without first reconciling liveness The `stats` path checks whether the stored VM status says `Running`, then goes directly to the API socket. It does not first refresh process liveness. If the hypervisor died out of band, users can get a low-level API/socket failure instead of a stable "not running" result. ## Why this matters - The test harness becomes untrustworthy exactly when it is needed most: during regressions. - A failed or crashed start can poison the next run with leftovers from the previous attempt. - Operational commands become noisy and confusing because they surface stale runtime artifacts rather than the real VM state. ## Suggested implementation ### Test runner 1. Make every suite produce an explicit final status record even if helpers return non-zero. 2. Treat a missing or empty results file as a suite failure, not as "no data". 3. Include `suite_rc` in the final aggregation logic. 4. Fail the overall run if any suite crashed before export. ### Start path 1. Treat start as a transactional sequence. 2. Track which artifacts were created during the current attempt. 3. On any failure after state creation, clean up storage, sockets, helper processes, and any temporary runtime metadata created during that attempt. 4. Make cleanup idempotent so it is safe to run during both failed starts and crash recovery. ### Runtime inspection 1. Refresh liveness before calling the hypervisor API. 2. If the process is already dead, update state and return a stable not-running error. 3. Remove stale API/vsock sockets during crash cleanup so restarts do not collide with old runtime artifacts. ## Acceptance criteria - A suite that exits non-zero before `export_results()` causes the top-level runner to fail. - The top-level runner cannot print `ALL TESTS PASSED` if any suite crashed, exited early, or failed to export its results. - A failed start leaves no stale runtime socket, helper daemon, or partially prepared VM state behind. - Restarting after a crash does not fail because of leftover API/vsock artifacts. - `stats` on a dead VM returns a stable not-running result instead of a raw socket/API error. ## Affected areas - `tests/run_tests.sh` - `tests/helpers.sh` - VM manager start / crash-cleanup / stats paths
rawan self-assigned this 2026-03-24 09:07:11 +00:00
Member

Implementation Spec for Issue #44: Fix False-Green Validation and Failed-Start Cleanup Paths

Objective

Fix three reliability issues: (1) the test runner can report "ALL TESTS PASSED" when suites crash before exporting results, (2) a failed VM start leaves behind partial state/sockets/processes that poison retries, and (3) stats on a VM whose hypervisor died out-of-band produces a raw socket error instead of a clean "not running" message.

Requirements

  • A suite that exits non-zero before calling export_results() must cause the top-level runner to fail.
  • The runner must never print ALL TESTS PASSED if any suite crashed, exited early, or failed to export results.
  • A failed start must leave no stale API socket, vsock socket, virtiofsd process, or partially-updated persisted state.
  • Restarting a VM after a crash must not fail due to leftover artifacts from the prior attempt.
  • stats on a VM whose hypervisor process has died must return a stable "not running" result (not a raw socket/API error).

Files to Modify

File Purpose
tests/run_tests.sh Fix false-green aggregation logic
tests/helpers.sh Make export_results robust against early exits
crates/my_hypervisor-lib/src/vm/manager.rs Transactional start cleanup; liveness-aware stats
crates/my_hypervisor-cli/src/commands/stats.rs Adjust for &mut self signature change on stats()

Implementation Plan

Step 1: Fix false-green test runner (tests/run_tests.sh)

Files: tests/run_tests.sh

  • Track SUITE_CRASHED counter alongside TOTAL_FAIL
  • When results file is missing/empty AND suite_rc != 0, increment TOTAL_FAIL and SUITE_CRASHED
  • When suite_rc != 0 but results file shows 0 failures, trust suite_rc and increment fail count
  • In final verdict, fail if TOTAL_FAIL > 0 OR SUITE_CRASHED > 0
    Dependencies: none

Step 2: Make export_results fire on exit via trap (tests/helpers.sh)

Files: tests/helpers.sh

  • Add export_results call inside existing trap_cleanup function
  • This ensures results are always written, even when set -e kills the suite early
  • Explicit export_results calls at end of suites become redundant but harmless (idempotent)
    Dependencies: none

Step 3: Make VM start transactional with cleanup on failure (manager.rs)

Files: crates/my_hypervisor-lib/src/vm/manager.rs

  • Create private cleanup_failed_start() helper that cleans up: storage, sockets (api.sock + vsock.sock), virtiofsd processes, network
  • On any failure after prepare_storage, call cleanup_failed_start before returning error
  • At start of start(), pre-clean stale sockets from prior crashed attempts
  • Make cleanup idempotent (ignore "not found" errors)
    Dependencies: none

Step 4: Make stats liveness-aware (manager.rs)

Files: crates/my_hypervisor-lib/src/vm/manager.rs, crates/my_hypervisor-cli/src/commands/stats.rs

  • Before checking API socket, verify hypervisor process is alive via process_running()
  • If process is dead: reconcile state to Stopped, clean up network/storage, return VmNotRunning
  • Change method signature from &self to &mut self if needed
  • Update CLI caller to pass mutable reference
    Dependencies: Step 3 (uses cleanup helpers)

Step 5: Extend cleanup_storage to handle vsock sockets and mount dirs

Files: crates/my_hypervisor-lib/src/vm/manager.rs

  • Add vsock.sock removal to stop path (api.sock already removed)
  • Scan vm_dir for mount-* subdirectories and kill virtiofsd pid files found there
    Dependencies: none

Acceptance Criteria

  • A suite that exits non-zero before export_results() causes the top-level runner to report failure
  • The runner cannot print ALL TESTS PASSED if any suite crashed/exited early/failed to export
  • export_results is called automatically via EXIT trap even when a suite dies from set -e
  • A failed start leaves no stale API socket, vsock socket, virtiofsd process, or partially prepared storage
  • Restarting after a crash does not fail due to leftover artifacts (stale sockets removed at start)
  • stats on a dead VM returns VmNotRunning error instead of a raw socket/API error
  • stats reconciles persisted state when it detects a dead hypervisor process
  • All existing unit tests pass (cargo test --workspace)
  • cargo clippy -- -D warnings passes

Notes

  • The stats method signature change from &self to &mut self may require updating callers
  • setup_network failure currently warns and continues (non-fatal). This behavior is preserved unless the reviewer requests it to be fatal.
  • cleanup_failed_start must be idempotent: safe to call multiple times, safe if some artifacts were never created
  • Steps 1 & 2 (test runner fixes) are complementary — the trap ensures results are written, and the runner is defensive about missing results as a safety net
## Implementation Spec for Issue #44: Fix False-Green Validation and Failed-Start Cleanup Paths ### Objective Fix three reliability issues: (1) the test runner can report "ALL TESTS PASSED" when suites crash before exporting results, (2) a failed VM start leaves behind partial state/sockets/processes that poison retries, and (3) `stats` on a VM whose hypervisor died out-of-band produces a raw socket error instead of a clean "not running" message. ### Requirements - A suite that exits non-zero before calling `export_results()` must cause the top-level runner to fail. - The runner must never print `ALL TESTS PASSED` if any suite crashed, exited early, or failed to export results. - A failed `start` must leave no stale API socket, vsock socket, virtiofsd process, or partially-updated persisted state. - Restarting a VM after a crash must not fail due to leftover artifacts from the prior attempt. - `stats` on a VM whose hypervisor process has died must return a stable "not running" result (not a raw socket/API error). ### Files to Modify | File | Purpose | |------|---------| | `tests/run_tests.sh` | Fix false-green aggregation logic | | `tests/helpers.sh` | Make `export_results` robust against early exits | | `crates/my_hypervisor-lib/src/vm/manager.rs` | Transactional start cleanup; liveness-aware `stats` | | `crates/my_hypervisor-cli/src/commands/stats.rs` | Adjust for `&mut self` signature change on `stats()` | ### Implementation Plan #### Step 1: Fix false-green test runner (`tests/run_tests.sh`) Files: `tests/run_tests.sh` - Track `SUITE_CRASHED` counter alongside `TOTAL_FAIL` - When results file is missing/empty AND `suite_rc != 0`, increment `TOTAL_FAIL` and `SUITE_CRASHED` - When `suite_rc != 0` but results file shows 0 failures, trust `suite_rc` and increment fail count - In final verdict, fail if `TOTAL_FAIL > 0` OR `SUITE_CRASHED > 0` Dependencies: none #### Step 2: Make `export_results` fire on exit via trap (`tests/helpers.sh`) Files: `tests/helpers.sh` - Add `export_results` call inside existing `trap_cleanup` function - This ensures results are always written, even when `set -e` kills the suite early - Explicit `export_results` calls at end of suites become redundant but harmless (idempotent) Dependencies: none #### Step 3: Make VM start transactional with cleanup on failure (`manager.rs`) Files: `crates/my_hypervisor-lib/src/vm/manager.rs` - Create private `cleanup_failed_start()` helper that cleans up: storage, sockets (api.sock + vsock.sock), virtiofsd processes, network - On any failure after `prepare_storage`, call `cleanup_failed_start` before returning error - At start of `start()`, pre-clean stale sockets from prior crashed attempts - Make cleanup idempotent (ignore "not found" errors) Dependencies: none #### Step 4: Make `stats` liveness-aware (`manager.rs`) Files: `crates/my_hypervisor-lib/src/vm/manager.rs`, `crates/my_hypervisor-cli/src/commands/stats.rs` - Before checking API socket, verify hypervisor process is alive via `process_running()` - If process is dead: reconcile state to `Stopped`, clean up network/storage, return `VmNotRunning` - Change method signature from `&self` to `&mut self` if needed - Update CLI caller to pass mutable reference Dependencies: Step 3 (uses cleanup helpers) #### Step 5: Extend `cleanup_storage` to handle vsock sockets and mount dirs Files: `crates/my_hypervisor-lib/src/vm/manager.rs` - Add vsock.sock removal to stop path (api.sock already removed) - Scan vm_dir for `mount-*` subdirectories and kill virtiofsd pid files found there Dependencies: none ### Acceptance Criteria - [ ] A suite that exits non-zero before `export_results()` causes the top-level runner to report failure - [ ] The runner cannot print `ALL TESTS PASSED` if any suite crashed/exited early/failed to export - [ ] `export_results` is called automatically via EXIT trap even when a suite dies from `set -e` - [ ] A failed start leaves no stale API socket, vsock socket, virtiofsd process, or partially prepared storage - [ ] Restarting after a crash does not fail due to leftover artifacts (stale sockets removed at start) - [ ] `stats` on a dead VM returns `VmNotRunning` error instead of a raw socket/API error - [ ] `stats` reconciles persisted state when it detects a dead hypervisor process - [ ] All existing unit tests pass (`cargo test --workspace`) - [ ] `cargo clippy -- -D warnings` passes ### Notes - The `stats` method signature change from `&self` to `&mut self` may require updating callers - `setup_network` failure currently warns and continues (non-fatal). This behavior is preserved unless the reviewer requests it to be fatal. - `cleanup_failed_start` must be idempotent: safe to call multiple times, safe if some artifacts were never created - Steps 1 & 2 (test runner fixes) are complementary — the trap ensures results are written, and the runner is defensive about missing results as a safety net
Member

Test Results

  • Total: 186
  • Passed: 186
  • Failed: 0
  • Clippy: Clean (0 warnings)

All unit tests pass. Clippy with -D warnings is clean.

## Test Results - **Total**: 186 - **Passed**: 186 - **Failed**: 0 - **Clippy**: Clean (0 warnings) All unit tests pass. Clippy with `-D warnings` is clean.
Member

Implementation Summary

Changes Made

tests/run_tests.sh — Fixed false-green test aggregation:

  • Added SUITE_CRASHED counter to detect suites that exit before exporting results
  • Missing/empty results with non-zero exit code now counted as failures
  • Non-zero suite exit with 0 reported failures also counted as crash
  • Final verdict fails if any suite crashed or had failures

tests/helpers.sh — Made export_results fire automatically on exit:

  • Added export_results call inside trap_cleanup (runs on EXIT trap)
  • Results are now written even when set -e aborts a suite early
  • Idempotent: explicit calls at end of suites remain harmless

crates/my_hypervisor-lib/src/vm/manager.rs — Transactional start + liveness-aware stats:

  • Pre-clean stale sockets (api.sock, vsock.sock) at start of start() so retries after crashes work
  • New cleanup_failed_start() helper: tears down storage, sockets, network on any failure after prepare_storage
  • prepare_volume_mounts and hypervisor launch failures now trigger full cleanup before returning
  • stop() now also removes vsock.sock (previously only removed api.sock)
  • stats() now checks process liveness before querying API socket — if hypervisor died out-of-band, reconciles state to Stopped and returns VmNotRunning

crates/my_hypervisor-cli/src/commands/stats.rs + main.rs — Updated stats caller for &mut self signature change

## Implementation Summary ### Changes Made **tests/run_tests.sh** — Fixed false-green test aggregation: - Added `SUITE_CRASHED` counter to detect suites that exit before exporting results - Missing/empty results with non-zero exit code now counted as failures - Non-zero suite exit with 0 reported failures also counted as crash - Final verdict fails if any suite crashed or had failures **tests/helpers.sh** — Made `export_results` fire automatically on exit: - Added `export_results` call inside `trap_cleanup` (runs on EXIT trap) - Results are now written even when `set -e` aborts a suite early - Idempotent: explicit calls at end of suites remain harmless **crates/my_hypervisor-lib/src/vm/manager.rs** — Transactional start + liveness-aware stats: - Pre-clean stale sockets (`api.sock`, `vsock.sock`) at start of `start()` so retries after crashes work - New `cleanup_failed_start()` helper: tears down storage, sockets, network on any failure after `prepare_storage` - `prepare_volume_mounts` and hypervisor launch failures now trigger full cleanup before returning - `stop()` now also removes `vsock.sock` (previously only removed `api.sock`) - `stats()` now checks process liveness before querying API socket — if hypervisor died out-of-band, reconciles state to Stopped and returns VmNotRunning **crates/my_hypervisor-cli/src/commands/stats.rs** + **main.rs** — Updated `stats` caller for `&mut self` signature change
rawdaGastan added this to the ACTIVE project 2026-03-24 15:17:46 +00:00
rawan closed this issue 2026-03-26 12:48:27 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
geomind_code/my_hypervisor#44
No description provided.