Fix false-green validation and failed-start cleanup paths #44
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
The current validation and runtime lifecycle paths can report success even when a suite or VM has already failed underneath them.
This is really three related problems in the same flow:
Current behavior
1. False-green test summary
In
tests/run_tests.sh, each suite is run with:but the final result only reads the suite result file if that file exists and is non-empty, and the final exit code is based only on the accumulated
TOTAL_FAILcount.That means a suite can exit non-zero before it writes its summary file, disappear from the totals entirely, and the runner can still print
ALL TESTS PASSED.The helper flow makes this easy to hit:
test_end()returns non-zero when any test failedset -eexport_results()only writes the summary file if the suite reaches itSo a suite can fail, exit early, never export its counts, and be silently omitted from the final summary.
2. Failed starts leave partial state behind
In the VM start path, storage is prepared first, then the rootfs source is rewritten to point at the prepared storage, then networking and volume mounts are set up, and only after that is the hypervisor launched.
If anything in the later part of that sequence fails, the operation is not treated as a rollbackable transaction. Partial artifacts can remain behind, for example:
That makes retries and crash recovery harder than they need to be.
3.
statsuses persisted state without first reconciling livenessThe
statspath checks whether the stored VM status saysRunning, then goes directly to the API socket. It does not first refresh process liveness.If the hypervisor died out of band, users can get a low-level API/socket failure instead of a stable "not running" result.
Why this matters
Suggested implementation
Test runner
suite_rcin the final aggregation logic.Start path
Runtime inspection
Acceptance criteria
export_results()causes the top-level runner to fail.ALL TESTS PASSEDif any suite crashed, exited early, or failed to export its results.statson a dead VM returns a stable not-running result instead of a raw socket/API error.Affected areas
tests/run_tests.shtests/helpers.shImplementation Spec for Issue #44: Fix False-Green Validation and Failed-Start Cleanup Paths
Objective
Fix three reliability issues: (1) the test runner can report "ALL TESTS PASSED" when suites crash before exporting results, (2) a failed VM start leaves behind partial state/sockets/processes that poison retries, and (3)
statson a VM whose hypervisor died out-of-band produces a raw socket error instead of a clean "not running" message.Requirements
export_results()must cause the top-level runner to fail.ALL TESTS PASSEDif any suite crashed, exited early, or failed to export results.startmust leave no stale API socket, vsock socket, virtiofsd process, or partially-updated persisted state.statson a VM whose hypervisor process has died must return a stable "not running" result (not a raw socket/API error).Files to Modify
tests/run_tests.shtests/helpers.shexport_resultsrobust against early exitscrates/my_hypervisor-lib/src/vm/manager.rsstatscrates/my_hypervisor-cli/src/commands/stats.rs&mut selfsignature change onstats()Implementation Plan
Step 1: Fix false-green test runner (
tests/run_tests.sh)Files:
tests/run_tests.shSUITE_CRASHEDcounter alongsideTOTAL_FAILsuite_rc != 0, incrementTOTAL_FAILandSUITE_CRASHEDsuite_rc != 0but results file shows 0 failures, trustsuite_rcand increment fail countTOTAL_FAIL > 0ORSUITE_CRASHED > 0Dependencies: none
Step 2: Make
export_resultsfire on exit via trap (tests/helpers.sh)Files:
tests/helpers.shexport_resultscall inside existingtrap_cleanupfunctionset -ekills the suite earlyexport_resultscalls at end of suites become redundant but harmless (idempotent)Dependencies: none
Step 3: Make VM start transactional with cleanup on failure (
manager.rs)Files:
crates/my_hypervisor-lib/src/vm/manager.rscleanup_failed_start()helper that cleans up: storage, sockets (api.sock + vsock.sock), virtiofsd processes, networkprepare_storage, callcleanup_failed_startbefore returning errorstart(), pre-clean stale sockets from prior crashed attemptsDependencies: none
Step 4: Make
statsliveness-aware (manager.rs)Files:
crates/my_hypervisor-lib/src/vm/manager.rs,crates/my_hypervisor-cli/src/commands/stats.rsprocess_running()Stopped, clean up network/storage, returnVmNotRunning&selfto&mut selfif neededDependencies: Step 3 (uses cleanup helpers)
Step 5: Extend
cleanup_storageto handle vsock sockets and mount dirsFiles:
crates/my_hypervisor-lib/src/vm/manager.rsmount-*subdirectories and kill virtiofsd pid files found thereDependencies: none
Acceptance Criteria
export_results()causes the top-level runner to report failureALL TESTS PASSEDif any suite crashed/exited early/failed to exportexport_resultsis called automatically via EXIT trap even when a suite dies fromset -estatson a dead VM returnsVmNotRunningerror instead of a raw socket/API errorstatsreconciles persisted state when it detects a dead hypervisor processcargo test --workspace)cargo clippy -- -D warningspassesNotes
statsmethod signature change from&selfto&mut selfmay require updating callerssetup_networkfailure currently warns and continues (non-fatal). This behavior is preserved unless the reviewer requests it to be fatal.cleanup_failed_startmust be idempotent: safe to call multiple times, safe if some artifacts were never createdTest Results
All unit tests pass. Clippy with
-D warningsis clean.Implementation Summary
Changes Made
tests/run_tests.sh — Fixed false-green test aggregation:
SUITE_CRASHEDcounter to detect suites that exit before exporting resultstests/helpers.sh — Made
export_resultsfire automatically on exit:export_resultscall insidetrap_cleanup(runs on EXIT trap)set -eaborts a suite earlycrates/my_hypervisor-lib/src/vm/manager.rs — Transactional start + liveness-aware stats:
api.sock,vsock.sock) at start ofstart()so retries after crashes workcleanup_failed_start()helper: tears down storage, sockets, network on any failure afterprepare_storageprepare_volume_mountsand hypervisor launch failures now trigger full cleanup before returningstop()now also removesvsock.sock(previously only removedapi.sock)stats()now checks process liveness before querying API socket — if hypervisor died out-of-band, reconciles state to Stopped and returns VmNotRunningcrates/my_hypervisor-cli/src/commands/stats.rs + main.rs — Updated
statscaller for&mut selfsignature change