[Linux]: Deploy VM fails with 'Permission denied (os error 13)' — my_hypervisor VM state dir owned by root #98
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_compute#98
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Deploying a VM through the hero_compute UI (or directly via
ComputeService.deploy_vmRPC) fails instantly. Everydeploy_vmjob ends inphase=failed, exit_code=1and the VM state ends up aserror.Reproduction
provisioning, job id assigned).list_vmsshows:Out of all past
deploy_vmjobs in my environment (ids 72, 73, 82, 83, 96–100), every single one hasphase: failed, exit_code: 1— the feature has been broken end-to-end.Root cause
The job built by crates/hero_compute_server/src/cloud/rpc.rs:1403-1407 is:
hero_proc runs this as the
rawanuser, butmy_hypervisorimmediately tries to take a file lock on every existing VM state directory under~/.my_hypervisor/vms/<id>/state.lock, and those directories are owned byroot:root:strace confirms the call that fails:
Running the same
my_hypervisor create …or evenmy_hypervisor listfrom therawanshell reproduces the same error.my_hypervisor doctorstill reports “All checks passed”, which masks the problem.Impact
errorwith an opaque message.Permission denied (os error 13)with no indication that the cause is a filesystem ownership mismatch in~/.my_hypervisor/vms/.my_hypervisorinvocation viasudo/doasleaves root-owned state dirs behind, and subsequent user-mode runs are broken forever.Suggested fix (for discussion)
Several options, not mutually exclusive:
my_hypervisor list(orstat ~/.my_hypervisor/vms/*) and surface a specific error likemy_hypervisor VM state dir is not writable by the current user — check ownership of ~/.my_hypervisor/vms/instead of relaying the rawos error 13.my_hypervisorunder the same user consistently. Document in the hero_compute install/setup scripts thatmy_hypervisormust always be invoked as the hero_compute service user, and extendscripts/configure.sh/setuptochown -Rthe~/.my_hypervisor/tree to that user.doctorshould also check that~/.my_hypervisor/vms/**/state.lockare writable by the current user, andlistshould skip (not hard-fail on) directories it can’t lock.Environment
development, commit13cac45target/x86_64-unknown-linux-musl/release/hero_compute_server/home/rawan/hero/var/sockets/hero_proc/rpc.sock/home/rawan/.cargo/bin/my_hypervisorDeploy VM fails with 'Permission denied (os error 13)' — my_hypervisor VM state dir owned by rootto [Linux]: Deploy VM fails with 'Permission denied (os error 13)' — my_hypervisor VM state dir owned by rootImplementation Specification: Pre-flight
my_hypervisorProbe fordeploy_vmIssue: #98 — [Linux]: Deploy VM fails with 'Permission denied (os error 13)' — my_hypervisor VM state dir owned by root
Repository:
hero_computeScope: hero_compute-side fix only (suggestion #1 from the issue, plus trivial setup doc/script touch-ups from #2). Suggestion #3 (my_hypervisor
doctor/listrobustness) lives in a different repo and is out of scope.Objective
When
ComputeService.deploy_vmis called, run a short, bounded pre-flight probe ofmy_hypervisor listbefore enqueueing the hero_proc deploy job. If the probe fails with a permission-related error (typical symptom: root-owned state dirs under~/.my_hypervisor/vms/that the service user cannot lock), return a specific, user-friendly error to the RPC caller instead of letting the deploy job fail later with an opaquePermission denied (os error 13). For healthy systems the probe must be cheap enough not to noticeably slow deployments.Requirements
my_hypervisor list(or an equivalent quick-exit subcommand) with a bounded timeout (default 3 seconds).deploy_vm, so a failed probe leaves no strayVM(error)records orInUseslices behind.Permission denied,os error 13, orEACCES.ComputeServiceError::Internal(message)wheremessageis a specific, actionable string that includes the~/.my_hypervisor/vms/path (resolved from$HOMEat runtime when possible). Example message:my_hypervisor VM state dir is not writable by the current user — check ownership of /home/<user>/.my_hypervisor/vms/ (often caused by a prior sudo/doas invocation). Run: sudo chown -R <user>:<user> /home/<user>/.my_hypervisor/vm_log_fail) remains the source of truth for those cases. This keeps the probe strictly additive for diagnosability.deploy_lockmutex is acquired after the probe).StubDriver(non-Linux / driver init failed). In that environmentmy_hypervisoris not on PATH and the existing stub error path is preserved.HERO_COMPUTE_SKIP_HYPERVISOR_PROBE(set to1to skip). Useful for CI and for users whosemy_hypervisorbinary lives outside$PATH.vm_log_failmessaging for jobs that fail after enqueue remains unchanged (so running VMs' error log format is stable).scripts/configure.sh(Step 6,install_my_hypervisor)chown -R "$SUDO_USER:$SUDO_USER" "$HOME/.my_hypervisor"when running under sudo, to prevent root-ownership drift across runs.Files to Modify / Create
Modify
crates/hero_compute_server/src/cloud/rpc.rshypervisor_probesubmodule (inline or#[path]-referenced) exposing:fn probe_my_hypervisor(timeout: Duration) -> ProbeResultenum ProbeResult { Ok, PermissionDenied { path: String, raw: String }, OtherFailure(String), BinaryMissing, Skipped, Timeout }fn classify_probe_output(exit_code: i32, stdout: &str, stderr: &str) -> ProbeResult— pure, testable.fn hypervisor_state_dir() -> String— returns$HOME/.my_hypervisor/vms/with fallback.deploy_vm(around line 1253, beforelet _lock = self.deploy_lock.lock()...), insert:Okifself.hypervisorisStubDriver(use adowncast_refor add a marker methodis_stub()on theHypervisorDrivertrait).OkifHERO_COMPUTE_SKIP_HYPERVISOR_PROBE=1.probe_my_hypervisor(Duration::from_secs(3)).ProbeResult::PermissionDenied { path, .. }:tracing::error!+ returnComputeServiceError::Internal(format!("my_hypervisor VM state dir is not writable by the current user — check ownership of {} (often caused by a prior sudo/doas invocation). Run: sudo chown -R $USER:$USER {}", path, parent_of(path))).tracing::warn!and continue (deploy proceeds as today).crates/hero_compute_server/src/cloud/constants.rspub const HYPERVISOR_PROBE_TIMEOUT_SECS: u64 = 3;pub const HYPERVISOR_PROBE_SKIP_ENV: &str = "HERO_COMPUTE_SKIP_HYPERVISOR_PROBE";crates/hero_compute_server/src/cloud/hypervisor.rsHypervisorDrivertrait with a defaultfn is_stub(&self) -> bool { false }; overrideStubDriver::is_stubto returntrue. This letsdeploy_vmskip the probe cleanly without adding conditional compilation.crates/hero_compute_server/src/cloud/tests.rs#[test] fn test_classify_probe_output_permission_denied()that feeds synthetic(exit_code=1, stderr="Error: IO error: Permission denied (os error 13)")intoclassify_probe_outputand asserts it returnsProbeResult::PermissionDenied { .. }. Add a second case forOk(exit 0) and a third forOtherFailure(exit 1 with an unrelated stderr). Tests must go through thesuper::server::*re-export path used by the existing tests in that file.scripts/configure.shinstall_my_hypervisor(around line 312), aftermkdir -p "$HOME/.cargo/bin" "$HOME/.my_hypervisor/bin"and again at the end ofcheck_hypervisor_deps, add achown_hypervisor_dirshelper that, whenSUDO_USERis set and non-empty, runschown -R "$SUDO_USER:$SUDO_USER" "/home/$SUDO_USER/.my_hypervisor" 2>/dev/null || true. Emitsok "Corrected ownership of ~/.my_hypervisor for $SUDO_USER".Create
No new files. The probe module is kept inline inside
rpc.rs(pattern matches the existinghypervisor_probepeers there:hardware.rs,hero_proc_jobs.rs,stats.rsare separate files, but the probe is small enough — ~80 lines — that inline keeps the diff tight and avoids a new#[path]include).Optional (only if the reviewer prefers a dedicated file):
crates/hero_compute_server/src/cloud/hypervisor_probe.rswired in via#[path = "hypervisor_probe.rs"] mod hypervisor_probe;at the top ofrpc.rsnext to the other mod declarations (line 14–25). In that case,tests.rsimports viause super::server::rpc::hypervisor_probe::*;.Step-by-Step Implementation Plan
Step 1 — Extend the
HypervisorDrivertrait withis_stub()Files:
crates/hero_compute_server/src/cloud/hypervisor.rsDepends on: —
fn is_stub(&self) -> bool { false }to theHypervisorDrivertrait with a default impl.impl HypervisorDriver for StubDriverto returntrue.MyHypervisorDriver(the default is correct).cargo check -p hero_compute_servermust pass.Step 2 — Add probe constants
Files:
crates/hero_compute_server/src/cloud/constants.rsDepends on: —
HYPERVISOR_PROBE_TIMEOUT_SECSandHYPERVISOR_PROBE_SKIP_ENVconstants documented with the same comment style as existing entries.Step 3 — Add the probe module (classification + subprocess)
Files:
crates/hero_compute_server/src/cloud/rpc.rsDepends on: Step 2
mod hypervisor_probe { ... }block near the top ofrpc.rs(after the existingthread_local!/ helper definitions, beforeComputeServiceHandler). It exposes:pub enum ProbeResult { Ok, PermissionDenied { path: String, raw: String }, OtherFailure(String), BinaryMissing, Timeout, Skipped }pub fn hypervisor_state_dir() -> String— reads$HOME, returnsformat!("{}/.my_hypervisor/vms/", home); falls back to"~/.my_hypervisor/vms/"if$HOMEis unset.pub fn classify_probe_output(exit_code: i32, stdout: &str, stderr: &str) -> ProbeResult— pure function. ReturnsOkon exit_code 0; returnsPermissionDeniedwhen exit_code != 0 AND the concatenatedstdout + stderr(case-insensitive) contains any of"permission denied","os error 13","eacces". OtherwiseOtherFailure(stderr.trim().to_string()).pub fn probe_my_hypervisor(timeout: Duration) -> ProbeResult— usesstd::process::Command::new("my_hypervisor").arg("list"), matching the pattern atrpc.rs:947–972. Implements timeout viaCommand::spawn()+ a watchdog thread that callschild.kill()if elapsed > timeout (simple, no extra deps — mirrors howbash -c 'timeout 5 …'is done in the health check). OnErrorKind::NotFound, returnBinaryMissing. On timeout, returnTimeout.Step 4 — Wire the probe into
deploy_vmFiles:
crates/hero_compute_server/src/cloud/rpc.rsDepends on: Step 1, Step 3
fn deploy_vm(line 1197), immediately after the input-validation block that ends at line 1252 and beforelet _lock = self.deploy_lock.lock()...at line 1256, insert the probe:Step 5 — Unit test for the classifier
Files:
crates/hero_compute_server/src/cloud/tests.rsDepends on: Step 3
classify_probe_outputdirectly (pure function, no I/O). Cover:exit_code=0, "", ""→Okexit_code=1, "", "Error: IO error: Permission denied (os error 13)"→PermissionDeniedexit_code=1, "", "EACCES: ~/.my_hypervisor/vms/abc/state.lock"→PermissionDeniedexit_code=1, "", "No kernels found"→OtherFailurehypervisor_state_dir()that setsHOMEviastd::env::set_var(within a#[serial]-style single-thread block, or just assert the returned string ends with/.my_hypervisor/vms/).use super::server::*;pattern. The probe module must bepub(crate)visible through that path.Step 6 — Setup script hygiene (chown)
Files:
scripts/configure.shDepends on: —
install_my_hypervisor(line 312), add a trailing call to a newchown_hypervisor_dirshelper placed just belowcheck_hypervisor_deps:install_my_hypervisor, aftercheck_hypervisor_deps.Step 7 — Manual verification (scripted smoke)
Depends on: Steps 1-5
cargo build --workspacefrom the repo root.sudo my_hypervisor listonce to create a root-owned state dir, thenrm -rf ~/.my_hypervisor/vms/* && sudo mkdir -p ~/.my_hypervisor/vms/junk && sudo chown root:root ~/.my_hypervisor/vms/junk && sudo touch ~/.my_hypervisor/vms/junk/state.lock && sudo chown root:root ~/.my_hypervisor/vms/junk/state.lock.ComputeService.deploy_vmvia the same curl from the issue. Expect the JSON-RPC response to contain the new friendly error message and no newVmrecord inerrorstate.sudo chown -R $USER:$USER ~/.my_hypervisor) and rerun; expect the deploy to start normally.Acceptance Criteria
HypervisorDriver::is_stub()exists and returnstrueforStubDriver,falseforMyHypervisorDriver.hypervisor_probe::classify_probe_outputis a pure function that maps(exit_code, stdout, stderr)to aProbeResult.classify_probe_outputreturnsPermissionDeniedfor all of:"Permission denied (os error 13)","Error: IO error: Permission denied (os error 13)","EACCES"; returnsOkon exit_code 0; returnsOtherFailurefor other non-zero exits.probe_my_hypervisorrespectsHYPERVISOR_PROBE_TIMEOUT_SECSand kills the child on timeout.deploy_vmcalls the probe before acquiringdeploy_lock, allocating slices, or creating aVmrecord.ComputeServiceError::Internalwhose message contains the literal substringmy_hypervisor VM state dir is not writableand the resolved~/.my_hypervisor/vms/path.Vmrows, noInUseslices, no hero_proc jobs.HERO_COMPUTE_SKIP_HYPERVISOR_PROBE=1short-circuits the probe and restores the pre-change deploy_vm behaviour exactly.StubDriver(e.g. on macOS dev machines), the probe is skipped.tracing::warn!.cargo test -p hero_compute_server --features cloudpasses, including the three newclassify_probe_outputcases.vm_log_failwith identical wording to today (no behavioural change for deploys that succeed the probe but fail later).scripts/configure.shrunschown -R $SUDO_USER:$SUDO_USER ~/.my_hypervisorwhen executed under sudo by a non-root user.my_hypervisor listsubprocess call (~tens of milliseconds) perdeploy_vminvocation.Notes
Internaland not a new variant?ComputeServiceErrorincrates/hero_compute_server/src/cloud/rpc_generated.rsis generated from the cloud oschema; adding a new variant there would require regenerating the schema and all clients.Internal(String)is already used for analogous "server-side precondition failed" conditions (e.g."No node registered. Call node_register first."at rpc.rs:1303) and the UI surfaces its payload verbatim. If a future iteration wants a typedPreflightFailedvariant, that is a separate schema-regen change outside this fix.my_hypervisor listrather than stat the directory?listexercises the exact samestate.lockacquisition path thatcreate/startuse, so it catches precisely the failure mode from the issue. Astaton the directory would miss permission problems on per-VMstate.lockfiles.Command::spawn()+ a backgroundthread::spawn(move || { sleep(timeout); let _ = child.kill(); })pattern. Notokio::processneeded because the trait method is synchronous. This matches the style already in use inrpc.rs:947–972.doctorcheck,listskipping unreadable dirs) — tracked in the my_hypervisor repo per issue suggestion #3.ComputeServiceErroror the generated cloud schema.start_vm/stop_vm/restart_vm/delete_vm— they share the same underlying hazard, but the issue specifically scopes the fix todeploy_vm. A follow-up issue should consider hoisting the probe into a sharedpreflight_hypervisor()helper reused across all four entry points.hero_compute_ui— the new error message will surface automatically because the UI already renderserror.messageon RPC failure.deploy_vm. It adds one subprocess exec per call on the healthy path and zero persistent state. The only observable behavioural change for a healthy system is+30-80mslatency ondeploy_vm. All other RPCs are untouched.Test Results
New tests for pre-flight probe (all passing)
test_classify_probe_output_oktest_classify_probe_output_permission_denied_os_error_13test_classify_probe_output_permission_denied_eaccestest_classify_probe_output_other_failuretest_hypervisor_state_dir_ends_with_expected_suffixBuild
cargo build --workspace— pass (no errors)Summary
All existing tests pass, 5 new unit tests cover the probe classifier and state-dir helper.
Implementation summary
Added a pre-flight
my_hypervisorprobe toComputeService.deploy_vmso the"Permission denied (os error 13)" failure reported in this issue now surfaces
as a specific, actionable error message instead of an opaque failure after the
hero_proc job has already been enqueued.
Changes
crates/hero_compute_server/src/cloud/hypervisor.rsfn is_stub(&self) -> bool { false }to theHypervisorDrivertraitwith a default implementation.
impl HypervisorDriver for StubDriverto returntruesothe new probe is skipped when the stub driver is in use (non-Linux / driver
init failed).
crates/hero_compute_server/src/cloud/constants.rsHYPERVISOR_PROBE_TIMEOUT_SECS: u64 = 3HYPERVISOR_PROBE_SKIP_ENV: &str = "HERO_COMPUTE_SKIP_HYPERVISOR_PROBE"crates/hero_compute_server/src/cloud/rpc.rspub(crate) mod hypervisor_probewith:ProbeResult { Ok, PermissionDenied { path, raw }, OtherFailure, BinaryMissing, Timeout, Skipped }classify_probe_output(exit_code, stdout, stderr)— pure, unit-tested.Returns
PermissionDeniedwhen the exit code is non-zero and theconcatenated output contains any of
"permission denied","os error 13",or
"eacces"(case-insensitive).hypervisor_state_dir()— resolves$HOME/.my_hypervisor/vms/.probe_my_hypervisor(timeout)— spawnsmy_hypervisor listwith abounded wall-clock timeout and a watchdog thread that kills the child on
timeout. Dependency-free.
ComputeServiceHandler::deploy_vmbeforedeploy_lockis acquired, slices are allocated, or a VM record is created.On
PermissionDeniedthe caller gets aComputeServiceError::Internalwhose message names the offending path and shows the exact chown command to
fix it. Non-permission failures (binary missing, timeout, other) only emit
a
tracing::warn!and let the deploy proceed — the existing post-enqueuefailure path is untouched.
HERO_COMPUTE_SKIP_HYPERVISOR_PROBE=1for CI / opt-out.crates/hero_compute_server/src/cloud/server/mod.rs#[cfg(test)] pub(crate) use rpc::hypervisor_probe;sounit tests can reach the probe module through the existing
super::server::*path.crates/hero_compute_server/src/cloud/tests.rsclassify_probe_output(Ok / PermissionDeniedvia
os error 13/ PermissionDenied viaEACCES/ OtherFailure) andhypervisor_state_dir.scripts/configure.shchown_hypervisor_dirshelper. When the script is invoked under sudoby a non-root user, it
chown -R $SUDO_USER:$SUDO_USER $HOME/.my_hypervisorto prevent the root-ownership drift that triggers this bug in the first
place. Called from both exit paths of
install_my_hypervisor.Test results
cargo test -p hero_compute_server: 30 passed, 0 failed, 2 ignored (doctests).cargo build --workspace: pass, no warnings.hypervisor_probetests pass:test_classify_probe_output_oktest_classify_probe_output_permission_denied_os_error_13test_classify_probe_output_permission_denied_eaccestest_classify_probe_output_other_failuretest_hypervisor_state_dir_ends_with_expected_suffixNotes / caveats
my_hypervisor listsubprocess call perdeploy_vminvocation on the healthy path (~tens of ms). All other RPCs are untouched.
deploy_vm.start_vm/stop_vm/restart_vm/delete_vmshare the same underlying hazard; hoisting theprobe into a shared
preflight_hypervisor()helper is left as a follow-up.doctorstate.lock check and
listskipping unreadable dirs) live in themy_hypervisor repo and are out of scope here.
vm_log_fail) is unchanged, so running VMs' error log format is stable.Follow-up: end-to-end deploy fix
After running the chown from this fix, the original
Permission denied (os error 13)is gone, but deploys then hit a second failure that blocks the same UX:my_hypervisor create/start/stopneed root to set up loop devices and mount rootfs images. Setting file capabilities (cap_sys_admin,cap_net_admin+ep) on the binary did not unblock the loop-device setup on a stock Ubuntu host, so the PR wraps those invocations withdoas -ninstead.Additional changes in this PR
crates/hero_compute_server/src/cloud/constants.rsHYPERVISOR_CMD_PREFIX_ENV = "HERO_COMPUTE_HYPERVISOR_CMD_PREFIX"HYPERVISOR_CMD_PREFIX_DEFAULT = "doas -n "crates/hero_compute_server/src/cloud/rpc.rshypervisor_cmd_prefix()reads the env var and returns eitherthe override or the default. The literal strings
"none"or""disablethe wrapper entirely (direct invocation).
my_hypervisorinvocations in the cloud service (deploy_vm,start_vm,stop_vm,restart_vm) now prepend the prefix. The pre-flightprobe still runs
my_hypervisor listdirectly sincelistdoes not needelevated privileges.
scripts/configure.shinstall_hypervisor_doas_rulehelper. Whendoasis installed and thescript can write
/etc/doas.conf, it appends a minimal rule:permit nopass <service_user> cmd <path/to/my_hypervisor>. Idempotent — itskips if the exact rule is already present. When
doasis missing or thescript is not running with write access, it prints guidance instead.
install_my_hypervisor, alongsidechown_hypervisor_dirs.End-to-end verification
On the affected machine, after rebuilding, restarting the
hero_computeservice, and issuing the same curl from the original report:
Deploy completes. VM is reachable (Cloud Hypervisor started, TAP networking
configured).
Opt-out
On hosts that grant
my_hypervisorelevated privileges another way (e.g.systemd
AmbientCapabilities, or a dedicated service user that iseffectively root), set
HERO_COMPUTE_HYPERVISOR_CMD_PREFIX=none(or theempty string) in hero_compute's environment to invoke
my_hypervisordirectly without a wrapper.
Test results
cargo test -p hero_compute_server: 30 passed, 0 failed, 2 ignored (doctests).cargo build --workspace: pass.Final fix — chown-back after doas-elevated calls
One more layer on top of the doas wrapper. Running
my_hypervisorunderdoascreates per-VM state directories (~/.my_hypervisor/vms/<id>/)owned by
root:root, which then breaks the post-deployresolve_external_vmstep (runs as the service user and needs to read those directories) — the VM
ends up
runningbut withhypervisor_id=None. Subsequentstart/stop/deleteon that VM fail withVM has no hypervisor ID — it may not have been provisioned successfully. Delete and redeploy.Change
crates/hero_compute_server/src/cloud/rpc.rs— new helperwrap_with_chown_back(body)wraps a shell body so that, after the innercommands finish, ownership of
$HOME/.my_hypervisoris restored to thecalling user (
chown -R "$(id -u):$(id -g)"). The wrapper preserves theinner exit code. All four cloud-service invocations (
deploy_vm,start_vm,stop_vm,restart_vm) now go through it. When the privilegewrapper is disabled (
HERO_COMPUTE_HYPERVISOR_CMD_PREFIX=none), the helperis a no-op and returns the command unchanged.
Verified end-to-end
After rebuilding and restarting the
hero_computeservice:Ownership of
~/.my_hypervisor/vms/*stays atrawan:rawanafter thedeploy — no drift, no follow-up chown needed.
delete_vmon the running VMreturned
true. Pre-flight probe continues to pass.Tests
cargo test -p hero_compute_server: 30 passed, 0 failed, 2 ignored.cargo build --workspace: pass.