Livekit service not starting successfully #153
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_skills#153
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
service livekit config files are not written or consistent with each other livekit.yaml and runtime.json should have same key
Implementation Spec for Issue #153
Objective
Make
service_livekit startreliably producelivekit.yaml,backend.env, andruntime.jsonwhoseapi_key/api_secretvalues are mutually consistent, so the upstreamlivekit-serverand thelk-backendJWT signer agree on credentials and the service starts successfully on first boot, on re-runs, and on upgrade from a previously broken state.Root-cause summary
Two coupled defects produced the symptom "config files not written or inconsistent":
Upstream supervisor (already fixed in
hero_livekitcommit64c5711on 2026-04-28)livekitservice.installonly downloaded the binary; it did not writelivekit.yaml/backend.env/runtime.json. If the user ranstartbeforeconfigure, the supervisor errored "configs missing — call configure first" instead of self-healing.livekitservice.configureminted a freshapi_secretonly when the in-memory secret was empty / placeholder, so an in-memory cfg loaded from a partialruntime.json(noapi_secret) could write a YAML with one key while leaving stale state on disk.livekit.yamlkeyedK1: S1whileruntime.jsonhadapi_key=K2, api_secret=S2(orS2empty).livekit-serveraccepted tokens signed withS1;lk-backend(which loadsLIVEKIT_API_SECRETfrombackend.env, derived fromruntime.json) signed withS2. Browsers got401and the service appeared "not started".hero_skills bootstrap (
tools/modules/services/service_livekit.nu) — the orchestration path that drivesinstall → configure → startagainst the supervisor:hero_livekitbinary cached in~/hero/bin,service_livekit start(without--update) silently reuses the broken binary even after pulling new skills.keys:map inlivekit.yamlactually matchesapi_key/api_secretinruntime.json. Stale, mismatched files left over from a prior broken run survive becausestart's self-heal in the supervisor only triggers whenlivekit.yamlorbackend.envis missing — not when they are inconsistent withruntime.json.svx_bootstrap_livekitcallslivekitservice.configurewith all-empty params (api_key: \"\",api_secret: \"\",domain: \"\",redis_address: \"\") and trusts the supervisor to do the right thing. When the supervisor is the broken pre-fix version, this is what produces the inconsistency.The fix in this repo is to (a) force a rebuild of
hero_livekitat the fixed commit onstart, and (b) add a defensive pre-flight that wipes mismatched config files so the (now-fixed) supervisor's self-heal path is forced to regenerate a consistent triple.Requirements
service_livekit startfinishes successfully, all three files in~/hero/var/hero_livekit/must be mutually consistent: the single key inlivekit.yaml'skeys:map must equalruntime.json.api_key, its value must equalruntime.json.api_secret, andbackend.env'sLIVEKIT_API_KEY/LIVEKIT_API_SECRETmust equal the same pair.startcommand must detect an inconsistent on-disk state from a prior broken run and force regeneration (rather than reusing stale files).hero_livekitcontaining the supervisor-side fix (64c5711"always make sure secrets are set"). On hosts with stale local binaries,service_livekit startmust either rebuild or print an actionable error instructing the operator to pass--update.livekitservice.configurefailures: if configure fails or the post-write consistency check fails, the skill must print a clear error and return non-zero.service_livekit startagainst an already-consistent state must be a no-op (no secret rotation, since that breaks tokens already issued).Files to Modify/Create
tools/modules/services/service_livekit.nu— Add a pre-flight consistency check that inspectslivekit.yaml+runtime.json+backend.envand deletes them if mismatched; tightensvx_bootstrap_livekitto fail loudly on configure errors and validate post-write consistency; bump the install path to force--updatewhen the local binary predates the supervisor fix.tools/modules/services/lib.nu— (read-only inspection) confirmsvc_install--updatesemantics and whether a "minimum-commit-required" guard belongs here or inservice_livekit.nu. No change expected unless we want a reusable helper for "rebuild if binary older than Git ref".claude/skills/hero_running/SKILL.md— Add a row to the "Common failure modes" table describing the symptom ("livekit-server 401s /service_collabhuddles never connect") and the one-line fix (service_livekit start --reset --update).No new files are required.
Implementation Plan
Step 1: Add a config-consistency pre-flight helper in
service_livekit.nuFiles:
tools/modules/services/service_livekit.nusvx_lk_configs_consistent [root: bool] -> boolthat:cfg_dir = $\"(svc_home $root)/var/hero_livekit\".true(vacuously consistent) if none oflivekit.yaml,runtime.json,backend.envexist (clean install path — supervisor will populate).falseif some-but-not-all exist (partial state — must be wiped).runtime.jsonwithopento getapi_key(string) andapi_secret(string).livekit.yamlas text and extracts the singlekeys:mapping line: a regex like^\\s+(\\S+):\\s*\"([^\"]+)\"\\s*$immediately after thekeys:line. (Avoid pulling in a full YAML parser — the file is generated byrender_livekit_yamland has a fixed shape.)backend.envas text and extractsLIVEKIT_API_KEY=…andLIVEKIT_API_SECRET=…values.trueiff all four pairs are equal: yaml-key == runtime.api_key == env.LIVEKIT_API_KEY, and yaml-value == runtime.api_secret == env.LIVEKIT_API_SECRET.falseso callers wipe and regenerate.svx_lk_wipe_stale_configs [root: bool]that, when called, removes the three files (livekit.yaml,backend.env,runtime.json) from~/hero/var/hero_livekit/. Leavesdata/andlogs/alone. Uses^rm -fand tolerates missing files.Dependencies: none. This step adds new helpers; nothing else in the file changes yet.
Step 2: Wire the pre-flight into
svx_bootstrap_livekitFiles:
tools/modules/services/service_livekit.nusvx_bootstrap_livekit, before thelivekitservice.installRPC call (i.e. between therpc.sockwait loop and theprint \"→ livekitservice.install …\"line):svx_lk_configs_consistent $root. Iffalse, print\"WARNING stale/inconsistent livekit configs detected — wiping so they will be regenerated\", then callsvx_lk_wipe_stale_configs $root.startself-heal path (lines 615-626 ofcrates/hero_livekit_server/src/livekit/rpc.rs) is exercised, since now bothlivekit.yamlandbackend.envare absent andwrite_config_artifactswill be called.configurefailure handling (currently lines 261-266):svx_lk_wipe_stale_configs $root) before returningfalse, so a retry starts from clean state instead of half-written files.configureconsistency assertion:if not ($runtime | path exists)guard succeeds (around line 270), callsvx_lk_configs_consistent $rootand abort with a clear error (print \"FAIL post-configure consistency check failed — refusing to start livekit-server with mismatched keys\"; return false) if it returnsfalse. This catches a still-broken supervisor binary in the field.Dependencies: Step 1.
Step 3: Force a rebuild path that picks up the supervisor fix
Files:
tools/modules/services/service_livekit.nuexport def start, change theinstall --root=$root --update=$update --reset=$resetcall (currently line 336) so that when--resetis not passed and the localhero_livekit_serverbinary already exists, the skill still pulls + rebuilds the supervisor source. Two acceptable approaches — pick (a) for minimum diff:--updatenor--resetwas passed AND~/hero/var/hero_livekit/runtime.jsondoes not exist OR the consistency pre-flight from Step 1 returnedfalse, internally promote$update = truefor theinstallcall soforge mergeruns and the supervisor binary is rebuilt from the dev branch (which contains commit64c5711).hero_livekit_server --versionor shell out tostrings $bin | grep ensure_secretas a heuristic) and force--resetif absent.export def start(the comment that begins on line 305): mention that on a detected inconsistent state the skill auto-rebuilds the supervisor.Dependencies: Step 1 (uses the consistency helper).
Step 4: Surface the failure mode in the operator-facing skill
Files:
claude/skills/hero_running/SKILL.mdhuddles silently fail / livekit-server logs \"invalid signature\" / 401stale livekit.yaml + runtime.json from a pre-fix supervisor (issue #153)service_livekit start --reset --updateDependencies: none. Can run in parallel with Steps 1-3.
Step 5: Manual verification on a host with the bug
Files: none (operational check)
service_livekit stop~/hero/var/hero_livekit/runtime.jsonto set\"api_key\": \"wrongkey\"(mismatched againstlivekit.yaml).service_livekit start. Expected: pre-flight prints "stale/inconsistent livekit configs detected — wiping…", supervisor regenerates all three files, post-configure consistency check passes,livekit-servercomes up on:7880, andproc service status hero_livekitreports running.~/hero/var/hero_livekit/):service_livekit start. Expected: pre-flight is a no-op (vacuous), supervisor'sinstallwrites consistent triple,configurere-writes consistent triple,startspawnslivekit-server+lk-backend, status is running.Dependencies: Steps 1-3 merged.
Acceptance Criteria
livekit.yamlandruntime.jsonshare consistent key/secret values: the single entry underkeys:in the YAML equals<runtime.api_key>: \"<runtime.api_secret>\".backend.env'sLIVEKIT_API_KEYandLIVEKIT_API_SECRETequalruntime.api_keyandruntime.api_secret.~/hero/var/hero_livekit/afterservice_livekit startcompletes successfully.livekit-serverprocess is alive (proc service status hero_livekitreportsrunning) and accepts WebSocket connections onws://<node_ip>:7880/rtc/v1.service_livekit start(no flags) auto-detects, wipes, and regenerates a consistent triple — operator does not need to manuallyrmfiles.service_livekit startsucceeds without ever creating an inconsistent triple at any point during the sequence.service_livekit startagainst a consistent state does not rotateapi_secret(token-stability guarantee from the upstreamensure_secretidempotency).livekitservice.configurefails, the skill returns non-zero, prints a clear error, and leaves no half-written files behind.Notes
64c5711inlhumina_code/hero_livekit, branchdevelopment) is a hard prerequisite for this issue to truly close. The hero_skills changes specified above are necessary because (a) operators on stale local binaries will never pick up the fix without a rebuild trigger, and (b) inconsistent files left over from a previous broken run survive even after the supervisor is upgraded — the supervisor's self-heal only triggers when files are missing, not when they disagree with each other.api_secreton everystart: tokens already issued (e.g. byservice_collabfor an in-progress huddle) would silently invalidate. The pre-flight wipe path means rotation only happens after a detected inconsistency, which is exactly the case where the existing tokens are already broken.render_livekit_yaml(lines 887-918 ofcrates/hero_livekit_server/src/livekit/rpc.rs). A full YAML dependency in nushell is unnecessary; if upstreamrender_livekit_yamlever grows multi-key support, this regex needs to be revisited (currently the renderer hard-codes a singlekeys:entry, so single-line extraction is safe).hero_livekitto add alivekitservice.health()RPC that returns(in_memory_cfg_hash, on_disk_yaml_hash)so future skills can do this consistency check over RPC instead of by file inspection.tools/modules/services/service_collab.nu. It already chainsservice_livekit startcorrectly and readsruntime.jsondefensively.