hero_livekit: hero_proc doesn't inject LIVEKIT_VERSION -> SFU never launches (nothing binds :7880) #50
Labels
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_livekit#50
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
Joining a huddle fails in the browser:
WebSocket to ws://<node_ip>:7880/rtc/v1?... failed: ERR_CONNECTION_REFUSED. On the box, nothing is listening on 7880 (the LiveKit signaling/SFU port);livekit-serveris installed (/root/hero/bin/livekit-server) but no process is bound to 7880-7882.Root cause
hero_livekit_serveris the orchestrator that supervises thelivekit-serverSFU child, gated onLIVEKIT_VERSION. Its log shows:The secret store DOES have
LIVEKIT_VERSION=v1.12.0(+LIVEKIT_URL/API_KEY/API_SECRET/NODE_IP), but hero_proc spawns supervised children with a clean env and only injects vars the service declares in itsservice.toml [[env]]. hero_livekit'sservice.tomldoes NOT declareLIVEKIT_VERSION(nor theLIVEKIT_*creds), so the hero_proc-supervisedhero_livekit_serverstarts WITHOUTLIVEKIT_VERSION→ daemon-only → the SFU is never launched → 7880 is dead. (Ahero_proc service restart hero_livekit_serverreproduces it: fresh start, same 'not set' log.)Fix
Declare the LiveKit env vars as
[[env]]incrates/hero_livekit_server/service.toml(mirroring how hero_collab declaresCOLLAB_AUTH_MODE) so hero_proc injects them from the secret store into the supervised action env:Then
hero_proc service restart hero_livekit_servershould auto-provision + launchlivekit-serveron 7880.Note (dev/tunnel)
Even once the SFU is up, a browser reaching collab via an
ssh -L 9988tunnel cannot reachws://<public-ip>:7880directly — the dev huddle path also needsssh -L 7880:127.0.0.1:7880andLIVEKIT_URL=ws://localhost:7880, or 7880 firewall-open + a routable URL.Unrelated to the hero_collab oschema migration (collab generates a valid join token + returns the configured ws_url correctly — verified). Surfaced testing collab huddles.
Update — original root cause is invalidated; corrected diagnosis below (work in progress on
development).The issue body's root cause ("hero_proc spawns supervised children with a clean env and only injects
[[env]]-declared vars; declareLIVEKIT_*inservice.toml") is wrong on two counts, verified against the code and the running box:hero_proc does inject the secrets into the child env.
hero_proc_server's executor (supervisor/executor.rs) builds the child env by injecting every secret in the job's context first, then the spec[[env]]. The runninghero_livekit_serveralready hasLIVEKIT_VERSION=v1.12.0,LIVEKIT_NODE_IP,LIVEKIT_API_SECRET, etc. in/proc/<pid>/environ— no[[env]]declaration needed. So the[[env]]fix would be a no-op (and risky: a declareddefault=""is appended after the context secret and would clobber it to empty).provision doesn't read env at all — it reads via the
secret_getRPC.provision.rs::get_secret_optcallshero_proc_sdk::secret_get(key, context="core"). The real bug is that every failure was collapsed toNoneby_ => None: hero_proc-unreachable, transport error, and — critically — sdk/daemon wire-shape drift all looked identical to "secret not set". The deployedhero_livekit_server(built 2026-06-08) links ahero_proc_sdk@maina week older than the runninghero_procdaemon (rebuilt 2026-06-15); the driftedsecret_getresponse failed to decode →None→ "LIVEKIT_VERSION not set — serving daemon only" → SFU never launches → nothing binds :7880 → huddleERR_CONNECTION_REFUSED. Reproduced deterministically: a freshhero_proc service restart hero_livekit_serverlogs "not set" even thoughhero_proc secret get LIVEKIT_VERSION --context corereturnsv1.12.0reliably.Fix (keeping the
secret_getRPC — per the no-env-vars-config rule, switching tostd::env::varis disallowed):NotSet) from failure (Error). Only hero_proc's secret-not-found code (-32002) counts as "not set"; everything else is an error. A renamed/dropped method (-32601 Method not found) is drift — surfaced loudly, never swallowed as an opt-out.hero_proc_sdkto clear the present drift.Unit test asserts
-32002→ opt-out while-32601/transport/decode errors are not treated as opt-out (the regression guard).Targeting
development(migration stack), per the post-migration-fixes-go-to-development policy. Verification on the box in progress.Definitive root cause (supersedes my "rebuild to clear drift" note above) + relationship to #51.
My added loud logging surfaced the exact failure on a fresh, current build (
hero_livekit_serverbuild #9, development stack):Probing the running hero_proc confirms it is fully migrated to the canonical hero_sockets surface, and has dropped the legacy
/rpcroute:POST /rpc(what frozen hero_livekit's hero_proc_sdk calls)GET /openrpc.jsonGET /api/ping"pong"GET /api/domains.jsonGET /health.jsonSo the secret RPC now lives at
POST /api/secrets/rpc; the pre-migration sdk posting to/rpccan never reach it. This is not a stale-SHA drift fixable by acargo update—hero_livekitondevelopmentis pinned to the frozen pre-migration stack (hero_rpc2/hero_rpc_openrpcfromhero_macros_previous@osis_oldserving), so it speaks the legacy dialect as both client and server.Relationship to #51 (same root cause, two faces):
/rpc, not canonical/api/{domain}/rpc)./rpcroute → 404.hero_proc_sdkthat targets/api/secrets/rpc). They are complementary, not conflicting.What my patch on this issue does (and doesn't): the provision hardening (distinguish absent vs error via
-32002; retry transient errors; log loudly; only daemon-only on confirmed-unset) is what revealed this 404 instead of silently mis-reporting "LIVEKIT_VERSION not set". It is worth keeping as robustness/observability and should ride on the migrated stack — but it does not by itself restore connectivity. The connectivity fix requires the migration. Holding for a decision on sequencing the hero_livekit migration vs. landing the hardening standalone.Parked (will resume after in-flight hero_collab issues).
development_sameh_livekit50(pushed, not merged): distinguishes secret-read failure from not-set, retries transient errors, logs loudly. 4/4 provision unit tests pass on the box.Migration plan written (turnkey):
hero_livekit/docs/superpowers/plans/2026-06-16-livekit-canonical-migration.mdon branchdevelopment_sameh_livekit_migration. Target = canonical herolib serving (herolib_macros::openrpc_server!+ serve_domains), like hero_collab/hero_proc — NOT the stalelivekit_migrationhero_rpc approach. The oschema already has aservice LiveKitService { }block (close to the collab pattern). Remaining: deps flip (osis_oldserving→herolib@development), main.rs rewrite (drop rpc2_adapter/OsisLivekit), provision port (fold #50 hardening fromdevelopment_sameh_livekit50f30c9b6), SDK/admin canonical updates, box verify. SFU runs meanwhile via manualLiveKitService.start(survives until hero_livekit_server restart). Collab regression sweep (the blocker) is DONE+merged; livekit migration is the focused next push.Fixed + closed in
63f80f5(development) — part of the hero_livekit → canonical herolib serving migration.Root cause (as re-diagnosed): the frozen pre-migration stack read
LIVEKIT_VERSIONvia anhero_proc_sdkwhosesecret_gethit hero_proc's dead legacy/rpc(404) — so provision silently fell back to daemon-only and the SFU never launched. Compounded byget_secret_optswallowing the error.Fix: migrated to
hero_proc_sdk@development(reaches the canonical/api/secrets/rpc); provision now reads the version via the newfactory.secrets().secret_get({sid})API + the hardened path (SecretRead Found/NotSet/Error, retry+backoff, loud-on-error,-32002-only opt-out); and provision's self-RPC was repointed from legacy/rpc→ canonical/api/main/rpc.Verified on box (build #11, dev HEAD, lock==HEAD):
hero_proc service restart hero_livekit_server→ provision auto-reads the version → "livekit stack ensured" → :7880 binds in ~4s with NO manual start, and it survives a 2nd restart. node_ip is the routable IP (no loopback). Huddle path confirmed: browser reaches the SFU (/rtc/v1/validate→ 401) andhuddle_startmints a room. Also: credential files now0600/dirs0700and the api_secret uses a CSPRNG (security review).