hero_livekit: hero_proc doesn't inject LIVEKIT_VERSION -> SFU never launches (nothing binds :7880) #50

Closed
opened 2026-06-16 10:28:23 +00:00 by sameh-farouk · 5 comments
Member

Symptom

Joining a huddle fails in the browser: WebSocket to ws://<node_ip>:7880/rtc/v1?... failed: ERR_CONNECTION_REFUSED. On the box, nothing is listening on 7880 (the LiveKit signaling/SFU port); livekit-server is installed (/root/hero/bin/livekit-server) but no process is bound to 7880-7882.

Root cause

hero_livekit_server is the orchestrator that supervises the livekit-server SFU child, gated on LIVEKIT_VERSION. Its log shows:

hero_livekit_server::provision: provision: LIVEKIT_VERSION not set — serving daemon only (no auto-provision)

The secret store DOES have LIVEKIT_VERSION=v1.12.0 (+ LIVEKIT_URL/API_KEY/API_SECRET/NODE_IP), but hero_proc spawns supervised children with a clean env and only injects vars the service declares in its service.toml [[env]]. hero_livekit's service.toml does NOT declare LIVEKIT_VERSION (nor the LIVEKIT_* creds), so the hero_proc-supervised hero_livekit_server starts WITHOUT LIVEKIT_VERSION → daemon-only → the SFU is never launched → 7880 is dead. (A hero_proc service restart hero_livekit_server reproduces it: fresh start, same 'not set' log.)

Fix

Declare the LiveKit env vars as [[env]] in crates/hero_livekit_server/service.toml (mirroring how hero_collab declares COLLAB_AUTH_MODE) so hero_proc injects them from the secret store into the supervised action env:

[[env]]
var = "LIVEKIT_VERSION"
# + LIVEKIT_URL, LIVEKIT_API_KEY, LIVEKIT_API_SECRET, LIVEKIT_NODE_IP

Then hero_proc service restart hero_livekit_server should auto-provision + launch livekit-server on 7880.

Note (dev/tunnel)

Even once the SFU is up, a browser reaching collab via an ssh -L 9988 tunnel cannot reach ws://<public-ip>:7880 directly — the dev huddle path also needs ssh -L 7880:127.0.0.1:7880 and LIVEKIT_URL=ws://localhost:7880, or 7880 firewall-open + a routable URL.

Unrelated to the hero_collab oschema migration (collab generates a valid join token + returns the configured ws_url correctly — verified). Surfaced testing collab huddles.

## Symptom Joining a huddle fails in the browser: `WebSocket to ws://<node_ip>:7880/rtc/v1?... failed: ERR_CONNECTION_REFUSED`. On the box, **nothing is listening on 7880** (the LiveKit signaling/SFU port); `livekit-server` is installed (`/root/hero/bin/livekit-server`) but no process is bound to 7880-7882. ## Root cause `hero_livekit_server` is the orchestrator that *supervises* the `livekit-server` SFU child, gated on `LIVEKIT_VERSION`. Its log shows: ``` hero_livekit_server::provision: provision: LIVEKIT_VERSION not set — serving daemon only (no auto-provision) ``` The secret store DOES have `LIVEKIT_VERSION=v1.12.0` (+ `LIVEKIT_URL`/`API_KEY`/`API_SECRET`/`NODE_IP`), but hero_proc spawns supervised children with a clean env and only injects vars the service **declares** in its `service.toml [[env]]`. hero_livekit's `service.toml` does NOT declare `LIVEKIT_VERSION` (nor the `LIVEKIT_*` creds), so the hero_proc-supervised `hero_livekit_server` starts WITHOUT `LIVEKIT_VERSION` → daemon-only → the SFU is never launched → 7880 is dead. (A `hero_proc service restart hero_livekit_server` reproduces it: fresh start, same 'not set' log.) ## Fix Declare the LiveKit env vars as `[[env]]` in `crates/hero_livekit_server/service.toml` (mirroring how hero_collab declares `COLLAB_AUTH_MODE`) so hero_proc injects them from the secret store into the supervised action env: ```toml [[env]] var = "LIVEKIT_VERSION" # + LIVEKIT_URL, LIVEKIT_API_KEY, LIVEKIT_API_SECRET, LIVEKIT_NODE_IP ``` Then `hero_proc service restart hero_livekit_server` should auto-provision + launch `livekit-server` on 7880. ## Note (dev/tunnel) Even once the SFU is up, a browser reaching collab via an `ssh -L 9988` tunnel cannot reach `ws://<public-ip>:7880` directly — the dev huddle path also needs `ssh -L 7880:127.0.0.1:7880` and `LIVEKIT_URL=ws://localhost:7880`, or 7880 firewall-open + a routable URL. Unrelated to the hero_collab oschema migration (collab generates a valid join token + returns the configured ws_url correctly — verified). Surfaced testing collab huddles.
Author
Member

Update — original root cause is invalidated; corrected diagnosis below (work in progress on development).

The issue body's root cause ("hero_proc spawns supervised children with a clean env and only injects [[env]]-declared vars; declare LIVEKIT_* in service.toml") is wrong on two counts, verified against the code and the running box:

  1. hero_proc does inject the secrets into the child env. hero_proc_server's executor (supervisor/executor.rs) builds the child env by injecting every secret in the job's context first, then the spec [[env]]. The running hero_livekit_server already has LIVEKIT_VERSION=v1.12.0, LIVEKIT_NODE_IP, LIVEKIT_API_SECRET, etc. in /proc/<pid>/environ — no [[env]] declaration needed. So the [[env]] fix would be a no-op (and risky: a declared default="" is appended after the context secret and would clobber it to empty).

  2. provision doesn't read env at all — it reads via the secret_get RPC. provision.rs::get_secret_opt calls hero_proc_sdk::secret_get(key, context="core"). The real bug is that every failure was collapsed to None by _ => None: hero_proc-unreachable, transport error, and — critically — sdk/daemon wire-shape drift all looked identical to "secret not set". The deployed hero_livekit_server (built 2026-06-08) links a hero_proc_sdk@main a week older than the running hero_proc daemon (rebuilt 2026-06-15); the drifted secret_get response failed to decode → None → "LIVEKIT_VERSION not set — serving daemon only" → SFU never launches → nothing binds :7880 → huddle ERR_CONNECTION_REFUSED. Reproduced deterministically: a fresh hero_proc service restart hero_livekit_server logs "not set" even though hero_proc secret get LIVEKIT_VERSION --context core returns v1.12.0 reliably.

Fix (keeping the secret_get RPC — per the no-env-vars-config rule, switching to std::env::var is disallowed):

  • Distinguish absent (NotSet) from failure (Error). Only hero_proc's secret-not-found code (-32002) counts as "not set"; everything else is an error. A renamed/dropped method (-32601 Method not found) is drift — surfaced loudly, never swallowed as an opt-out.
  • Retry transient errors with linear backoff (covers the startup race: provision fires immediately after rpc.sock serves).
  • On a persistent read failure, log loudly (Always-Die) and serve daemon-only — never silently pretend the secret is unset.
  • Rebuild against current hero_proc_sdk to clear the present drift.

Unit test asserts -32002 → opt-out while -32601/transport/decode errors are not treated as opt-out (the regression guard).

Targeting development (migration stack), per the post-migration-fixes-go-to-development policy. Verification on the box in progress.

**Update — original root cause is invalidated; corrected diagnosis below (work in progress on `development`).** The issue body's root cause ("hero_proc spawns supervised children with a clean env and only injects `[[env]]`-declared vars; declare `LIVEKIT_*` in `service.toml`") is **wrong** on two counts, verified against the code and the running box: 1. **hero_proc *does* inject the secrets into the child env.** `hero_proc_server`'s executor (`supervisor/executor.rs`) builds the child env by injecting **every secret in the job's context** first, then the spec `[[env]]`. The running `hero_livekit_server` already has `LIVEKIT_VERSION=v1.12.0`, `LIVEKIT_NODE_IP`, `LIVEKIT_API_SECRET`, etc. in `/proc/<pid>/environ` — no `[[env]]` declaration needed. So the `[[env]]` fix would be a no-op (and risky: a declared `default=""` is appended *after* the context secret and would clobber it to empty). 2. **provision doesn't read env at all — it reads via the `secret_get` RPC.** `provision.rs::get_secret_opt` calls `hero_proc_sdk::secret_get(key, context="core")`. The real bug is that **every failure was collapsed to `None`** by `_ => None`: hero_proc-unreachable, transport error, and — critically — sdk/daemon **wire-shape drift** all looked identical to "secret not set". The deployed `hero_livekit_server` (built 2026-06-08) links a `hero_proc_sdk@main` a week older than the running `hero_proc` daemon (rebuilt 2026-06-15); the drifted `secret_get` response failed to decode → `None` → "LIVEKIT_VERSION not set — serving daemon only" → SFU never launches → nothing binds :7880 → huddle `ERR_CONNECTION_REFUSED`. Reproduced deterministically: a fresh `hero_proc service restart hero_livekit_server` logs "not set" even though `hero_proc secret get LIVEKIT_VERSION --context core` returns `v1.12.0` reliably. **Fix (keeping the `secret_get` RPC — per the no-env-vars-config rule, switching to `std::env::var` is disallowed):** - Distinguish *absent* (`NotSet`) from *failure* (`Error`). Only hero_proc's secret-not-found code (`-32002`) counts as "not set"; everything else is an error. A renamed/dropped method (`-32601 Method not found`) is drift — surfaced **loudly**, never swallowed as an opt-out. - **Retry** transient errors with linear backoff (covers the startup race: provision fires immediately after rpc.sock serves). - On a persistent read failure, **log loudly** (Always-Die) and serve daemon-only — never silently pretend the secret is unset. - Rebuild against current `hero_proc_sdk` to clear the present drift. Unit test asserts `-32002` → opt-out while `-32601`/transport/decode errors are **not** treated as opt-out (the regression guard). Targeting **`development`** (migration stack), per the post-migration-fixes-go-to-development policy. Verification on the box in progress.
Author
Member

Definitive root cause (supersedes my "rebuild to clear drift" note above) + relationship to #51.

My added loud logging surfaced the exact failure on a fresh, current build (hero_livekit_server build #9, development stack):

provision: secret read failed ... error=Transport error: HTTP 404 Not Found from /rpc   (×5 retries)
provision: could NOT read LIVEKIT_VERSION from hero_proc after retries — ... Serving daemon only.

Probing the running hero_proc confirms it is fully migrated to the canonical hero_sockets surface, and has dropped the legacy /rpc route:

route result
POST /rpc (what frozen hero_livekit's hero_proc_sdk calls) 404
GET /openrpc.json 404
GET /api/ping 200 "pong"
GET /api/domains.json 200 (domains: jobs, logs, secrets, …)
GET /health.json 200

So the secret RPC now lives at POST /api/secrets/rpc; the pre-migration sdk posting to /rpc can never reach it. This is not a stale-SHA drift fixable by a cargo updatehero_livekit on development is pinned to the frozen pre-migration stack (hero_rpc2/hero_rpc_openrpc from hero_macros_previous@osis_oldserving), so it speaks the legacy dialect as both client and server.

Relationship to #51 (same root cause, two faces):

  • #51 = hero_livekit's server surface is legacy-only (serves /rpc, not canonical /api/{domain}/rpc).
  • #50's connectivity = hero_livekit's client call to hero_proc uses the legacy /rpc route → 404.
  • Both stem from hero_livekit being unmigrated. The hero_livekit migration to the canonical stack fixes both (canonical server surface + migrated hero_proc_sdk that targets /api/secrets/rpc). They are complementary, not conflicting.

What my patch on this issue does (and doesn't): the provision hardening (distinguish absent vs error via -32002; retry transient errors; log loudly; only daemon-only on confirmed-unset) is what revealed this 404 instead of silently mis-reporting "LIVEKIT_VERSION not set". It is worth keeping as robustness/observability and should ride on the migrated stack — but it does not by itself restore connectivity. The connectivity fix requires the migration. Holding for a decision on sequencing the hero_livekit migration vs. landing the hardening standalone.

**Definitive root cause (supersedes my "rebuild to clear drift" note above) + relationship to #51.** My added loud logging surfaced the exact failure on a fresh, current build (`hero_livekit_server` build #9, development stack): ``` provision: secret read failed ... error=Transport error: HTTP 404 Not Found from /rpc (×5 retries) provision: could NOT read LIVEKIT_VERSION from hero_proc after retries — ... Serving daemon only. ``` Probing the running hero_proc confirms it is **fully migrated to the canonical hero_sockets surface**, and has **dropped the legacy `/rpc` route**: | route | result | |---|---| | `POST /rpc` (what frozen hero_livekit's hero_proc_sdk calls) | **404** | | `GET /openrpc.json` | 404 | | `GET /api/ping` | 200 `"pong"` | | `GET /api/domains.json` | 200 (domains: jobs, logs, **secrets**, …) | | `GET /health.json` | 200 | So the secret RPC now lives at `POST /api/secrets/rpc`; the pre-migration sdk posting to `/rpc` can never reach it. This is **not** a stale-SHA drift fixable by a `cargo update` — `hero_livekit` on `development` is pinned to the **frozen pre-migration stack** (`hero_rpc2`/`hero_rpc_openrpc` from `hero_macros_previous@osis_oldserving`), so it speaks the legacy dialect as both client and server. **Relationship to #51 (same root cause, two faces):** - #51 = hero_livekit's **server** surface is legacy-only (serves `/rpc`, not canonical `/api/{domain}/rpc`). - #50's connectivity = hero_livekit's **client** call to hero_proc uses the legacy `/rpc` route → 404. - Both stem from hero_livekit being unmigrated. **The hero_livekit migration to the canonical stack fixes both** (canonical server surface + migrated `hero_proc_sdk` that targets `/api/secrets/rpc`). They are complementary, not conflicting. **What my patch on this issue does (and doesn't):** the provision hardening (distinguish absent vs error via `-32002`; retry transient errors; log loudly; only daemon-only on confirmed-unset) is what *revealed* this 404 instead of silently mis-reporting "LIVEKIT_VERSION not set". It is worth keeping as robustness/observability and should ride on the migrated stack — but it does **not** by itself restore connectivity. The connectivity fix requires the migration. Holding for a decision on sequencing the hero_livekit migration vs. landing the hardening standalone.
Author
Member

Parked (will resume after in-flight hero_collab issues).

  • Provision hardening committed to branch development_sameh_livekit50 (pushed, not merged): distinguishes secret-read failure from not-set, retries transient errors, logs loudly. 4/4 provision unit tests pass on the box.
  • The connectivity fix is the hero_livekit canonical migration (same root cause as #51) — I am taking that migration; this hardening will fold onto the migrated stack.
  • Meanwhile the SFU is manually restored on the box (livekit-server up on :7880/:7881) so huddles keep working until auto-provision lands with the migration.
**Parked (will resume after in-flight hero_collab issues).** - Provision hardening committed to branch `development_sameh_livekit50` (pushed, **not** merged): distinguishes secret-read *failure* from *not-set*, retries transient errors, logs loudly. 4/4 provision unit tests pass on the box. - The connectivity fix is the hero_livekit canonical migration (same root cause as #51) — I am taking that migration; this hardening will fold onto the migrated stack. - Meanwhile the SFU is manually restored on the box (livekit-server up on :7880/:7881) so huddles keep working until auto-provision lands with the migration.
Author
Member

Migration plan written (turnkey): hero_livekit/docs/superpowers/plans/2026-06-16-livekit-canonical-migration.md on branch development_sameh_livekit_migration. Target = canonical herolib serving (herolib_macros::openrpc_server! + serve_domains), like hero_collab/hero_proc — NOT the stale livekit_migration hero_rpc approach. The oschema already has a service LiveKitService { } block (close to the collab pattern). Remaining: deps flip (osis_oldserving→herolib@development), main.rs rewrite (drop rpc2_adapter/OsisLivekit), provision port (fold #50 hardening from development_sameh_livekit50 f30c9b6), SDK/admin canonical updates, box verify. SFU runs meanwhile via manual LiveKitService.start (survives until hero_livekit_server restart). Collab regression sweep (the blocker) is DONE+merged; livekit migration is the focused next push.

**Migration plan written** (turnkey): `hero_livekit/docs/superpowers/plans/2026-06-16-livekit-canonical-migration.md` on branch `development_sameh_livekit_migration`. Target = canonical **herolib** serving (`herolib_macros::openrpc_server!` + serve_domains), like hero_collab/hero_proc — NOT the stale `livekit_migration` hero_rpc approach. The oschema already has a `service LiveKitService { }` block (close to the collab pattern). Remaining: deps flip (osis_oldserving→herolib@development), main.rs rewrite (drop rpc2_adapter/OsisLivekit), provision port (fold #50 hardening from `development_sameh_livekit50` f30c9b6), SDK/admin canonical updates, box verify. SFU runs meanwhile via manual `LiveKitService.start` (survives until hero_livekit_server restart). Collab regression sweep (the blocker) is DONE+merged; livekit migration is the focused next push.
Author
Member

Fixed + closed in 63f80f5 (development) — part of the hero_livekit → canonical herolib serving migration.

Root cause (as re-diagnosed): the frozen pre-migration stack read LIVEKIT_VERSION via an hero_proc_sdk whose secret_get hit hero_proc's dead legacy /rpc (404) — so provision silently fell back to daemon-only and the SFU never launched. Compounded by get_secret_opt swallowing the error.

Fix: migrated to hero_proc_sdk@development (reaches the canonical /api/secrets/rpc); provision now reads the version via the new factory.secrets().secret_get({sid}) API + the hardened path (SecretRead Found/NotSet/Error, retry+backoff, loud-on-error, -32002-only opt-out); and provision's self-RPC was repointed from legacy /rpc → canonical /api/main/rpc.

Verified on box (build #11, dev HEAD, lock==HEAD): hero_proc service restart hero_livekit_server → provision auto-reads the version → "livekit stack ensured" → :7880 binds in ~4s with NO manual start, and it survives a 2nd restart. node_ip is the routable IP (no loopback). Huddle path confirmed: browser reaches the SFU (/rtc/v1/validate → 401) and huddle_start mints a room. Also: credential files now 0600/dirs 0700 and the api_secret uses a CSPRNG (security review).

**Fixed + closed** in `63f80f5` (development) — part of the hero_livekit → canonical herolib serving migration. Root cause (as re-diagnosed): the frozen pre-migration stack read `LIVEKIT_VERSION` via an `hero_proc_sdk` whose `secret_get` hit hero_proc's **dead legacy `/rpc`** (404) — so provision silently fell back to daemon-only and the SFU never launched. Compounded by `get_secret_opt` swallowing the error. Fix: migrated to `hero_proc_sdk@development` (reaches the canonical `/api/secrets/rpc`); provision now reads the version via the new `factory.secrets().secret_get({sid})` API + the hardened path (SecretRead Found/NotSet/Error, retry+backoff, loud-on-error, `-32002`-only opt-out); and provision's **self-RPC** was repointed from legacy `/rpc` → canonical `/api/main/rpc`. **Verified on box (build #11, dev HEAD, lock==HEAD):** `hero_proc service restart hero_livekit_server` → provision auto-reads the version → "livekit stack ensured" → **:7880 binds in ~4s with NO manual start**, and it **survives a 2nd restart**. node_ip is the routable IP (no loopback). Huddle path confirmed: browser reaches the SFU (`/rtc/v1/validate` → 401) and `huddle_start` mints a room. Also: credential files now `0600`/dirs `0700` and the api_secret uses a CSPRNG (security review).
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_livekit#50
No description provided.