service_collab restart does not re-read livekit.secret — stale JWT signing key after rotation #36

Open
opened 2026-04-29 03:32:56 +00:00 by sameh-farouk · 0 comments
Member

Observed

On dev box (138.201.206.39), after rotating the livekit shared secret in ~/hero/cfg/livekit.secret:

  • livekit-server restarted, picked up the new secret cleanly
  • hero_collab_server kept signing JWTs with the old secret
  • All huddle JWT verifications returned 401 Unauthorized against livekit
  • proc service restart hero_collab_server did not fix it
  • Only after pkill -9 + manual respawn did the new secret load

Hypothesis

hero_collab_server reads --livekit-api-secret-file once at startup into in-memory state. A graceful restart via proc service restart should re-read it (new process = fresh memory), but it didn't here.

Likely interaction with hero_proc orphan-procs bug

This may be a downstream effect of the orphan-procs issue filed against hero_proc. If proc service restart doesn't fully terminate the old hero_collab_server process, the new one might fail silently to bind the UDS (the old one still owns it), and traffic continues to hit the old PID with the stale secret.

Diagnostic to confirm: after a restart that fails to refresh the secret, run pgrep -af hero_collab_server — if there are two PIDs for one user, the orphan is the one serving stale traffic.

Suggested fixes (independent of root cause)

Even if the orphan-procs root cause is fixed in hero_proc, defense-in-depth here is cheap and matches typical Unix daemon convention:

  • SIGHUP handler that re-reads the secret file (cleanest)
  • Periodic re-read with mtime check (no extra signal needed)
  • Move secret read to per-request (acceptable cost — JWT signing isn't a hot path)

The SIGHUP approach is the most idiomatic and lets ops scripts trigger a refresh without going through the full proc lifecycle.

Repro

  1. Note current livekit.secret contents and hero_collab_server PID
  2. Rewrite livekit.secret with a new value
  3. proc service restart hero_collab_server
  4. Issue a huddle join → JWT verification fails at livekit
  5. pgrep -af hero_collab_server — check if the old PID is still alive (it usually is on this box)

Why this matters

Secret rotation is exactly the kind of operation where you expect graceful restart to suffice. The current behavior forces operators to use pkill -9, which masks the underlying supervisor bug and risks data loss if the process is mid-write.

## Observed On dev box (138.201.206.39), after rotating the livekit shared secret in `~/hero/cfg/livekit.secret`: - `livekit-server` restarted, picked up the new secret cleanly - `hero_collab_server` kept signing JWTs with the old secret - All huddle JWT verifications returned `401 Unauthorized` against livekit - `proc service restart hero_collab_server` did **not** fix it - Only after `pkill -9` + manual respawn did the new secret load ## Hypothesis `hero_collab_server` reads `--livekit-api-secret-file` once at startup into in-memory state. A graceful restart via `proc service restart` *should* re-read it (new process = fresh memory), but it didn't here. ## Likely interaction with hero_proc orphan-procs bug This may be a downstream effect of the orphan-procs issue filed against hero_proc. If `proc service restart` doesn't fully terminate the old `hero_collab_server` process, the new one might fail silently to bind the UDS (the old one still owns it), and traffic continues to hit the old PID with the stale secret. Diagnostic to confirm: after a restart that fails to refresh the secret, run `pgrep -af hero_collab_server` — if there are two PIDs for one user, the orphan is the one serving stale traffic. ## Suggested fixes (independent of root cause) Even if the orphan-procs root cause is fixed in hero_proc, defense-in-depth here is cheap and matches typical Unix daemon convention: - **SIGHUP handler that re-reads the secret file** (cleanest) - **Periodic re-read with mtime check** (no extra signal needed) - **Move secret read to per-request** (acceptable cost — JWT signing isn't a hot path) The SIGHUP approach is the most idiomatic and lets ops scripts trigger a refresh without going through the full proc lifecycle. ## Repro 1. Note current livekit.secret contents and hero_collab_server PID 2. Rewrite livekit.secret with a new value 3. `proc service restart hero_collab_server` 4. Issue a huddle join → JWT verification fails at livekit 5. `pgrep -af hero_collab_server` — check if the old PID is still alive (it usually is on this box) ## Why this matters Secret rotation is exactly the kind of operation where you expect graceful restart to suffice. The current behavior forces operators to use `pkill -9`, which masks the underlying supervisor bug and risks data loss if the process is mid-write.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_collab#36
No description provided.