[bug] service_livekit start: hardcoded Redis port 6379 + claude CLI fallback dependency #185

Open
opened 2026-04-30 17:52:26 +00:00 by mik-tf · 0 comments
Owner

Symptoms

service_livekit start --reset --update aborts with one of two errors depending on the host:

  • Host A — Hetzner deploy as root, no redis-cli installed (reported by Kristof):

    /home/driver/hero/code/hero_skills/tools/modules/services/service_livekit.nu:132:24
    131 │ def svx_ensure_redis [] {
    132 │     let probe = (do { ^redis-cli -h 127.0.0.1 -p 6379 ping } | complete)
        ·                        ────┬────
        ·                            ╰── Command redis-cli not found
    
  • Host B — herodemo (TF Grid) as driver, redis-cli installed but Hero's DB on port 6378:

    → Redis not responding — attempting to start...
    agent.nu:104:14
    104 │             ^claude ...$CLAUDE_FLAGS --model haiku ...$iargs
        ·              ^^^|^^
        ·                 `-- Command `claude` not found
    

Both are bugs in service_livekit.nu's start path.

Root causes

1. Hardcoded port 6379 in svx_ensure_redis

do { ^redis-cli -h 127.0.0.1 -p 6379 ping } | complete

6379 is upstream Redis's default port, not Hero's. hero_db (the Hero Redis-compatible service) defaults to 6378 and is configurable. Probing 6379 will always miss Hero's DB on a normal deploy, regardless of platform.

2. redis-cli is a hard external dependency

Whichever Hero service needs to verify the DB is responding should not need a separate Redis client tool. install_core does not (and should not) install redis-cli. Probing via TCP (or trusting that hero_db is in the dependency list and let livekit's own startup fail loudly if it isn't) avoids the external dep entirely.

3. "Auto-start Redis" fallback path goes through ^claude (Claude Code CLI)

When the redis probe fails, the code falls into agent.nu's flow which eventually invokes ^claude. This is wrong on multiple levels:

  • An LLM has no business in a service-start hot path
  • claude isn't documented as a runtime prerequisite
  • It silently makes service_livekit unstartable on any host without Claude Code installed for the invoking user

Proposed fix (small, ~20 lines)

In tools/modules/services/service_livekit.nu:

  1. Read the Redis port from env:

    let redis_port = ($env | get -o HERO_DB_PORT | default "6378")
    

    Default to 6378 (Hero's default). Document the env var.

  2. Probe via plain TCP, not redis-cli:

    def svx_ensure_redis [] {
        let port = ($env | get -o HERO_DB_PORT | default "6378")
        let probe = (do { ^bash -c $"timeout 2 bash -c '</dev/tcp/127.0.0.1/($port)'" } | complete)
        if $probe.exit_code == 0 {
            return  # port open, hero_db is up
        }
        error make {msg: $"hero_db not responding on 127.0.0.1:($port). Run `hero_proc service status hero_db` and `service_db start --reset` if needed."}
    }
    

    No external CLI dependency. Clear error if hero_db is down — operator drives recovery.

  3. Remove the agent.nu / claude fallback path entirely from this flow. service_livekit should NOT try to auto-start hero_db; that's hero_proc's job via the dependency list.

Acceptance criteria

  • service_livekit start --reset --update succeeds on:
    • TF Grid root deploy (hero_db on 6378)
    • Hetzner root deploy (no redis-cli installed)
    • Generic Ubuntu driver-user deploy (no redis-cli, no claude)
  • service_livekit.nu no longer references redis-cli or claude
  • HERO_DB_PORT documented as the override knob in the service module's docstring
  • If hero_db is genuinely down, the error message points the operator at service_db start --reset

References

  • Observed during the herodemo sweep on 2026-04-30 — sweep stopped at livekit phase 2 (hero_demo#46)
  • Kristof's Hetzner deploy hit the same service_livekit.nu:132 line
  • agent.nu:104 shows the unwanted ^claude invocation in the fallback path

Signed-off-by: mik-tf

## Symptoms `service_livekit start --reset --update` aborts with one of two errors depending on the host: - **Host A — Hetzner deploy as root, no `redis-cli` installed** (reported by Kristof): ``` /home/driver/hero/code/hero_skills/tools/modules/services/service_livekit.nu:132:24 131 │ def svx_ensure_redis [] { 132 │ let probe = (do { ^redis-cli -h 127.0.0.1 -p 6379 ping } | complete) · ────┬──── · ╰── Command redis-cli not found ``` - **Host B — herodemo (TF Grid) as `driver`, redis-cli installed but Hero's DB on port 6378**: ``` → Redis not responding — attempting to start... agent.nu:104:14 104 │ ^claude ...$CLAUDE_FLAGS --model haiku ...$iargs · ^^^|^^ · `-- Command `claude` not found ``` Both are bugs in `service_livekit.nu`'s start path. ## Root causes ### 1. Hardcoded port 6379 in `svx_ensure_redis` ``` do { ^redis-cli -h 127.0.0.1 -p 6379 ping } | complete ``` `6379` is **upstream Redis**'s default port, not Hero's. `hero_db` (the Hero Redis-compatible service) defaults to **6378** and is configurable. Probing 6379 will always miss Hero's DB on a normal deploy, regardless of platform. ### 2. `redis-cli` is a hard external dependency Whichever Hero service needs to verify the DB is responding should not need a separate Redis client tool. `install_core` does not (and should not) install `redis-cli`. Probing via TCP (or trusting that hero_db is in the dependency list and let livekit's own startup fail loudly if it isn't) avoids the external dep entirely. ### 3. "Auto-start Redis" fallback path goes through `^claude` (Claude Code CLI) When the redis probe fails, the code falls into `agent.nu`'s flow which eventually invokes `^claude`. This is wrong on multiple levels: - An LLM has no business in a service-start hot path - `claude` isn't documented as a runtime prerequisite - It silently makes service_livekit unstartable on any host without Claude Code installed for the invoking user ## Proposed fix (small, ~20 lines) In `tools/modules/services/service_livekit.nu`: 1. **Read the Redis port from env**: ```nu let redis_port = ($env | get -o HERO_DB_PORT | default "6378") ``` Default to **6378** (Hero's default). Document the env var. 2. **Probe via plain TCP, not redis-cli**: ```nu def svx_ensure_redis [] { let port = ($env | get -o HERO_DB_PORT | default "6378") let probe = (do { ^bash -c $"timeout 2 bash -c '</dev/tcp/127.0.0.1/($port)'" } | complete) if $probe.exit_code == 0 { return # port open, hero_db is up } error make {msg: $"hero_db not responding on 127.0.0.1:($port). Run `hero_proc service status hero_db` and `service_db start --reset` if needed."} } ``` No external CLI dependency. Clear error if hero_db is down — operator drives recovery. 3. **Remove the agent.nu / claude fallback path** entirely from this flow. service_livekit should NOT try to auto-start hero_db; that's hero_proc's job via the dependency list. ## Acceptance criteria - [ ] `service_livekit start --reset --update` succeeds on: - TF Grid root deploy (hero_db on 6378) - Hetzner root deploy (no redis-cli installed) - Generic Ubuntu driver-user deploy (no redis-cli, no claude) - [ ] `service_livekit.nu` no longer references `redis-cli` or `claude` - [ ] `HERO_DB_PORT` documented as the override knob in the service module's docstring - [ ] If hero_db is genuinely down, the error message points the operator at `service_db start --reset` ## References - Observed during the herodemo sweep on 2026-04-30 — sweep stopped at livekit phase 2 (`hero_demo#46`) - Kristof's Hetzner deploy hit the same `service_livekit.nu:132` line - `agent.nu:104` shows the unwanted `^claude` invocation in the fallback path Signed-off-by: mik-tf
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_skills#185
No description provided.