[bug] service_livekit start: hardcoded Redis port 6379 + claude CLI fallback dependency #185

New issue

Open

opened 2026-04-30 17:52:26 +00:00 by mik-tf · 0 comments

mik-tf commented

2026-04-30 17:52:26 +00:00

Owner

Symptoms

service_livekit start --reset --update aborts with one of two errors depending on the host:

Host A — Hetzner deploy as root, no redis-cli installed (reported by Kristof):

/home/driver/hero/code/hero_skills/tools/modules/services/service_livekit.nu:132:24
131 │ def svx_ensure_redis [] {
132 │     let probe = (do { ^redis-cli -h 127.0.0.1 -p 6379 ping } | complete)
    ·                        ────┬────
    ·                            ╰── Command redis-cli not found

Host B — herodemo (TF Grid) as driver, redis-cli installed but Hero's DB on port 6378:

→ Redis not responding — attempting to start...
agent.nu:104:14
104 │             ^claude ...$CLAUDE_FLAGS --model haiku ...$iargs
    ·              ^^^|^^
    ·                 `-- Command `claude` not found

Both are bugs in service_livekit.nu's start path.

Root causes

1. Hardcoded port 6379 in `svx_ensure_redis`

do { ^redis-cli -h 127.0.0.1 -p 6379 ping } | complete

6379 is upstream Redis's default port, not Hero's. hero_db (the Hero Redis-compatible service) defaults to 6378 and is configurable. Probing 6379 will always miss Hero's DB on a normal deploy, regardless of platform.

2. `redis-cli` is a hard external dependency

Whichever Hero service needs to verify the DB is responding should not need a separate Redis client tool. install_core does not (and should not) install redis-cli. Probing via TCP (or trusting that hero_db is in the dependency list and let livekit's own startup fail loudly if it isn't) avoids the external dep entirely.

3. "Auto-start Redis" fallback path goes through `^claude` (Claude Code CLI)

When the redis probe fails, the code falls into agent.nu's flow which eventually invokes ^claude. This is wrong on multiple levels:

An LLM has no business in a service-start hot path
claude isn't documented as a runtime prerequisite
It silently makes service_livekit unstartable on any host without Claude Code installed for the invoking user

Proposed fix (small, ~20 lines)

In tools/modules/services/service_livekit.nu:

Read the Redis port from env:
```
let redis_port = ($env | get -o HERO_DB_PORT | default "6378")
```
Default to 6378 (Hero's default). Document the env var.

Probe via plain TCP, not redis-cli:

def svx_ensure_redis [] {
    let port = ($env | get -o HERO_DB_PORT | default "6378")
    let probe = (do { ^bash -c $"timeout 2 bash -c '</dev/tcp/127.0.0.1/($port)'" } | complete)
    if $probe.exit_code == 0 {
        return  # port open, hero_db is up
    }
    error make {msg: $"hero_db not responding on 127.0.0.1:($port). Run `hero_proc service status hero_db` and `service_db start --reset` if needed."}
}

No external CLI dependency. Clear error if hero_db is down — operator drives recovery.

Remove the agent.nu / claude fallback path entirely from this flow. service_livekit should NOT try to auto-start hero_db; that's hero_proc's job via the dependency list.

Acceptance criteria

service_livekit start --reset --update succeeds on:
- TF Grid root deploy (hero_db on 6378)
- Hetzner root deploy (no redis-cli installed)
- Generic Ubuntu driver-user deploy (no redis-cli, no claude)
service_livekit.nu no longer references redis-cli or claude
HERO_DB_PORT documented as the override knob in the service module's docstring
If hero_db is genuinely down, the error message points the operator at service_db start --reset

References

Observed during the herodemo sweep on 2026-04-30 — sweep stopped at livekit phase 2 (hero_demo#46)
Kristof's Hetzner deploy hit the same service_livekit.nu:132 line
agent.nu:104 shows the unwanted ^claude invocation in the fallback path

Signed-off-by: mik-tf

## Symptoms `service_livekit start --reset --update` aborts with one of two errors depending on the host: - **Host A — Hetzner deploy as root, no `redis-cli` installed** (reported by Kristof): ``` /home/driver/hero/code/hero_skills/tools/modules/services/service_livekit.nu:132:24 131 │ def svx_ensure_redis [] { 132 │ let probe = (do { ^redis-cli -h 127.0.0.1 -p 6379 ping } | complete) · ────┬──── · ╰── Command redis-cli not found ``` - **Host B — herodemo (TF Grid) as `driver`, redis-cli installed but Hero's DB on port 6378**: ``` → Redis not responding — attempting to start... agent.nu:104:14 104 │ ^claude ...$CLAUDE_FLAGS --model haiku ...$iargs · ^^^|^^ · `-- Command `claude` not found ``` Both are bugs in `service_livekit.nu`'s start path. ## Root causes ### 1. Hardcoded port 6379 in `svx_ensure_redis` ``` do { ^redis-cli -h 127.0.0.1 -p 6379 ping } | complete ``` `6379` is **upstream Redis**'s default port, not Hero's. `hero_db` (the Hero Redis-compatible service) defaults to **6378** and is configurable. Probing 6379 will always miss Hero's DB on a normal deploy, regardless of platform. ### 2. `redis-cli` is a hard external dependency Whichever Hero service needs to verify the DB is responding should not need a separate Redis client tool. `install_core` does not (and should not) install `redis-cli`. Probing via TCP (or trusting that hero_db is in the dependency list and let livekit's own startup fail loudly if it isn't) avoids the external dep entirely. ### 3. "Auto-start Redis" fallback path goes through `^claude` (Claude Code CLI) When the redis probe fails, the code falls into `agent.nu`'s flow which eventually invokes `^claude`. This is wrong on multiple levels: - An LLM has no business in a service-start hot path - `claude` isn't documented as a runtime prerequisite - It silently makes service_livekit unstartable on any host without Claude Code installed for the invoking user ## Proposed fix (small, ~20 lines) In `tools/modules/services/service_livekit.nu`: 1. **Read the Redis port from env**: ```nu let redis_port = ($env | get -o HERO_DB_PORT | default "6378") ``` Default to **6378** (Hero's default). Document the env var. 2. **Probe via plain TCP, not redis-cli**: ```nu def svx_ensure_redis [] { let port = ($env | get -o HERO_DB_PORT | default "6378") let probe = (do { ^bash -c $"timeout 2 bash -c '</dev/tcp/127.0.0.1/($port)'" } | complete) if $probe.exit_code == 0 { return # port open, hero_db is up } error make {msg: $"hero_db not responding on 127.0.0.1:($port). Run `hero_proc service status hero_db` and `service_db start --reset` if needed."} } ``` No external CLI dependency. Clear error if hero_db is down — operator drives recovery. 3. **Remove the agent.nu / claude fallback path** entirely from this flow. service_livekit should NOT try to auto-start hero_db; that's hero_proc's job via the dependency list. ## Acceptance criteria - [ ] `service_livekit start --reset --update` succeeds on: - TF Grid root deploy (hero_db on 6378) - Hetzner root deploy (no redis-cli installed) - Generic Ubuntu driver-user deploy (no redis-cli, no claude) - [ ] `service_livekit.nu` no longer references `redis-cli` or `claude` - [ ] `HERO_DB_PORT` documented as the override knob in the service module's docstring - [ ] If hero_db is genuinely down, the error message points the operator at `service_db start --reset` ## References - Observed during the herodemo sweep on 2026-04-30 — sweep stopped at livekit phase 2 (`hero_demo#46`) - Kristof's Hetzner deploy hit the same `service_livekit.nu:132` line - `agent.nu:104` shows the unwanted `^claude` invocation in the fallback path Signed-off-by: mik-tf