service start returns 'Started' even when the process fails to bind; status then reports 'inactive' (misleading) #92

New issue

Open

opened 2026-05-03 18:48:34 +00:00 by zaelgohary · 0 comments

zaelgohary commented

2026-05-03 18:48:34 +00:00

Member

Repro

Have a process bound to a port outside hero_proc supervision (e.g. an old hero_router instance on :9988)
hero_proc service restart hero_router (or start)
Output: Started: hero_router ✓
Run hero_proc service status hero_router → reports State: ○ inactive
The new process actually died immediately with Error: Address already in use (os error 98) — the bind failed, but start never surfaced the error

Why

rpc/service.rs:489 derives state purely from the DB:

let running = service_running_jobs(db, context, &name);   // active jobs in DB
let base_state = if !running.is_empty() { "running" }
    else { service_last_terminal_state(...) };            // returns "inactive" with no terminal jobs

That logic is correct given the DB view. The issue is upstream: the start command returns synchronously after fork(), before the spawned process has a chance to bind/health-check. When the spawn fails immediately (port collision, missing dep, panic-on-startup, missing config), the user sees Started: <name> and only finds out it's broken if they think to check status — and even then, "inactive" is ambiguous: did it never start, or did it run and exit cleanly?

Suggested fix (one of)

Make start/restart block briefly — wait until the first health probe completes (or a configurable timeout) before returning. Surface the failure in the response.
Add a failed-to-start state distinct from inactive. Status would then say State: ✗ failed-to-start (Address already in use) instead of an indistinguishable "inactive".
Both — start waits for first health, AND status differentiates fail-to-start from never-ran/clean-exit.

Either way, the user-visible contract should be: a successful Started: <name> means the process is actually serving, not just that fork() succeeded.

Today's example

Earlier in the session: restart hero_router returned Started: hero_router, but the new instance failed to bind because PID 682562 from yesterday was holding port 9988. service status hero_router then reported inactive even though something was serving HTTP 200 on 9988 (the unsupervised PID 682562). The whole picture only became visible by running the binary foreground and seeing the EADDRINUSE.

Filing for owner input — no patch yet.

## Repro 1. Have a process bound to a port outside hero_proc supervision (e.g. an old `hero_router` instance on :9988) 2. `hero_proc service restart hero_router` (or `start`) 3. Output: `Started: hero_router` ✓ 4. Run `hero_proc service status hero_router` → reports `State: ○ inactive` 5. The new process actually died immediately with `Error: Address already in use (os error 98)` — the bind failed, but `start` never surfaced the error ## Why `rpc/service.rs:489` derives state purely from the DB: ```rust let running = service_running_jobs(db, context, &name); // active jobs in DB let base_state = if !running.is_empty() { "running" } else { service_last_terminal_state(...) }; // returns "inactive" with no terminal jobs ``` That logic is correct given the DB view. The issue is upstream: the `start` command returns synchronously after fork(), before the spawned process has a chance to bind/health-check. When the spawn fails immediately (port collision, missing dep, panic-on-startup, missing config), the user sees `Started: <name>` and only finds out it's broken if they think to check status — and even then, "inactive" is ambiguous: did it never start, or did it run and exit cleanly? ## Suggested fix (one of) 1. **Make `start`/`restart` block briefly** — wait until the first health probe completes (or a configurable timeout) before returning. Surface the failure in the response. 2. **Add a `failed-to-start` state** distinct from `inactive`. Status would then say `State: ✗ failed-to-start (Address already in use)` instead of an indistinguishable "inactive". 3. **Both** — `start` waits for first health, AND status differentiates fail-to-start from never-ran/clean-exit. Either way, the user-visible contract should be: a successful `Started: <name>` means the process is actually serving, not just that fork() succeeded. ## Today's example Earlier in the session: `restart hero_router` returned `Started: hero_router`, but the new instance failed to bind because PID 682562 from yesterday was holding port 9988. `service status hero_router` then reported `inactive` even though *something* was serving HTTP 200 on 9988 (the unsupervised PID 682562). The whole picture only became visible by running the binary foreground and seeing the EADDRINUSE. Filing for owner input — no patch yet.