service start returns 'Started' even when the process fails to bind; status then reports 'inactive' (misleading) #92
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#92
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Repro
hero_routerinstance on :9988)hero_proc service restart hero_router(orstart)Started: hero_router✓hero_proc service status hero_router→ reportsState: ○ inactiveError: Address already in use (os error 98)— the bind failed, butstartnever surfaced the errorWhy
rpc/service.rs:489derives state purely from the DB:That logic is correct given the DB view. The issue is upstream: the
startcommand returns synchronously after fork(), before the spawned process has a chance to bind/health-check. When the spawn fails immediately (port collision, missing dep, panic-on-startup, missing config), the user seesStarted: <name>and only finds out it's broken if they think to check status — and even then, "inactive" is ambiguous: did it never start, or did it run and exit cleanly?Suggested fix (one of)
start/restartblock briefly — wait until the first health probe completes (or a configurable timeout) before returning. Surface the failure in the response.failed-to-startstate distinct frominactive. Status would then sayState: ✗ failed-to-start (Address already in use)instead of an indistinguishable "inactive".startwaits for first health, AND status differentiates fail-to-start from never-ran/clean-exit.Either way, the user-visible contract should be: a successful
Started: <name>means the process is actually serving, not just that fork() succeeded.Today's example
Earlier in the session:
restart hero_routerreturnedStarted: hero_router, but the new instance failed to bind because PID 682562 from yesterday was holding port 9988.service status hero_routerthen reportedinactiveeven though something was serving HTTP 200 on 9988 (the unsupervised PID 682562). The whole picture only became visible by running the binary foreground and seeing the EADDRINUSE.Filing for owner input — no patch yet.