lab service destructively deletes hero_proc rpc.sock on false-negative liveness probe #255
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_skills#255
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Observed (session 95 / hero_code sweep, lab build #50469)
While bootstrapping the hero_proc → hero_db → hero_code dep chain via
lab service, lab decided the running hero_proc daemon was "not running" (probe failed for an unrelated reason —screenwas not installed yet, separate issue), then proceeded to:The "removing leftover socket" step deleted the live
rpc.sockBEFORE confirming a replacement could be launched. With screen not installed, the launch failed but the socket was already gone — so existing hero_proc clients (e.g.~/hero/bin/hero_proc service list) started failing withConnection error: No such file or directory.lab service resetallcleanly recovered (wipes hero_proc DB + sockets + restarts), but the immediate post-cleanup state was broken until that recovery.Also:
pkillmatches process names truncated to 15 chars (kernel limit). The literal patternhero_proc_serveris 16 chars and never matches by name — silent zero-kill.pkill -f hero_proc_servermatches the command line; that's the right flag.Suggested fixes
connect()(not just file existence) before deciding "not running". If a process isaccept()ing on it, the daemon is alive.lsof/fusercheck, orconnect()+timeout. Don't delete if owned.pkill -f(or justpkill --full) for >15-char process names.Why this matters for the sweep
Every repo's smoke gate via
lab service … --startwill go through this dep-bootstrap path. A flaky liveness probe on a developer's machine becomes destructive instead of merely failing to start.Refs: lhumina_code/hero_proc#102 (sweep tracker), lhumina_code/hero_code#15 (where this surfaced).
Fix in PR #257 — awaiting squash-merge gate.
Verified under lab build #54729 during testing on the hero_proc#102 sweep:
lab service hero_code --startnow starts both server and admin via a single invocation (8/8 smoke checks).start_hero_procno longer destroys a live hero_proc socket on false-negative liveness probe.screenfails fast with a clear pointer tolab install baseBEFORE any state cleanup.mik-tf referenced this issue from lhumina_code/hero_demo2026-05-16 00:33:48 +00:00