[hero_codescalers] Fix duplicate instance appearing in hero_router #18
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
A second unexpected hero_codescalers instance is appearing in hero_router
dashboard. This is caused by a registration issue.
Steps to fix
instance N uses socket dir
hero_codescalers_serverN/Acceptance Criteria
Implementation Spec for Issue #18
Objective
Eliminate the duplicate/ghost
hero_codescalersentry in the hero_router dashboard by detecting and pruning stale legacy socket directories (hero_codescalers_server[/_N]) left over from before PR #21, and by hardening--startso a fresh start cannot leave such directories behind in the future.Root Cause
(A) Stale-on-disk legacy directory is the dominant root cause. Ranked analysis:
(A) Confirmed — stale legacy directory. Before commit
7edb4e3(PR #16, "fix(cli): align CLI socket dir default with the server's"), the CLI'ssock_dir_name(0)returned"hero_codescalers_server"and exported it viaHERO_CODESCALERS_SOCK_NAMEto the spawned server/UI children, causing them to bind their UDS at$HERO_SOCKET_DIR/hero_codescalers_server/{rpc,ui}.sock. After the fix, the canonical directory ishero_codescalers/(seecrates/hero_codescalers/src/main.rs:42-51). On any host that previously ran the old binary and was upgraded after PR #16/#21,$HERO_SOCKET_DIR/hero_codescalers_server/still exists with the old socket files.hero_routerscans every subdirectory of$HERO_SOCKET_DIR(perhero_socketsskill section 4 andhero_routerskill probing strategy), so it discovers BOTHhero_codescalers/(live) andhero_codescalers_server/(ghost). The ghost shows up as Inactive (no listener) but still appears in the dashboard as a second entry. For non-zero instances the legacy names werehero_codescalers_server1,hero_codescalers_server2, etc. (no underscore betweenserverand the number), so any of those may also be lingering.(B) Not confirmed.
self_startcallshp.restart_service(&svc_name, ...)(crates/hero_codescalers/src/main.rs:406). The hero_proc service is keyed bysvc_name, so calling--startrepeatedly is idempotent — it replaces the existing service definition and does not register a second copy.(C) Not active.
sock_dir_name(instance)(lines 42-51) yields a unique directory per instance (hero_codescalers,hero_codescalers_1,hero_codescalers_2, …). Server/UI actions for the same instance share that one directory but bind different filenames (rpc.sock,ui.sock), which is by design.(D) Cosmetic only. Both
crates/hero_codescalers_server/heroservice.jsonandcrates/hero_codescalers_ui/heroservice.jsonare static, embedded at compile time, and always declare"name": "hero_codescalers"— even for instance 1, instance 2, etc.hero_routerkeys services by their socket directory name (URL routing prefix/<service>/...), so two running instances are distinguished correctly. The staticnamefield is only informational. Out of scope for this fix unless the dashboard renders the manifest name as the primary label.Requirements
$HERO_SOCKET_DIRwhose names match the legacy pattern (hero_codescalers_server,hero_codescalers_server<N>,hero_codescalers_server_<N>).eprintln!warning during--startshowing the offending path(s).--prune-legacy-sockets) so an operator can confirm the action.--startonly — never at runtime in the server or UI binary, and never in any subcommand path.--start(already in place viarestart_service).Files to Modify/Create
crates/hero_codescalers/src/main.rs— add a legacy-directory scan/warn (and optional prune) helper, invoke it fromself_start, add a CLI flag to opt into pruning.README.md— add a brief "Migrating from pre-PR#16 layout" subsection under "Unix socket directories" pointing operators at the warning and the optional prune flag.heroservice.jsonfiles do not need to change for this fix.)Implementation Plan
Step 1: Add a legacy-directory detector helper
Files:
crates/hero_codescalers/src/main.rsChanges:
legacy_sock_dirs(base: &str) -> Vec<PathBuf>, that:base = socket_base()."hero_codescalers_server"(covershero_codescalers_server,hero_codescalers_server1,hero_codescalers_server_1, etc.). Be careful to usestarts_with("hero_codescalers_server")AND require the next char (if any) to be either end-of-string, an underscore, or an ASCII digit, so we do not accidentally match a hypothetical futurehero_codescalers_server_adminetc.hero_codescalers,hero_codescalers_<N>). Since the legacy names all start withhero_codescalers_server, no canonical name will match — but document this invariant in a comment.&str) under#[cfg(test)] mod testsnear the existingparse_duration_mstests.Dependencies: none.
Step 2: Wire the detector into
self_start(warn by default)Files:
crates/hero_codescalers/src/main.rsChanges:
self_start(line 402), before callinghp.restart_service(...):eprintln!("warning: ...")block: name the legacy directory, explain that it is from a pre-PR#16 layout, suggest runninghero_codescalers --start --prune-legacy-sockets(orrm -rf <path>) to remove it, and warn that it will appear as a ghost service inhero_routeruntil removed.Dependencies: Step 1.
Step 3: Add an opt-in
--prune-legacy-socketsflagFiles:
crates/hero_codescalers/src/main.rsChanges:
#[arg(long)] prune_legacy_sockets: boolfield onCli(around lines 119-147) — global so it can pair with--start. Document it in the long help.self_start(signature change to accept the bool, plumb through frommain), after the warn step, ifprune_legacy_sockets == true:std::fs::remove_dir_all(path). On success, logeprintln!("removed legacy socket dir: {path}"). On failure, log a warning but continue.Dependencies: Steps 1, 2.
Step 4: Document the migration in README
Files:
README.mdChanges:
hero_codescalers_server[N]/as the socket dir.hero_codescalers --start(it will warn) and optionally pass--prune-legacy-socketsto clean up.Dependencies: Step 3 (so the flag name in the doc is accurate).
Step 5: Add a regression test
Files:
crates/hero_codescalers/src/main.rs(test module)Changes:
hero_codescalers_server,hero_codescalers_server1,hero_codescalers_server_2are flagged as legacy; thathero_codescalers,hero_codescalers_1,hero_codescalers_99, and unrelated names likehero_router,hero_codescalers_auditare NOT flagged.tempfile-based integration test that creates a fake$HERO_SOCKET_DIRwith a mix of canonical and legacy directories, runs the helper, and asserts the returned list. Skip if the crate does not already pulltempfile.Dependencies: Step 1.
Acceptance Criteria
hero_codescalers[/_N]entry/entries appear inhero_routerafter runninghero_codescalers --start --prune-legacy-socketsonce.hero_codescalers --starton a clean host (no legacy dir) is a no-op for the new code path — no spurious warnings.hero_codescalers --start(without the flag) on a host with a legacy dir prints a clear, actionable warning naming the offending path(s) and exits successfully.cargo test -p hero_codescalerspasses (legacy-name classifier unit tests).--start/--stopsemantics or therestart_serviceidempotency.Notes
development_fix_codescalers_service_name(PR #21, unmerged). Either rebase this work on top of #21 once merged, or accept that the two PRs touch overlapping regions ofcrates/hero_codescalers/src/main.rsand may conflict. Both fixes target the same root cause from different angles: PR #21 catches a staleHERO_CODESCALERS_SOCK_NAMEenv var; this fix catches a stale on-disk directory.heroservice.jsondynamic (write it from the server at startup usingargs.socknameinstead of a staticinclude_str!). That is a separate, larger change and is out of scope for issue #18.hero_procaction names (hero_codescalers_server[_N]) are unchanged and are not the cause of any duplication: they live inside hero_proc, not on the socket filesystem, and the hero_router dashboard does not key off them.remove_dir_allon$HERO_SOCKET_DIR/hero_codescalers_serverwould silently delete any orphaned socket files even if the operator is running an old version of the binary in another window. The opt-in flag puts the operator in control.