transport.rs: check_socket_or_die should return Result instead of eprintln! + exit(0) #95
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_blueprint#95
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
hero_rpc/openrpc/transport.rs::check_socket_or_dieis a library function that:eprintln!std::process::exit(0)on missing socketAll three together make it unsafe for use in any reusable library function. Every caller (lab, hero_router scanner, hero_proxy, hero_db, hero_slides — anything using the openrpc SDK) loses the ability to:
Where it bites users
A representative symptom: customer-fresh
lab user initon a machine where hero_proc isn't running yet hits the secrets-into-hero_proc import step. The SDK call goes throughcheck_socket_or_die→ loud ERROR appears in the middle of an otherwise-successful install →exit(0)silently terminateslab user init→ parent (lab_install.sh) sees exit 0 and reports success. Operators see contradictory signals (red error + green exit code) and lose trust in the install.Reproduced today on
kristof5with the customer-install flow.Why each of the three points is independently wrong
1. Library functions shouldn't print errors to stderr directly
The library has no idea who's calling it. The caller might be:
By
eprintln!ing, the library forces its choice on every caller, who can't intercept, redirect, or suppress the message.Right pattern: return
Result::Err(...). Let the caller decide how to display.2. Library functions shouldn't terminate the calling process
std::process::exit()immediately stops the program. It skips:Consider a script that tries hero_proc and, if down, falls back to a remote API. With
exit(), the script just dies — the fallback never runs.Right pattern: return an error and let the caller decide whether to retry, fall back, or give up.
3. Exit code 0 means "success" — but the operation FAILED
Exit code 0 = "everything worked"; non-zero = "something failed".
check_socket_or_diecallsexit(0)when the socket DOESN'T exist — i.e., when the operation failed. So:if cmd; then ...thinks the command succeededlab_install.sh(the customer bootstrap) sees$? == 0and continues normallyThis is actively lying to every layer above. Worst of the three — it's not just bad ergonomics, it actively masks failures.
The function name
_or_dieeven self-documents the design defect —_or_diesemantics belong in a binary'smain(), not in a library function shared across the ecosystem.Workarounds in lab today (forced by this defect)
lab::service::hero_proc_exception::ping_hero_proc— a custom raw-socket+JSON-RPC ping that bypasses the SDK entirely just to avoid triggeringcheck_socket_or_die. Comment at line 517-518 spells it out:lab::secrets::sync::ensure_daemon_reachable(Mahmoud's9461c05in hero_skills) — pre-checks socket existence before calling the SDK to avoid the loud exit.lab::service::hero_proc_exception::is_hero_proc_alive(today's hero_skills7eb1c29) — consolidates the pattern across lab.Every Hero service that uses the openrpc SDK against another Hero service has to invent its own pre-check or accept the loud-exit behaviour. The right fix is upstream — here.
Proposed fix shape
Change
check_socket_or_die(path: &str)to:And update
unix_socket()/unix_socket_with_path()to propagate the Err via the existingResult<Self, OpenRpcError>return type. No moreeprintln!, no moreexit(0).Callers (SDK consumers) then handle the missing-socket case themselves — either matching on the new variants or letting
?propagate.The error-text-display can move to callers that want loud output (e.g.
labcould continue printing its current formatted message, but only when it chooses to, after receiving the Err).Migration
Related
lhumina_code/hero_skills@9461c05— Mahmoud's first lab-side workaroundlhumina_code/hero_skills@7eb1c29— today's consolidation of all lab-side workarounds viais_hero_proc_alive()lhumina_code/hero_proc#61— orphan-process accumulation issue, partially mitigated by reducing lab's restart frequency inhero_skills@2c25f2c— same family of "lab forced to work around hero_proc/hero_rpc behaviour"Wider impact audit completed — 33 downstream consumers, all trivially-migrating.
Audited every Hero repo (17 local + 16 remote-only via Forgejo API) for code that depends on the current
check_socket_or_diebehavior:OpenRpcTransport::unix_socket(...)callershero_router,hero_lib, derive macros, lab).await?— pick up newErrfor freehero_proc_sdk,hero_office_sdk,hero_osis_sdk,hero_memory_sdk, ...)?propagation; no SDK regen required, no pattern-match on variantsexit(0)Key finding — the existing integration test at
crates/openrpc/tests/openrpc_transport.rs:333(test_unix_socket_connection_error, authored by @despiegk in710381e, 2026-05-01) ALREADY asserts exactly the Result-returning contract this issue proposes:The
exit(0)introduced incaa028f(2026-05-15) silently broke this test — cargo reports it as "running 2 tests" then the test binary terminates after the first one without printing "test result: ok". So the fix not only restores correct library semantics but also un-breaks the existing test infrastructure.Rollout impact, beyond the fix itself:
The code change is non-breaking and trivial in
hero_rpc(~20 LOC). But end-user behavior only flips once each downstream binary is rebuilt against the newhero_rpc. ~10 of the 33 repos have @mik2'slab-publish.yamlCI workflow installed (auto-republish on push to development); the other ~23 don't. Propagating that workflow is a separate, valuable follow-up.Concrete fix plan I'll attach as a PR:
check_socket_or_die→check_socket_reachablereturningResult<(), OpenRpcError>(reuses existingOpenRpcError::Connection(String)variant — no new enum variant, no breaking change)unix_socket()andunix_socket_with_path()to?-propagate, including the innerUnixStream::connectmap_errthat also currently callsexit(0)Will open a PR once verified end-to-end on a dev box.