transport.rs: check_socket_or_die should return Result instead of eprintln! + exit(0) #95

Closed
opened 2026-05-20 16:15:44 +00:00 by sameh-farouk · 1 comment
Member

Summary

hero_rpc/openrpc/transport.rs::check_socket_or_die is a library function that:

  1. Prints multi-line ERROR output directly to stderr via eprintln!
  2. Kills the calling process via std::process::exit(0) on missing socket
  3. Exits with code 0 — pretending the failure was a clean stop

All three together make it unsafe for use in any reusable library function. Every caller (lab, hero_router scanner, hero_proxy, hero_db, hero_slides — anything using the openrpc SDK) loses the ability to:

  • Handle the missing-socket case gracefully
  • Format their own error message
  • Retry, fall back, or report-and-continue
  • Distinguish "succeeded" from "failed-and-died" via exit code

Where it bites users

A representative symptom: customer-fresh lab user init on a machine where hero_proc isn't running yet hits the secrets-into-hero_proc import step. The SDK call goes through check_socket_or_die → loud ERROR appears in the middle of an otherwise-successful install → exit(0) silently terminates lab user init → parent (lab_install.sh) sees exit 0 and reports success. Operators see contradictory signals (red error + green exit code) and lose trust in the install.

Reproduced today on kristof5 with the customer-install flow.

Why each of the three points is independently wrong

1. Library functions shouldn't print errors to stderr directly

The library has no idea who's calling it. The caller might be:

  • A CLI showing errors directly (fine — but their choice)
  • A test capturing stderr for assertions
  • A daemon logging structured JSON to a file
  • A wrapper that wants to suppress / reformat / contextualise the error

By eprintln!ing, the library forces its choice on every caller, who can't intercept, redirect, or suppress the message.

Right pattern: return Result::Err(...). Let the caller decide how to display.

2. Library functions shouldn't terminate the calling process

std::process::exit() immediately stops the program. It skips:

  • Destructors (file flushes, lock releases, temp-file cleanup)
  • The caller's error-handling logic
  • Retry / fallback / report-and-continue logic the caller might want

Consider a script that tries hero_proc and, if down, falls back to a remote API. With exit(), the script just dies — the fallback never runs.

Right pattern: return an error and let the caller decide whether to retry, fall back, or give up.

3. Exit code 0 means "success" — but the operation FAILED

Exit code 0 = "everything worked"; non-zero = "something failed". check_socket_or_die calls exit(0) when the socket DOESN'T exist — i.e., when the operation failed. So:

  • A parent shell if cmd; then ... thinks the command succeeded
  • A CI pipeline marks the step as passing
  • A supervisor sees a clean exit and logs success
  • lab_install.sh (the customer bootstrap) sees $? == 0 and continues normally

This is actively lying to every layer above. Worst of the three — it's not just bad ergonomics, it actively masks failures.

The function name _or_die even self-documents the design defect — _or_die semantics belong in a binary's main(), not in a library function shared across the ecosystem.

Workarounds in lab today (forced by this defect)

  • lab::service::hero_proc_exception::ping_hero_proc — a custom raw-socket+JSON-RPC ping that bypasses the SDK entirely just to avoid triggering check_socket_or_die. Comment at line 517-518 spells it out:

    "Never uses the generated SDK — hero_proc_factory calls check_socket_or_die which calls std::process::exit(0), killing lab."

  • lab::secrets::sync::ensure_daemon_reachable (Mahmoud's 9461c05 in hero_skills) — pre-checks socket existence before calling the SDK to avoid the loud exit.
  • lab::service::hero_proc_exception::is_hero_proc_alive (today's hero_skills 7eb1c29) — consolidates the pattern across lab.

Every Hero service that uses the openrpc SDK against another Hero service has to invent its own pre-check or accept the loud-exit behaviour. The right fix is upstream — here.

Proposed fix shape

Change check_socket_or_die(path: &str) to:

pub fn check_socket(path: &str) -> Result<(), TransportError> {
    use std::os::unix::fs::FileTypeExt;
    match std::fs::metadata(path) {
        Err(e) if e.kind() == std::io::ErrorKind::NotFound => {
            Err(TransportError::SocketNotFound(path.to_string()))
        }
        Err(e) => Err(TransportError::SocketAccessFailed { path: path.to_string(), source: e }),
        Ok(meta) if !meta.file_type().is_socket() => {
            Err(TransportError::NotASocket { path: path.to_string(), kind: ... })
        }
        Ok(_) => Ok(()),
    }
}

And update unix_socket() / unix_socket_with_path() to propagate the Err via the existing Result<Self, OpenRpcError> return type. No more eprintln!, no more exit(0).

Callers (SDK consumers) then handle the missing-socket case themselves — either matching on the new variants or letting ? propagate.

The error-text-display can move to callers that want loud output (e.g. lab could continue printing its current formatted message, but only when it chooses to, after receiving the Err).

Migration

  • Bump hero_rpc minor version
  • Update direct SDK consumers (hero_proc_sdk, hero_router transport, etc.) to handle the new Result return type
  • Remove ad-hoc pre-checks in lab (we'll keep them as defense-in-depth for now, but they become redundant once this lands)
  • lhumina_code/hero_skills@9461c05 — Mahmoud's first lab-side workaround
  • lhumina_code/hero_skills@7eb1c29 — today's consolidation of all lab-side workarounds via is_hero_proc_alive()
  • lhumina_code/hero_proc#61 — orphan-process accumulation issue, partially mitigated by reducing lab's restart frequency in hero_skills@2c25f2c — same family of "lab forced to work around hero_proc/hero_rpc behaviour"
## Summary `hero_rpc/openrpc/transport.rs::check_socket_or_die` is a library function that: 1. **Prints multi-line ERROR output directly to stderr** via `eprintln!` 2. **Kills the calling process** via `std::process::exit(0)` on missing socket 3. **Exits with code 0** — pretending the failure was a clean stop All three together make it unsafe for use in any reusable library function. Every caller (lab, hero_router scanner, hero_proxy, hero_db, hero_slides — anything using the openrpc SDK) loses the ability to: - Handle the missing-socket case gracefully - Format their own error message - Retry, fall back, or report-and-continue - Distinguish "succeeded" from "failed-and-died" via exit code ## Where it bites users A representative symptom: customer-fresh `lab user init` on a machine where hero_proc isn't running yet hits the secrets-into-hero_proc import step. The SDK call goes through `check_socket_or_die` → loud ERROR appears in the middle of an otherwise-successful install → `exit(0)` silently terminates `lab user init` → parent (`lab_install.sh`) sees exit 0 and reports success. Operators see contradictory signals (red error + green exit code) and lose trust in the install. Reproduced today on `kristof5` with the customer-install flow. ## Why each of the three points is independently wrong ### 1. Library functions shouldn't print errors to stderr directly The library has no idea who's calling it. The caller might be: - A CLI showing errors directly (fine — but their choice) - A test capturing stderr for assertions - A daemon logging structured JSON to a file - A wrapper that wants to suppress / reformat / contextualise the error By `eprintln!`ing, the library forces its choice on every caller, who can't intercept, redirect, or suppress the message. **Right pattern**: return `Result::Err(...)`. Let the caller decide how to display. ### 2. Library functions shouldn't terminate the calling process `std::process::exit()` immediately stops the program. It skips: - Destructors (file flushes, lock releases, temp-file cleanup) - The caller's error-handling logic - Retry / fallback / report-and-continue logic the caller might want Consider a script that tries hero_proc and, if down, falls back to a remote API. With `exit()`, the script just dies — the fallback never runs. **Right pattern**: return an error and let the caller decide whether to retry, fall back, or give up. ### 3. Exit code 0 means "success" — but the operation FAILED Exit code 0 = "everything worked"; non-zero = "something failed". `check_socket_or_die` calls `exit(0)` when the socket DOESN'T exist — i.e., when the operation **failed**. So: - A parent shell `if cmd; then ...` thinks the command succeeded - A CI pipeline marks the step as passing - A supervisor sees a clean exit and logs success - `lab_install.sh` (the customer bootstrap) sees `$? == 0` and continues normally This is **actively lying** to every layer above. Worst of the three — it's not just bad ergonomics, it actively masks failures. The function name `_or_die` even self-documents the design defect — `_or_die` semantics belong in a binary's `main()`, not in a library function shared across the ecosystem. ## Workarounds in lab today (forced by this defect) - `lab::service::hero_proc_exception::ping_hero_proc` — a custom raw-socket+JSON-RPC ping that bypasses the SDK entirely just to avoid triggering `check_socket_or_die`. Comment at line 517-518 spells it out: > *"Never uses the generated SDK — `hero_proc_factory` calls `check_socket_or_die` which calls `std::process::exit(0)`, killing lab."* - `lab::secrets::sync::ensure_daemon_reachable` (Mahmoud's `9461c05` in hero_skills) — pre-checks socket existence before calling the SDK to avoid the loud exit. - `lab::service::hero_proc_exception::is_hero_proc_alive` (today's hero_skills `7eb1c29`) — consolidates the pattern across lab. Every Hero service that uses the openrpc SDK against another Hero service has to invent its own pre-check or accept the loud-exit behaviour. The right fix is upstream — here. ## Proposed fix shape Change `check_socket_or_die(path: &str)` to: ```rust pub fn check_socket(path: &str) -> Result<(), TransportError> { use std::os::unix::fs::FileTypeExt; match std::fs::metadata(path) { Err(e) if e.kind() == std::io::ErrorKind::NotFound => { Err(TransportError::SocketNotFound(path.to_string())) } Err(e) => Err(TransportError::SocketAccessFailed { path: path.to_string(), source: e }), Ok(meta) if !meta.file_type().is_socket() => { Err(TransportError::NotASocket { path: path.to_string(), kind: ... }) } Ok(_) => Ok(()), } } ``` And update `unix_socket()` / `unix_socket_with_path()` to propagate the Err via the existing `Result<Self, OpenRpcError>` return type. No more `eprintln!`, no more `exit(0)`. Callers (SDK consumers) then handle the missing-socket case themselves — either matching on the new variants or letting `?` propagate. The error-text-display can move to callers that want loud output (e.g. `lab` could continue printing its current formatted message, but only when it chooses to, after receiving the Err). ## Migration - Bump hero_rpc minor version - Update direct SDK consumers (hero_proc_sdk, hero_router transport, etc.) to handle the new Result return type - Remove ad-hoc pre-checks in lab (we'll keep them as defense-in-depth for now, but they become redundant once this lands) ## Related - `lhumina_code/hero_skills@9461c05` — Mahmoud's first lab-side workaround - `lhumina_code/hero_skills@7eb1c29` — today's consolidation of all lab-side workarounds via `is_hero_proc_alive()` - `lhumina_code/hero_proc#61` — orphan-process accumulation issue, partially mitigated by reducing lab's restart frequency in `hero_skills@2c25f2c` — same family of "lab forced to work around hero_proc/hero_rpc behaviour"
Author
Member

Wider impact audit completed — 33 downstream consumers, all trivially-migrating.

Audited every Hero repo (17 local + 16 remote-only via Forgejo API) for code that depends on the current check_socket_or_die behavior:

Group Count Migration
Direct OpenRpcTransport::unix_socket(...) callers a handful (hero_router, hero_lib, derive macros, lab) Already use .await? — pick up new Err for free
Indirect via generated SDKs (hero_proc_sdk, hero_office_sdk, hero_osis_sdk, hero_memory_sdk, ...) ~33 All use ? propagation; no SDK regen required, no pattern-match on variants
Code that load-bears on the silent exit(0) 0 None found

Key finding — the existing integration test at crates/openrpc/tests/openrpc_transport.rs:333 (test_unix_socket_connection_error, authored by @despiegk in 710381e, 2026-05-01) ALREADY asserts exactly the Result-returning contract this issue proposes:

let result = OpenRpcTransport::unix_socket("/nonexistent/path/test.sock").await;
assert!(result.is_err());
match result.unwrap_err() {
    OpenRpcError::Connection(msg) => { ... }
    other => panic!("Expected Connection error, got: {:?}", other),
}

The exit(0) introduced in caa028f (2026-05-15) silently broke this test — cargo reports it as "running 2 tests" then the test binary terminates after the first one without printing "test result: ok". So the fix not only restores correct library semantics but also un-breaks the existing test infrastructure.

Rollout impact, beyond the fix itself:

The code change is non-breaking and trivial in hero_rpc (~20 LOC). But end-user behavior only flips once each downstream binary is rebuilt against the new hero_rpc. ~10 of the 33 repos have @mik2's lab-publish.yaml CI workflow installed (auto-republish on push to development); the other ~23 don't. Propagating that workflow is a separate, valuable follow-up.

Concrete fix plan I'll attach as a PR:

  1. Rename check_socket_or_diecheck_socket_reachable returning Result<(), OpenRpcError> (reuses existing OpenRpcError::Connection(String) variant — no new enum variant, no breaking change)
  2. Convert both unix_socket() and unix_socket_with_path() to ?-propagate, including the inner UnixStream::connect map_err that also currently calls exit(0)
  3. TDD: fix the broken unit test + extend integration coverage (missing socket / stale file / unconnectable socket)
  4. Update doc comments to match new semantics

Will open a PR once verified end-to-end on a dev box.

**Wider impact audit completed — 33 downstream consumers, all trivially-migrating.** Audited every Hero repo (17 local + 16 remote-only via Forgejo API) for code that depends on the current `check_socket_or_die` behavior: | Group | Count | Migration | |---|---|---| | Direct `OpenRpcTransport::unix_socket(...)` callers | a handful (`hero_router`, `hero_lib`, derive macros, lab) | Already use `.await?` — pick up new `Err` for free | | Indirect via generated SDKs (`hero_proc_sdk`, `hero_office_sdk`, `hero_osis_sdk`, `hero_memory_sdk`, ...) | ~33 | All use `?` propagation; no SDK regen required, no pattern-match on variants | | Code that load-bears on the silent `exit(0)` | **0** | None found | **Key finding** — the existing integration test at `crates/openrpc/tests/openrpc_transport.rs:333` (`test_unix_socket_connection_error`, authored by @despiegk in `710381e`, 2026-05-01) ALREADY asserts exactly the Result-returning contract this issue proposes: ```rust let result = OpenRpcTransport::unix_socket("/nonexistent/path/test.sock").await; assert!(result.is_err()); match result.unwrap_err() { OpenRpcError::Connection(msg) => { ... } other => panic!("Expected Connection error, got: {:?}", other), } ``` The `exit(0)` introduced in `caa028f` (2026-05-15) silently broke this test — cargo reports it as "running 2 tests" then the test binary terminates after the first one without printing "test result: ok". So the fix not only restores correct library semantics but also un-breaks the existing test infrastructure. **Rollout impact, beyond the fix itself:** The code change is non-breaking and trivial in `hero_rpc` (~20 LOC). But end-user behavior only flips once each downstream binary is rebuilt against the new `hero_rpc`. ~10 of the 33 repos have @mik2's `lab-publish.yaml` CI workflow installed (auto-republish on push to development); the other ~23 don't. Propagating that workflow is a separate, valuable follow-up. **Concrete fix plan** I'll attach as a PR: 1. Rename `check_socket_or_die` → `check_socket_reachable` returning `Result<(), OpenRpcError>` (reuses existing `OpenRpcError::Connection(String)` variant — no new enum variant, no breaking change) 2. Convert both `unix_socket()` and `unix_socket_with_path()` to `?`-propagate, including the inner `UnixStream::connect` `map_err` that also currently calls `exit(0)` 3. TDD: fix the broken unit test + extend integration coverage (missing socket / stale file / unconnectable socket) 4. Update doc comments to match new semantics Will open a PR once verified end-to-end on a dev box.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_blueprint#95
No description provided.