transport: check_socket_or_die calls std::process::exit on socket errors #109

Closed
opened 2026-05-21 11:33:42 +00:00 by omarz · 1 comment
Member

Problem

crates/openrpc/src/transport.rs::check_socket_or_die calls std::process::exit(0) whenever a Unix socket cannot be opened — file missing, EACCES on the path, or the entry is not a socket. It is invoked from unix_socket (line 149) and unix_socket_with_path (line 178) on commit f17dcd7.

A library calling process::exit is a footgun for any long-running consumer: a single failed connect() for one target's socket kills the entire host process. We hit this in hero_codescalers, where the daemon needs to talk to each user's hero_proc socket — when a user's home is 0700 or their daemon isn't running, the codescalers daemon dies with exit code 0 and hero_proc restarts it. Every in-flight RPC drops; the UI shows "RPC server unavailable" for a few seconds.

Reproduction

From any consumer that opens an OpenRpcClient against a missing or unreachable socket:

let client = OpenRpcClient::unix_socket("/path/that/does/not/exist").await;
// Process exits 0 here — no Err returned, no Result<> branch reachable.

Impact

  • Long-running daemons cannot survive a per-target socket failure even though the call is fallible by signature (Result<Self, OpenRpcError>).
  • The behaviour is undocumented at the call site — callers expecting Err are terminated instead.
  • Workarounds (pre-stat with tokio::net::UnixStream::connect + timeout) push responsibility back to every consumer, defeating the purpose of returning a Result.

Suggested fix

Return an OpenRpcError from check_socket_or_die (and rename it) so the existing Result signature is honoured. Move the eprintln! block to consumers that genuinely want CLI-friendly output — most servers want the structured error.

fn check_socket(path: &str) -> Result<(), OpenRpcError> {
    use std::os::unix::fs::FileTypeExt;
    match std::fs::metadata(path) {
        Err(e) if e.kind() == std::io::ErrorKind::NotFound =>
            Err(OpenRpcError::SocketNotFound(path.into())),
        Err(e) =>
            Err(OpenRpcError::SocketAccess { path: path.into(), source: e }),
        Ok(meta) if !meta.file_type().is_socket() =>
            Err(OpenRpcError::NotASocket(path.into())),
        Ok(_) => Ok(()),
    }
}

CLI tools that rely on the friendly exit-on-fail behaviour today can wrap OpenRpcClient::unix_socket(...)? themselves and print before bailing.

Context

  • Commit observed: f17dcd71aa18ce9527ddf7a2ce19ea607ff41d65 (branch development)
  • Workaround landed in hero_codescalers: pre-stat each per-user socket before calling the SDK (see crates/hero_codescalers_server/src/workspace.rs::ensure_user_daemon_reachable). The workaround can be removed once this lands.
## Problem `crates/openrpc/src/transport.rs::check_socket_or_die` calls `std::process::exit(0)` whenever a Unix socket cannot be opened — file missing, EACCES on the path, or the entry is not a socket. It is invoked from `unix_socket` (line 149) and `unix_socket_with_path` (line 178) on commit `f17dcd7`. A library calling `process::exit` is a footgun for any long-running consumer: a single failed `connect()` for one target's socket kills the entire host process. We hit this in `hero_codescalers`, where the daemon needs to talk to each user's hero_proc socket — when a user's home is `0700` or their daemon isn't running, the codescalers daemon dies with exit code 0 and hero_proc restarts it. Every in-flight RPC drops; the UI shows "RPC server unavailable" for a few seconds. ## Reproduction From any consumer that opens an `OpenRpcClient` against a missing or unreachable socket: ```rust let client = OpenRpcClient::unix_socket("/path/that/does/not/exist").await; // Process exits 0 here — no Err returned, no Result<> branch reachable. ``` ## Impact - Long-running daemons cannot survive a per-target socket failure even though the call is fallible by signature (`Result<Self, OpenRpcError>`). - The behaviour is undocumented at the call site — callers expecting `Err` are terminated instead. - Workarounds (pre-stat with `tokio::net::UnixStream::connect` + timeout) push responsibility back to every consumer, defeating the purpose of returning a `Result`. ## Suggested fix Return an `OpenRpcError` from `check_socket_or_die` (and rename it) so the existing `Result` signature is honoured. Move the `eprintln!` block to consumers that genuinely want CLI-friendly output — most servers want the structured error. ```rust fn check_socket(path: &str) -> Result<(), OpenRpcError> { use std::os::unix::fs::FileTypeExt; match std::fs::metadata(path) { Err(e) if e.kind() == std::io::ErrorKind::NotFound => Err(OpenRpcError::SocketNotFound(path.into())), Err(e) => Err(OpenRpcError::SocketAccess { path: path.into(), source: e }), Ok(meta) if !meta.file_type().is_socket() => Err(OpenRpcError::NotASocket(path.into())), Ok(_) => Ok(()), } } ``` CLI tools that rely on the friendly exit-on-fail behaviour today can wrap `OpenRpcClient::unix_socket(...)?` themselves and print before bailing. ## Context - Commit observed: `f17dcd71aa18ce9527ddf7a2ce19ea607ff41d65` (branch `development`) - Workaround landed in `hero_codescalers`: pre-stat each per-user socket before calling the SDK (see `crates/hero_codescalers_server/src/workspace.rs::ensure_user_daemon_reachable`). The workaround can be removed once this lands.
Member

Already fixed in dc83817 (closes #95). You cited commit f17dcd7 which predates the fix by one day.

Current state of crates/openrpc/src/transport.rs (HEAD 2b2aeb6):

$ grep -n "std::process::exit\|check_socket_or_die\|check_socket_reachable" crates/openrpc/src/transport.rs
23:fn check_socket_reachable(path: &str) -> Result<(), OpenRpcError>
141:        check_socket_reachable(path)?;
168:        check_socket_reachable(path)?;

check_socket_or_die was renamed → check_socket_reachable, returns Result<(), OpenRpcError> instead of calling exit(0). Both unix_socket() and unix_socket_with_path() propagate via ?. The inner UnixStream::connect .map_err blocks that also called exit(0) are gone too.

Audit at the time confirmed all 33 downstream consumers use ?-propagation through the generated SDKs — non-breaking. See #95 for the full architectural write-up.

Closing as duplicate of #95. If you see the exit(0) behavior on the latest binary, let me know — that would mean the fix got reverted somewhere.

**Already fixed in `dc83817` (closes #95).** You cited commit `f17dcd7` which predates the fix by one day. Current state of `crates/openrpc/src/transport.rs` (HEAD `2b2aeb6`): ``` $ grep -n "std::process::exit\|check_socket_or_die\|check_socket_reachable" crates/openrpc/src/transport.rs 23:fn check_socket_reachable(path: &str) -> Result<(), OpenRpcError> 141: check_socket_reachable(path)?; 168: check_socket_reachable(path)?; ``` `check_socket_or_die` was renamed → `check_socket_reachable`, returns `Result<(), OpenRpcError>` instead of calling `exit(0)`. Both `unix_socket()` and `unix_socket_with_path()` propagate via `?`. The inner `UnixStream::connect` `.map_err` blocks that also called `exit(0)` are gone too. Audit at the time confirmed all 33 downstream consumers use `?`-propagation through the generated SDKs — non-breaking. See [#95](https://forge.ourworld.tf/lhumina_code/hero_rpc/issues/95) for the full architectural write-up. Closing as duplicate of #95. If you see the exit(0) behavior on the latest binary, let me know — that would mean the fix got reverted somewhere.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_blueprint#109
No description provided.