service.define hardcodes Interpreter::Exec, breaks any caller passing shell-shaped exec strings (uc12-16 root cause) #113

Closed
opened 2026-05-21 12:57:56 +00:00 by sameh-farouk · 1 comment
Member

Root cause for uc12-16 (and likely uc17+, etc.) — confirmed by isolated repro

When service.define RPC is called with a shell-shaped exec string (e.g. "while true; do sleep 1; done"), the spawned job fails immediately with spawn error: No such file or directory (os error 2) and the service never transitions to running.

Repro

# Clean PATH_ROOT sandbox, start hero_proc_server, then:
cargo run -p hero_proc_test -- --filter functional::uc_12_16::uc12_define_service_start_stop
# → FAIL "service 'svc12' running=false timeout after 10s"

The job record:

phase: failed
script: "while true; do sleep 1; done"
spec.interpreter: "exec"
error: "spawn error: No such file or directory (os error 2)"

Daemon log:

ERROR spawn error: No such file or directory (os error 2)  job_id=1

Root cause

crates/hero_proc_server/src/service/define.rs:143 (and lines 167, 188) hardcodes interpreter: Interpreter::Exec on the main / stop / check actions. The executor (supervisor/executor.rs:655-663) interprets Interpreter::Exec as literal execve: script.split_whitespace() → first token is the binary, rest are args. So "while true; do sleep 1; done" becomes execve("while", ["true;", "do", "sleep", "1;", "done"]), which fails with ENOENT because there's no binary literally named while.

The spec's own example (crates/hero_proc_server/src/service/SPECS.md:96):

service_define(&db, "nginx", "nginx -g 'daemon off;'", ...)

would also be broken under Interpreter::Exec semantics — the single quotes aren't stripped, so nginx would receive literal 'daemon and off;' as separate argv entries. So this defect is not test-only; it's a UX flaw of the high-level convenience API.

Affects

  • All service.define callers passing anything more shell-shaped than "binary arg1 arg2"
  • crates/hero_proc_test/src/tests/functional/uc_12_16_service_lifecycle.rs (uc12, uc13, uc14, uc16 — confirmed)
  • Likely many other uc_*_* tests that use service_define (haven't fully checked)
  • Likely real production services anyone defines via the convenience API

Proposed fix

Replace the hardcoded Interpreter::Exec with crate::db::actions::model::detect_interpreter(exec) in all 3 sites in define.rs. The existing detect_interpreter (crates/hero_proc_server/src/db/actions/model.rs:124-148) already exists and defaults to Bash when no shebang is present — so:

exec value detect_interpreter result Behavior
"while true; do sleep 1; done" Bash bash -c "while true; do sleep 1; done"
"#!/bin/sh\nrm -rf foo" Sh sh -c "#!/bin/sh\nrm -rf foo"
"sleep 100" Bash bash -c "sleep 100"
"nginx -g 'daemon off;'" Bash shell quoting works ✓

Interpreter::Exec would still be accessible via the lower-level db.actions.set(...) for users who want literal execve semantics.

Verification

After the fix, cargo run -p hero_proc_test --filter functional::uc_12_16::uc12_define_service_start_stop should pass on a clean PATH_ROOT sandbox. Expect 4 tests in the uc_12_16 cluster + possibly several others to flip from FAIL to PASS.

Origin

Introduced in cc836a7 (despiegk, 2026-05-14, refactor: consolidate crates, rewrite logging/scheduler/service/secrets modules) — the rewrite that created the service/define.rs module. The Interpreter::Exec choice was likely an oversight given the spec's own example would also fail under those semantics.

Surfacing rather than auto-patching since service/define.rs is hot Kristof surface (cc836a7 was 7 days ago).

cc @despiegk

## Root cause for uc12-16 (and likely uc17+, etc.) — confirmed by isolated repro When `service.define` RPC is called with a shell-shaped `exec` string (e.g. `"while true; do sleep 1; done"`), the spawned job fails immediately with `spawn error: No such file or directory (os error 2)` and the service never transitions to `running`. ### Repro ```bash # Clean PATH_ROOT sandbox, start hero_proc_server, then: cargo run -p hero_proc_test -- --filter functional::uc_12_16::uc12_define_service_start_stop # → FAIL "service 'svc12' running=false timeout after 10s" ``` The job record: ``` phase: failed script: "while true; do sleep 1; done" spec.interpreter: "exec" error: "spawn error: No such file or directory (os error 2)" ``` Daemon log: ``` ERROR spawn error: No such file or directory (os error 2) job_id=1 ``` ### Root cause `crates/hero_proc_server/src/service/define.rs:143` (and lines 167, 188) **hardcodes** `interpreter: Interpreter::Exec` on the main / stop / check actions. The executor (`supervisor/executor.rs:655-663`) interprets `Interpreter::Exec` as **literal execve**: `script.split_whitespace()` → first token is the binary, rest are args. So `"while true; do sleep 1; done"` becomes `execve("while", ["true;", "do", "sleep", "1;", "done"])`, which fails with ENOENT because there's no binary literally named `while`. The spec's own example (`crates/hero_proc_server/src/service/SPECS.md:96`): ```rust service_define(&db, "nginx", "nginx -g 'daemon off;'", ...) ``` would also be broken under `Interpreter::Exec` semantics — the single quotes aren't stripped, so `nginx` would receive literal `'daemon` and `off;'` as separate argv entries. So this defect is not test-only; it's a UX flaw of the high-level convenience API. ### Affects - All `service.define` callers passing anything more shell-shaped than `"binary arg1 arg2"` - `crates/hero_proc_test/src/tests/functional/uc_12_16_service_lifecycle.rs` (uc12, uc13, uc14, uc16 — confirmed) - Likely many other `uc_*_*` tests that use `service_define` (haven't fully checked) - Likely real production services anyone defines via the convenience API ### Proposed fix Replace the hardcoded `Interpreter::Exec` with `crate::db::actions::model::detect_interpreter(exec)` in all 3 sites in `define.rs`. The existing `detect_interpreter` (`crates/hero_proc_server/src/db/actions/model.rs:124-148`) already exists and defaults to `Bash` when no shebang is present — so: | `exec` value | detect_interpreter result | Behavior | |---|---|---| | `"while true; do sleep 1; done"` | Bash | `bash -c "while true; do sleep 1; done"` ✓ | | `"#!/bin/sh\nrm -rf foo"` | Sh | `sh -c "#!/bin/sh\nrm -rf foo"` ✓ | | `"sleep 100"` | Bash | `bash -c "sleep 100"` ✓ | | `"nginx -g 'daemon off;'"` | Bash | shell quoting works ✓ | `Interpreter::Exec` would still be accessible via the lower-level `db.actions.set(...)` for users who want literal execve semantics. ### Verification After the fix, `cargo run -p hero_proc_test --filter functional::uc_12_16::uc12_define_service_start_stop` should pass on a clean PATH_ROOT sandbox. Expect 4 tests in the uc_12_16 cluster + possibly several others to flip from FAIL to PASS. ### Origin Introduced in `cc836a7` (despiegk, 2026-05-14, `refactor: consolidate crates, rewrite logging/scheduler/service/secrets modules`) — the rewrite that created the `service/define.rs` module. The `Interpreter::Exec` choice was likely an oversight given the spec's own example would also fail under those semantics. Surfacing rather than auto-patching since `service/define.rs` is hot Kristof surface (cc836a7 was 7 days ago). cc @despiegk
Author
Member

Verified the proposed fix works. Patched service/define.rs lines 143/167/188 to use detect_interpreter() (the proposal in the OP), rebuilt, ran against clean PATH_ROOT sandbox:

Test Before patch After patch
uc12_define_service_start_stop FAIL (svc12 running=false 10s) PASS
uc13_restart_produces_new_job_ids FAIL (svc13 running=false 10s) PASS
uc14_stop_with_remove_jobs_deletes_records FAIL (svc14 running=false 10s) PASS at the running check — fails later at "expected at least one child job" (separate service.children defect, not this issue's scope)
uc16_system_class_excluded_from_stop_all FAIL (svc16-sys running=false 10s) PASS

So 3/4 of the uc12-16 cluster flip from FAIL→PASS with the 3-line change, and uc14 reaches a second-stage assertion exposing a separate service.children defect (worth its own ticket; not blocking this fix).

Patch reverted from my working tree (this verification was repro-only — leaving the actual change to whoever takes the issue, since service/define.rs is your hot surface).

**Verified the proposed fix works.** Patched `service/define.rs` lines 143/167/188 to use `detect_interpreter()` (the proposal in the OP), rebuilt, ran against clean PATH_ROOT sandbox: | Test | Before patch | After patch | |---|---|---| | `uc12_define_service_start_stop` | FAIL (svc12 running=false 10s) | **PASS** | | `uc13_restart_produces_new_job_ids` | FAIL (svc13 running=false 10s) | **PASS** | | `uc14_stop_with_remove_jobs_deletes_records` | FAIL (svc14 running=false 10s) | **PASS at the running check** — fails later at "expected at least one child job" (separate `service.children` defect, not this issue's scope) | | `uc16_system_class_excluded_from_stop_all` | FAIL (svc16-sys running=false 10s) | **PASS** | So 3/4 of the uc12-16 cluster flip from FAIL→PASS with the 3-line change, and uc14 reaches a second-stage assertion exposing a separate `service.children` defect (worth its own ticket; not blocking this fix). Patch reverted from my working tree (this verification was repro-only — leaving the actual change to whoever takes the issue, since `service/define.rs` is your hot surface).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#113
No description provided.