lhumina_code/hero_proc

Fork 0

Service cascade-stop: stopping A should stop services that require A #65

New issue

Open

opened 2026-04-29 04:22:46 +00:00 by despiegk · 4 comments

despiegk commented

2026-04-29 04:22:46 +00:00

Owner

Context

When a user stops service A and other services declare requires: [A], those dependents should stop too (most-dependent-first). Today service.stop only stops the named service; dependents keep running with a now-broken dependency.

A test already exists but is #[ignore]d:

tests/integration/tests/dependencies.rs:51 — test_cascade_stop — #[ignore = "cascade stop not yet implemented"]

Repro (from the test)

Set up A ← B ← C (C requires B, B requires A).
Start all three; all reach running.
Call service_stop({name: "svc-a"}).
Expected: all three reach stopped/blocked within DEFAULT_TIMEOUT. Cascade ordering: C first, then B, then A.

Acceptance

service.stop cascades to services with a transitive requires chain into the stopped service.
Order is most-dependent-first (children before parent).
Existing test test_cascade_stop is un-ignored and passes.
No regression on test_conflict_blocking (independent services unaffected).

Notes / scope

The dependency graph traversal is the same shape as the cascade-stop sibling work for service-delete (see separate issue) and the existing action-delete cascade in #56. Consider a shared graph helper in hero_proc_lib.
Idempotent: stopping a service whose dependents are already stopped is fine.
This is the runtime sibling of cascade-delete — they share traversal but operate on different terminal states.

Follow-up to #56.

## Context When a user stops service A and other services declare `requires: [A]`, those dependents should stop too (most-dependent-first). Today `service.stop` only stops the named service; dependents keep running with a now-broken dependency. A test already exists but is `#[ignore]`d: - `tests/integration/tests/dependencies.rs:51` — `test_cascade_stop` — `#[ignore = "cascade stop not yet implemented"]` ## Repro (from the test) 1. Set up A ← B ← C (C requires B, B requires A). 2. Start all three; all reach `running`. 3. Call `service_stop({name: "svc-a"})`. 4. Expected: all three reach `stopped`/`blocked` within `DEFAULT_TIMEOUT`. Cascade ordering: C first, then B, then A. ## Acceptance - `service.stop` cascades to services with a transitive `requires` chain into the stopped service. - Order is most-dependent-first (children before parent). - Existing test `test_cascade_stop` is un-ignored and passes. - No regression on `test_conflict_blocking` (independent services unaffected). ## Notes / scope - The dependency graph traversal is the same shape as the cascade-stop sibling work for service-delete (see separate issue) and the existing action-delete cascade in #56. Consider a shared graph helper in `hero_proc_lib`. - Idempotent: stopping a service whose dependents are already stopped is fine. - This is the runtime sibling of cascade-delete — they share traversal but operate on different terminal states. Follow-up to #56.

despiegk commented

2026-05-01 04:24:13 +00:00

Author

Owner

Implementation Spec: Issue #65 — Service Cascade-Stop + Start-Cleanup of Stale Jobs

Objective

Make service.stop cascade to all transitive dependents (most-dependent-first), and make service.start proactively delete stale (terminal-phase) jobs belonging to that service so a fresh start has a clean slate. Jobs are NEVER deleted on stop — they remain so the user can inspect crash output.

Background — Codebase Findings

How job → service identity is currently established

There is no direct service_name field on Job and no service tag. The existing code at crates/hero_proc_server/src/rpc/service.rs already implements the canonical job-to-service link via service_running_jobs() and service_last_terminal_state(). The recipe is:

Load ServiceConfig → take service.actions: Vec<String>.
A job belongs to the service iff one of:
- job.action == name (raw service name), or
- job.action.starts_with("{name}.") (e.g. svc-a.main — this is the format service.start writes), or
- job.action.starts_with("{name}:") (legacy/colon variant), or
- job.action == one of service.actions.

service.start writes jobs with action = "{service_name}.{action_name}". So the dot-prefix branch is the primary signal for jobs the supervisor created.

Decision on the job→service link

Use the existing reverse lookup via the action list — do NOT add a new field.

Reasons:

The match logic is already present, battle-tested, and used by service.status, service.stop, service.kill, service.children, service.is_running. Adding a new service_name column on jobs would diverge from a working pattern and require a schema migration plus backfill.
JobFilter already carries an unused hero_proc_service_name field. If a future cleanup wants to add a real link, that is the place — out of scope here.
We extract the matcher into a reusable helper so the two new code paths (cascade-stop, start-cleanup) and the four existing call sites converge on one definition.

Existing dependency traversal

crates/hero_proc_server/src/supervisor/shutdown.rs (build_stop_waves) already does the topological sort we need: it walks dependencies.requires + dependencies.after, inverts the edges, and emits waves with leaves (most-dependent) first via Kahn's algorithm. We reuse this shape for cascade-stop, scoped to a single root service.

What "still running" means for a job

JobStatus::is_active() returns true for Running and Retrying. Anything is_terminal() (Succeeded/Failed/Cancelled) is fair game to delete. Pending and Waiting jobs from a previous botched start are also safe to delete — they were never executing.

Requirements

Feature 1 — Cascade stop

service.stop({name: A}) must:
1. Compute S = {A} ∪ all services with a transitive requires chain into A in the same context.
2. Order S so the most-dependent service comes first and A last.
3. For each service in that order, cancel its running jobs.
4. Be idempotent — a service in S whose jobs are already terminal contributes 0 cancellations and does not error.
5. Return a result that aggregates total cancelled count plus a cascaded: [...] list.
Cycle defense: if a cycle is detected, log a warning and stop those services in arbitrary order.

Feature 2 — Start cleanup of stale jobs

service.start({name: A}) must, before creating the new start job:
1. Find all jobs that match service A via the reverse-lookup matcher.
2. For each such job whose phase.is_active() is false (terminal or pending/waiting), delete it via db.jobs.delete(id).
3. Skip — never delete — any job whose phase.is_active() is true.
4. Then proceed to insert the new start job exactly as today.

Non-requirements

Do NOT delete jobs on stop. Crash forensics are preserved.
Do NOT add a service_name column to jobs.
Do NOT modify graceful global shutdown.
Do NOT change service.restart directly; it composes stop + start and inherits both behaviors.

Files to Create / Modify

New file

crates/hero_proc_lib/src/db/service/graph.rs
- pub fn collect_dependents(services, context, root) -> Vec<String> — Kahn's on inverted requires graph, leaves first, root last.
- pub fn job_belongs_to_service(job_action, service_name, service_actions) -> bool — single source of truth for the four-prong match.
- Unit tests covering linear chain, diamond, cycle, unrelated, missing root, dot/colon/exact/list matches.

Modified files

crates/hero_proc_lib/src/db/service/mod.rs — pub mod graph; and re-export.
crates/hero_proc_server/src/rpc/service.rs
- Replace inline match in service_running_jobs, service_last_terminal_state, count_restarts, handle_children with graph::job_belongs_to_service.
- Add service_owned_jobs(db, context, name) -> Vec<Job> — all matching jobs (active + terminal + pending).
- handle_stop: list services, call collect_dependents, loop the ordered names, cancel running jobs in each, return {ok, cancelled, cascaded}.
- handle_start: before creating the new job, fetch service_owned_jobs and delete any non-active job; log on failure but do not fail the start.
tests/integration/tests/dependencies.rs
- Remove #[ignore] from test_cascade_stop.
- Add test_start_cleans_stale_jobs.

Step-by-Step Implementation Plan

Step 1 — Add the shared graph helper

Files: create crates/hero_proc_lib/src/db/service/graph.rs; edit crates/hero_proc_lib/src/db/service/mod.rs.
Depends on: nothing.
Done when: cargo test -p hero_proc_lib passes the new tests.

Step 2 — Refactor service.rs to use the helper (no behavior change)

Files: crates/hero_proc_server/src/rpc/service.rs.
Depends on: Step 1.
Done when: cargo build -p hero_proc_server compiles; existing service tests still pass.

Step 3 — Implement cascade in `handle_stop`

Files: crates/hero_proc_server/src/rpc/service.rs.
Depends on: Step 1, 2.

Step 4 — Implement start-cleanup in `handle_start`

Files: crates/hero_proc_server/src/rpc/service.rs.
Depends on: Step 2.

Step 5 — Un-ignore and add tests

Files: tests/integration/tests/dependencies.rs.
Depends on: Step 3, 4.
Add test_start_cleans_stale_jobs:

Start svc-a, wait running, stop, wait stopped → assert at least one terminal job exists.
Start svc-a again → assert no terminal job for svc-a.main remains; only the fresh active one.
Run a second unrelated svc-b concurrently and assert its job survives.

Step 6 — Validate full suite

cargo build --all-targets and cargo test --workspace.

Acceptance Criteria

cargo test -p hero_proc_lib passes new graph helper tests.
test_cascade_stop is un-ignored and passes (A←B←C all stop within DEFAULT_TIMEOUT after stopping A).
test_conflict_blocking still passes (no regression on independent services).
New test_start_cleans_stale_jobs passes — stale terminal jobs deleted on start, unrelated service's jobs untouched.
service.stop is idempotent: stopping a service whose dependents are already stopped reports cancelled: 0 for them and does not error.
service.stop does NOT delete terminal jobs — verify in test.
service.restart cleans stale jobs (implicit via stop+start composition).
No regression in cargo test --workspace.

Notes / Edge Cases

Concurrency: JobsApi is Arc<Mutex<Connection>> — atomic per-call. Cascade-stop is sequential cancels; idempotent. No new locking.
requires vs after: cascade-stop uses ONLY requires (hard dependency). after is ordering-only.
Cycle handling: fall back to arbitrary order with a warning, mirroring build_stop_waves.
"Still running" predicate: JobStatus::is_active() (Running, Retrying). Pending/Waiting from a botched prior attempt get deleted.
Context boundary: both features operate strictly within one context. Cross-context dependencies are not modeled.
Backward-compatible response: keep {ok, cancelled} in handle_stop; add cascaded as a new key.
Best-effort delete: db.jobs.delete(id) returning Ok(false) is fine. An Err(_) is logged but does not fail the start.

# Implementation Spec: Issue #65 — Service Cascade-Stop + Start-Cleanup of Stale Jobs ## Objective Make `service.stop` cascade to all transitive dependents (most-dependent-first), and make `service.start` proactively delete stale (terminal-phase) jobs belonging to that service so a fresh start has a clean slate. Jobs are NEVER deleted on stop — they remain so the user can inspect crash output. ## Background — Codebase Findings ### How job → service identity is currently established There is **no direct `service_name` field on `Job`** and **no service tag**. The existing code at `crates/hero_proc_server/src/rpc/service.rs` already implements the canonical job-to-service link via `service_running_jobs()` and `service_last_terminal_state()`. The recipe is: 1. Load `ServiceConfig` → take `service.actions: Vec<String>`. 2. A job belongs to the service iff one of: - `job.action == name` (raw service name), or - `job.action.starts_with("{name}.")` (e.g. `svc-a.main` — this is the format `service.start` writes), or - `job.action.starts_with("{name}:")` (legacy/colon variant), or - `job.action == one of service.actions`. `service.start` writes jobs with `action = "{service_name}.{action_name}"`. So the dot-prefix branch is the primary signal for jobs the supervisor created. ### Decision on the job→service link **Use the existing reverse lookup via the action list — do NOT add a new field.** Reasons: - The match logic is already present, battle-tested, and used by `service.status`, `service.stop`, `service.kill`, `service.children`, `service.is_running`. Adding a new `service_name` column on `jobs` would diverge from a working pattern and require a schema migration plus backfill. - `JobFilter` already carries an unused `hero_proc_service_name` field. If a future cleanup wants to add a real link, that is the place — out of scope here. - We extract the matcher into a reusable helper so the two new code paths (cascade-stop, start-cleanup) and the four existing call sites converge on one definition. ### Existing dependency traversal `crates/hero_proc_server/src/supervisor/shutdown.rs` (`build_stop_waves`) already does the topological sort we need: it walks `dependencies.requires + dependencies.after`, inverts the edges, and emits waves with leaves (most-dependent) first via Kahn's algorithm. We reuse this shape for cascade-stop, scoped to a single root service. ### What "still running" means for a job `JobStatus::is_active()` returns true for `Running` and `Retrying`. Anything `is_terminal()` (Succeeded/Failed/Cancelled) is fair game to delete. `Pending` and `Waiting` jobs from a previous botched start are also safe to delete — they were never executing. ## Requirements ### Feature 1 — Cascade stop - `service.stop({name: A})` must: 1. Compute `S = {A} ∪ all services with a transitive `requires` chain into A` in the same context. 2. Order `S` so the most-dependent service comes first and `A` last. 3. For each service in that order, cancel its running jobs. 4. Be idempotent — a service in `S` whose jobs are already terminal contributes 0 cancellations and does not error. 5. Return a result that aggregates total cancelled count plus a `cascaded: [...]` list. - Cycle defense: if a cycle is detected, log a warning and stop those services in arbitrary order. ### Feature 2 — Start cleanup of stale jobs - `service.start({name: A})` must, **before** creating the new start job: 1. Find all jobs that match service `A` via the reverse-lookup matcher. 2. For each such job whose `phase.is_active()` is false (terminal or pending/waiting), delete it via `db.jobs.delete(id)`. 3. Skip — never delete — any job whose `phase.is_active()` is true. 4. Then proceed to insert the new start job exactly as today. ### Non-requirements - Do NOT delete jobs on stop. Crash forensics are preserved. - Do NOT add a `service_name` column to jobs. - Do NOT modify graceful global shutdown. - Do NOT change `service.restart` directly; it composes stop + start and inherits both behaviors. ## Files to Create / Modify ### New file - **`crates/hero_proc_lib/src/db/service/graph.rs`** - `pub fn collect_dependents(services, context, root) -> Vec<String>` — Kahn's on inverted requires graph, leaves first, root last. - `pub fn job_belongs_to_service(job_action, service_name, service_actions) -> bool` — single source of truth for the four-prong match. - Unit tests covering linear chain, diamond, cycle, unrelated, missing root, dot/colon/exact/list matches. ### Modified files - **`crates/hero_proc_lib/src/db/service/mod.rs`** — `pub mod graph;` and re-export. - **`crates/hero_proc_server/src/rpc/service.rs`** - Replace inline match in `service_running_jobs`, `service_last_terminal_state`, `count_restarts`, `handle_children` with `graph::job_belongs_to_service`. - Add `service_owned_jobs(db, context, name) -> Vec<Job>` — all matching jobs (active + terminal + pending). - `handle_stop`: list services, call `collect_dependents`, loop the ordered names, cancel running jobs in each, return `{ok, cancelled, cascaded}`. - `handle_start`: before creating the new job, fetch `service_owned_jobs` and delete any non-active job; log on failure but do not fail the start. - **`tests/integration/tests/dependencies.rs`** - Remove `#[ignore]` from `test_cascade_stop`. - Add `test_start_cleans_stale_jobs`. ## Step-by-Step Implementation Plan ### Step 1 — Add the shared graph helper Files: create `crates/hero_proc_lib/src/db/service/graph.rs`; edit `crates/hero_proc_lib/src/db/service/mod.rs`. Depends on: nothing. Done when: `cargo test -p hero_proc_lib` passes the new tests. ### Step 2 — Refactor service.rs to use the helper (no behavior change) Files: `crates/hero_proc_server/src/rpc/service.rs`. Depends on: Step 1. Done when: `cargo build -p hero_proc_server` compiles; existing service tests still pass. ### Step 3 — Implement cascade in `handle_stop` Files: `crates/hero_proc_server/src/rpc/service.rs`. Depends on: Step 1, 2. ### Step 4 — Implement start-cleanup in `handle_start` Files: `crates/hero_proc_server/src/rpc/service.rs`. Depends on: Step 2. ### Step 5 — Un-ignore and add tests Files: `tests/integration/tests/dependencies.rs`. Depends on: Step 3, 4. Add `test_start_cleans_stale_jobs`: - Start `svc-a`, wait running, stop, wait stopped → assert at least one terminal job exists. - Start `svc-a` again → assert no terminal job for `svc-a.main` remains; only the fresh active one. - Run a second unrelated `svc-b` concurrently and assert its job survives. ### Step 6 — Validate full suite `cargo build --all-targets` and `cargo test --workspace`. ## Acceptance Criteria - [ ] `cargo test -p hero_proc_lib` passes new graph helper tests. - [ ] `test_cascade_stop` is un-ignored and passes (A←B←C all stop within `DEFAULT_TIMEOUT` after stopping A). - [ ] `test_conflict_blocking` still passes (no regression on independent services). - [ ] New `test_start_cleans_stale_jobs` passes — stale terminal jobs deleted on start, unrelated service's jobs untouched. - [ ] `service.stop` is idempotent: stopping a service whose dependents are already stopped reports `cancelled: 0` for them and does not error. - [ ] `service.stop` does NOT delete terminal jobs — verify in test. - [ ] `service.restart` cleans stale jobs (implicit via stop+start composition). - [ ] No regression in `cargo test --workspace`. ## Notes / Edge Cases - **Concurrency**: `JobsApi` is `Arc<Mutex<Connection>>` — atomic per-call. Cascade-stop is sequential cancels; idempotent. No new locking. - **`requires` vs `after`**: cascade-stop uses ONLY `requires` (hard dependency). `after` is ordering-only. - **Cycle handling**: fall back to arbitrary order with a warning, mirroring `build_stop_waves`. - **"Still running" predicate**: `JobStatus::is_active()` (Running, Retrying). Pending/Waiting from a botched prior attempt get deleted. - **Context boundary**: both features operate strictly within one context. Cross-context dependencies are not modeled. - **Backward-compatible response**: keep `{ok, cancelled}` in `handle_stop`; add `cascaded` as a new key. - **Best-effort delete**: `db.jobs.delete(id)` returning `Ok(false)` is fine. An `Err(_)` is logged but does not fail the start.

despiegk commented

2026-05-01 04:38:26 +00:00

Author

Owner

Spec revision (working tree at code0/development)

After re-exploring the code0/development working tree (the canonical source), the spec changed in two important ways. The earlier comment is superseded by what follows.

Job-to-service link: there's already a direct field

Job.service_id: Option<String> exists, is persisted in the SQLite jobs table, has an index idx_jobs_service_id, and is set by service.rs::handle_start (service_id: Some(name.clone())). Existing helpers:

JobsApi::list_by_service_id(name)
JobsApi::delete_inactive_by_service(name) — deletes jobs whose phase is NOT running or retrying (preserves currently-running)
JobsApi::delete_terminated_by_service(name)

Decision: use service_id rather than action-name reverse lookup. Cleaner, single-purpose, already indexed.

Implementation plan

Feature 1 — Cascade stop

In crates/hero_proc_server/src/rpc/service.rs:

New private helper service_dependents_in_order(db, context, name) -> Vec<String>:
- db.services.list(Some(context))
- Build adjacency parent -> [children that require parent] from svc.service.dependencies.requires
- BFS from name; layer 0 = [name]; track visited (cycle-safe)
- Drop name itself; reverse the layer order so leaves come first
Modify handle_stop:
- Compute dependents
- For each dependent in order: cancel non-terminal jobs (existing service_non_terminal_jobs + executor::cancel_job); do NOT remove their job records
- Then cancel the named service's jobs as today; respect existing remove_jobs flag for the named service ONLY
- Return { "ok": true, "cancelled": N, "stopped_dependents": [...] } (additive, backward-compatible)

Feature 2 — Start cleanup of stale jobs

In crates/hero_proc_server/src/rpc/service.rs::handle_start:

At entry, before the existing replace_existing_jobs block, unconditionally call db.jobs.delete_inactive_by_service(&name).
This is no-op when there are no stale jobs. Running/retrying jobs are protected by SQL.
The existing replace_existing_jobs=true path keeps its full behaviour (cancel + delete) on top.
Net effect:
- replace_existing_jobs=true (default): same as today plus an explicit prune (already a subset)
- replace_existing_jobs=false: today does nothing; new behaviour cleans up dead records but leaves running jobs alone — this is the user's explicit ask

Feature 3 — Jobs preserved on stop

Stays preserved by default. Locked in by a new regression test.

Files to modify

crates/hero_proc_server/src/rpc/service.rs (production)
tests/integration/tests/dependencies.rs
- remove #[ignore = "cascade stop not yet implemented"] from test_cascade_stop (line 51)
- add test_start_prunes_stale_jobs_keeps_running
- add test_stop_preserves_jobs_by_default

No new files. No changes to factory.rs, shutdown.rs, or the SDK.

Acceptance criteria

test_cascade_stop un-ignored and passing
test_conflict_blocking still passes (no regression)
test_start_prunes_stale_jobs_keeps_running passes
test_stop_preserves_jobs_by_default passes
test_stop_with_remove_jobs still passes (remove_jobs=true still wipes)
test_service_restart_clears_old_jobs still passes (replace_existing_jobs=true path)
cargo build --workspace clean
No openrpc.json schema change (additive response field only)

Edge cases / notes

BFS with visited set is cycle-safe; on cycle, log warn, continue
Cross-context dependents not modeled; restrict to services.list(Some(context))
wait_stopped accepts the exited terminal state — cancelled jobs map to exited in service_last_terminal_state
Cascade-delete tests (test_remove_parent_stops_children_first, test_remove_with_dependency_chain) remain #[ignore]'d — they are #56 territory, out of scope here
Branch: development (working on /Volumes/T7/code0/hero_proc); the abandoned development_kristof worktree at /Volumes/T7/code1/hero_proc is not touched

## Spec revision (working tree at code0/development) After re-exploring the code0/development working tree (the canonical source), the spec changed in two important ways. The earlier comment is superseded by what follows. ### Job-to-service link: there's already a direct field `Job.service_id: Option<String>` exists, is persisted in the SQLite `jobs` table, has an index `idx_jobs_service_id`, and is set by `service.rs::handle_start` (`service_id: Some(name.clone())`). Existing helpers: - `JobsApi::list_by_service_id(name)` - `JobsApi::delete_inactive_by_service(name)` — deletes jobs whose phase is NOT `running` or `retrying` (preserves currently-running) - `JobsApi::delete_terminated_by_service(name)` **Decision: use `service_id`** rather than action-name reverse lookup. Cleaner, single-purpose, already indexed. ### Implementation plan #### Feature 1 — Cascade stop In `crates/hero_proc_server/src/rpc/service.rs`: 1. New private helper `service_dependents_in_order(db, context, name) -> Vec<String>`: - `db.services.list(Some(context))` - Build adjacency `parent -> [children that require parent]` from `svc.service.dependencies.requires` - BFS from `name`; layer 0 = `[name]`; track visited (cycle-safe) - Drop `name` itself; reverse the layer order so leaves come first 2. Modify `handle_stop`: - Compute dependents - For each dependent in order: cancel non-terminal jobs (existing `service_non_terminal_jobs` + `executor::cancel_job`); do NOT remove their job records - Then cancel the named service's jobs as today; respect existing `remove_jobs` flag for the named service ONLY - Return `{ "ok": true, "cancelled": N, "stopped_dependents": [...] }` (additive, backward-compatible) #### Feature 2 — Start cleanup of stale jobs In `crates/hero_proc_server/src/rpc/service.rs::handle_start`: - At entry, before the existing `replace_existing_jobs` block, unconditionally call `db.jobs.delete_inactive_by_service(&name)`. - This is no-op when there are no stale jobs. Running/retrying jobs are protected by SQL. - The existing `replace_existing_jobs=true` path keeps its full behaviour (cancel + delete) on top. - Net effect: - `replace_existing_jobs=true` (default): same as today plus an explicit prune (already a subset) - `replace_existing_jobs=false`: today does nothing; new behaviour cleans up dead records but leaves running jobs alone — this is the user's explicit ask #### Feature 3 — Jobs preserved on stop Stays preserved by default. Locked in by a new regression test. ### Files to modify - `crates/hero_proc_server/src/rpc/service.rs` (production) - `tests/integration/tests/dependencies.rs` - remove `#[ignore = "cascade stop not yet implemented"]` from `test_cascade_stop` (line 51) - add `test_start_prunes_stale_jobs_keeps_running` - add `test_stop_preserves_jobs_by_default` No new files. No changes to `factory.rs`, `shutdown.rs`, or the SDK. ### Acceptance criteria - [ ] `test_cascade_stop` un-ignored and passing - [ ] `test_conflict_blocking` still passes (no regression) - [ ] `test_start_prunes_stale_jobs_keeps_running` passes - [ ] `test_stop_preserves_jobs_by_default` passes - [ ] `test_stop_with_remove_jobs` still passes (remove_jobs=true still wipes) - [ ] `test_service_restart_clears_old_jobs` still passes (`replace_existing_jobs=true` path) - [ ] `cargo build --workspace` clean - [ ] No openrpc.json schema change (additive response field only) ### Edge cases / notes - BFS with visited set is cycle-safe; on cycle, log warn, continue - Cross-context dependents not modeled; restrict to `services.list(Some(context))` - `wait_stopped` accepts the `exited` terminal state — cancelled jobs map to `exited` in `service_last_terminal_state` - Cascade-delete tests (`test_remove_parent_stops_children_first`, `test_remove_with_dependency_chain`) remain `#[ignore]`'d — they are #56 territory, out of scope here - Branch: `development` (working on `/Volumes/T7/code0/hero_proc`); the abandoned `development_kristof` worktree at `/Volumes/T7/code1/hero_proc` is not touched

despiegk commented

2026-05-01 04:49:51 +00:00

Author

Owner

Test Results

All targeted suites pass. No regressions.

Targeted: dependency tests

cargo test -p hero_proc_integration_tests --test dependencies -- --test-threads=1
test_cascade_stop ... ok                              (un-ignored, now passing)
test_circular_dependency_error ... ignored             (out of scope)
test_conflict_blocking ... ok                          (no regression)
test_start_prunes_stale_jobs_keeps_running ... ok      (new)
test_stop_preserves_jobs_by_default ... ok             (new)
result: ok. 4 passed; 0 failed; 1 ignored

Regression: tests most likely to break

service_management.rs (covers service.stop + remove_jobs + delete-cascade ignores):

result: ok. 16 passed; 0 failed; 2 ignored

service_restart_with_cleanup.rs (covers replace_existing_jobs=true path):

result: ok. 4 passed; 0 failed; 0 ignored

Full suite

hero_proc_integration_tests (all 19 test binaries):

173 passed; 0 failed; 20 ignored (all pre-existing)

hero_proc_server --lib:

70 passed; 0 failed

hero_proc_lib --lib:

160 passed; 0 failed; 1 ignored

Acceptance criteria (from spec)

test_cascade_stop un-ignored and passing
test_conflict_blocking still passes
test_start_prunes_stale_jobs_keeps_running passes
test_stop_preserves_jobs_by_default passes
test_stop_with_remove_jobs still passes
test_service_restart_clears_old_jobs still passes
cargo build --workspace clean (verified per-crate)
No openrpc.json schema change (additive stopped_dependents field only)

## Test Results All targeted suites pass. No regressions. ### Targeted: dependency tests ``` cargo test -p hero_proc_integration_tests --test dependencies -- --test-threads=1 test_cascade_stop ... ok (un-ignored, now passing) test_circular_dependency_error ... ignored (out of scope) test_conflict_blocking ... ok (no regression) test_start_prunes_stale_jobs_keeps_running ... ok (new) test_stop_preserves_jobs_by_default ... ok (new) result: ok. 4 passed; 0 failed; 1 ignored ``` ### Regression: tests most likely to break `service_management.rs` (covers `service.stop` + `remove_jobs` + delete-cascade ignores): ``` result: ok. 16 passed; 0 failed; 2 ignored ``` `service_restart_with_cleanup.rs` (covers `replace_existing_jobs=true` path): ``` result: ok. 4 passed; 0 failed; 0 ignored ``` ### Full suite `hero_proc_integration_tests` (all 19 test binaries): ``` 173 passed; 0 failed; 20 ignored (all pre-existing) ``` `hero_proc_server --lib`: ``` 70 passed; 0 failed ``` `hero_proc_lib --lib`: ``` 160 passed; 0 failed; 1 ignored ``` ### Acceptance criteria (from spec) - [x] `test_cascade_stop` un-ignored and passing - [x] `test_conflict_blocking` still passes - [x] `test_start_prunes_stale_jobs_keeps_running` passes - [x] `test_stop_preserves_jobs_by_default` passes - [x] `test_stop_with_remove_jobs` still passes - [x] `test_service_restart_clears_old_jobs` still passes - [x] `cargo build --workspace` clean (verified per-crate) - [x] No openrpc.json schema change (additive `stopped_dependents` field only)

despiegk commented

2026-05-01 04:50:27 +00:00

Author

Owner

Implementation Summary

Files changed

crates/hero_proc_server/src/rpc/service.rs
- Added private helper service_dependents_in_order(db, context, root) -> Vec<String> — BFS over requires graph; returns deepest-first ordering, root excluded; cycle-safe via visited set.
- Modified handle_stop: now cancels jobs of every transitive requires dependent (deepest-first) before cancelling the named service. Response gains a stopped_dependents: [...] field; existing ok and cancelled keys preserved. remove_jobs=true still only deletes the named service's jobs — dependents always keep theirs for crash forensics.
- Modified handle_start: pulled delete_inactive_by_service(name) out of the replace_existing_jobs branch so it runs unconditionally. The SQL filter (phase NOT IN ('running','retrying')) preserves any actively-running jobs. The old run-record cleanup stays gated on replace_existing_jobs=true.
tests/integration/tests/dependencies.rs
- Removed #[ignore = "cascade stop not yet implemented"] from test_cascade_stop.
- Added test_start_prunes_stale_jobs_keeps_running — start/stop a service, restart with replace_existing_jobs: Some(false), assert old terminal jobs are gone, assert a concurrently-running unrelated service's jobs are untouched.
- Added test_stop_preserves_jobs_by_default — stop without remove_jobs=true must leave terminal job records visible.
- Imported wait_stopped from the test assertions.

Key design decisions

Job→service link: used the existing Job.service_id SQLite column (indexed) plus the existing delete_inactive_by_service helper. No new column, no new migration, no string-prefix matching.
Cascade scope: requires only (hard dependency). after is ordering-only and is intentionally excluded.
Cycle handling: BFS with a visited set. On cycle, the re-entry edge is skipped without error.
Backward compatibility: response from service.stop now carries a stopped_dependents array, but existing ok and cancelled keys are preserved. No openrpc.json schema change required.

Test results

All 4 dependency tests pass; 16/16 service_management; 4/4 service_restart_with_cleanup; full integration suite 173 passed / 0 failed; hero_proc_server lib 70/0; hero_proc_lib 160/0. See previous comment for the breakdown.

## Implementation Summary ### Files changed - `crates/hero_proc_server/src/rpc/service.rs` - Added private helper `service_dependents_in_order(db, context, root) -> Vec<String>` — BFS over `requires` graph; returns deepest-first ordering, root excluded; cycle-safe via visited set. - Modified `handle_stop`: now cancels jobs of every transitive `requires` dependent (deepest-first) before cancelling the named service. Response gains a `stopped_dependents: [...]` field; existing `ok` and `cancelled` keys preserved. `remove_jobs=true` still only deletes the named service's jobs — dependents always keep theirs for crash forensics. - Modified `handle_start`: pulled `delete_inactive_by_service(name)` out of the `replace_existing_jobs` branch so it runs unconditionally. The SQL filter (`phase NOT IN ('running','retrying')`) preserves any actively-running jobs. The old run-record cleanup stays gated on `replace_existing_jobs=true`. - `tests/integration/tests/dependencies.rs` - Removed `#[ignore = "cascade stop not yet implemented"]` from `test_cascade_stop`. - Added `test_start_prunes_stale_jobs_keeps_running` — start/stop a service, restart with `replace_existing_jobs: Some(false)`, assert old terminal jobs are gone, assert a concurrently-running unrelated service's jobs are untouched. - Added `test_stop_preserves_jobs_by_default` — stop without `remove_jobs=true` must leave terminal job records visible. - Imported `wait_stopped` from the test assertions. ### Key design decisions - **Job→service link**: used the existing `Job.service_id` SQLite column (indexed) plus the existing `delete_inactive_by_service` helper. No new column, no new migration, no string-prefix matching. - **Cascade scope**: `requires` only (hard dependency). `after` is ordering-only and is intentionally excluded. - **Cycle handling**: BFS with a visited set. On cycle, the re-entry edge is skipped without error. - **Backward compatibility**: response from `service.stop` now carries a `stopped_dependents` array, but existing `ok` and `cancelled` keys are preserved. No openrpc.json schema change required. ### Test results All 4 dependency tests pass; 16/16 service_management; 4/4 service_restart_with_cleanup; full integration suite 173 passed / 0 failed; hero_proc_server lib 70/0; hero_proc_lib 160/0. See previous comment for the breakdown.

despiegk referenced this issue from a commit

2026-05-01 07:02:32 +00:00

feat(service): cascade-stop dependents and prune stale jobs on start

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

lhumina_code/hero_proc#65

No description provided.

Rows
Columns

Service cascade-stop: stopping A should stop services that require A #65

Context

Repro (from the test)

Acceptance

Notes / scope

Implementation Spec: Issue #65 — Service Cascade-Stop + Start-Cleanup of Stale Jobs

Objective

Background — Codebase Findings

How job → service identity is currently established

Decision on the job→service link

Existing dependency traversal

What "still running" means for a job

Requirements

Feature 1 — Cascade stop

Feature 2 — Start cleanup of stale jobs

Non-requirements

Files to Create / Modify

New file

Modified files

Step-by-Step Implementation Plan

Step 1 — Add the shared graph helper

Step 2 — Refactor service.rs to use the helper (no behavior change)

Step 3 — Implement cascade in handle_stop

Step 4 — Implement start-cleanup in handle_start

Step 5 — Un-ignore and add tests

Step 6 — Validate full suite

Acceptance Criteria

Notes / Edge Cases

Spec revision (working tree at code0/development)

Job-to-service link: there's already a direct field

Implementation plan

Feature 1 — Cascade stop

Feature 2 — Start cleanup of stale jobs

Feature 3 — Jobs preserved on stop

Files to modify

Acceptance criteria

Edge cases / notes

Test Results

Targeted: dependency tests

Regression: tests most likely to break

Full suite

Acceptance criteria (from spec)

Implementation Summary

Files changed

Key design decisions

Test results

Step 3 — Implement cascade in `handle_stop`

Step 4 — Implement start-cleanup in `handle_start`