Service cascade-stop: stopping A should stop services that require A #65
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#65
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
When a user stops service A and other services declare
requires: [A], those dependents should stop too (most-dependent-first). Todayservice.stoponly stops the named service; dependents keep running with a now-broken dependency.A test already exists but is
#[ignore]d:tests/integration/tests/dependencies.rs:51—test_cascade_stop—#[ignore = "cascade stop not yet implemented"]Repro (from the test)
running.service_stop({name: "svc-a"}).stopped/blockedwithinDEFAULT_TIMEOUT. Cascade ordering: C first, then B, then A.Acceptance
service.stopcascades to services with a transitiverequireschain into the stopped service.test_cascade_stopis un-ignored and passes.test_conflict_blocking(independent services unaffected).Notes / scope
hero_proc_lib.Follow-up to #56.
Implementation Spec: Issue #65 — Service Cascade-Stop + Start-Cleanup of Stale Jobs
Objective
Make
service.stopcascade to all transitive dependents (most-dependent-first), and makeservice.startproactively delete stale (terminal-phase) jobs belonging to that service so a fresh start has a clean slate. Jobs are NEVER deleted on stop — they remain so the user can inspect crash output.Background — Codebase Findings
How job → service identity is currently established
There is no direct
service_namefield onJoband no service tag. The existing code atcrates/hero_proc_server/src/rpc/service.rsalready implements the canonical job-to-service link viaservice_running_jobs()andservice_last_terminal_state(). The recipe is:ServiceConfig→ takeservice.actions: Vec<String>.job.action == name(raw service name), orjob.action.starts_with("{name}.")(e.g.svc-a.main— this is the formatservice.startwrites), orjob.action.starts_with("{name}:")(legacy/colon variant), orjob.action == one of service.actions.service.startwrites jobs withaction = "{service_name}.{action_name}". So the dot-prefix branch is the primary signal for jobs the supervisor created.Decision on the job→service link
Use the existing reverse lookup via the action list — do NOT add a new field.
Reasons:
service.status,service.stop,service.kill,service.children,service.is_running. Adding a newservice_namecolumn onjobswould diverge from a working pattern and require a schema migration plus backfill.JobFilteralready carries an unusedhero_proc_service_namefield. If a future cleanup wants to add a real link, that is the place — out of scope here.Existing dependency traversal
crates/hero_proc_server/src/supervisor/shutdown.rs(build_stop_waves) already does the topological sort we need: it walksdependencies.requires + dependencies.after, inverts the edges, and emits waves with leaves (most-dependent) first via Kahn's algorithm. We reuse this shape for cascade-stop, scoped to a single root service.What "still running" means for a job
JobStatus::is_active()returns true forRunningandRetrying. Anythingis_terminal()(Succeeded/Failed/Cancelled) is fair game to delete.PendingandWaitingjobs from a previous botched start are also safe to delete — they were never executing.Requirements
Feature 1 — Cascade stop
service.stop({name: A})must:S = {A} ∪ all services with a transitiverequireschain into Ain the same context.Sso the most-dependent service comes first andAlast.Swhose jobs are already terminal contributes 0 cancellations and does not error.cascaded: [...]list.Feature 2 — Start cleanup of stale jobs
service.start({name: A})must, before creating the new start job:Avia the reverse-lookup matcher.phase.is_active()is false (terminal or pending/waiting), delete it viadb.jobs.delete(id).phase.is_active()is true.Non-requirements
service_namecolumn to jobs.service.restartdirectly; it composes stop + start and inherits both behaviors.Files to Create / Modify
New file
crates/hero_proc_lib/src/db/service/graph.rspub fn collect_dependents(services, context, root) -> Vec<String>— Kahn's on inverted requires graph, leaves first, root last.pub fn job_belongs_to_service(job_action, service_name, service_actions) -> bool— single source of truth for the four-prong match.Modified files
crates/hero_proc_lib/src/db/service/mod.rs—pub mod graph;and re-export.crates/hero_proc_server/src/rpc/service.rsservice_running_jobs,service_last_terminal_state,count_restarts,handle_childrenwithgraph::job_belongs_to_service.service_owned_jobs(db, context, name) -> Vec<Job>— all matching jobs (active + terminal + pending).handle_stop: list services, callcollect_dependents, loop the ordered names, cancel running jobs in each, return{ok, cancelled, cascaded}.handle_start: before creating the new job, fetchservice_owned_jobsand delete any non-active job; log on failure but do not fail the start.tests/integration/tests/dependencies.rs#[ignore]fromtest_cascade_stop.test_start_cleans_stale_jobs.Step-by-Step Implementation Plan
Step 1 — Add the shared graph helper
Files: create
crates/hero_proc_lib/src/db/service/graph.rs; editcrates/hero_proc_lib/src/db/service/mod.rs.Depends on: nothing.
Done when:
cargo test -p hero_proc_libpasses the new tests.Step 2 — Refactor service.rs to use the helper (no behavior change)
Files:
crates/hero_proc_server/src/rpc/service.rs.Depends on: Step 1.
Done when:
cargo build -p hero_proc_servercompiles; existing service tests still pass.Step 3 — Implement cascade in
handle_stopFiles:
crates/hero_proc_server/src/rpc/service.rs.Depends on: Step 1, 2.
Step 4 — Implement start-cleanup in
handle_startFiles:
crates/hero_proc_server/src/rpc/service.rs.Depends on: Step 2.
Step 5 — Un-ignore and add tests
Files:
tests/integration/tests/dependencies.rs.Depends on: Step 3, 4.
Add
test_start_cleans_stale_jobs:svc-a, wait running, stop, wait stopped → assert at least one terminal job exists.svc-aagain → assert no terminal job forsvc-a.mainremains; only the fresh active one.svc-bconcurrently and assert its job survives.Step 6 — Validate full suite
cargo build --all-targetsandcargo test --workspace.Acceptance Criteria
cargo test -p hero_proc_libpasses new graph helper tests.test_cascade_stopis un-ignored and passes (A←B←C all stop withinDEFAULT_TIMEOUTafter stopping A).test_conflict_blockingstill passes (no regression on independent services).test_start_cleans_stale_jobspasses — stale terminal jobs deleted on start, unrelated service's jobs untouched.service.stopis idempotent: stopping a service whose dependents are already stopped reportscancelled: 0for them and does not error.service.stopdoes NOT delete terminal jobs — verify in test.service.restartcleans stale jobs (implicit via stop+start composition).cargo test --workspace.Notes / Edge Cases
JobsApiisArc<Mutex<Connection>>— atomic per-call. Cascade-stop is sequential cancels; idempotent. No new locking.requiresvsafter: cascade-stop uses ONLYrequires(hard dependency).afteris ordering-only.build_stop_waves.JobStatus::is_active()(Running, Retrying). Pending/Waiting from a botched prior attempt get deleted.{ok, cancelled}inhandle_stop; addcascadedas a new key.db.jobs.delete(id)returningOk(false)is fine. AnErr(_)is logged but does not fail the start.Spec revision (working tree at code0/development)
After re-exploring the code0/development working tree (the canonical source), the spec changed in two important ways. The earlier comment is superseded by what follows.
Job-to-service link: there's already a direct field
Job.service_id: Option<String>exists, is persisted in the SQLitejobstable, has an indexidx_jobs_service_id, and is set byservice.rs::handle_start(service_id: Some(name.clone())). Existing helpers:JobsApi::list_by_service_id(name)JobsApi::delete_inactive_by_service(name)— deletes jobs whose phase is NOTrunningorretrying(preserves currently-running)JobsApi::delete_terminated_by_service(name)Decision: use
service_idrather than action-name reverse lookup. Cleaner, single-purpose, already indexed.Implementation plan
Feature 1 — Cascade stop
In
crates/hero_proc_server/src/rpc/service.rs:service_dependents_in_order(db, context, name) -> Vec<String>:db.services.list(Some(context))parent -> [children that require parent]fromsvc.service.dependencies.requiresname; layer 0 =[name]; track visited (cycle-safe)nameitself; reverse the layer order so leaves come firsthandle_stop:service_non_terminal_jobs+executor::cancel_job); do NOT remove their job recordsremove_jobsflag for the named service ONLY{ "ok": true, "cancelled": N, "stopped_dependents": [...] }(additive, backward-compatible)Feature 2 — Start cleanup of stale jobs
In
crates/hero_proc_server/src/rpc/service.rs::handle_start:replace_existing_jobsblock, unconditionally calldb.jobs.delete_inactive_by_service(&name).replace_existing_jobs=truepath keeps its full behaviour (cancel + delete) on top.replace_existing_jobs=true(default): same as today plus an explicit prune (already a subset)replace_existing_jobs=false: today does nothing; new behaviour cleans up dead records but leaves running jobs alone — this is the user's explicit askFeature 3 — Jobs preserved on stop
Stays preserved by default. Locked in by a new regression test.
Files to modify
crates/hero_proc_server/src/rpc/service.rs(production)tests/integration/tests/dependencies.rs#[ignore = "cascade stop not yet implemented"]fromtest_cascade_stop(line 51)test_start_prunes_stale_jobs_keeps_runningtest_stop_preserves_jobs_by_defaultNo new files. No changes to
factory.rs,shutdown.rs, or the SDK.Acceptance criteria
test_cascade_stopun-ignored and passingtest_conflict_blockingstill passes (no regression)test_start_prunes_stale_jobs_keeps_runningpassestest_stop_preserves_jobs_by_defaultpassestest_stop_with_remove_jobsstill passes (remove_jobs=true still wipes)test_service_restart_clears_old_jobsstill passes (replace_existing_jobs=truepath)cargo build --workspacecleanEdge cases / notes
services.list(Some(context))wait_stoppedaccepts theexitedterminal state — cancelled jobs map toexitedinservice_last_terminal_statetest_remove_parent_stops_children_first,test_remove_with_dependency_chain) remain#[ignore]'d — they are #56 territory, out of scope heredevelopment(working on/Volumes/T7/code0/hero_proc); the abandoneddevelopment_kristofworktree at/Volumes/T7/code1/hero_procis not touchedTest Results
All targeted suites pass. No regressions.
Targeted: dependency tests
Regression: tests most likely to break
service_management.rs(coversservice.stop+remove_jobs+ delete-cascade ignores):service_restart_with_cleanup.rs(coversreplace_existing_jobs=truepath):Full suite
hero_proc_integration_tests(all 19 test binaries):hero_proc_server --lib:hero_proc_lib --lib:Acceptance criteria (from spec)
test_cascade_stopun-ignored and passingtest_conflict_blockingstill passestest_start_prunes_stale_jobs_keeps_runningpassestest_stop_preserves_jobs_by_defaultpassestest_stop_with_remove_jobsstill passestest_service_restart_clears_old_jobsstill passescargo build --workspaceclean (verified per-crate)stopped_dependentsfield only)Implementation Summary
Files changed
crates/hero_proc_server/src/rpc/service.rsservice_dependents_in_order(db, context, root) -> Vec<String>— BFS overrequiresgraph; returns deepest-first ordering, root excluded; cycle-safe via visited set.handle_stop: now cancels jobs of every transitiverequiresdependent (deepest-first) before cancelling the named service. Response gains astopped_dependents: [...]field; existingokandcancelledkeys preserved.remove_jobs=truestill only deletes the named service's jobs — dependents always keep theirs for crash forensics.handle_start: pulleddelete_inactive_by_service(name)out of thereplace_existing_jobsbranch so it runs unconditionally. The SQL filter (phase NOT IN ('running','retrying')) preserves any actively-running jobs. The old run-record cleanup stays gated onreplace_existing_jobs=true.tests/integration/tests/dependencies.rs#[ignore = "cascade stop not yet implemented"]fromtest_cascade_stop.test_start_prunes_stale_jobs_keeps_running— start/stop a service, restart withreplace_existing_jobs: Some(false), assert old terminal jobs are gone, assert a concurrently-running unrelated service's jobs are untouched.test_stop_preserves_jobs_by_default— stop withoutremove_jobs=truemust leave terminal job records visible.wait_stoppedfrom the test assertions.Key design decisions
Job.service_idSQLite column (indexed) plus the existingdelete_inactive_by_servicehelper. No new column, no new migration, no string-prefix matching.requiresonly (hard dependency).afteris ordering-only and is intentionally excluded.service.stopnow carries astopped_dependentsarray, but existingokandcancelledkeys are preserved. No openrpc.json schema change required.Test results
All 4 dependency tests pass; 16/16 service_management; 4/4 service_restart_with_cleanup; full integration suite 173 passed / 0 failed; hero_proc_server lib 70/0; hero_proc_lib 160/0. See previous comment for the breakdown.