service manager in router #90
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_router#90
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Story: Replace Nu/Bash/Make Service Scripts with Hero Service Manager
Purpose
We need to replace the current collection of Nu shell scripts, Bash scripts, and Makefiles with one Rust-based service management server, following the Hero architecture.
The goal is to make service lifecycle management reproducible, typed, inspectable, and centrally controlled.
Goal
Build a Hero Service Manager that can:
Functional Requirements
Service Definition Requirements
Each service definition should include:
do this as code (rust as part of the _server)
Core Operations
service.buildservice.installservice.startservice.stopservice.restartservice.statusservice.healthservice.logsservice.listservice.upgradeservice.verifyDeliverables
Implementation plan — Hero Service Manager inside hero_router
Aligning before coding. PR will land on
development_mik_service_manageroffdevelopment. Squash-merge only with explicit go-ahead.Architecture
New third socket under
$HERO_SOCKET_DIR/hero_router/:service.sockservesPOST /rpc,GET /openrpc.json(separate spec),GET /health,GET /.well-known/heroservice.json(distinctservice_idso the scanner indexes it as its own service).No new crate — single binary stays single. New module tree:
ServiceDefinition — pure data with typed extension points
Custom per-service behavior is encoded as enum-typed extension points, not trait impls — so the agent can serialize and reason about every service uniformly.
Extra::OnnxRuntimeandBindStrategy::Myceliumcompile in but are stubbed for now — exercised in follow-up PRs (hero_voice / hero_router itself).RPC surface (
service.sock, namespaceservice.*)service.list[{name, status, version_installed, healthy}]service.inspect{name}ServiceDefinition+ runtime status + last 5 opsservice.status{name}{state, uptime_ms, restart_count}(delegates tohp.service_status)service.health{name}service.build{name, mode, version?}{installed_paths, duration_ms}service.install{name, mode, version?, reset?}service.start{name, reset?, version?}{started_at}service.stop{name}{stopped_at}service.restart{name}{restarted_at}service.delete{name, purge_binaries?}service.upgrade{name, version?}service.verify{name}service.logs{name, lines?}hp.logs_tailservice.troubleshoot{name}OpenRPC spec hand-written in
crates/hero_router/static/service_manager.openrpc.json— same approach as the existingstatic/openrpc.json.Reuse from existing crates
hero_proc_sdk::HeroProcFactory—start_service/stop_service/restart_service+service_status/service_list/logs_tail/job_listhero_proc_sdk::ServiceBuilder/ActionBuilder— convertServiceDefinition→ServiceBuildResulthero_proc_sdk::socket::{socket_base_dir, service_socket_dir}crate::probe::fetch_openrpcforservice.healthcrate::log_bridgefor tracing → herolib_core file loggerserver/rpc.rs::dispatchshape (envelope helpers ~30 lines duplicated; not abstracting yet)Source vs download (both first-class)
InstallPolicy::Either { asset_suffix: "linux-amd64-musl" }is the default.cargo build --releaseunder$CARGO_TARGET_DIR/hero_service_manager/<name>/src/, copy named binaries into~/hero/bin/, last-200-lines-on-error.svc_install_download(lib.nu:585) — resolve tag (/api/v1/repos/<forge_loc>/releases/latest), fetch each<bin>-<asset_suffix>, ELF-verify (\x7fELFmagic),chmod +x,touchmtime fix (lib.nu:619), usesFORGEJO_TOKENif present.Freshness check (
svc_verify_binaries_freshfrom lib.nu:784) →install::verify_fresh().CLI subcommand
hero_router service <op> [name] [flags]— same code path as the agent (connects to localservice.sockover UDS).First-PR scope: 2 services
hero_dbandhero_books— both server+ui pairs, no mycelium quirks, no ONNX. Exercises every code path without dragging in extras yet to design.Out of scope this PR: hero_router (mycelium auto-detect), hero_voice / hero_embedder / hero_editor (ONNX overlay), hero_proc (circular), and the other ~17. Existing nu modules keep working in parallel — this PR adds capability, doesn't remove anything.
Tests
definition.rs(to_proc_service_build_result),build.rs(tag-resolution URL, ELF magic, asset-name composition).tests/service_manager_e2e.rs— tempdir socket + stub hero_proc, drivesservice.list/start/status/stopagainst a fake echo service.cargo run --bin hero_router -- service listprints the two registered services after starting hero_proc locally.Verification
cargo fmt --check && cargo clippy --workspace --all-targets -- -D warnings && cargo build --workspace --release.curl --unix-socket .../service.sock http://x/openrpc.json;hero_router service list.service install hero_db --mode download --version latest.start→status running→health ok→stop.rpc.sock,ui.sock) unchanged — covered bycargo test --workspace.Files modified / created
Modified:
crates/hero_router/src/main.rs— bind third socket,serviceCLI subcommandcrates/hero_router/src/lib.rs— re-exportservice_managercrates/hero_router/Cargo.toml—hero_proc_sdkdep already present; reusehyperlocal+hyperfor HTTPS download (no new dep)crates/hero_router/CLAUDE.md— document third socketCreated (14 files):
service_manager/{mod,definition,registry,error,build,install,lifecycle,health,inspect,ops_log,rpc}.rsservice_manager/services/{mod,hero_db,hero_books}.rsstatic/service_manager.openrpc.jsontests/service_manager_e2e.rsOut-of-scope follow-ups (separate issues filed after merge)
ui.sock(/servicespage).hero_skills/nutools/modules/services/*.nuonce all ports complete.Starting implementation now on
development_mik_service_manager.mik-tf referenced this issue2026-05-07 16:58:46 +00:00
Status update — code-complete, operationally unverified
PR #91 lands the framework + all 33 service ports + documentation. Workspace gate green (fmt + clippy
-D warnings+ release build + 125 tests). However: zero of this has been run against a live hero_proc. Closing this META is gated on a smoke session that proves the manager actually drives services end-to-end.This comment captures everything needed to finish the work.
What's in PR #91 (verified to compile, not to work)
Framework under
crates/hero_router/src/service_manager/:definition.rs—ServiceDefinition+ typed extension points (BindStrategy,Extra,InstallPolicy,ArgSource,EnvSource,Resolver,HealthSpec,TimingPolicy)registry.rs— compile-time registry overservices::all()build.rs— source (cargo build --release) + download (Forgejo Releases viacurl) dispatcherinstall.rs— atomic-rename binary placement, ELF magic verify, freshness checklifecycle.rs—hero_proc_sdkwrappers (start/stop/restart/status/list)health.rs— HTTP probe over UDS viahyperlocalinspect.rs—inspect(def + status + last 5 ops) andtroubleshoot(+ log tail) compositesops_log.rs— per-service ring buffer (32 entries cap)rpc.rs— JSON-RPC 2.0 dispatcher + Axum router forservice.sockerror.rs—ServiceErrorenum with stable JSON-RPC error codes (-32001..-32009)Socket layout: third UDS
service.sockalongside existingrpc.sockandui.sock, separate OpenRPC domain (static/service_manager.openrpc.json), separateservice_idso the router scanner indexes it as a distinct service.RPC methods (14):
service.list / inspect / status / health / build / install / start / stop / restart / delete / upgrade / verify / logs / troubleshoot+rpc.discover+rpc.health.CLI:
hero_router service <op>connects toservice.sockover UDS — same dispatcher path agents use.Service ports (33): every
service_<name>.numodule underhero_skills/nutools/modules/services/is represented as a Rust file underservices/:Documented exclusions (in
services/mod.rs):hero_proc— circular: manager is a hero_proc client.hero_onlyoffice— Docker container; engine doesn't issuedocker runactions yet.hero_do— installer-only nu module, no daemon.service_core.nu— empty meta-module.Documentation (#90 deliverables):
crates/hero_router/docs/service_manager/README.md— developer guidecrates/hero_router/docs/service_manager/migration.md— 4-phase nu → managercrates/hero_router/docs/service_manager/removal.md— bottom-up deletion orderWhat is NOT verified (the important part)
The framework has never been run against a live hero_proc. None of these have happened:
hero_router service listagainst a running router — output unverifiedhero_router service install hero_db --mode download --version latest— forge fetch + ELF verify + freshness check pipeline never exercised against real Forgejo URLshero_router service install hero_db --mode source—cargo build --releaseshell-out never run against a real checkouthero_router service start hero_db— hero_proc_sdk action-spec translation may be subtly wrong; would surface herehero_router service health hero_db— UDS HTTP probe path untested against a real servicehero_router service troubleshoot hero_db— composite output never inspected on a real misconfigured servicehero_router service stop hero_dbthenservice status hero_dbshowingexitedPer-service translation accuracy is unproven. I read 33 nu modules and translated them into Rust by hand. Almost certainly some have bugs (wrong env key, missing arg, wrong socket subdir). The only way to find them is to run them.
Known structural gaps (data model declares, engine doesn't honor)
These are intentionally deferred — the data field is correct so the agent can reason about the intent, but the engine treats them as no-ops:
hero_collabExtra::LiveKitvariant + auto-bootstrap ofhero_livekithero_voice,hero_embedder,hero_editorExtra::OnnxRuntimedeclared; install-side support TODO--split)hero_aibrokerServiceDefinitionsharingbinaries[]hero_codescalers,myceliumhero_routerBindStrategy::Myceliumengine support viamycelium_sdkhero_onlyofficeexecFromHeroProcSecretresolverHow to actually finish this issue
Step 1 — Local smoke (≤30 min)
On the workstation, after
source ~/hero/cfg/env/env.sh:If any of these fail, that's the bug list to fix before closing the issue.
Step 2 — Walk ≥3 services with different shapes
Pick services exercising different code paths:
hero_db(musl, RESP TCP port inkill_other) — covers download path + port reclaimhero_books(gnu, env wiring forHERO_BOOKS_DATA+HERO_EMBEDDER_URL) — coversResolver::HeroHomePath+Resolver::SocketPathhero_browser(musl, plain server+ui) — sanity baselineEach: install → start → health → stop. File a fix-up commit for any translation bugs found.
Step 3 — Heroci validation
Per
feedback_no_direct_push_except_hero_demo.md, this is an L2 PR change and needs explicit go-ahead. Once given:Step 4 — Catalog per-service translation bugs
Run
hero_router service install <name> --mode downloadfor each service that has a published release. Any failure is either:services/<name>.rs(wrong action shape) — fix in PR.Limitation:.Track the verified-vs-broken matrix in a follow-up comment here.
Step 5 — Close this META
Only when:
Follow-up issues to file before closing
These are explicit out-of-scope for this PR and should each become their own tracked issue:
hero_router#9X— LiveKit auto-bootstrap (Extra::LiveKit)hero_router#9X— ONNX Runtime auto-install (Extra::OnnxRuntimeengine support)hero_router#9X—BindStrategy::Myceliumengine supporthero_router#9X— Multi-instance suffix support (action/socket templating)hero_router#9X— Cascade--splitmother/child as separate ServiceDefinitionhero_router#9X— Docker-action interpreter forhero_onlyofficehero_router#9X— Admin UI dashboard pane onui.sock(driven byservice.sock)hero_router#9X—service.deploycomposite (build + install + restart + verify)hero_router#9X— Per-service nu module deletions (one PR per service after verified parity)Each follow-up references this META.
TL;DR
PR #91 is code-complete: framework, 33 ports, docs, gate green. Not operationally validated — zero live hero_proc runs. Steps above (1–5) are the path to closing this issue. Realistically: one focused 30-60 min smoke session covers steps 1–2; heroci validation is one more session.
Direction change — switching from data-schema to code-per-service
Per Kristof's feedback (paraphrased):
He's right. PR #91 built a
ServiceDefinitionschema + interpreter on top ofhero_proc_sdk— adding a meta-layer where one wasn't needed. The 6 "documented gaps" (LiveKit auto-bootstrap, ONNX install, cascade--split, multi-instance, mycelium auto-detect, Docker actions) are all cases where a service didn't fit the schema, exactly the rigidity he called out.New direction
Each service is Rust code (a small module with
install/start/stop/healthfunctions) that callshero_proc_sdk::ServiceBuilder+ActionBuilderdirectly, plus shared helpers ported from the existing nulib.nu. No schema, no interpreter. Same model as the existing nu modules, just in Rust.What survives from #91
Roughly 60-65% of the PR is reusable:
service.sock+ Axum router + JSON-RPC dispatcher (the call surface is correct; only the handler bodies change)service.*method namespace + OpenRPC spec + CLI subcommand (hero_router service <op>)ops_logring bufferinstall.rs/build.rs/health.rshelpers — these are exactly the "primitives" Kristof referenced; refactored intoservice_manager::libREADME.md,migration.md,removal.md) — mostly valid; updated to describe the trait-not-schema modelWhat gets thrown out
definition.rs—ServiceDefinitionstruct + 7 typed-extension-point enums + the interpreterto_proc_service_build_result()method (~400 lines)lifecycle.rsinterpreterWhat gets rewritten
Each
services/<name>.rsbecomes aHeroServicetrait impl that callshero_proc_sdkdirectly. Volume is similar (~50 lines per service); flexibility is much higher — every nu-module wrinkle (LiveKit, ONNX, cascade, multi-instance, mycelium) becomes "just write the code in thisstart()method", not "extend the schema".For a service with a quirk (LiveKit auto-bootstrap, ONNX install, cascade variant), the quirk lives inline in that service's
install()orstart()body — no schema extension needed.Plan
PR #91 is being closed; v2 is in flight on
development_mik_service_manager_v2offdevelopment. Same scope (33 services + framework + docs) but the per-service files are Rust code, not data literals. New PR will follow.Closing checklist remains the same as my previous comment — operational verification (live hero_proc smoke + heroci validation + per-service translation accuracy check) is what gates closing this META, regardless of v1 vs v2.
v2 status — code-not-data design landed in PR #92
#91 is closed; #92 is the live PR. The pivot rationale is in comment 30721 — Kristof's redirection captured.
What v2 changes from v1
pub const DEF: ServiceDefinition(data + 7 enums)pub struct X; impl HeroService for X(Rust code)start()/install()ServiceDefinitiondatabuild.rs/install.rs/lifecycle.rs/health.rsservice_manager::lib(Rust port of nulib.nu)The v2 framework matches the existing nu-modules-under-
hero_skillspattern exactly, just in Rust. Kristof's "metadata = less flexibility" concern is structurally addressed: the engine has zero per-service knowledge, every quirk is just code in the service's file.What survived the pivot
About 60% of v1 carried over unchanged:
service.sock+ Axum router + JSON-RPC dispatcherservice.*method namespace + OpenRPC spec + CLI subcommandops_logring bufferREADME.md/migration.md/removal.md) — README rewritten to describe trait-not-schemalib.rsCoverage — same 33 services
Documented exclusions unchanged:
hero_proc(circular),hero_onlyoffice(Docker),hero_do(installer-only),service_core.nu(empty).Verification status
What's verified (CI-level):
cargo fmt --checkcargo clippy -p hero_router --all-targets -- -D warningscargo build -p hero_router --releasecargo test -p hero_router— 119 tests (9 dispatcher e2e + 5 lib unit + 105 pre-existing)list,scan,spec,markdown,html,add,remove,start,stop,access)What's NOT verified (operational):
start/health/stop) against any real serviceservices/<name>.rsagainst itsservice_<name>.nusource-of-truth)Closing checklist (unchanged)
The path to closing this META is the same regardless of v1 vs v2:
service_proc start→ start hero_router → walk hero_db / hero_books / hero_browser throughinstall --mode download→start→health→stopservices/<name>.rsimplfeedback_no_direct_push_except_hero_demo.md--split, multi-instance, mycelium auto-detect, Docker-action support) — file before closing this METAAttempting local smoke on workstation now.
Local smoke ✅ — manager drives a real service end-to-end (with one translation bug found and fixed)
Ran v2 (PR #92) against live
hero_proc_serveron the workstation. Full lifecycle proven: install → start → health → stop → status.Trace
Translation bug caught + fixed
The first attempt failed health probe — exposed a real bug:
The v0.3.2
hero_db_serverbinary binds sockets directly under$HERO_SOCKET_DIR/with flat naming (hero_db_server.sock/hero_db_resp.sock/hero_db_ui.sock), NOT under a per-service subdirectory likehero_db/rpc.sockas I'd coded (and as the upstream nu module'skill_otherpaths also incorrectly listed).Fixed in PR #92 by overriding
health()to probe the actual socket name and updatingkill_other.socketto match what the binary actually binds. The fix is the kind of change v2's "code-not-data" architecture makes trivial — just edit the Rust function body, no schema extension needed.This is exactly the value of running real smoke tests vs. trusting a translated-from-nu schema: the unit tests + clippy + release build all passed before this fix, but the binary's actual runtime behavior diverged from what both the nu module and my v2 port assumed.
Verified surface
service.sockbinds at startup (third socket alongside rpc/ui)rpc.discoverreturns the OpenRPC document (16 methods)service.listreturns all 33 services with correctinstalled/registeredflagsservice.inspectreturns identity + binaries + statusservice.install --mode download --version latestagainst real Forgejo releaseservice.startregisters with hero_proc + starts; sockets get boundservice.healthdoes live UDS probe and returns the body + statusservice.statusreflects hero_proc supervisor state (running/exited)service.stopclean shutdownservice.troubleshootcomposite with status + recent_ops + log_tailhero_routerfeatures unaffected (scanner indexes the new socket as a distinct service)Still-not-tested
start(), re-run) but each one is its own audit.--mode source)--split, multi-instance, mycelium auto-detect, Docker)Updated closing checklist
Local smoke for hero_db✅ donestart()/health()overrides)The framework is operationally validated. What remains is per-service translation accuracy — every additional service that smokes green moves us closer to closing this META.
Aligned to hero_skills@371138f convention (_ui → _admin)
Per direction: don't chase per-binary smoke fixes. Mirror the upstream nu modules' canonical naming. Done in commit
92a947aon PR #92:What changed
Sweeping rename mirroring hero_skills@371138f:
hero_<service>_ui→hero_<service>_admin(and consequently inbinaries[],ActionBuildernames,kill_other.socketpaths, health probes)<service>/ui.sock→<service>/admin.sock<service>_ui→<service>_adminAffected: 29 of 33 service files.
Special cases
hero_collab_web, not_admin) — reverted my prior smoke-driven misfixhero_shrimphero_plannerhero_dbhero_db/admin.sock)hero_mailhero_mail_cli→hero_mail;_ui→_adminhero_books_uiand the dev_adminbinary into one)hero_routerui.sockis the router's admin dashboard socket, not a managed servicehero_code_web(no_adminin canonical)Trade-off
Published Forgejo releases for some services still ship the pre-rename
_uiasset names (e.g.hero_db v0.3.2shipshero_db_ui, nothero_db_admin). The--mode downloadpath fails for those services until upstream cuts new releases with renamed assets. Source-build path works regardless.This is the explicit choice: align with the canonical convention now, accept that downloads fail until releases catch up. The alternative (chase observed binary behaviour per-service) was the churn we're avoiding.
Workspace gate
cargo fmt --check✅cargo clippy -p hero_router --all-targets -- -D warnings✅cargo build -p hero_router --release✅cargo test -p hero_router— 119 tests pass ✅Closing checklist (re-stated)
The framework is canonically aligned. Remaining work to close #90:
_adminasset names — verify install/start/health/stop end-to-end on those--split, multi-instance, mycelium auto-detect, Docker actions)PR #92 is ready for review against the new convention.
🤝 Handoff for next agent — everything you need to continue
This is a complete state dump for whoever picks this up next. Read top-to-bottom; everything below is operational, not aspirational.
TL;DR
PR #92 is open against
development_mik_service_manager_v2→development. Code is convention-aligned with hero_skills@371138f, gate green (fmt + clippy-D warnings+ release build + 119 tests). Awaiting review + squash-merge OK.PR #91 (v1, schema-based) is closed — superseded.
What this work delivered
A Rust-based Hero Service Manager inside
hero_router:service.sockalongside the existingrpc.sockandui.sockservice.*JSON-RPC methods + own OpenRPC documenthero_router service <op>HeroServicetrait with default impls; 33 per-service Rust modules usehero_proc_sdk+ shared helpers directly (Kristof's "code-not-data" model — like nu modules in Rust)crates/hero_router/docs/service_manager/{README,migration,removal}.mdArchitecture rationale
This work pivoted mid-session from a
ServiceDefinitionschema + interpreter (v1, PR #91) to free-form Rust per service (v2, PR #92). See comment 30721 for Kristof's redirection. Don't re-introduce the schema — quirks (LiveKit bootstrap, ONNX install, cascade--split) live as Rust code in each service file, not as schema extensions.Verified-working state
cargo fmt --check, clippy-D warnings, release build, 119 testsservice.sockbinds at startup alongsiderpc.sock+ui.sockcurl --unix-socket service.sock /openrpc.jsonreturns 16 methodsservice.listservice install hero_db --mode download --version latest(BEFORE the convention alignment commit)start hero_db→health200 OK from realhero_db_server v0.3.2→stopCritical: after the convention alignment commit
92a947a, the--mode downloadpath will fail for services whose published releases still ship the pre-rename_uiasset names. This is a known, deliberate trade-off — the code is forward-aligned, releases need to catch up. Source-build path works regardless.Known limitations (each tracked as a follow-up issue)
--splitmother variantBindStrategy::Myceliumengine supportFromHeroProcSecretresolver (low priority)ui.sockLimitation:markers in the correspondingservices/<name>.rsdoc comments make the per-service caveats discoverable from the source.Documented exclusions (4 services intentionally NOT in the registry)
In
services/mod.rsdoc comment:hero_proc— circular: the manager IS a hero_proc client.hero_onlyoffice— Docker container; #98 will add support.hero_do— installer-only nu module, no daemon.service_core.nu— empty meta-module.How to continue
Option A: Wait for upstream releases, then validate
_adminasset names.Option B: Source-build smoke (works regardless of release naming)
export CODEROOT=$HOME/Documents/temp/hero_work(or wherever you cloned).git clone https://forge.ourworld.tf/lhumina_code/<svc>.git $CODEROOT/lhumina_code/<svc>for each service to test.hero_router service install <svc> --mode sourcethen proceed with start / health / stop.Option C: Heroci validation
Per
feedback_no_direct_push_except_hero_demo.md— needs the user's explicit go-ahead (an L2 PR change). Don't push to heroci unprompted.How to run the smoke (reference)
If any service install/start fails with the new convention, the fix is one of:
services/<name>.rs::start()orhealth()(no schema change needed)What NOT to do
ServiceDefinitionschema or interpreter. Kristof was clear on this. Each service is code, not config.hero_proc_meta, the canonical pattern is daemon-side runtime resolution.FromCallerEnvfor env-passthrough from operator shell is fine.feedback_squash_merge_gate.md.migration.mdPhase 1; deletion only in Phase 4 once parity is verified.Workstation state at handoff
Local box has the smoke session leftovers:
~/hero/bin/:hero_proc,hero_proc_server,hero_proc_admin,hero_db,hero_db_server,hero_db_ui(note: pre-rename binary; would behero_db_adminpost-release-catchup)~/hero/var/sockets/:hero_router/{rpc,ui,service}.sock,hero_proc/rpc.sock, plushero_db_*.sockleftovershero_proc_server,hero_router --port 0,hero_db_server,hero_db_ui serveStop with:
The
~/hero/bin/binaries are intentionally left in place so a follow-up agent can re-test without reinstalling.Documents to read first (in this order)
crates/hero_router/docs/service_manager/README.md— developer guidecrates/hero_router/docs/service_manager/migration.md— 4-phase nu→manager plancrates/hero_router/docs/service_manager/removal.md— bottom-up deletion orderClosing checklist (what gates closing #90)
When all 4 above are checked, this META can close.