[META] Hero OS demo Phase 3 - complete admin and tester UX #238
Labels
No labels
meeting-notes
meeting-transcript
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/home#238
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
s173 close (2026-05-28): Code-only ship of the four operator-and-tester UX issues filed during the s172d live walk. Two squash-merges landed on origin/development across hero_os_tfgrid_deployer and hero_cockpit. hero_os_tfgrid_deployer@a6fc6a4 adds a small polling loop to the admin per-user VMs table (new
GET /users/{u}/vms.jsonendpoint; rows refresh in place every five seconds; full page reload only when install_state crosses a boundary that reshapes the action set) and appends/hero_cockpit/web/to every rendered cockpit URL on the admin UI so the link lands on the cockpit rather than hero_proxy's service-discovery dashboard. hero_cockpit@ba02baa turns the cockpit Services page into the sandbox's service catalog: a new catalog module defines fourteen canonical service entries, two new RPCs (list_catalog and install_service) validate requests against the catalog so a hand-crafted call cannot trigger lab build on an arbitrary repo string, the page renders a greyed-out row per uninstalled catalog entry with a per-row Install button, and the URL column gains clickable cockpit-relative links derived from service.name suffix conventions for every running service with a web UI. Pre-merge gate green on both commits (fmt + clippy--workspace --all-targets -- -D warnings+ 72 deployer server-lib tests + 22 cockpit server-lib tests with six new in catalog.rs + 16 cockpit web tests with four new for the URL deriver + workspace release build + --info smoke on both deployer binaries and all four cockpit binaries). Closes hero_cockpit#11, hero_cockpit#12, hero_os_tfgrid_deployer#19, and the install-side polling half of hero_os_tfgrid_deployer#18 (Provision-side async conversion deferred as a focused follow-up). Operator authorized a code-only autonomous shape with admin VM 0069 + alice123 + all QA live state explicitly off-limits, so the actual live verify + e2e_checklist row flips + Forge issue closes carry to the next operator-driven smoke + deploy cycle once CI republishes the binaries. No D-NN or L-NN minted. End-to-end checklist counts unchanged from s172d: 63 Have / 18 Need / 2 Blocked across 83 rows. Carry to next session (operator-driven, ~1-2h):lab build hero_os_tfgrid_deployer --download --install+lab build hero_cockpit --download --installon admin VM 0069, restart the four touched services via hero_proc, browser-walk the four UX behaviours, flip the rows that live evidence supports, post comments + close the four Forge issues. Then a following session picks up the deferred welcome-email pipeline (A-18 + B-1, with operator selection of resend.com vs SendGrid amending or superseding D-20), the BYO-key auto-start cascade in hero_cockpit, and the Provision-async conversion for the remaining half of hero_os_tfgrid_deployer#18.s172d close (2026-05-28): Per-tester Forgejo OAuth apps replace the workspace-shared model. Every tester VM now gets its own OAuth application minted on
forge.ourworld.tfat provision time and reaped at delete time; each app's redirect_uris allowlist contains exactly one URI bound to that tester's own URL. A real Forge user (alice172d) walked the full SSO loop end-to-end on her own cockpit URL: anonymous request returned 302 to Forge OAuth with her per-tester client_id, she set a permanent password on her first Forge login, saw the consent screen displayed her tester app name, accepted, and landed on her own Hero Cockpit. Three distinct OAuth client identities proven live across three URLs (admin plus two testers), each with a single-URI redirect_uris allowlist private to that host. Six squash-merges shipped on hero_os_tfgrid_deployer/development closing every gap surfaced live: per-tester OAuth app create-and-delete wire with a new schema migration for the per-VM client_id and client_secret triple; the missing hero_proxy domain.add call that registers each tester URL as OAuth-gated; a health-poll loop replacing a fixed sleep that wasn't enough for a cold-start hero_proxy on a fresh VM; the hero_proxy allowlist secret being pushed at the wrong hero_proc context; and the empirical discovery that Forgejo rotates the OAuth client_secret as a side effect of every PATCH on an OAuth application, even when the body does not request rotation, which had been silently invalidating the admin VM's own SSO session config on every tester operation. Three Forge issues filed for the demo's remaining UX polish: hero_cockpit#11 (services page Install button for components not yet installed), hero_cockpit#12 (clickable URL column for services with a web UI), hero_os_tfgrid_deployer#18 (admin UI progress indicator for Provision symmetric with Install). The admin VM deployment runbook gained a new caveat plus a one-shot OAuth-secret-rotation recovery appendix. The full tester onboarding flow was also live-walked by an operator through the admin UI without any curl scripts: register on Forge, set permanent password, upload SSH key, return to admin UI, see SSH key badge appear, click Provision (VM minted in 55s), click Install (9-minute cascade). End-to-end checklist rows: A-31 (per-tester hero_proxy allowlist) Need to Have; B-40 (tester opens cockpit URL on their own VM) caveat dropped. Counts moved to 63 Have / 18 Need / 2 Blocked across 83 rows. Carries to next session: A-18 plus B-1 welcome-email pipeline (operator selects resend.com vs SendGrid as the email provider), hero_books BYO-key auto-start cascade in the cockpit (when tester pastes an AI key in Settings, hero_aibroker plus hero_books start automatically), the three cockpit-polish issues above. Estimated nine to thirteen hours.s172c close (2026-05-28): The install pipeline now works LIVE end-to-end. A freshly provisioned tester VM walks through
deployer.install_hero_stackto install_state=ready in roughly eight minutes; all twelve canonical components are running inside the tester VM; the tester's TFGrid Web Gateway URL serves the cockpit publicly (HTTP 200 on/, HTTP 303 on/hero_cockpit/web/to the welcome page). Admin SSH co-injection verified live (workstation SSH into the tester's root via mycelium succeeded). Five squash-merges shipped on hero_os_tfgrid_deployer/development closing every install-pipeline gap surfaced live (mycelium IP not persisted, webgateway not cleaned on delete, PEM trailing newline lost on secret seed, missing curl on bare base image, listener seed not propagated via bash environment files, and a port-default drift between hero_proxy builds). The empirical lesson — that runtime config must flow through hero_proc's secret store rather than bash environment files — was codified as a locked architectural decision (workspace-private). A structural follow-up issue was filed: hero_os_tfgrid_deployer#17 to replace the bash install runner with a Rust crate consuming a typed manifest. A post-v1 vision issue was filed: hero_demo#68 for a service catalog UI that lets testers self-install any published Hero service. End-to-end checklist rows A-30 and B-40 flipped to Have-with-caveat (install pipeline live; full browser SSO walks pending propagation of the OAuth client secrets to tester VMs which is the next session's primary task, the same SSH-push pattern this session shipped). Counts moved to 62 Have / 19 Need / 2 Blocked across 83 rows. Carries to next session: propagate the four cockpit OAuth secrets to tester VMs via the same SSH-push pattern, then walk two testers simultaneously (alice plus bob) with browser SSO walks demonstrating per-tester isolation plus the admin symmetric trust verified across both cockpits; estimated three to five hours.s172b close (2026-05-27): The Forgejo OAuth callback URL list is now managed automatically by the deployer. Every time a new tester VM is provisioned, the deployer adds that tester's callback URL to the workspace OAuth application; every time a tester VM is deleted, the URL is removed. This eliminates the manual sixty-second step the operator previously did per tester through the Forge admin UI. Code shipped at hero_os_tfgrid_deployer@ec27241 (+354 LOC, +7 tests, pre-merge gate clean). The cutover required pivoting the production OAuth application from a site-admin app to a user-owned app because Forgejo only exposes OAuth applications through the per-user API path (confirmed by reading the Forgejo source for redirect URI matching). Browser walk by operator verified login still lands. The first attempted end-to-end install walk on a fresh tester surfaced four latent bugs in the install pipeline code that shipped earlier (none are design issues with the trust model; all are wire-up gaps), captured in the next-session plan with concrete fix recipes. The install live walk and the end-to-end checklist row flips carry to the next session, estimated three to four hours.
s171 close (2026-05-27): A-12 (deployer.provision_vm calls deploy_webgateway after deploy_vm, persists daemon-returned fqdn, surfaces on admin user_detail.html) shipped at hero_os_tfgrid_deployer@15e5473 (+492/-6 across 8 files: schema M3 webgateway_fqdn column via the canonical recreate-with-FK dance, ComputeAdapter.deploy_webgateway JSON-RPC wrapper, handle_provision_vm extension, admin Cockpit URL column with copy-to-clipboard, new TFGRID_GATEWAY_NODE_SID env block). Pre-merge gate clean: fmt + clippy
-D warnings+ 32 server-lib tests (+6 new) +--infosmoke on both deployer binaries. Live walk on admin VM 0069 surfaced an API asymmetry (deploy_webgateway.node_sid takes raw TFGrid node_id, not the daemon-local catalog sid that deploy_vm.node_sid takes); pivoted secret to TFGRID_GATEWAY_NODE_SID=2 and retried, then three subsequent provision attempts each hit the daemon's 300s inline-await timeout (consistent QA substrate finalization slowness today; not a transient flake). Daemon-side rollback ran cleanly each time (2 orphan contracts cancelled per attempt). Filed hero_compute#131 requesting deploy_webgateway 300s timeout bump + env var + per-chain differentiation. A-12 flipped indocs/hero_os/free/e2e_checklist.mdfrom Need to Have-with-caveat (code path complete + gateway-node selection live-verified through daemon logs; live URL gated on QA substrate window or hero_compute#131). 60 Have / 20 Need / 2 Blocked across 82 rows. Carries to s172: A-30 Hero stack auto-install on tester VM (design-locked + minimal vertical slice; SSH-and-run vs cloud-init vs pre-baked image decision needed up-front).Current state (s168 close, 2026-05-27)
Code shipped across three repos: hero_os_tfgrid_deployer@8c640cd (provisioning fixes, SSH-key readiness in admin UI, closes hero_os_tfgrid_deployer#11), hero_cockpit@a52b784 (admin scaffold redirect, Books card, hero-voice-bar), and hero_lib@2f46f8f5 (upstream tools/src/forge/client.rs fix). All deployed on the public QA admin VM at hcockpit.gent01.qa.grid.tf. Browser walk partial: cockpit admin redirect and Books card on tester landing confirmed; voice-bar, tester creation, default-image provision, regenerate-password, and Books navigation carry to the next session. Three follow-up issues filed: hero_router#113, hero_os_tfgrid_deployer#12, hero_voice#36.
Session 167 handoff, 2026-05-27: flow contract confirmed for the free-testing channel. Runbook creates the admin VM; allowlisted admins use
/hero_tfgrid_deployer/admin/to create/select Forge testers and provision child VMs through existing deployer/compute; testers use Forge to change password and upload SSH keys, then enter/hero_cockpit/web/through SSO. Normal cockpit use must not require pasting a Forge API token; token paste remains fallback/headless. Paid onboarding/billing/KYC stays out of scope for this issue.Phase 3 completes the hand-off-ready Hero testing environment after Phase 2 closed the SSO/auth substrate. The auth perimeter is already correct: cockpit and deployer paths on
https://hcockpit.gent01.qa.grid.tfare restricted until Forge login, and the QA admin allowlist ismik-tf,scott,despiegk. This issue owns the post-login product UX and the finalhome/docs/hero_os/free/e2e_checklist.mdwalk.Admin target UX: an allowlisted admin opens
https://hcockpit.gent01.qa.grid.tf/hero_tfgrid_deployer/admin/, logs in throughforge.ourworld.tfif needed, and lands on a real deployer admin dashboard, not scaffold text. The dashboard lets the admin list testers, see each tester's VM status, create a new tester, provision a VM, watch provisioning state (provisioning,starting,running,failed), regenerate a one-time password, destroy and redeploy a VM, delete or disable users where allowed, and see useful event/log/error details when something fails. The admin should not need CLI knowledge. Implementation should reusehero_tfgrid_deployerfor orchestration andhero_computefor VM lifecycle, gateway and state information.Tester target UX: a tester receives the cockpit URL plus Forge credentials out of band, opens
https://hcockpit.gent01.qa.grid.tf/hero_cockpit/web/, signs in through Forge SSO, grants consent once, and lands in Hero Cockpit as their personal Hero computer. The cockpit home should be clear and non-admin: app/services launcher, Books visible and openable, Settings visible, Manual/help visible, service status available without overwhelming the user, non-production demo warning visible, and no normal-user paste-token flow. Voice is in scope through locallhumina_code/hero_voiceusinghero_voice_widget/<hero-voice-bar>. Slides, Whiteboard and Call should only be presented as working if they are actually reachable; otherwise their checklist rows stay Need or Blocked.Completion rule:
e2e_checklist.mdis the acceptance contract. A row flips to Have only after live browser verification on the SSO-gated QA URL. Code existing, RPC working, or a CLI-only workaround is not enough. Definition of done: a real admin can operate the tester and VM lifecycle from the browser, and a real tester can log in and use the core Hero cockpit without us standing next to them.One implementation guardrail for this issue: do not reinvent VM lifecycle or dashboard foundations. The admin side should reuse
hero_computefor deploy, delete, gateway and state information, andhero_tfgrid_deployershould present that cleanly to operators. The tester side should connect the existinghero_cockpitsurfaces and usehero_voicethroughhero_voice_widgetfor voice. This is primarily a wiring, deployment and UX verification pass over existing Hero components, with new code only where the checklist exposes a real gap.Additional implementation alignment for the next session: pull
developmentbefore coding, usehero_ui_dashboard_adminfor admin UX shape,hero_ui_dashboard_implementationfor Rust/Askama/admin-lib wiring, andhero_service_implementationwith the currentlhumina_code/hero_servicetemplate as the service reference. First inventory the existing Hero stack, especiallyhero_compute,hero_tfgrid_deployer,hero_cockpit, andhero_voice, then connect what is already there. Do not reinvent VM lifecycle, service lifecycle, dashboard chrome, API docs, logs/jobs widgets, or voice integration.Session 167 planning/handoff note: the implementation target is the free-testing channel from
docs/hero_os/overview.md, grounded indocs/hero_os/free/admin-vm-deployment-runbook.md.Flow contract for s168:
/hero_tfgrid_deployer/admin/: create/select Forge tester, verify the tester has SSH keys, provision the tester child VM through existinghero_tfgrid_deployer+hero_compute, see state/errors/logs, and hand the child cockpit URL plus credentials to the tester out of band./hero_cockpit/web/: non-admin personal Hero cockpit with Books, services, settings, manual/help, feedback, demo warning, andhero_voice_widgetvisible/usable. Normal browser use should not depend on pasting a Forge API token; token paste remains fallback/headless only. BYO AI provider keys remain settings-page material.Known s168 first checks: remove or route around the
hero_cockpit_adminscaffold, fix deployer provisioning defaults/env (DEFAULT_IMAGE,HERO_COMPUTE_NODE_ADDR), add Books + voice discoverability, deploy, then live browser-walke2e_checklist.mdbefore flipping rows.Post-handoff SSH key safety clarification: provisioning must never silently create a tester VM without the tester's SSH public key injected.
Canonical custody remains Forge: the tester's public key is stored under the tester's Forge account. If cockpit offers an SSH-key helper, it should upload the public key to Forge under the tester identity; cockpit/deployer should not become the private-key custody system.
Existing server behavior already has the right invariant in
hero_tfgrid_deployer_server/src/web.rs::handle_provision_vm: it callsforge.list_user_ssh_keys(username), returns an actionable-32602error if none exist, and passesssh_keysinline toComputeService.deploy_vmwhen provisioning.s168 acceptance should make this visible and verified in UX/checklist:
ssh_key_count > 0/ equivalent evidence.Update: code for the admin and tester UX work landed this session across three repos.
hero_os_tfgrid_deployer (8c640cd) ships the provisioning fixes (HERO_COMPUTE_NODE_ADDR is now actually read, default image is
Ubuntu 24.04so the resolver accepts it, and the admin user-detail page shows a Forge SSH-key badge and gates the Provision button when the tester has zero keys). It closes hero_os_tfgrid_deployer#11.hero_cockpit (a52b784) routes the cockpit admin scaffold to the deployer admin via a 302 redirect, adds a Books card as the first card on the tester landing page, and wires the hero-voice-bar widget into the navbar with the canonical voice-widget asset links per the hero_voice_widget skill.
hero_lib (2f46f8f5) fixes an upstream regression in tools/src/forge/client.rs that was breaking downstream workspace builds.
All three are deployed to the public QA admin VM hcockpit.gent01.qa.grid.tf. The admin and cockpit binaries were installed via manual SCP because the Forgejo Actions release pipeline was wedged on missing token scope. The token scope and stale repo-level overrides were also cleaned up during the session, so future deploys should go through the canonical lab build pipeline.
The
deployer/FORGE_TOKENon the VM was rotated to a site-admin Forgejo token because the original token was non-admin and forge.create_user was failing with 403.Browser-walk so far: the admin URL correctly lands on the deployer admin instead of the scaffold, and the Books card is visible on the tester landing. Voice-bar render, tester creation, provision_vm with the default image, regenerate-password, and Books navigation carry to the next session. Three follow-up issues were filed: hero_router#113 (a prefix-doubling bug that breaks /hero_books/web/), hero_os_tfgrid_deployer#12 (the admin navbar shows the OS username instead of the SSO user), and hero_voice#36 (operator note about redeploying hero_voice_admin when a host UI adopts the widget embed).
s169 closed 2026-05-27 — verify-and-close walk + per-tester-VM arc made explicit + 5 UX squash-merges + multi-session roadmap to home#238 closure
End-to-end admin + tester SSO browser walk on the public QA admin VM
0069. Verified A-20 (admin user list), A-25 (regenerate password), A-28 (SSH-key readiness pre-flight in both states), cockpit landing with Books card + voice-bar rendered, all tester pages render. A-21 blocked by P0hero_os_tfgrid_deployer#13—my_compute_zos_servernot running on admin VM (operational fix queued for s170).5 UX issues filed + fixed + closed via squash-merges:
hero_os_tfgrid_deployer#12— SSO username instead of OS username in admin navbarhero_os_tfgrid_deployer#14— Create-user success panel rewritten as SSO-first walkhero_os_tfgrid_deployer#15— Node SID help text matches realityhero_os_tfgrid_deployer#16— Bootstrap modal dialogs replace browser confirm()hero_cockpit#10— Droppedtable-lightthead so dark theme is honoredSquash-merges (live on origin/development; redeploy queued for s170):
hero_os_tfgrid_deployer@c649d76hero_cockpit@08e7788home@a0dd2f3Per-tester-VM arc made explicit in
e2e_checklist.mdwith 4 new Need rows mapping the gap from today's state to executive summary lines 27/28/29/31-40:hero_proxyallowlist on tester VM (exec line 28)Multi-session arc to home#238 closure: s170 (compute daemon + UX redeploy + first Provision) → s171 (A-12 deploy_webgateway integration) → s172 (A-30 Hero stack auto-install) → s172-bis (A-31 per-tester allowlist) → s173 (full e2e walk + close).
Counts after s169: 57 Have / 23 Need / 2 Blocked across 82 rows (was 54/20/2 across 76).
See
sessions/169.yml(local pipeline artifact) for full per-step record.mik-tf referenced this issue from lhumina_code/hero_compute2026-05-27 18:17:22 +00:00
s171 close — A-12 deploy_webgateway after deploy_vm shipped
Code shipped at hero_os_tfgrid_deployer@15e5473 (+492/-6 across 8 files).
What landed:
webgateway_fqdn TEXT NOT NULL DEFAULT ''to thevmstable via the canonical recreate-with-FK dance (preserves the M2 ON DELETE RESTRICT FK).ComputeAdaptergained a typedWebgatewaystruct +deploy_webgatewaymethod wrapping the JSON-RPC envelope.handle_provision_vmcallsdeploy_webgateway(name={user}-demo, kind=Name, fqdn="", backends=["http://[mycelium_ip]:9988"], tls_passthrough=false, secret=vm_secret, node_sid=<env>)afterdb.insert_vm, then persists the daemon-returnedWebgateway.fqdnvia the newupdate_vm_webgatewaysetter.webgateway_errorin the JSON response for operator retry.user_detail.htmladds a "Cockpit URL" column on the VMs table plus a Cockpit URL row in the post-Provision alert, both with a copy-to-clipboard button.[[env]] TFGRID_GATEWAY_NODE_SIDblock onhero_tfgrid_deployer_server/service.tomlwithdefault="".Pre-merge gate: fmt + clippy
-D warnings+ workspace release build + 32 server-lib tests pass (+6 new);--infosmoke clean on both deployer binaries.Live walk on admin VM 0069: surfaced an API asymmetry where
deploy_webgateway.node_sidtakes the raw TFGridnode_id(e.g."2") whiledeploy_vm.node_sidtakes the daemon-local catalog sid (e.g."0001"); pivoted the operator secret toTFGRID_GATEWAY_NODE_SID=2and retried. Three subsequent attempts each hit the daemon's 300s inline-await timeout on the substrate write (consistent QA substrate finalization slowness today, not a transient flake). Daemon-side rollback ran cleanly each time, cancelling 2 orphan contracts per attempt. Filed hero_compute#131 requesting the 300s timeout be bumped, exposed as an env var, and differentiated per chain.Phase B.5 adversarial review caught two protocol fixes before the code shipped: the deployer must read the daemon-returned fqdn (never compute it locally), and backends must carry an
http://scheme prefix.A-12 row in
docs/hero_os/free/e2e_checklist.md(renamed mid-session by another maintainer fromdocs/hero_os/free/) flipped from Need to Have-with-caveat. Code path complete and the gateway-node selection live-verified through daemon logs; live URL gated on QA substrate window opening or hero_compute#131 landing. Counts: 60 Have / 20 Need / 2 Blocked across 82 rows.Cleanup: 4 throwaway VMs deleted, the throwaway Forge user purged via admin DELETE (verified 404), ephemeral SSH key scrubbed from
/tmp. QA twin 703 RentContract 84983 and the admin VM0069gateway contracts untouched. Zero TFT cost.Next session (s172) = A-30 Hero stack auto-install on tester VM (design-lock between SSH-and-run, cloud-init, and pre-baked image, then ship a minimal vertical slice with
hero_proxy+hero_router+hero_proc+hero_cockpitrunning on a fresh tester VM).Design lock for the next session: after
provision_vmmints the cockpit URL, the deployer installs the Hero stack on the tester VM over SSH using a stable installer keypair held in the admin VM's secret store. At provision time the deployer co-injects three pubkey sets into the new VM'sauthorized_keys: the tester's own Forge SSH key, the deployer's installer key, and the workspace admin SSH keys, so workspace admins keep standing root access to every tester VM for ops and debugging. The tester VM'shero_proxyis configured with a symmetric web allowlist: the tester's Forge identity plus the workspace admin Forge identities (deployer/ADMIN_FORGE_USERS), so SSH access and cockpit web access converge on the same identity set. The workspace registers one shared Forgejo OAuth app and the deployer patches itsredirect_urisper Provision (append on provision, remove on delete). This is the sandbox trust model and is explicitly bounded to the Hero OS Tester Sandbox; the future paid-tier sovereign deploy inherits none of these defaults (no admin SSH co-injection, no admin in tester web allowlist, no shared installer key, no shared OAuth app). The A-30 canonical stack list also grows from 11 to 12 components withhero_bizjoining; B-41 caption updated to match. Live walk and row flips for A-30 + A-31 land later in the session after the implementation phases.Session 172c close summary
What shipped, live
The install pipeline works end-to-end on a freshly provisioned tester VM.
deployer.install_hero_stackadvances the new VM through install_state none → installing → ready in about eight minutes. All twelve canonical components run on the tester. The tester's TFGrid Web Gateway URL serves the cockpit publicly: external HTTPS curl returns 200 on/and 303 on/hero_cockpit/web/to the welcome page. Admin SSH co-injection verified live (operator workstation SSH into the tester's root over mycelium succeeded).Code
Five squash-merges on hero_os_tfgrid_deployer/development closed every install-pipeline gap surfaced live. The commit chain is
319cf68 → ce9b9e4 → cab2f16 → 794da22 → 483c8b8 → 541d9d5(cumulative deployer server md59173e330ab6ddff5849118e5edc51a88on admin VM). Pre-merge gate green on every commit: fmt, clippy--workspace --all-targets -D warnings, 55 server-lib tests (six new) + 2 SDK tests, workspace release build,--infosmoke on both deployer binaries.Architectural lesson
A new locked decision codifies that tester VM runtime configuration flows exclusively through hero_proc's secret store via service.toml env blocks. Bash environment files (
/root/app.env) bypass hero_proc-managed daemons and never reach the services that actually need the values. This was empirically demonstrated through three iterations during the session: settingHERO_PROXY_SEED_GATEWAY_LISTENER=1in app.env produced exactly the same 502 Bad Gateway as setting nothing, because the managed daemon reads from hero_proc, not from bash. Only after extending the deployer's SSH payload to runhero_proc secret set --quiet --context core HERO_PROXY_SEED_GATEWAY_LISTENER 1and restart hero_proxy did the listener actually bind a TCP socket.Structural follow-up filed
hero_os_tfgrid_deployer#17 — promote the tester-VM install runner from the bash script in hero_demo to a Rust crate in the deployer workspace that consumes a typed install manifest. Retires the impedance boundary between the deployer's typed Rust shape and what daemons actually see.
Post-v1 vision filed
hero_demo#68 —
hero_storeservice catalog UI for tester VMs. Browsable list of every Hero service published vialab-publish.yaml, one-click install or uninstall onto the tester's own VM. Depends on the install-runner cleanup landing first. Post-v1-sandbox polish.Checklist row flips
A-30 (Hero stack present + running on a freshly provisioned tester VM) → Have-with-caveat. B-40 (tester opens cockpit URL on their own provisioned VM) → Have-with-caveat. The caveat in both cases is that the install runner is still the bash script in hero_demo per the structural follow-up above, and full browser SSO walks need the next session's OAuth-secret propagation work to complete. Counts moved 60 Have / 21 Need / 2 Blocked → 62 / 19 / 2 across 83 rows.
Cleanup at /stop
Tester VM and Forge user deleted (Gap 3 webgateway cleanup verified live four times across the session — every delete returned
webgateway_error: ""). Workstation and admin VM temp files shredded. No orphan VMs or contracts on QA. Admin VM stays up at the public URL.Carries to next session (estimated three to five hours)
Propagate the four cockpit OAuth secrets to tester VMs via the same SSH-push pattern. Walk two testers simultaneously (alice plus bob) with browser SSO walks demonstrating per-tester isolation plus admin symmetric trust across both cockpits. Flip the SSO-dependent checklist rows (A-31, B-41, several B-1x where the live walk surfaces evidence). After that the substrate supports any number of testers — provisioning more becomes mechanical.
Next session plan (v1 demo close)
The s172d live walk surfaced four narrow UI / UX gaps that are the last things blocking the v1 demo from feeling clickable end-to-end. The architectural substrate (per-tester OAuth, install pipeline, admin allowlist) all works; what is left is operator and tester polish:
P0 — Admin UI auto-refresh on Install and Provision state
hero_os_tfgrid_deployer#18. Today the admin must manually refresh the user-detail page to see the Install state transition from
installingtoready, and Provision has no visible progress at all (looks like a hung browser for the full 55 to 60 seconds). Single fix is a small polling script on the VMs table plus a new provision-state column symmetric with the existing install-state column.P0.5 — Cockpit URL column in admin UI is missing the /hero_cockpit/web/ path
hero_os_tfgrid_deployer#19. The admin sees
https://<tester>.<gateway>and clicks it, but lands on Hero Proxy's own service-discovery dashboard, not on the cockpit. One-line template fix to append/hero_cockpit/web/to the displayed link and href.P1a — Cockpit Services page: Install button for uninstalled components
hero_cockpit#11. The cockpit currently only lists services already known to Hero Proc. Components in the canonical demo stack that are not yet started (because their dependency is not met, e.g. Books needs an AI key) are invisible from the cockpit, so the tester has no UI path to bring them up. Unified list with greyed-out rows for "available but not installed" plus a per-row Install button that fires
lab buildthenlab service startserver-side.P1b — Cockpit Services page: clickable URL column
hero_cockpit#12. The Services page has a URL column but it currently shows an em-dash for every row. Every service with an
_adminor_webbinary has a publicly reachable URL the cockpit can compute from its own service.toml. Render those URLs as clickable links and the tester gets an obvious affordance to open any service.P2 — BYO-key auto-start cascade in cockpit Settings
When a tester pastes an AI provider key in the cockpit Settings page, the cockpit's save handler should automatically fire the lab-service-start cascade for the AI-dependent components (Hero AI Broker plus Hero Books plus Hero Agent). Today the tester pastes the key and nothing visible happens; they have to know to SSH into their VM and run a command to bring up Books. After this cascade, Books just works the moment the key is saved.
After s173, the v1 tester loop is end-to-end clickable: admin creates user via admin UI, tester registers on Forge and uploads SSH key, admin clicks Provision and sees real-time progress, admin clicks Install and sees real-time state transitions, admin shares the cockpit URL with the tester out-of-band (Slack / email, manually for v1), tester clicks URL and signs in via SSO, sees cockpit Services with both installed components (Books, Slides, Whiteboard, etc) and uninstalled components (greyed out with Install buttons), pastes an AI key in Settings and Books starts working automatically, clicks any service's URL to open it.
s174 (v2 polish) adds the automatic welcome-email pipeline so the admin no longer shares URLs out-of-band: A-18 (welcome email at user-create time with cockpit URL plus initial password plus 4-step onboarding) and B-1 ("your VM is ready" email at install-ready time). Selecting the email provider (resend.com or SendGrid) is the first decision in s174.
Estimated effort: s173 about 6 to 9 hours, s174 about 4 to 6 hours, both stay inside home#238.
Scope refinement on the cockpit Services polish (hero_cockpit#11)
The previous comment described the Install button work as covering the 12 auto-installed demo components. After thinking through the catalog model more carefully, the right scope is broader: the cockpit Services page becomes the platform's service catalog, full stop. So hero_cockpit#11 ships:
This collapses what was previously planned as a future "service catalog UI" arc (filed at hero_demo#68) into the same page. The cockpit IS the catalog for the free / sandbox tier. A separate searchable marketplace surface is only worth building when paid services and onboarding flows exist, which is a future paid-tier concern outside home#238's scope.
Net for s173 V1 close: same four issues (hero_os_tfgrid_deployer#18 admin UI auto-refresh, hero_os_tfgrid_deployer#19 cockpit URL path-suffix, hero_cockpit#11 Install button with the full-catalog scope, hero_cockpit#12 clickable URL column) plus the BYO-key auto-start cascade. After s173 the cockpit Services page is the complete platform catalog for the sandbox demo.
Closing as the visible UX surface is complete. The work shipped across two stretches:
The admin path landed first: tfgrid_deployer admin UI with per-user VMs table, install state machine, provisioning flow, per-tester Forgejo OAuth apps replacing the workspace-shared model. Real Forge users walked through the admin VM end-to-end (alice172d, alice123).
The tester path landed second: cockpit Services page with install-from-catalog flow, Bootstrap modals replacing every browser confirm/alert leak, dark-mode contrast fixes on Disable button and Logs drawer, log_tail rendered in the install result modal so dependency-cascade failures surface inline, Manual with 17 entries split into Core infrastructure (4) and Apps (13), About data locations table covering all 16 catalog services that store user data, Feedback page secondary sections wrapped in Bootstrap cards, landing page CTAs unified, Settings cleanup (Public exposure base domain section removed alongside the Expose/Unexpose UI hiding), and a connection-status dot fix in hero_admin_lib that now paints green when connected and stays steady (used to be grey-when-connected with a constant pulse on the wrong state). About 23 commits across hero_cockpit, hero_os_tfgrid_deployer, hero_website_framework, and hero_demo.
What this arc does NOT cover: functional verification of the catalog apps themselves. A tester can click Install on hero_books and the install cascade completes, but whether hero_books actually renders a library, indexes a document, and answers a grounded question is unverified. That work moves to the new arc: home#239.
Signed-by: mik-tf mik-tf@noreply.invalid