fix: 2026-05-01 herodemo deploy hotfixes (onlyoffice, mycelium 0.7.5, env overrides) #192

Closed
mik-tf wants to merge 4 commits from development_mik_demo_hotfixes_2026_05_01 into development
Owner

Summary

Bundle of four hotfixes from the 2026-05-01 herodemo redeploy. None of these are speculative — each is a fix for a problem we hit live and verified working before this branch went up.

Commit What Tracked in
b8835b5 OnlyOffice: inject host.docker.internal:<host-ip> add-host, IP detected at install time (HERO_HOST_IP override) hero_demo#57
b61bf1c nginx: bypass basic auth on /hero_office/ui/{files,callback} (JWT is the actual auth) hero_demo#57
d76cd7e mycelium: update daemon flags for 0.7.5+ (--no-tun, --socket-dir); old flags removed n/a (caught at deploy)
6710a08 lib.nu: HERO_KEEP_BINARIES + HERO_CARGO_RELEASE_ALWAYS env overrides for fast iteration n/a

Why

The OnlyOffice editor was three layers broken (JWT mismatch → DNS unresolved → nginx 401). The mycelium daemon refused to start at all on 0.7.5. Combined, these blocked the photos archipelago and all office-app islands. The two env-var escape hatches in lib.nu came out of debugging the deploy itself — service_X start --reset was triggering full debug rebuilds even when release binaries were already in place.

All four are minimal and additive:

  • The OnlyOffice + nginx changes only affect the /hero_office/ui/* traffic path; everything else is untouched.
  • The mycelium change is forced — old flags don't work on 0.7.5+.
  • The env overrides are no-ops when unset; default behaviour is unchanged.

Verified live

  • OnlyOffice: .xlsx + .pptx open, edit, save end-to-end. Container resolves host.docker.internal to 10.1.2.2 (eth0). /hero_office/ui/files/... returns 200 (was 401).
  • mycelium: daemon up, photos archipelago re-enabled.
  • HERO_KEEP_BINARIES + HERO_CARGO_RELEASE_ALWAYS used during the deploy itself; collapsed a 5–10 min debug rebuild into a 3-second register+start.

Test plan

  • nu -c 'use tools/modules/services *; service_onlyoffice install --update; service_onlyoffice start --reset' on a Linux Docker host with ONLYOFFICE_JWT_SECRET set; container should come up healthy and the launcher should print the resolved host IP.
  • nu -c 'use tools/modules/installers *; basic_auth_setup --user admin --pass test' and verify the generated /etc/nginx/sites-enabled/hero_demo contains the new location ~ ^/hero_office/ui/(files|callback)(/|$) block.
  • nu -c 'use tools/modules/services *; service_mycelium install --update; service_mycelium start --reset' on a non-root TF Grid VM; daemon starts in messages-only mode (no EPERM).
  • Set HERO_KEEP_BINARIES=1 and run service_X start --reset against any service whose binaries are pre-installed; confirm purge is skipped and start is fast.

Cross-refs

  • hero_demo#57 — OnlyOffice three-layer postmortem.
  • hero_skills#191load_init_sh doesn't follow source directives. Not fixed by this PR but was the upstream cause of the JWT-secret missing-from-env that triggered the whole OnlyOffice chase.
  • hero_agent#17 — UTF-8 panic in agent.rs (separate, code-side bug surfaced during the same deploy).

🤖 Generated with Claude Code

## Summary Bundle of four hotfixes from the 2026-05-01 herodemo redeploy. None of these are speculative — each is a fix for a problem we hit live and verified working before this branch went up. | Commit | What | Tracked in | |---|---|---| | `b8835b5` | OnlyOffice: inject `host.docker.internal:<host-ip>` add-host, IP detected at install time (HERO_HOST_IP override) | [hero_demo#57](https://forge.ourworld.tf/lhumina_code/hero_demo/issues/57) | | `b61bf1c` | nginx: bypass basic auth on `/hero_office/ui/{files,callback}` (JWT is the actual auth) | [hero_demo#57](https://forge.ourworld.tf/lhumina_code/hero_demo/issues/57) | | `d76cd7e` | mycelium: update daemon flags for 0.7.5+ (`--no-tun`, `--socket-dir`); old flags removed | n/a (caught at deploy) | | `6710a08` | lib.nu: `HERO_KEEP_BINARIES` + `HERO_CARGO_RELEASE_ALWAYS` env overrides for fast iteration | n/a | ## Why The OnlyOffice editor was three layers broken (JWT mismatch → DNS unresolved → nginx 401). The mycelium daemon refused to start at all on 0.7.5. Combined, these blocked the photos archipelago and all office-app islands. The two env-var escape hatches in `lib.nu` came out of debugging the deploy itself — `service_X start --reset` was triggering full debug rebuilds even when release binaries were already in place. All four are minimal and additive: - The OnlyOffice + nginx changes only affect the `/hero_office/ui/*` traffic path; everything else is untouched. - The mycelium change is forced — old flags don't work on 0.7.5+. - The env overrides are no-ops when unset; default behaviour is unchanged. ## Verified live - OnlyOffice: .xlsx + .pptx open, edit, save end-to-end. Container resolves `host.docker.internal` to `10.1.2.2` (eth0). `/hero_office/ui/files/...` returns 200 (was 401). - mycelium: daemon up, photos archipelago re-enabled. - HERO_KEEP_BINARIES + HERO_CARGO_RELEASE_ALWAYS used during the deploy itself; collapsed a 5–10 min debug rebuild into a 3-second register+start. ## Test plan - [ ] `nu -c 'use tools/modules/services *; service_onlyoffice install --update; service_onlyoffice start --reset'` on a Linux Docker host with `ONLYOFFICE_JWT_SECRET` set; container should come up healthy and the launcher should print the resolved host IP. - [ ] `nu -c 'use tools/modules/installers *; basic_auth_setup --user admin --pass test'` and verify the generated `/etc/nginx/sites-enabled/hero_demo` contains the new `location ~ ^/hero_office/ui/(files|callback)(/|$)` block. - [ ] `nu -c 'use tools/modules/services *; service_mycelium install --update; service_mycelium start --reset'` on a non-root TF Grid VM; daemon starts in messages-only mode (no EPERM). - [ ] Set `HERO_KEEP_BINARIES=1` and run `service_X start --reset` against any service whose binaries are pre-installed; confirm purge is skipped and start is fast. ## Cross-refs - [hero_demo#57](https://forge.ourworld.tf/lhumina_code/hero_demo/issues/57) — OnlyOffice three-layer postmortem. - [hero_skills#191](https://forge.ourworld.tf/lhumina_code/hero_skills/issues/191) — `load_init_sh` doesn't follow `source` directives. **Not** fixed by this PR but was the upstream cause of the JWT-secret missing-from-env that triggered the whole OnlyOffice chase. - [hero_agent#17](https://forge.ourworld.tf/lhumina_code/hero_agent/issues/17) — UTF-8 panic in agent.rs (separate, code-side bug surfaced during the same deploy). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
OnlyOffice container hits a callback URL like
http://host.docker.internal:9988/hero_office/ui/... given by hero_office_server.
On Linux Docker the magic hostname is not auto-provided (only on Docker
Desktop), and the host-gateway alias resolves to docker0 (172.17.0.1) which
has no listener on :9988 (nginx is on the host's eth0; hero_router is on
loopback).

This patch:
- Adds oo_host_alias_ip helper that detects the host's primary non-loopback,
  non-bridge IPv4 (preferring 10.x / 192.168.x / 172.x) at install time, with
  HERO_HOST_IP env override for explicit operator control. Falls back to the
  literal "host-gateway" if detection fails.
- Threads host_alias through oo_launcher_script and embeds
  --add-host=host.docker.internal:<ip> into the docker run command.

Without this, the OnlyOffice editor surfaces "The file cannot be accessed
right now" / "Download failed" because the container can't resolve the
callback hostname. Tracked in hero_demo#57.

Verified live on herodemo VM 2026-05-01: detection picked 10.1.2.2 (eth0),
container resolves host.docker.internal, .xlsx and .pptx open + save.

See lhumina_code/hero_demo#57

Signed-off-by: mik-tf
hero_office_server gives OnlyOffice URLs of the form
/hero_office/ui/files/<ctx>/<file> and /hero_office/ui/callback/<ctx>. Both
are JWT-signed (HMAC-SHA256 with ONLYOFFICE_JWT_SECRET) and validated by
hero_office_server, so the JWT is the actual auth — but nginx's auth_basic
"Hero OS Demo" block returns 401 to the docker-internal traffic, surfacing
in the editor as "Download failed".

This patch adds a location block that bypasses basic auth on those exact
paths (and only those — the rest of /hero_office/ui still requires the
htpasswd gate, since the human-facing iframe view is auth-gated).

Verified live on herodemo VM 2026-05-01: file fetch returns 200 (was 401),
callback returns 415 on empty body (was 401). xlsx + pptx open + save end
to end.

See lhumina_code/hero_demo#57

Signed-off-by: mik-tf
mycelium 0.7.5 reorganized its CLI flags. The pre-0.7.5 invocation
(--uds-only --rpc-socket --tun-name --tcp-listen-port --quic-listen-port)
fails to start with "unrecognized argument" errors on the new binary.

Changes:
- --uds-only removed: UDS exposure is now default behaviour, no flag.
- --rpc-socket <path> → --socket-dir <base>: the binary now auto-creates
  <socket-dir>/mycelium/rpc.sock under the given base.
- --tun-name <iface> → --tun <iface> (rename).
- --tcp-listen-port and --quic-listen-port: moved into peers.toml config
  and are no longer CLI flags.

Linux/TF-Grid additionally switches to --no-tun: TUN device creation
requires CAP_NET_ADMIN (or root). On a TF Grid VM running as a non-root
service user, mycelium can't create a TUN and fails with EPERM. The
messages-only mode (no TUN) is sufficient for the Hero stack's internal
use of mycelium (UDS RPC for messaging, peer discovery, topic subs).

To re-enable TUN routing in future, swap --no-tun for --tun ($tun) and
grant CAP_NET_ADMIN to the binary (setcap cap_net_admin+ep) or run as
root.

macOS path (which already requires root for TUN creation) keeps --tun.

Verified live on herodemo VM 2026-05-01: mycelium daemon starts cleanly,
photos archipelago re-enabled.

Signed-off-by: mik-tf
feat(services/lib): HERO_KEEP_BINARIES + HERO_CARGO_RELEASE_ALWAYS env overrides
All checks were successful
Build and Publish Skills / build-and-publish (pull_request) Successful in 3s
6710a08072
Two operator-facing env-var escape hatches for the deploy-day pain points
hit during the 2026-05-01 herodemo redeploy:

1. HERO_KEEP_BINARIES=1 — skip svc_purge_binaries and the matching
   svc_verify_binaries_fresh check. Lets `service_X start --reset`
   reuse already-installed release binaries instead of triggering a
   full debug rebuild via the start→install→cargo path. Eliminates
   the "release binaries get purged then immediately rebuilt as debug"
   pathology when start has no --release flag of its own.

2. HERO_CARGO_RELEASE_ALWAYS=1 — force --release in svc_cargo_install
   even when the caller passes $release=false. Useful when
   target/release/ is already populated and you want a cargo no-op
   instead of a debug rebuild from scratch.

Together these collapse `service_X start --reset` from "purge → cargo
rebuild → install" (~5–10 min for hero_books) to "register + start"
(~3–5 sec) when the caller has already placed the release binaries.

Both are no-ops when unset; existing behaviour is unchanged.

Signed-off-by: mik-tf
despiegk closed this pull request 2026-05-06 03:23:43 +00:00
All checks were successful
Build and Publish Skills / build-and-publish (pull_request) Successful in 3s

Pull request closed

Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_skills!192
No description provided.