[onlyoffice][hotfixes-applied] document-server unbreakage 2026-05-01 — three-layer postmortem + prod-level follow-ups #57

Open
opened 2026-05-01 22:46:02 +00:00 by mik-tf · 0 comments
Owner

Summary

OnlyOffice document editor on the herodemo VM was completely broken across the 2026-04-30 redeploy. After ~3 layers of investigation we have it working. This issue documents the symptom chain, the layered hotfixes that landed, and the prod-level follow-ups needed before the next deploy.

Symptom chain (as observed)

  1. First symptom: "ONLYOFFICE_JWT_SECRET is not set on the server" (red banner in editor).
  2. Fix #1 applied → second symptom: "The file cannot be accessed right now" (modal in editor; spreadsheet/presentation never loads).
  3. Fix #2 applied → third symptom: "Download failed" (red error dialog in editor).
  4. Fix #3 applied working. Editor opens, loads, saves.

Each fix unblocked the next layer underneath.

Root causes & hotfixes

Layer 1 — JWT secret missing

Cause: ONLYOFFICE_JWT_SECRET was not in ~/hero/cfg/env/env.sh at all. Even when added, env.sh exports don't propagate to nu deploy scripts (see hero_skills#191load_init_sh doesn't follow source directives). Result: service_onlyoffice.nu registered the action spec with the hardcoded placeholder OO_DEFAULT_SECRET = "hero-demo-jwt-secret-change-in-prod", baked into the docker run command. hero_office_server validated incoming JWTs against the (correct) env.sh secret while the OnlyOffice container signed them with the placeholder → JWT mismatch, every callback rejected.

Hotfix applied:

  • Generated 64-char hex secret (openssl rand -hex 32).
  • Added to ~/hero/cfg/env/env.sh and to hero_proc secret store (hero_proc secret set ONLYOFFICE_JWT_SECRET ...).
  • Set in nu env explicitly (because of #191) and ran service_onlyoffice start --reset to re-register the action spec — only --reset regenerates the docker run command embedded in the action.

Prod-level fix needed:

  • Remove OO_DEFAULT_SECRET = "hero-demo-jwt-secret-change-in-prod" from service_onlyoffice.nu. Fail-closed if ONLYOFFICE_JWT_SECRET is unset, with a clear error message pointing to init.sh and env.sh.
  • Move secret to hero_proc secret store (the pattern hero_aibroker already uses); service_onlyoffice should read via proc secret get instead of env.

Layer 2 — host.docker.internal doesn't resolve on Linux Docker

Cause: hero_office_server gives OnlyOffice a callback/download URL of the form http://host.docker.internal:9988/hero_office/ui/.... On Docker Desktop (macOS/Windows), host.docker.internal is auto-provided. On Linux Docker (this VM, prod), it is not — and the Docker host-gateway magic value resolves to the docker0 bridge IP (172.17.0.1), which has no listener on :9988 (nginx is bound to 10.1.2.2:9988, hero_router to 127.0.0.1:9988).

Hotfix applied: patched service_onlyoffice.nu to detect the host IP at install time (prefers private LAN IP from non-loopback, non-bridge interfaces; here 10.1.2.2) and inject --add-host=host.docker.internal:<detected-ip> into the docker run command. Operator override: HERO_HOST_IP env var.

Prod-level fix needed:

  • The detection logic is heuristic (picks first 10.x/192.168.x/172.x IP). On hosts with multiple private interfaces, can pick the wrong one. Better: explicitly probe which IP responds to :9988 from a docker container's perspective at install time, and use that.
  • Alternative cleaner long-term: configure host-gateway-ip in /etc/docker/daemon.json so --add-host=host.docker.internal:host-gateway works portably without per-host detection. Trades one daemon config edit at provisioning time for cross-host portability of the launcher script.

Layer 3 — nginx basic auth blocks the OnlyOffice callback/download paths

Cause: /etc/nginx/sites-enabled/hero_demo enforces auth_basic "Hero OS Demo" on every path except ^/hero_[a-z_]+/rpc(/|$). OnlyOffice container hits GET /hero_office/ui/files/<ctx>/<file> and POST /hero_office/ui/callback/<ctx> — both UI paths, both basic-auth gated → 401, surfacing as "Download failed" in the editor. The container has no way to send basic-auth credentials (operator-side).

Hotfix applied: patched tools/modules/installers/auth.nu to add an additional location ~ ^/hero_office/ui/(files|callback)(/|$) { auth_basic off; ... } block. Both endpoints are JWT-signed (HMAC-SHA256 with ONLYOFFICE_JWT_SECRET) and validated by hero_office_server, so the JWT is the actual auth. Live nginx config patched manually + reloaded; full installer re-run will reproduce on next deploy.

Prod-level fix needed:

  • Add a regression test: assert that hero_office_server rejects any callback/files request without a valid JWT. The nginx bypass is only safe if JWT validation is bulletproof. If validation regresses to "missing secret = allow", we'd silently expose /hero_office/ui/files/* to unauthenticated access.
  • Consider scoping further: /files/<ctx>/<file> should only be readable by an OnlyOffice container request (presence of Authorization: Bearer <jwt> from the OO container), not by random clients.
  • Document the auth model in docs_hero so the next operator who adds an OnlyOffice-like component knows the JWT-vs-basic-auth boundary.

Where the patches landed

  • lhumina_code/hero_skills/tools/modules/services/service_onlyoffice.nu
    • New oo_host_alias_ip helper (HERO_HOST_IP env var or auto-detect).
    • oo_launcher_script accepts host_alias and embeds --add-host=host.docker.internal:<ip>.
    • oo_install_launcher calls the helper and prints the resolved IP.
  • lhumina_code/hero_skills/tools/modules/installers/auth.nu
    • New nginx location block exempting /hero_office/ui/(files|callback)/* from basic auth.

Local-only on the VM at time of writing (under /home/driver/hero/code/hero_skills/); upstream PR pending.

Cross-references

  • hero_skills#191load_init_sh doesn't follow source directives. This compounds the JWT-secret problem at deploy time and is the broader fix that would have prevented Layer 1 entirely.
  • hero_aibroker — already uses hero_proc secret store. The pattern to copy.

Demo posture

For tomorrow's demo: working as of 2026-05-01 22:38 UTC. Don't trigger another service_onlyoffice start --reset without ONLYOFFICE_JWT_SECRET in nu env (would re-bake the placeholder secret into the action spec). The current launcher script + docker container are correctly paired.

Signed-off-by: mik-tf

## Summary OnlyOffice document editor on the herodemo VM was completely broken across the 2026-04-30 redeploy. After ~3 layers of investigation we have it working. This issue documents the symptom chain, the layered hotfixes that landed, and the prod-level follow-ups needed before the next deploy. ## Symptom chain (as observed) 1. **First symptom**: `"ONLYOFFICE_JWT_SECRET is not set on the server"` (red banner in editor). 2. **Fix #1 applied** → second symptom: `"The file cannot be accessed right now"` (modal in editor; spreadsheet/presentation never loads). 3. **Fix #2 applied** → third symptom: `"Download failed"` (red error dialog in editor). 4. **Fix #3 applied** → ✅ working. Editor opens, loads, saves. Each fix unblocked the next layer underneath. ## Root causes & hotfixes ### Layer 1 — JWT secret missing **Cause**: `ONLYOFFICE_JWT_SECRET` was not in `~/hero/cfg/env/env.sh` at all. Even when added, env.sh exports don't propagate to nu deploy scripts (see [hero_skills#191](https://forge.ourworld.tf/lhumina_code/hero_skills/issues/191) — `load_init_sh` doesn't follow `source` directives). Result: `service_onlyoffice.nu` registered the action spec with the **hardcoded placeholder** `OO_DEFAULT_SECRET = "hero-demo-jwt-secret-change-in-prod"`, baked into the docker run command. hero_office_server validated incoming JWTs against the (correct) env.sh secret while the OnlyOffice container signed them with the placeholder → JWT mismatch, every callback rejected. **Hotfix applied**: - Generated 64-char hex secret (`openssl rand -hex 32`). - Added to `~/hero/cfg/env/env.sh` and to hero_proc secret store (`hero_proc secret set ONLYOFFICE_JWT_SECRET ...`). - Set in nu env explicitly (because of #191) and ran `service_onlyoffice start --reset` to re-register the action spec — only `--reset` regenerates the docker run command embedded in the action. **Prod-level fix needed**: - Remove `OO_DEFAULT_SECRET = "hero-demo-jwt-secret-change-in-prod"` from `service_onlyoffice.nu`. Fail-closed if `ONLYOFFICE_JWT_SECRET` is unset, with a clear error message pointing to `init.sh` and `env.sh`. - Move secret to hero_proc secret store (the pattern hero_aibroker already uses); `service_onlyoffice` should read via `proc secret get` instead of env. ### Layer 2 — `host.docker.internal` doesn't resolve on Linux Docker **Cause**: hero_office_server gives OnlyOffice a callback/download URL of the form `http://host.docker.internal:9988/hero_office/ui/...`. On Docker Desktop (macOS/Windows), `host.docker.internal` is auto-provided. On Linux Docker (this VM, prod), it is **not** — and the Docker `host-gateway` magic value resolves to the docker0 bridge IP (`172.17.0.1`), which has no listener on `:9988` (nginx is bound to `10.1.2.2:9988`, hero_router to `127.0.0.1:9988`). **Hotfix applied**: patched `service_onlyoffice.nu` to detect the host IP at install time (prefers private LAN IP from non-loopback, non-bridge interfaces; here `10.1.2.2`) and inject `--add-host=host.docker.internal:<detected-ip>` into the docker run command. Operator override: `HERO_HOST_IP` env var. **Prod-level fix needed**: - The detection logic is heuristic (picks first 10.x/192.168.x/172.x IP). On hosts with multiple private interfaces, can pick the wrong one. Better: explicitly probe which IP responds to `:9988` from a docker container's perspective at install time, and use that. - Alternative cleaner long-term: configure `host-gateway-ip` in `/etc/docker/daemon.json` so `--add-host=host.docker.internal:host-gateway` works portably without per-host detection. Trades one daemon config edit at provisioning time for cross-host portability of the launcher script. ### Layer 3 — nginx basic auth blocks the OnlyOffice callback/download paths **Cause**: `/etc/nginx/sites-enabled/hero_demo` enforces `auth_basic "Hero OS Demo"` on every path except `^/hero_[a-z_]+/rpc(/|$)`. OnlyOffice container hits `GET /hero_office/ui/files/<ctx>/<file>` and `POST /hero_office/ui/callback/<ctx>` — both UI paths, both basic-auth gated → 401, surfacing as "Download failed" in the editor. The container has no way to send basic-auth credentials (operator-side). **Hotfix applied**: patched `tools/modules/installers/auth.nu` to add an additional `location ~ ^/hero_office/ui/(files|callback)(/|$) { auth_basic off; ... }` block. Both endpoints are JWT-signed (HMAC-SHA256 with `ONLYOFFICE_JWT_SECRET`) and validated by hero_office_server, so the JWT is the actual auth. Live nginx config patched manually + reloaded; full installer re-run will reproduce on next deploy. **Prod-level fix needed**: - Add a regression test: assert that hero_office_server **rejects** any callback/files request without a valid JWT. The nginx bypass is only safe if JWT validation is bulletproof. If validation regresses to "missing secret = allow", we'd silently expose `/hero_office/ui/files/*` to unauthenticated access. - Consider scoping further: `/files/<ctx>/<file>` should only be readable by an OnlyOffice container request (presence of `Authorization: Bearer <jwt>` from the OO container), not by random clients. - Document the auth model in `docs_hero` so the next operator who adds an OnlyOffice-like component knows the JWT-vs-basic-auth boundary. ## Where the patches landed - `lhumina_code/hero_skills/tools/modules/services/service_onlyoffice.nu` - New `oo_host_alias_ip` helper (HERO_HOST_IP env var or auto-detect). - `oo_launcher_script` accepts `host_alias` and embeds `--add-host=host.docker.internal:<ip>`. - `oo_install_launcher` calls the helper and prints the resolved IP. - `lhumina_code/hero_skills/tools/modules/installers/auth.nu` - New nginx `location` block exempting `/hero_office/ui/(files|callback)/*` from basic auth. Local-only on the VM at time of writing (under `/home/driver/hero/code/hero_skills/`); upstream PR pending. ## Cross-references - [hero_skills#191](https://forge.ourworld.tf/lhumina_code/hero_skills/issues/191) — `load_init_sh` doesn't follow `source` directives. This compounds the JWT-secret problem at deploy time and is the broader fix that would have prevented Layer 1 entirely. - hero_aibroker — already uses hero_proc secret store. The pattern to copy. ## Demo posture For tomorrow's demo: working as of 2026-05-01 22:38 UTC. Don't trigger another `service_onlyoffice start --reset` without `ONLYOFFICE_JWT_SECRET` in nu env (would re-bake the placeholder secret into the action spec). The current launcher script + docker container are correctly paired. Signed-off-by: mik-tf
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_demo#57
No description provided.