[nu-demo] hero_aibroker_server crash-loops on missing /var/ modelsconfig.yml — agent gets 401, Settings → Environment page broken, AI Assistant 'network error' #198

Closed
opened 2026-04-27 14:43:05 +00:00 by mik-tf · 1 comment
Owner

Symptom (user-visible on herodemo, 2026-04-27)

  1. Settings → Environment tab: Error: expected value at line 1 column 1 — endpoint returns empty/HTML, frontend JSON parse fails. No API keys visible.
  2. AI Assistant chat: Stream read error: JsValue(TypeError: network error TypeError: network error) after sending a message. Worked one minute earlier (different LLM provider routed direct).
  3. hero_agent log shows: LLM streaming API error 401 Unauthorized: <html> for every Claude call (claude-3-5-sonnet-latest, claude-haiku-4.5).

Root cause

hero_aibroker_server crash-loop:
  Error: Failed to read config file: /home/driver/hero/var/hero_aibroker/modelsconfig.yml
     No such file or directory (os error 2)

The _server binary has been down on herodemo since 2026-04-24. Only hero_aibroker_ui is running (the UI alone can't relay LLM calls, hence the 401s).

The config file does exist in the source repo at /home/driver/hero/code0/hero_aibroker/modelsconfig.yml, and in the cargo registry checkout, but it was never seeded into /home/driver/hero/var/hero_aibroker/ — that directory doesn't exist at all on herodemo.

Downstream fan-out

  • aibroker_server crash → no LLM relay
  • → hero_agent (configured to route Claude through aibroker) → 401 from upstream
  • → AI Assistant UI: "Stream read error: network error" (the 401 happens mid-stream)
  • → Settings → Environment page: backend returns no JSON (broker doesn't serve env-list when down) → frontend expected value at line 1 column 1

Fix options

  1. hero_skills installer side (cleanest): service_aibroker.nu should cp the bundled modelsconfig.yml from the repo into /home/driver/hero/var/hero_aibroker/ on first install, with idempotent skip-if-present.
  2. hero_aibroker_server side (defensive): fall back to the bundled repo-relative modelsconfig.yml if the var/ path is missing, and log a warning rather than failing startup.
  3. Documentation side (interim for herodemo): document the mkdir -p /home/driver/hero/var/hero_aibroker && cp /home/driver/hero/code0/hero_aibroker/modelsconfig.yml /home/driver/hero/var/hero_aibroker/ recovery step in the demo runbook.

Companion to #137

#137 covers the content of modelsconfig.yml (groq-only models with no fallbacks). This issue is one layer below: the file isn't present at the expected path at all on herodemo, so hero_aibroker can't even load any routing config.

Verification path

ssh root@herodemo  # i.e. 185.69.166.153
ls -la /home/driver/hero/var/hero_aibroker/   # → No such file or directory
ps -eo pid,cmd | grep hero_aibroker_server   # → no _server process, only _ui
su - driver -c 'cd ~/hero/code/hero_skills && nu -c "use tools/modules/clients/proc.nu *; proc logs get hero_aibroker.hero_aibroker_server --lines 30"'
# → repeated 'Failed to read config file: /home/driver/hero/var/hero_aibroker/modelsconfig.yml'

Filed during the rc.12 herodemo validation, surfaced when user tested AI Assistant. Pre-existing bug, ~3 days old; NOT caused by the rc.12 work.

Signed-off-by: mik-tf

## Symptom (user-visible on herodemo, 2026-04-27) 1. **Settings → Environment tab**: `Error: expected value at line 1 column 1` — endpoint returns empty/HTML, frontend JSON parse fails. No API keys visible. 2. **AI Assistant chat**: `Stream read error: JsValue(TypeError: network error TypeError: network error)` after sending a message. Worked one minute earlier (different LLM provider routed direct). 3. hero_agent log shows: `LLM streaming API error 401 Unauthorized: <html>` for every Claude call (`claude-3-5-sonnet-latest`, `claude-haiku-4.5`). ## Root cause ``` hero_aibroker_server crash-loop: Error: Failed to read config file: /home/driver/hero/var/hero_aibroker/modelsconfig.yml No such file or directory (os error 2) ``` The `_server` binary has been **down on herodemo since 2026-04-24**. Only `hero_aibroker_ui` is running (the UI alone can't relay LLM calls, hence the 401s). The config file *does* exist in the source repo at `/home/driver/hero/code0/hero_aibroker/modelsconfig.yml`, and in the cargo registry checkout, but it was never seeded into `/home/driver/hero/var/hero_aibroker/` — that directory doesn't exist at all on herodemo. ## Downstream fan-out - aibroker_server crash → no LLM relay - → hero_agent (configured to route Claude through aibroker) → 401 from upstream - → AI Assistant UI: "Stream read error: network error" (the 401 happens mid-stream) - → Settings → Environment page: backend returns no JSON (broker doesn't serve env-list when down) → frontend `expected value at line 1 column 1` ## Fix options 1. **hero_skills installer side** (cleanest): `service_aibroker.nu` should `cp` the bundled `modelsconfig.yml` from the repo into `/home/driver/hero/var/hero_aibroker/` on first install, with idempotent skip-if-present. 2. **hero_aibroker_server side** (defensive): fall back to the bundled repo-relative `modelsconfig.yml` if the var/ path is missing, and log a warning rather than failing startup. 3. **Documentation side** (interim for herodemo): document the `mkdir -p /home/driver/hero/var/hero_aibroker && cp /home/driver/hero/code0/hero_aibroker/modelsconfig.yml /home/driver/hero/var/hero_aibroker/` recovery step in the demo runbook. ## Companion to #137 [#137](https://forge.ourworld.tf/lhumina_code/home/issues/137) covers the *content* of `modelsconfig.yml` (groq-only models with no fallbacks). This issue is one layer below: the file isn't present at the expected path at all on herodemo, so hero_aibroker can't even load any routing config. ## Verification path ``` ssh root@herodemo # i.e. 185.69.166.153 ls -la /home/driver/hero/var/hero_aibroker/ # → No such file or directory ps -eo pid,cmd | grep hero_aibroker_server # → no _server process, only _ui su - driver -c 'cd ~/hero/code/hero_skills && nu -c "use tools/modules/clients/proc.nu *; proc logs get hero_aibroker.hero_aibroker_server --lines 30"' # → repeated 'Failed to read config file: /home/driver/hero/var/hero_aibroker/modelsconfig.yml' ``` Filed during the rc.12 herodemo validation, surfaced when user tested AI Assistant. Pre-existing bug, ~3 days old; NOT caused by the rc.12 work. Signed-off-by: mik-tf
Author
Owner

Root cause + fix landed

The seeding logic in service_aibroker.nu was correct all along — it copies modelsconfig.yml from the cloned source into $(svc_home $root)/var/hero_aibroker/. It silently no-op'd on herodemo because CODEROOT was empty in the SSH login shell that drove the install.

Why: Debian's default ~/.profile doesn't source ~/.bashrc. install.sh only writes the [ -f ~/hero/cfg/init.sh ] && source … snippet to ~/.bashrc. SSH login shells (su - driver -c '…', cron, hero_proc child processes) read .profile and skip .bashrc, so CODEROOT/ROOTDIR/BUILDDIR never reach them. forge_ensure_local\s ensure_env_vars errors out, the seed step is skipped, but the action is still registered and crash-loops on the missing config.

install.sh's patch_shell_rc_files already adds a .profile → .bashrc bridge on first bootstrap (lines 741-758). Hosts upgraded via git pull only — like herodemo — don't pick up later additions to that logic.

Fix landed

hero_skills 7ce11b3 on development:

  • New ensure_shell_init function in installers.nu that mirrors the install.sh .profile bridge logic.
  • Called from install_core so every install_core run is self-healing on existing hosts (idempotent).

Pragmatic patch on herodemo (used during today's session): service_aibroker start --reset with ~/hero/cfg/init.sh explicitly sourced — broker is up and serving 42 models.

Long-term verification

Next herodemo redeploy from clean development branches will exercise the install_core → ensure_shell_init path. If .profile already has a hero bridge or sources .bashrc, the function logs already bridges to hero init and skips. If neither, it appends the bash bridge snippet.

Closing.

Signed-off-by: mik-tf

## Root cause + fix landed The seeding logic in `service_aibroker.nu` was correct all along — it copies `modelsconfig.yml` from the cloned source into `$(svc_home $root)/var/hero_aibroker/`. It silently no-op'd on herodemo because **`CODEROOT` was empty** in the SSH login shell that drove the install. Why: Debian's default `~/.profile` doesn't source `~/.bashrc`. install.sh only writes the `[ -f ~/hero/cfg/init.sh ] && source …` snippet to `~/.bashrc`. SSH login shells (`su - driver -c '…'`, cron, hero_proc child processes) read `.profile` and skip `.bashrc`, so `CODEROOT/ROOTDIR/BUILDDIR` never reach them. `forge_ensure_local`\s `ensure_env_vars` errors out, the seed step is skipped, but the action is still registered and crash-loops on the missing config. install.sh's `patch_shell_rc_files` already adds a `.profile → .bashrc` bridge on first bootstrap (lines 741-758). Hosts upgraded via `git pull` only — like herodemo — don't pick up later additions to that logic. ## Fix landed hero_skills [`7ce11b3`](https://forge.ourworld.tf/lhumina_code/hero_skills/commit/7ce11b3) on development: * New `ensure_shell_init` function in `installers.nu` that mirrors the install.sh `.profile` bridge logic. * Called from `install_core` so every `install_core` run is self-healing on existing hosts (idempotent). Pragmatic patch on herodemo (used during today's session): `service_aibroker start --reset` with `~/hero/cfg/init.sh` explicitly sourced — broker is up and serving 42 models. ## Long-term verification Next herodemo redeploy from clean development branches will exercise the `install_core → ensure_shell_init` path. If `.profile` already has a hero bridge or sources `.bashrc`, the function logs `already bridges to hero init` and skips. If neither, it appends the bash bridge snippet. Closing. Signed-off-by: mik-tf
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#198
No description provided.