[infra] Live smoke loop against herodemo — automated detection of broken services before users see it #201

New issue

Open

opened 2026-04-30 20:02:43 +00:00 by mik-tf · 0 comments

mik-tf commented

2026-04-30 20:02:43 +00:00

Owner

Symptom

We have no automated detection of broken services on herodemo. Every failure mode we've hit recently was discovered by a human looking at the demo:

"Contexts island shows Root × 5" — discovered by user
"Photos render alt-text only, no images" — discovered by user
"Biz dashboard shows 0 records" — discovered by user
"hero_os UI banner: Socket 'ui.sock' not found" — discovered by user
"hero_foundry rpc.sock accepts but doesn't respond" — discovered by user

In each case the symptom was a 5-second curl away from being detectable. We just don't run the curls.

Proposal

A live smoke loop that runs every 5–15 min against https://herodemo.gent01.grid.tf/ and exercises one URL per critical user path. On failure: log it, optionally page someone, optionally auto-restart the service via hero_proc.

Initial probe panel

Probe	URL	Expected	Catches
Photos download	`/hero_foundry/rpc/api/files/<ctx>/Photos/<known-file>.jpg`	200 image/jpeg	foundry rpc.sock half-broken; webdav storage misconfigured
Per-context routing	`POST /hero_osis_base/rpc` with `X-Hero-Context: geomind` body `base.configuration.list`	result differs from root context	hero_osis context-fallback regression (#42)
Biz dashboard data	`POST /hero_osis_business/rpc` `business.persons.list`	non-empty	seed-data drift; OSIS down
hero_os UI	`GET /hero_os/ui/`	200 + has expected nav element	ui.sock missing, hero_os_ui dead
RPC health	`GET /hero_<svc>/rpc/health` for every supervised service	200	half-broken handler
Embedder queue	`GET /hero_embedder/health`	200 + queue depth ok	OOM, runaway batch

The panel grows as we hit more failure modes — each new bug we discover gets a probe so it can never sneak past us silently again.

Implementation options

Lightweight cron + curl + jq — runs on the VM itself, posts failures to forge or a Slack webhook. Fastest to ship. Obvious downside: blind to network issues between the VM and the public gateway.
External monitor — runs from a different host (or GitHub Actions cron). Catches gateway/DNS/TLS issues too. Slower, more setup.
Both — internal probe gives fine-grained "which service" answers, external probe gives "is the demo reachable at all" answers.

Recommend starting with (1) for shippability, then layering (2) once stable.

Acceptance

Cron job (or hero_proc-managed timer) runs the panel every 5–15 min.
Failure path: log to file + (optional) post to forge issue or Slack webhook.
Add a "blast radius" probe per known regression so it never recurs silently:
- photos: file download
- per-context routing: distinct response per context header
- biz: non-empty list
- ui.sock: GET //ui/ returns expected HTML element
Document the panel in hero_demo/docs/ops/SMOKE_LOOP.md so adding a new probe is a 3-line change.

Cross-references

Today's session discovered three half-broken services in 4 hours, all of which would have been caught by this loop within a single tick.
Sibling: lhumina_code/hero_proc#83 (per-service in-process probes catch single-service degradation; this loop catches end-to-end including gateway/router/DNS).
Related: lhumina_code/hero_demo#46 step 5 (snapshot of known-good state) — a smoke loop is what tells you "this state is good" before you snapshot it.

Signed-off-by: mik-tf

## Symptom We have no automated detection of broken services on herodemo. Every failure mode we've hit recently was discovered by a human looking at the demo: - "Contexts island shows Root × 5" — discovered by user - "Photos render alt-text only, no images" — discovered by user - "Biz dashboard shows 0 records" — discovered by user - "hero_os UI banner: Socket 'ui.sock' not found" — discovered by user - "hero_foundry rpc.sock accepts but doesn't respond" — discovered by user In each case the symptom was a 5-second curl away from being detectable. We just don't run the curls. ## Proposal A live smoke loop that runs every 5–15 min against `https://herodemo.gent01.grid.tf/` and exercises one URL per critical user path. On failure: log it, optionally page someone, optionally auto-restart the service via hero_proc. ### Initial probe panel | Probe | URL | Expected | Catches | |-------|-----|----------|---------| | Photos download | `/hero_foundry/rpc/api/files/<ctx>/Photos/<known-file>.jpg` | 200 image/jpeg | foundry rpc.sock half-broken; webdav storage misconfigured | | Per-context routing | `POST /hero_osis_base/rpc` with `X-Hero-Context: geomind` body `base.configuration.list` | result differs from root context | hero_osis context-fallback regression (#42) | | Biz dashboard data | `POST /hero_osis_business/rpc` `business.persons.list` | non-empty | seed-data drift; OSIS down | | hero_os UI | `GET /hero_os/ui/` | 200 + has expected nav element | ui.sock missing, hero_os_ui dead | | RPC health | `GET /hero_<svc>/rpc/health` for every supervised service | 200 | half-broken handler | | Embedder queue | `GET /hero_embedder/health` | 200 + queue depth ok | OOM, runaway batch | The panel grows as we hit more failure modes — each new bug we discover gets a probe so it can never sneak past us silently again. ### Implementation options 1. **Lightweight cron + curl + jq** — runs on the VM itself, posts failures to forge or a Slack webhook. Fastest to ship. Obvious downside: blind to network issues between the VM and the public gateway. 2. **External monitor** — runs from a different host (or GitHub Actions cron). Catches gateway/DNS/TLS issues too. Slower, more setup. 3. **Both** — internal probe gives fine-grained "which service" answers, external probe gives "is the demo reachable at all" answers. Recommend starting with (1) for shippability, then layering (2) once stable. ## Acceptance - [ ] Cron job (or hero_proc-managed timer) runs the panel every 5–15 min. - [ ] Failure path: log to file + (optional) post to forge issue or Slack webhook. - [ ] Add a "blast radius" probe per known regression so it never recurs silently: - photos: file download - per-context routing: distinct response per context header - biz: non-empty list - ui.sock: GET /<svc>/ui/ returns expected HTML element - [ ] Document the panel in `hero_demo/docs/ops/SMOKE_LOOP.md` so adding a new probe is a 3-line change. ## Cross-references - Today's session discovered three half-broken services in 4 hours, all of which would have been caught by this loop within a single tick. - Sibling: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/83 (per-service in-process probes catch single-service degradation; this loop catches end-to-end including gateway/router/DNS). - Related: https://forge.ourworld.tf/lhumina_code/hero_demo/issues/46 step 5 (snapshot of known-good state) — a smoke loop is what tells you "this state is good" before you snapshot it. Signed-off-by: mik-tf