[infra] Live smoke loop against herodemo — automated detection of broken services before users see it #201

Open
opened 2026-04-30 20:02:43 +00:00 by mik-tf · 0 comments
Owner

Symptom

We have no automated detection of broken services on herodemo. Every failure mode we've hit recently was discovered by a human looking at the demo:

  • "Contexts island shows Root × 5" — discovered by user
  • "Photos render alt-text only, no images" — discovered by user
  • "Biz dashboard shows 0 records" — discovered by user
  • "hero_os UI banner: Socket 'ui.sock' not found" — discovered by user
  • "hero_foundry rpc.sock accepts but doesn't respond" — discovered by user

In each case the symptom was a 5-second curl away from being detectable. We just don't run the curls.

Proposal

A live smoke loop that runs every 5–15 min against https://herodemo.gent01.grid.tf/ and exercises one URL per critical user path. On failure: log it, optionally page someone, optionally auto-restart the service via hero_proc.

Initial probe panel

Probe URL Expected Catches
Photos download /hero_foundry/rpc/api/files/<ctx>/Photos/<known-file>.jpg 200 image/jpeg foundry rpc.sock half-broken; webdav storage misconfigured
Per-context routing POST /hero_osis_base/rpc with X-Hero-Context: geomind body base.configuration.list result differs from root context hero_osis context-fallback regression (#42)
Biz dashboard data POST /hero_osis_business/rpc business.persons.list non-empty seed-data drift; OSIS down
hero_os UI GET /hero_os/ui/ 200 + has expected nav element ui.sock missing, hero_os_ui dead
RPC health GET /hero_<svc>/rpc/health for every supervised service 200 half-broken handler
Embedder queue GET /hero_embedder/health 200 + queue depth ok OOM, runaway batch

The panel grows as we hit more failure modes — each new bug we discover gets a probe so it can never sneak past us silently again.

Implementation options

  1. Lightweight cron + curl + jq — runs on the VM itself, posts failures to forge or a Slack webhook. Fastest to ship. Obvious downside: blind to network issues between the VM and the public gateway.
  2. External monitor — runs from a different host (or GitHub Actions cron). Catches gateway/DNS/TLS issues too. Slower, more setup.
  3. Both — internal probe gives fine-grained "which service" answers, external probe gives "is the demo reachable at all" answers.

Recommend starting with (1) for shippability, then layering (2) once stable.

Acceptance

  • Cron job (or hero_proc-managed timer) runs the panel every 5–15 min.
  • Failure path: log to file + (optional) post to forge issue or Slack webhook.
  • Add a "blast radius" probe per known regression so it never recurs silently:
    • photos: file download
    • per-context routing: distinct response per context header
    • biz: non-empty list
    • ui.sock: GET //ui/ returns expected HTML element
  • Document the panel in hero_demo/docs/ops/SMOKE_LOOP.md so adding a new probe is a 3-line change.

Cross-references

  • Today's session discovered three half-broken services in 4 hours, all of which would have been caught by this loop within a single tick.
  • Sibling: lhumina_code/hero_proc#83 (per-service in-process probes catch single-service degradation; this loop catches end-to-end including gateway/router/DNS).
  • Related: lhumina_code/hero_demo#46 step 5 (snapshot of known-good state) — a smoke loop is what tells you "this state is good" before you snapshot it.

Signed-off-by: mik-tf

## Symptom We have no automated detection of broken services on herodemo. Every failure mode we've hit recently was discovered by a human looking at the demo: - "Contexts island shows Root × 5" — discovered by user - "Photos render alt-text only, no images" — discovered by user - "Biz dashboard shows 0 records" — discovered by user - "hero_os UI banner: Socket 'ui.sock' not found" — discovered by user - "hero_foundry rpc.sock accepts but doesn't respond" — discovered by user In each case the symptom was a 5-second curl away from being detectable. We just don't run the curls. ## Proposal A live smoke loop that runs every 5–15 min against `https://herodemo.gent01.grid.tf/` and exercises one URL per critical user path. On failure: log it, optionally page someone, optionally auto-restart the service via hero_proc. ### Initial probe panel | Probe | URL | Expected | Catches | |-------|-----|----------|---------| | Photos download | `/hero_foundry/rpc/api/files/<ctx>/Photos/<known-file>.jpg` | 200 image/jpeg | foundry rpc.sock half-broken; webdav storage misconfigured | | Per-context routing | `POST /hero_osis_base/rpc` with `X-Hero-Context: geomind` body `base.configuration.list` | result differs from root context | hero_osis context-fallback regression (#42) | | Biz dashboard data | `POST /hero_osis_business/rpc` `business.persons.list` | non-empty | seed-data drift; OSIS down | | hero_os UI | `GET /hero_os/ui/` | 200 + has expected nav element | ui.sock missing, hero_os_ui dead | | RPC health | `GET /hero_<svc>/rpc/health` for every supervised service | 200 | half-broken handler | | Embedder queue | `GET /hero_embedder/health` | 200 + queue depth ok | OOM, runaway batch | The panel grows as we hit more failure modes — each new bug we discover gets a probe so it can never sneak past us silently again. ### Implementation options 1. **Lightweight cron + curl + jq** — runs on the VM itself, posts failures to forge or a Slack webhook. Fastest to ship. Obvious downside: blind to network issues between the VM and the public gateway. 2. **External monitor** — runs from a different host (or GitHub Actions cron). Catches gateway/DNS/TLS issues too. Slower, more setup. 3. **Both** — internal probe gives fine-grained "which service" answers, external probe gives "is the demo reachable at all" answers. Recommend starting with (1) for shippability, then layering (2) once stable. ## Acceptance - [ ] Cron job (or hero_proc-managed timer) runs the panel every 5–15 min. - [ ] Failure path: log to file + (optional) post to forge issue or Slack webhook. - [ ] Add a "blast radius" probe per known regression so it never recurs silently: - photos: file download - per-context routing: distinct response per context header - biz: non-empty list - ui.sock: GET /<svc>/ui/ returns expected HTML element - [ ] Document the panel in `hero_demo/docs/ops/SMOKE_LOOP.md` so adding a new probe is a 3-line change. ## Cross-references - Today's session discovered three half-broken services in 4 hours, all of which would have been caught by this loop within a single tick. - Sibling: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/83 (per-service in-process probes catch single-service degradation; this loop catches end-to-end including gateway/router/DNS). - Related: https://forge.ourworld.tf/lhumina_code/hero_demo/issues/46 step 5 (snapshot of known-good state) — a smoke loop is what tells you "this state is good" before you snapshot it. Signed-off-by: mik-tf
mik-tf self-assigned this 2026-04-30 20:02:43 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#201
No description provided.