[infra] Live smoke loop against herodemo — automated detection of broken services before users see it #201
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
We have no automated detection of broken services on herodemo. Every failure mode we've hit recently was discovered by a human looking at the demo:
In each case the symptom was a 5-second curl away from being detectable. We just don't run the curls.
Proposal
A live smoke loop that runs every 5–15 min against
https://herodemo.gent01.grid.tf/and exercises one URL per critical user path. On failure: log it, optionally page someone, optionally auto-restart the service via hero_proc.Initial probe panel
/hero_foundry/rpc/api/files/<ctx>/Photos/<known-file>.jpgPOST /hero_osis_base/rpcwithX-Hero-Context: geomindbodybase.configuration.listPOST /hero_osis_business/rpcbusiness.persons.listGET /hero_os/ui/GET /hero_<svc>/rpc/healthfor every supervised serviceGET /hero_embedder/healthThe panel grows as we hit more failure modes — each new bug we discover gets a probe so it can never sneak past us silently again.
Implementation options
Recommend starting with (1) for shippability, then layering (2) once stable.
Acceptance
hero_demo/docs/ops/SMOKE_LOOP.mdso adding a new probe is a 3-line change.Cross-references
Signed-off-by: mik-tf