Zinit service watchdog hotfix (deployed) #24
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Zinit service watchdog hotfix
Context
During herodemo2 deployment,
hero_embedder_serverwent inactive (crashed) and stayed down — no auto-restart, no alerting. Investigation revealed this affects all services: zinit health checks are no-ops and there is no restart-on-failure behavior in the current config format.Root cause: hero_services_server generates zinit configs using the old TOML format (
[service]+exec+oneshot). This format has no restart policy. The health check TOMLs are dummy scripts that always succeed (echo "No health check configured"). When a service crashes, zinit marks itinactiveand nothing recovers it.Fix
Added a watchdog loop to
docker/entrypoint.shthat checkszinit listevery 60 seconds and restarts any service that went inactive unexpectedly.What it covers:
What it won't catch:
Exclusions:
.health,.install,.testoneshots +hero_cloud(intentionally inactive)Verification
Tested on both environments by killing
hero_embedder_serverand confirming the watchdog restarted it:Deployed
docker/entrypoint.shin hero_services:dev2image rebuilt and deployed to herodev2 — verified:dev2promoted to:demo2, deployed to herodemo2 — verifiedImage digest:
sha256:44de4cff1e31b70f2aa68ad9981092b9d18f35cfeb955492464042c5dece0321Follow-up
The proper fix is migrating hero_services_server to use the zinit 0.4.0 job model with restart policies and real socket-based health checks. Tracked separately — see linked issue.
Related
hero_services/docker/entrypoint.sh— watchdog locationZinit service resilience: watchdog hotfix + proper migrationto Zinit service watchdog hotfix (deployed)