Zinit service watchdog hotfix (deployed) #24

New issue

Closed

opened 2026-03-13 12:19:56 +00:00 by mik-tf · 0 comments

mik-tf commented

2026-03-13 12:19:56 +00:00

Owner

Zinit service watchdog hotfix

Context

During herodemo2 deployment, hero_embedder_server went inactive (crashed) and stayed down — no auto-restart, no alerting. Investigation revealed this affects all services: zinit health checks are no-ops and there is no restart-on-failure behavior in the current config format.

Root cause: hero_services_server generates zinit configs using the old TOML format ([service] + exec + oneshot). This format has no restart policy. The health check TOMLs are dummy scripts that always succeed (echo "No health check configured"). When a service crashes, zinit marks it inactive and nothing recovers it.

Fix

Added a watchdog loop to docker/entrypoint.sh that checks zinit list every 60 seconds and restarts any service that went inactive unexpectedly.

# Service watchdog — restart crashed services every 60 seconds
(while true; do
    sleep 60
    for svc in $(zinit list --socket "$ZINIT_SOCK" 2>/dev/null \
        | grep 'inactive' \
        | grep -v 'health\|install\|test\|cloud' \
        | awk '{print $2}'); do
        echo "[watchdog] Restarting $svc"
        zinit start "$svc" --socket "$ZINIT_SOCK" 2>/dev/null || true
    done
done) &

What it covers:

All non-oneshot services (embedder, indexer, books, auth, osis, proxy, etc.)
Auto-restarts within 60 seconds of crash

What it won't catch:

Hung processes (alive but not responding) — tracked in follow-up issue
Tight crash loops — 60-second interval prevents system overload

Exclusions: .health, .install, .test oneshots + hero_cloud (intentionally inactive)

Verification

Tested on both environments by killing hero_embedder_server and confirming the watchdog restarted it:

herodev2: killed PID 189 → went inactive → watchdog restarted as PID 1333 ✅
herodemo2: killed PID 7476 → went inactive → watchdog restarted as PID 12308 ✅

Deployed

Watchdog tested manually on herodemo2
Watchdog added to docker/entrypoint.sh in hero_services
:dev2 image rebuilt and deployed to herodev2 — verified
:dev2 promoted to :demo2, deployed to herodemo2 — verified
Both environments running with watchdog active

Image digest: sha256:44de4cff1e31b70f2aa68ad9981092b9d18f35cfeb955492464042c5dece0321

Follow-up

The proper fix is migrating hero_services_server to use the zinit 0.4.0 job model with restart policies and real socket-based health checks. Tracked separately — see linked issue.

#23 (Hero OS UI polish — parent)
hero_services/docker/entrypoint.sh — watchdog location

# Zinit service watchdog hotfix ## Context During herodemo2 deployment, `hero_embedder_server` went inactive (crashed) and stayed down — no auto-restart, no alerting. Investigation revealed this affects **all services**: zinit health checks are no-ops and there is no restart-on-failure behavior in the current config format. **Root cause:** hero_services_server generates zinit configs using the old TOML format (`[service]` + `exec` + `oneshot`). This format has no restart policy. The health check TOMLs are dummy scripts that always succeed (`echo "No health check configured"`). When a service crashes, zinit marks it `inactive` and nothing recovers it. ## Fix Added a watchdog loop to `docker/entrypoint.sh` that checks `zinit list` every 60 seconds and restarts any service that went inactive unexpectedly. ```bash # Service watchdog — restart crashed services every 60 seconds (while true; do sleep 60 for svc in $(zinit list --socket "$ZINIT_SOCK" 2>/dev/null \ | grep 'inactive' \ | grep -v 'health\|install\|test\|cloud' \ | awk '{print $2}'); do echo "[watchdog] Restarting $svc" zinit start "$svc" --socket "$ZINIT_SOCK" 2>/dev/null || true done done) & ``` **What it covers:** - All non-oneshot services (embedder, indexer, books, auth, osis, proxy, etc.) - Auto-restarts within 60 seconds of crash **What it won't catch:** - Hung processes (alive but not responding) — tracked in follow-up issue - Tight crash loops — 60-second interval prevents system overload **Exclusions:** `.health`, `.install`, `.test` oneshots + `hero_cloud` (intentionally inactive) ## Verification Tested on both environments by killing `hero_embedder_server` and confirming the watchdog restarted it: - herodev2: killed PID 189 → went inactive → watchdog restarted as PID 1333 ✅ - herodemo2: killed PID 7476 → went inactive → watchdog restarted as PID 12308 ✅ ## Deployed - [x] Watchdog tested manually on herodemo2 - [x] Watchdog added to `docker/entrypoint.sh` in hero_services - [x] `:dev2` image rebuilt and deployed to herodev2 — verified - [x] `:dev2` promoted to `:demo2`, deployed to herodemo2 — verified - [x] Both environments running with watchdog active **Image digest:** `sha256:44de4cff1e31b70f2aa68ad9981092b9d18f35cfeb955492464042c5dece0321` ## Follow-up The proper fix is migrating hero_services_server to use the zinit 0.4.0 job model with restart policies and real socket-based health checks. Tracked separately — see linked issue. ## Related - #23 (Hero OS UI polish — parent) - `hero_services/docker/entrypoint.sh` — watchdog location