[bug][P1] Log-store runaway: 58 GB in 24h from one chatty service wedged hero_proc + filled /data on herodemo #87
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#87
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Production incident: 58 GB of logs in 24 hours, /data 100% full, hero_proc wedged
On 2026-04-30, the hero_proc partitioned log SQLite for day-of-year 120 grew to 58 GB in a single day on herodemo. Combined with the prior 6 days of normal-rate logs (~2.3 GB/day),
/home/driver/hero/var/logs/reached 71 GB out of a 200 GB partition, eventually filling/datato 100%. Once full, everyhero_proc service ...RPC failed withDatabase error: database or disk is full, because hero_proc itself can't write tohero_proc.dbwithout journal/WAL headroom.This is the materialized form of the architectural risk filed at hero_proc#85. Not a "could happen" — actually happened.
Diagnosis (captured before deletion)
The runaway source was hero_collab.hero_collab_server writing 3.7 million log lines / 528 MB in ~5 hours before it died around 00:28 UTC on 2026-05-01. Sustained rate: ~200 lines / second.
Sample content (last log line written):
os error 24isEMFILE— file descriptor exhaustion. Every iteration of the accept loop fails immediately and logs at full speed, with no backoff. Filed separately at hero_collab as a coding bug; this issue is about hero_proc's log-store responding correctly to a chatty source.Day-by-day partition sizes (before remediation)
Baseline volume: ~2.3 GB/day from 24 supervised services. Spike: ~25× baseline in one day from one service.
Remediation taken (live, by mik-tf)
/home/driver/hero/var/logs/core/2026/120/logs.sqlite(-58 GB)./home/driver/hero/var/logs/core/2026/114/logs.sqlite(-597 MB; past 7-day retention).hero_proc service start hero_booksnow succeeds.Root cause analysis (hero_proc-side)
Three things failed simultaneously in the hero_proc log store:
No total-size cap. Retention is purely time-based (7 days, hard-coded
main.rs:118). One bad day at 25× baseline ate the whole disk, and once the disk was full, the time-based reaper itself couldn't write the DELETE journal to enforce retention. This is the wedged-by-fullness pattern: the cleaner can't clean because the disk is too full to clean from.No per-source rate limit. A single service writing 200 lines/second has the same priority as every other source. There is no admission control between hero_proc's log API and the SQLite writer. A chatty source can starve disk for the whole node.
No drop-on-write-failure path. When the SQLite write started failing, hero_proc didn't degrade gracefully — log accumulation just blocks or returns errors that propagate up. There's no "log to stderr and continue" fallback when persistence is broken.
Proposed long-term fix (architectural)
Order of importance:
A. Total-size cap with eager enforcement
Add a configurable
HERO_PROC_LOG_MAX_BYTES(default e.g. 10 GB). Whenever a partition rolls (or every N writes), check total log-store size; if over cap, drop oldest partitions outright (not via SQLite DELETE —unlinkthe file). This works even when the disk is fully wedged becauseunlinkdoesn't need write headroom.B. Per-source rate limit / sampling
Token bucket per
srcfield. Default e.g. 100 lines/second sustained, 1000-line burst. Excess gets sampled (1-in-N) or dropped with a counter that's logged at low rate. Prevents one runaway service from monopolizing the store.C. Atomic partition-level reaper
Today's retention loop must DELETE individual rows inside a SQLite file, requiring journal space. Change to "drop oldest partition file" semantics — partitioning is already by day, so just
unlinkfiles older thanMAX_DAYS. Idempotent, doesn't need write headroom, runs even on a full disk.D. Drop-on-write-failure fallback
When SQLite write returns "database or disk is full" (or any errno), log to stderr and increment a dropped-line counter exposed via
logs.stats. Don't propagate up. Don't wedge.E. Operator alerting
Surface log-store fill % on the hero_proc UI dashboard. Alarm at 70%, page at 90%. Today the only signal is the demo going down.
Severity
Critical (P1). Took the live demo offline and several services with it. Recurs by default — same loop in hero_collab is still latent, and any other service with a chatty error path will reproduce.
Cross-refs
hero_proc service listreported every service asfailedwhile many were actually running).Spotted during docs_hero Phase 1 UX gate (session 52). Reconciliation memo:
memory/investigation_roadmap_reconciliation.md.will be resolved in #89
Update: architectural fix already exists in HEAD
After filing this issue I pulled
lhumina_code/hero_procand found the SQLite log backend has already been replaced withhero_logper #89, commit a2eff7c —refactor(logs): replace SQLite log backend with hero_log. The runaway-resilience properties this issue asked for (no SQLite, presumably size-capped append store, no DB-full wedge) are likely inhero_logalready — pending review of the actual semantics.The 58 GB / 24h incident on herodemo therefore reflects a deployment lag, not a missing fix. The herodemo binary was built before
a2eff7cand still carries the SQLite log backend that filled/data.Reframing this issue:
Deploy
hero_procon herodemo to pick upa2eff7cand the related reliability fixes that landed in the same window:hero_proc service listreported all-failed while pgrep showed most services running.specs/issue_86_plan.md.Verify hero_log under runaway conditions. The hero_collab#42 loop (200 lines/sec on EMFILE) is a great natural stress test once that bug is staging-reproducible — confirm
hero_logsize-caps and doesn't wedge under sustained ~25× baseline load.Operator alerting — the original recommendation in this issue body still applies regardless of backend: surface log-store fill % on the hero_proc UI dashboard with thresholds (warn at 70%, page at 90%). Today the only signal is the demo going down.
If
hero_logalready covers (1) and (2), this issue can probably close as a duplicate of #89 once herodemo is redeployed and the size-cap behavior is verified under load.Apologies for the duplicate filing — I checked the open issues list at the start of session 52 but missed #89 (and the rapid succession of merges in the last few hours).