[bug][P1] Log-store runaway: 58 GB in 24h from one chatty service wedged hero_proc + filled /data on herodemo #87

Closed
opened 2026-05-01 04:39:56 +00:00 by mik-tf · 2 comments
Owner

Production incident: 58 GB of logs in 24 hours, /data 100% full, hero_proc wedged

On 2026-04-30, the hero_proc partitioned log SQLite for day-of-year 120 grew to 58 GB in a single day on herodemo. Combined with the prior 6 days of normal-rate logs (~2.3 GB/day), /home/driver/hero/var/logs/ reached 71 GB out of a 200 GB partition, eventually filling /data to 100%. Once full, every hero_proc service ... RPC failed with Database error: database or disk is full, because hero_proc itself can't write to hero_proc.db without journal/WAL headroom.

This is the materialized form of the architectural risk filed at hero_proc#85. Not a "could happen" — actually happened.

Diagnosis (captured before deletion)

The runaway source was hero_collab.hero_collab_server writing 3.7 million log lines / 528 MB in ~5 hours before it died around 00:28 UTC on 2026-05-01. Sustained rate: ~200 lines / second.

Sample content (last log line written):

2026-05-01T00:28:31.752605Z ERROR hero_collab_server: Socket accept error: Too many open files (os error 24)

os error 24 is EMFILE — file descriptor exhaustion. Every iteration of the accept loop fails immediately and logs at full speed, with no backoff. Filed separately at hero_collab as a coding bug; this issue is about hero_proc's log-store responding correctly to a chatty source.

Day-by-day partition sizes (before remediation)

DOY 114 (Apr 24): 597 MB    (partial — service started 14:35)
DOY 115 (Apr 25): 2.4 GB
DOY 116 (Apr 26): 2.3 GB
DOY 117 (Apr 27): 2.3 GB
DOY 118 (Apr 28): 2.3 GB
DOY 119 (Apr 29): 2.3 GB
DOY 120 (Apr 30): 58  GB   ← runaway
DOY 121 (May 1):  1.1 GB    (current, partial; hero_collab died 00:28)

Baseline volume: ~2.3 GB/day from 24 supervised services. Spike: ~25× baseline in one day from one service.

Remediation taken (live, by mik-tf)

  • Deleted /home/driver/hero/var/logs/core/2026/120/logs.sqlite (-58 GB).
  • Deleted /home/driver/hero/var/logs/core/2026/114/logs.sqlite (-597 MB; past 7-day retention).
  • Verified hero_proc DB writes resumed — hero_proc service start hero_books now succeeds.
  • hero_collab_server is left stopped until the FD-leak / no-backoff issue is fixed at hero_collab. Restarting it without those fixes would re-trigger the same loop.

Root cause analysis (hero_proc-side)

Three things failed simultaneously in the hero_proc log store:

  1. No total-size cap. Retention is purely time-based (7 days, hard-coded main.rs:118). One bad day at 25× baseline ate the whole disk, and once the disk was full, the time-based reaper itself couldn't write the DELETE journal to enforce retention. This is the wedged-by-fullness pattern: the cleaner can't clean because the disk is too full to clean from.

  2. No per-source rate limit. A single service writing 200 lines/second has the same priority as every other source. There is no admission control between hero_proc's log API and the SQLite writer. A chatty source can starve disk for the whole node.

  3. No drop-on-write-failure path. When the SQLite write started failing, hero_proc didn't degrade gracefully — log accumulation just blocks or returns errors that propagate up. There's no "log to stderr and continue" fallback when persistence is broken.

Proposed long-term fix (architectural)

Order of importance:

A. Total-size cap with eager enforcement

Add a configurable HERO_PROC_LOG_MAX_BYTES (default e.g. 10 GB). Whenever a partition rolls (or every N writes), check total log-store size; if over cap, drop oldest partitions outright (not via SQLite DELETE — unlink the file). This works even when the disk is fully wedged because unlink doesn't need write headroom.

B. Per-source rate limit / sampling

Token bucket per src field. Default e.g. 100 lines/second sustained, 1000-line burst. Excess gets sampled (1-in-N) or dropped with a counter that's logged at low rate. Prevents one runaway service from monopolizing the store.

C. Atomic partition-level reaper

Today's retention loop must DELETE individual rows inside a SQLite file, requiring journal space. Change to "drop oldest partition file" semantics — partitioning is already by day, so just unlink files older than MAX_DAYS. Idempotent, doesn't need write headroom, runs even on a full disk.

D. Drop-on-write-failure fallback

When SQLite write returns "database or disk is full" (or any errno), log to stderr and increment a dropped-line counter exposed via logs.stats. Don't propagate up. Don't wedge.

E. Operator alerting

Surface log-store fill % on the hero_proc UI dashboard. Alarm at 70%, page at 90%. Today the only signal is the demo going down.

Severity

Critical (P1). Took the live demo offline and several services with it. Recurs by default — same loop in hero_collab is still latent, and any other service with a chatty error path will reproduce.

Cross-refs

  • hero_proc#85 — the prior architectural-follow-up filing for log volume risk; this issue is the concrete incident.
  • hero_proc#86 — reliability META.
  • hero_collab — separate issue (this session) for the FD leak + no-backoff accept loop.
  • hero_proc#83 — separate issue: status drift during this incident (hero_proc service list reported every service as failed while many were actually running).

Spotted during docs_hero Phase 1 UX gate (session 52). Reconciliation memo: memory/investigation_roadmap_reconciliation.md.

## Production incident: 58 GB of logs in 24 hours, /data 100% full, hero_proc wedged On 2026-04-30, the hero_proc partitioned log SQLite for day-of-year 120 grew to **58 GB in a single day** on herodemo. Combined with the prior 6 days of normal-rate logs (~2.3 GB/day), `/home/driver/hero/var/logs/` reached **71 GB out of a 200 GB partition**, eventually filling `/data` to 100%. Once full, every `hero_proc service ...` RPC failed with `Database error: database or disk is full`, because hero_proc itself can't write to `hero_proc.db` without journal/WAL headroom. This is the materialized form of the architectural risk filed at [hero_proc#85](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/85). Not a "could happen" — actually happened. ## Diagnosis (captured before deletion) The runaway source was **hero_collab.hero_collab_server** writing **3.7 million log lines / 528 MB in ~5 hours** before it died around 00:28 UTC on 2026-05-01. Sustained rate: ~200 lines / second. Sample content (last log line written): ``` 2026-05-01T00:28:31.752605Z ERROR hero_collab_server: Socket accept error: Too many open files (os error 24) ``` `os error 24` is `EMFILE` — file descriptor exhaustion. Every iteration of the accept loop fails immediately and logs at full speed, with no backoff. Filed separately at hero_collab as a coding bug; this issue is about hero_proc's log-store responding correctly to a chatty source. ## Day-by-day partition sizes (before remediation) ``` DOY 114 (Apr 24): 597 MB (partial — service started 14:35) DOY 115 (Apr 25): 2.4 GB DOY 116 (Apr 26): 2.3 GB DOY 117 (Apr 27): 2.3 GB DOY 118 (Apr 28): 2.3 GB DOY 119 (Apr 29): 2.3 GB DOY 120 (Apr 30): 58 GB ← runaway DOY 121 (May 1): 1.1 GB (current, partial; hero_collab died 00:28) ``` Baseline volume: ~2.3 GB/day from 24 supervised services. Spike: ~25× baseline in one day from one service. ## Remediation taken (live, by mik-tf) - Deleted `/home/driver/hero/var/logs/core/2026/120/logs.sqlite` (-58 GB). - Deleted `/home/driver/hero/var/logs/core/2026/114/logs.sqlite` (-597 MB; past 7-day retention). - Verified hero_proc DB writes resumed — `hero_proc service start hero_books` now succeeds. - hero_collab_server is left **stopped** until the FD-leak / no-backoff issue is fixed at hero_collab. Restarting it without those fixes would re-trigger the same loop. ## Root cause analysis (hero_proc-side) Three things failed simultaneously in the hero_proc log store: 1. **No total-size cap.** Retention is purely time-based (7 days, hard-coded `main.rs:118`). One bad day at 25× baseline ate the whole disk, and once the disk was full, the time-based reaper itself couldn't write the DELETE journal to enforce retention. This is the wedged-by-fullness pattern: the cleaner can't clean because the disk is too full to clean from. 2. **No per-source rate limit.** A single service writing 200 lines/second has the same priority as every other source. There is no admission control between hero_proc's log API and the SQLite writer. A chatty source can starve disk for the whole node. 3. **No drop-on-write-failure path.** When the SQLite write started failing, hero_proc didn't degrade gracefully — log accumulation just blocks or returns errors that propagate up. There's no "log to stderr and continue" fallback when persistence is broken. ## Proposed long-term fix (architectural) Order of importance: ### A. Total-size cap with eager enforcement Add a configurable `HERO_PROC_LOG_MAX_BYTES` (default e.g. 10 GB). Whenever a partition rolls (or every N writes), check total log-store size; if over cap, drop oldest partitions outright (not via SQLite DELETE — `unlink` the file). This works even when the disk is fully wedged because `unlink` doesn't need write headroom. ### B. Per-source rate limit / sampling Token bucket per `src` field. Default e.g. 100 lines/second sustained, 1000-line burst. Excess gets sampled (1-in-N) or dropped with a counter that's logged at low rate. Prevents one runaway service from monopolizing the store. ### C. Atomic partition-level reaper Today's retention loop must DELETE individual rows inside a SQLite file, requiring journal space. Change to "drop oldest partition file" semantics — partitioning is already by day, so just `unlink` files older than `MAX_DAYS`. Idempotent, doesn't need write headroom, runs even on a full disk. ### D. Drop-on-write-failure fallback When SQLite write returns "database or disk is full" (or any errno), log to stderr and increment a dropped-line counter exposed via `logs.stats`. Don't propagate up. Don't wedge. ### E. Operator alerting Surface log-store fill % on the hero_proc UI dashboard. Alarm at 70%, page at 90%. Today the only signal is the demo going down. ## Severity Critical (P1). Took the live demo offline and several services with it. Recurs by default — same loop in hero_collab is still latent, and any other service with a chatty error path will reproduce. ## Cross-refs - [hero_proc#85](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/85) — the prior architectural-follow-up filing for log volume risk; this issue is the concrete incident. - [hero_proc#86](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/86) — reliability META. - hero_collab — separate issue (this session) for the FD leak + no-backoff accept loop. - hero_proc#83 — separate issue: status drift during this incident (`hero_proc service list` reported every service as `failed` while many were actually running). Spotted during docs_hero Phase 1 UX gate (session 52). Reconciliation memo: `memory/investigation_roadmap_reconciliation.md`.
Owner

will be resolved in #89

will be resolved in #89
Author
Owner

Update: architectural fix already exists in HEAD

After filing this issue I pulled lhumina_code/hero_proc and found the SQLite log backend has already been replaced with hero_log per #89, commit a2eff7c — refactor(logs): replace SQLite log backend with hero_log. The runaway-resilience properties this issue asked for (no SQLite, presumably size-capped append store, no DB-full wedge) are likely in hero_log already — pending review of the actual semantics.

The 58 GB / 24h incident on herodemo therefore reflects a deployment lag, not a missing fix. The herodemo binary was built before a2eff7c and still carries the SQLite log backend that filled /data.

Reframing this issue:

  1. Deploy hero_proc on herodemo to pick up a2eff7c and the related reliability fixes that landed in the same window:

  2. Verify hero_log under runaway conditions. The hero_collab#42 loop (200 lines/sec on EMFILE) is a great natural stress test once that bug is staging-reproducible — confirm hero_log size-caps and doesn't wedge under sustained ~25× baseline load.

  3. Operator alerting — the original recommendation in this issue body still applies regardless of backend: surface log-store fill % on the hero_proc UI dashboard with thresholds (warn at 70%, page at 90%). Today the only signal is the demo going down.

If hero_log already covers (1) and (2), this issue can probably close as a duplicate of #89 once herodemo is redeployed and the size-cap behavior is verified under load.

Apologies for the duplicate filing — I checked the open issues list at the start of session 52 but missed #89 (and the rapid succession of merges in the last few hours).

## Update: architectural fix already exists in HEAD After filing this issue I pulled `lhumina_code/hero_proc` and found the SQLite log backend has already been replaced with `hero_log` per [#89](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/89), commit [a2eff7c — `refactor(logs): replace SQLite log backend with hero_log`](https://forge.ourworld.tf/lhumina_code/hero_proc/commit/a2eff7c51e26a71c780742fbc595f10d1ace714c). The runaway-resilience properties this issue asked for (no SQLite, presumably size-capped append store, no DB-full wedge) are likely in `hero_log` already — pending review of the actual semantics. **The 58 GB / 24h incident on herodemo therefore reflects a deployment lag, not a missing fix.** The herodemo binary was built before `a2eff7c` and still carries the SQLite log backend that filled `/data`. Reframing this issue: 1. **Deploy `hero_proc` on herodemo** to pick up `a2eff7c` and the related reliability fixes that landed in the same window: - [#85 — log volume / SQLite removal](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/85) (closed) - [#83 — status drift](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/83) (closed) — saw this live during the incident: `hero_proc service list` reported all-failed while pgrep showed most services running. - [#80](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/80), [#81](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/81) (closed) - [#86 reliability META](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/86) (closed) — see the detailed plan now committed at `specs/issue_86_plan.md`. 2. **Verify hero_log under runaway conditions.** The [hero_collab#42](https://forge.ourworld.tf/lhumina_code/hero_collab/issues/42) loop (200 lines/sec on EMFILE) is a great natural stress test once that bug is staging-reproducible — confirm `hero_log` size-caps and doesn't wedge under sustained ~25× baseline load. 3. **Operator alerting** — the original recommendation in this issue body still applies regardless of backend: surface log-store fill % on the hero_proc UI dashboard with thresholds (warn at 70%, page at 90%). Today the only signal is the demo going down. If `hero_log` already covers (1) and (2), this issue can probably close as a duplicate of #89 once herodemo is redeployed and the size-cap behavior is verified under load. Apologies for the duplicate filing — I checked the open issues list at the start of session 52 but missed #89 (and the rapid succession of merges in the last few hours).
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#87
No description provided.