[bug] log_batcher unbounded channel — hero_proc_server leaks unbounded memory under sqlite write pressure #80

Closed
opened 2026-04-30 13:34:13 +00:00 by mik-tf · 3 comments
Owner

Symptom

Observed on herodemo (TF Grid VM, 32 GB RAM, 16 CPU) on 2026-04-30:

  • hero_proc_server accumulated 26 GB resident / 31 GB virtual after only 2h 13m of uptime.
  • State S (sleeping), VmData ≈ 30 GB (so ~all heap, not code or shared mappings).
  • Coincided with a long cargo build --release -j 16 running on the same VM, which pegged btrfs I/O for ~50 minutes.
  • Demo became unresponsive (load avg 80+ on a 16-CPU VM); WASM shell sat at a blank page because every API call queued behind hero_proc's CPU/memory pressure.
  • kill -9 on the daemon + clean restart dropped memory to 27 MB RSS within seconds. All supervised services kept running unchanged (supervisor model held — children are not killed when the daemon dies).

Root cause

crates/hero_proc_server/src/log_batcher.rs:42:

let (tx, rx) = mpsc::unbounded_channel::<LogEntry>();

Producer side: every Hero service that calls logs.insert (heartbeats, info messages, supervisor pings, RPC tracing, etc.) pushes a LogEntry onto this channel via batcher.send(entry). There is no backpressure.

Consumer side: the batcher_task flushes every 100 ms or when 10 000 entries accumulate, writing in a single SQLite transaction.

Failure mode: when SQLite writes slow down for any reason (disk I/O contention, a slow consumer query holding a lock, etc.) the channel grows without limit. Today's trigger was cargo dominating btrfs I/O — the writer task fell behind, producers kept appending, ~26 GB of buffered LogEntry accumulated over 2h.

At steady-state failure rate today: ~26 GB / 2h13m ≈ 3 MB/s of buffered logs that never reached disk.

Why this is a real bug, not just "unhappy VM"

  • Unbounded queue + steady producers + any temporarily-slow consumer = guaranteed memory blowup. The cargo build was the trigger this time; any future I/O pressure event (large export, snapshot, fsync stall, even another service doing heavy disk work) recreates the conditions.
  • The bug is on origin/development today — not fixed in the 4 commits the deployed VM is behind.
  • Lost log entries are not load-bearing. Out-of-memory daemon restart is load-bearing (took out the demo for ~30 s while we diagnosed and recovered, and on a less-attentive operator could have pushed all the way to OOM-killer).

Proposed fix (~10 lines)

  1. Bounded channel with a sane capacity (proposed: 100 000 entries — ~30 MB max buffered at typical entry size).
  2. Producers use try_send and drop log entries when the channel is full.
  3. Optional but cheap: a tracing::warn! rate-limited message (or AtomicU64 counter) when drops happen, so we have telemetry if it ever does. Out of scope for the minimal fix.

Dropping log entries when the system is under heavy I/O pressure is the correct trade-off — those logs would not have been useful, an OOM-killed supervisor would have been catastrophic.

Reproduction

Not yet attempted in isolation. The trigger today was incidental (heavy cargo build on the same disk as hero_proc's sqlite). To reproduce in a controlled way: tc qdisc rate-limit the disk hosting ~/hero/var/hero_proc.db while running a normal Hero deployment with chatty services, then monitor ps -p <hero_proc_pid> -o rss,vsz.

Severity

Medium-high. Doesn't crash the daemon by itself, but causes total demo unresponsiveness and risks OOM-kill on smaller VMs (the 8 GB OOM that prompted the herodemo rebuild from heronu would have been hit much faster with this leak active under load).

Cross-references

  • Diagnosis from session on herodemo, 2026-04-30, while preparing tier-by-tier service updates: cargo build on the same VM exposed the unbounded-channel growth.
  • Related operational pattern (also worth fixing once we touch the supervisor restart path): #78 (dangling rpc.sock after restart).

Signed-off-by: mik-tf

## Symptom Observed on herodemo (TF Grid VM, 32 GB RAM, 16 CPU) on 2026-04-30: - `hero_proc_server` accumulated **26 GB resident / 31 GB virtual** after only 2h 13m of uptime. - State `S` (sleeping), VmData ≈ 30 GB (so ~all heap, not code or shared mappings). - Coincided with a long `cargo build --release -j 16` running on the same VM, which pegged btrfs I/O for ~50 minutes. - Demo became unresponsive (load avg 80+ on a 16-CPU VM); WASM shell sat at a blank page because every API call queued behind hero_proc's CPU/memory pressure. - `kill -9` on the daemon + clean restart dropped memory to 27 MB RSS within seconds. All supervised services kept running unchanged (supervisor model held — children are not killed when the daemon dies). ## Root cause `crates/hero_proc_server/src/log_batcher.rs:42`: ```rust let (tx, rx) = mpsc::unbounded_channel::<LogEntry>(); ``` Producer side: every Hero service that calls `logs.insert` (heartbeats, info messages, supervisor pings, RPC tracing, etc.) pushes a `LogEntry` onto this channel via `batcher.send(entry)`. There is no backpressure. Consumer side: the `batcher_task` flushes every 100 ms or when 10 000 entries accumulate, writing in a single SQLite transaction. Failure mode: when SQLite writes slow down for any reason (disk I/O contention, a slow consumer query holding a lock, etc.) the channel grows without limit. Today's trigger was cargo dominating btrfs I/O — the writer task fell behind, producers kept appending, ~26 GB of buffered `LogEntry` accumulated over 2h. At steady-state failure rate today: ~26 GB / 2h13m ≈ **3 MB/s of buffered logs that never reached disk**. ## Why this is a real bug, not just "unhappy VM" - Unbounded queue + steady producers + any temporarily-slow consumer = guaranteed memory blowup. The cargo build was the trigger this time; any future I/O pressure event (large export, snapshot, fsync stall, even another service doing heavy disk work) recreates the conditions. - The bug is on origin/development today — not fixed in the 4 commits the deployed VM is behind. - Lost log entries are not load-bearing. Out-of-memory daemon restart **is** load-bearing (took out the demo for ~30 s while we diagnosed and recovered, and on a less-attentive operator could have pushed all the way to OOM-killer). ## Proposed fix (~10 lines) 1. Bounded channel with a sane capacity (proposed: 100 000 entries — ~30 MB max buffered at typical entry size). 2. Producers use `try_send` and drop log entries when the channel is full. 3. Optional but cheap: a `tracing::warn!` rate-limited message (or AtomicU64 counter) when drops happen, so we have telemetry if it ever does. Out of scope for the minimal fix. Dropping log entries when the system is under heavy I/O pressure is the correct trade-off — those logs would not have been useful, an OOM-killed supervisor would have been catastrophic. ## Reproduction Not yet attempted in isolation. The trigger today was incidental (heavy cargo build on the same disk as hero_proc's sqlite). To reproduce in a controlled way: `tc qdisc` rate-limit the disk hosting `~/hero/var/hero_proc.db` while running a normal Hero deployment with chatty services, then monitor `ps -p <hero_proc_pid> -o rss,vsz`. ## Severity Medium-high. Doesn't crash the daemon by itself, but causes total demo unresponsiveness and risks OOM-kill on smaller VMs (the 8 GB OOM that prompted the herodemo rebuild from heronu would have been hit much faster with this leak active under load). ## Cross-references - Diagnosis from session on herodemo, 2026-04-30, while preparing tier-by-tier service updates: cargo build on the same VM exposed the unbounded-channel growth. - Related operational pattern (also worth fixing once we touch the supervisor restart path): https://forge.ourworld.tf/lhumina_code/hero_proc/issues/78 (dangling rpc.sock after restart). Signed-off-by: mik-tf
Author
Owner

Retracting log_batcher as the cause of the herodemo memory observation today. Actual leak tracked to sysmon /proc fd retention — see #81. Keeping this issue open as a separate hardening idea (bounded channel is still good practice) but lower priority.

Signed-off-by: mik-tf

Retracting log_batcher as the cause of the herodemo memory observation today. Actual leak tracked to sysmon /proc fd retention — see https://forge.ourworld.tf/lhumina_code/hero_proc/issues/81. Keeping this issue open as a separate hardening idea (bounded channel is still good practice) but lower priority. Signed-off-by: mik-tf
Author
Owner

Confirmed real on herodemo, ~13h after the original retraction.

Earlier today I retracted this issue as the cause of the 26 GB OOM event because the dominant leak turned out to be sysmon /proc fd retention (#81). I did say at the time that the unbounded-channel pattern is still wrong and worth fixing.

After today's full sweep + 5h of supervisor activity (service_complete --update --release cycle), with #81's fix in place, hero_proc reached 8 GB RSS with:

  • fd count: 151 (no /proc retention regression — #81 holding firm)
  • sqlite WAL: 4 MB (sqlite is draining fine)
  • DB main file: 425 KB (cold)

The 8 GB is heap, not files, not fds. Process state at sample: sleeping. Ratio of heap growth (~8 GB) to disk-write growth (~4 MB) is the textbook pattern of an unbounded buffer queue accumulating ahead of a fast-enough but capped disk writer — exactly what an mpsc::unbounded_channel<LogEntry> produces under sustained log volume.

The local fix branch I drafted earlier (mpsc::channel(100_000) + try_send + drop counter) is the right patch. Will push + ship as a proper PR — should have shipped it the first time around regardless of which leak was dominant.

Re-classifying this from "bounded-channel hardening" to "real second leak, ship the fix".

Signed-off-by: mik-tf

**Confirmed real on herodemo, ~13h after the original retraction.** Earlier today I retracted this issue as the cause of the 26 GB OOM event because the dominant leak turned out to be sysmon /proc fd retention (#81). I did say at the time that the unbounded-channel pattern is still wrong and worth fixing. After today's full sweep + 5h of supervisor activity (`service_complete --update --release` cycle), with #81's fix in place, hero_proc reached **8 GB RSS** with: - fd count: **151** (no /proc retention regression — #81 holding firm) - sqlite WAL: **4 MB** (sqlite is draining fine) - DB main file: 425 KB (cold) The 8 GB is heap, not files, not fds. Process state at sample: sleeping. Ratio of heap growth (~8 GB) to disk-write growth (~4 MB) is the textbook pattern of an unbounded buffer queue accumulating ahead of a fast-enough but capped disk writer — exactly what an `mpsc::unbounded_channel<LogEntry>` produces under sustained log volume. The local fix branch I drafted earlier (`mpsc::channel(100_000)` + `try_send` + drop counter) is the right patch. Will push + ship as a proper PR — should have shipped it the first time around regardless of which leak was dominant. Re-classifying this from "bounded-channel hardening" to "real second leak, ship the fix". Signed-off-by: mik-tf
Author
Owner

Closed by 4218c94 on development.

This is the safety-net fix — caps in-channel memory and stops the OOM cascade observed on herodemo (7.9 GB after ~5h uptime). Under sustained overload, log entries are now dropped at the producer side and counted; a tracing::warn! surfaces drops once per flush tick when delta > 0.

The architectural fix — measure who's producing the log volume, size SQLite for it, or move logs out of SQLite entirely — is tracked separately in #85. Until that lands the bounded channel is the safety floor; on a healthy steady state dropped_total should be zero.

Same pattern as the HERO_CONTEXTS env fix sister to lhumina_code/hero_osis#42 — this stops today's bleeding without claiming to be the architecture answer.

Signed-off-by: mik-tf

Closed by 4218c94 on `development`. This is the **safety-net** fix — caps in-channel memory and stops the OOM cascade observed on herodemo (7.9 GB after ~5h uptime). Under sustained overload, log entries are now dropped at the producer side and counted; a `tracing::warn!` surfaces drops once per flush tick when delta > 0. The **architectural** fix — measure who's producing the log volume, size SQLite for it, or move logs out of SQLite entirely — is tracked separately in https://forge.ourworld.tf/lhumina_code/hero_proc/issues/85. Until that lands the bounded channel is the safety floor; on a healthy steady state `dropped_total` should be zero. Same pattern as the HERO_CONTEXTS env fix sister to https://forge.ourworld.tf/lhumina_code/hero_osis/issues/42 — this stops today's bleeding without claiming to be the architecture answer. Signed-off-by: mik-tf
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#80
No description provided.