[bug] log_batcher unbounded channel — hero_proc_server leaks unbounded memory under sqlite write pressure #80
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#80
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
Observed on herodemo (TF Grid VM, 32 GB RAM, 16 CPU) on 2026-04-30:
hero_proc_serveraccumulated 26 GB resident / 31 GB virtual after only 2h 13m of uptime.S(sleeping), VmData ≈ 30 GB (so ~all heap, not code or shared mappings).cargo build --release -j 16running on the same VM, which pegged btrfs I/O for ~50 minutes.kill -9on the daemon + clean restart dropped memory to 27 MB RSS within seconds. All supervised services kept running unchanged (supervisor model held — children are not killed when the daemon dies).Root cause
crates/hero_proc_server/src/log_batcher.rs:42:Producer side: every Hero service that calls
logs.insert(heartbeats, info messages, supervisor pings, RPC tracing, etc.) pushes aLogEntryonto this channel viabatcher.send(entry). There is no backpressure.Consumer side: the
batcher_taskflushes every 100 ms or when 10 000 entries accumulate, writing in a single SQLite transaction.Failure mode: when SQLite writes slow down for any reason (disk I/O contention, a slow consumer query holding a lock, etc.) the channel grows without limit. Today's trigger was cargo dominating btrfs I/O — the writer task fell behind, producers kept appending, ~26 GB of buffered
LogEntryaccumulated over 2h.At steady-state failure rate today: ~26 GB / 2h13m ≈ 3 MB/s of buffered logs that never reached disk.
Why this is a real bug, not just "unhappy VM"
Proposed fix (~10 lines)
try_sendand drop log entries when the channel is full.tracing::warn!rate-limited message (or AtomicU64 counter) when drops happen, so we have telemetry if it ever does. Out of scope for the minimal fix.Dropping log entries when the system is under heavy I/O pressure is the correct trade-off — those logs would not have been useful, an OOM-killed supervisor would have been catastrophic.
Reproduction
Not yet attempted in isolation. The trigger today was incidental (heavy cargo build on the same disk as hero_proc's sqlite). To reproduce in a controlled way:
tc qdiscrate-limit the disk hosting~/hero/var/hero_proc.dbwhile running a normal Hero deployment with chatty services, then monitorps -p <hero_proc_pid> -o rss,vsz.Severity
Medium-high. Doesn't crash the daemon by itself, but causes total demo unresponsiveness and risks OOM-kill on smaller VMs (the 8 GB OOM that prompted the herodemo rebuild from heronu would have been hit much faster with this leak active under load).
Cross-references
Signed-off-by: mik-tf
Retracting log_batcher as the cause of the herodemo memory observation today. Actual leak tracked to sysmon /proc fd retention — see #81. Keeping this issue open as a separate hardening idea (bounded channel is still good practice) but lower priority.
Signed-off-by: mik-tf
Confirmed real on herodemo, ~13h after the original retraction.
Earlier today I retracted this issue as the cause of the 26 GB OOM event because the dominant leak turned out to be sysmon /proc fd retention (#81). I did say at the time that the unbounded-channel pattern is still wrong and worth fixing.
After today's full sweep + 5h of supervisor activity (
service_complete --update --releasecycle), with #81's fix in place, hero_proc reached 8 GB RSS with:The 8 GB is heap, not files, not fds. Process state at sample: sleeping. Ratio of heap growth (~8 GB) to disk-write growth (~4 MB) is the textbook pattern of an unbounded buffer queue accumulating ahead of a fast-enough but capped disk writer — exactly what an
mpsc::unbounded_channel<LogEntry>produces under sustained log volume.The local fix branch I drafted earlier (
mpsc::channel(100_000)+try_send+ drop counter) is the right patch. Will push + ship as a proper PR — should have shipped it the first time around regardless of which leak was dominant.Re-classifying this from "bounded-channel hardening" to "real second leak, ship the fix".
Signed-off-by: mik-tf
Closed by
4218c94ondevelopment.This is the safety-net fix — caps in-channel memory and stops the OOM cascade observed on herodemo (7.9 GB after ~5h uptime). Under sustained overload, log entries are now dropped at the producer side and counted; a
tracing::warn!surfaces drops once per flush tick when delta > 0.The architectural fix — measure who's producing the log volume, size SQLite for it, or move logs out of SQLite entirely — is tracked separately in #85. Until that lands the bounded channel is the safety floor; on a healthy steady state
dropped_totalshould be zero.Same pattern as the HERO_CONTEXTS env fix sister to lhumina_code/hero_osis#42 — this stops today's bleeding without claiming to be the architecture answer.
Signed-off-by: mik-tf