[arch] Log pipeline performance — measure producers, size SQLite, or move logs out (root cause behind #80) #85
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#85
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Premise
#80 was the symptom:
log_batcher's unbounded mpsc channel grew without bound under SQLite write pressure, andhero_proc_serverOOM'd to 8 GB on herodemo. The bounded-channel fix that closes #80 caps in-channel memory at ~10 MB and drops entries gracefully under overload — durable coping, not root cause.This issue tracks the architectural follow-up: understand and fix why the channel fills in the first place.
What we don't know yet
crates/hero_proc_server/src/rpc/log.rs:168—logs.insertRPC (callers: sysmon, external services)crates/hero_proc_server/src/supervisor/executor.rs:687— PTY stdout line-by-line for every supervised processcrates/hero_proc_server/src/supervisor/executor.rs:710— same for stderrcrates/hero_proc_server/src/supervisor/executor.rs:882— PTY tee path for the WS-streamed scrollbacklogstable indexed for the access patterns the UI uses? Is journal mode WAL? Synchronous=NORMAL? Per-partition tables?The bounded-channel fix masks all of this. The new
dropped_totalmetric tells us that we're overloaded but not who the producers are or why SQLite can't keep up.Proposal
Three workstreams, can ship independently:
1. Per-source instrumentation
Annotate each
batcher.send(...)call with a source tag (sysmon/pty:<service>/rpc). Track per-source send counts and per-source drop counts in atomic counters. Surface periodically (or via a newlog_batcher.statsRPC) so we can answer: "during overload, who was producing?"Estimate: half a day. Output: actionable data.
2. SQLite write benchmarking
Stand up a microbench against the real schema with realistic LogEntry payloads. Measure peak sustained inserts/sec at different journal modes (WAL vs DELETE), batch sizes, with and without competing read load from the UI. Document the headroom.
Estimate: 1–2 days. Output: numbers we can compare against measured producer rate.
3. Reduce or move logs
Decide based on (1) and (2):
Each option is a separate decision; we can't make it without (1) and (2).
Acceptance
log_batcher, exposed via an RPC or periodic log line.docs/dev/LOG_BATCHER_BENCH.md) with numbers from the demo VM hardware profile.dropped_totalfrom #80's bounded-channel fix should be zero under steady-state load. If it isn't, the structural fix didn't work.Cross-references
Signed-off-by: mik-tf
will be fixed with lhumina_code/hero_lib#133