[arch] Log pipeline performance — measure producers, size SQLite, or move logs out (root cause behind #80) #85

New issue

Closed

opened 2026-04-30 22:33:50 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-04-30 22:33:50 +00:00

Owner

Premise

#80 was the symptom: log_batcher's unbounded mpsc channel grew without bound under SQLite write pressure, and hero_proc_server OOM'd to 8 GB on herodemo. The bounded-channel fix that closes #80 caps in-channel memory at ~10 MB and drops entries gracefully under overload — durable coping, not root cause.

This issue tracks the architectural follow-up: understand and fix why the channel fills in the first place.

What we don't know yet

Producer rate — three sources push into the batcher, never measured separately:
- crates/hero_proc_server/src/rpc/log.rs:168 — logs.insert RPC (callers: sysmon, external services)
- crates/hero_proc_server/src/supervisor/executor.rs:687 — PTY stdout line-by-line for every supervised process
- crates/hero_proc_server/src/supervisor/executor.rs:710 — same for stderr
- crates/hero_proc_server/src/supervisor/executor.rs:882 — PTY tee path for the WS-streamed scrollback
SQLite write capacity — never benchmarked at the schema and storage layout we actually run on the demo VM. Don't know peak sustained insert rate, don't know contention profile against other writes.
Schema fitness for log volume — is the logs table indexed for the access patterns the UI uses? Is journal mode WAL? Synchronous=NORMAL? Per-partition tables?
Whether SQLite is the right place at all — observability data (sysmon samples, PTY tail) might fit a circular file buffer or per-service rotating logs better than a transactional store.

The bounded-channel fix masks all of this. The new dropped_total metric tells us that we're overloaded but not who the producers are or why SQLite can't keep up.

Proposal

Three workstreams, can ship independently:

1. Per-source instrumentation

Annotate each batcher.send(...) call with a source tag (sysmon / pty:<service> / rpc). Track per-source send counts and per-source drop counts in atomic counters. Surface periodically (or via a new log_batcher.stats RPC) so we can answer: "during overload, who was producing?"

Estimate: half a day. Output: actionable data.

2. SQLite write benchmarking

Stand up a microbench against the real schema with realistic LogEntry payloads. Measure peak sustained inserts/sec at different journal modes (WAL vs DELETE), batch sizes, with and without competing read load from the UI. Document the headroom.

Estimate: 1–2 days. Output: numbers we can compare against measured producer rate.

3. Reduce or move logs

Decide based on (1) and (2):

Reduce volume: rate-limit chatty PTY tail per service; cap sysmon to a sample-rate the demo actually uses; coalesce duplicate lines.
Speed up SQLite: WAL, indexes, partition-by-time-window, periodic VACUUM, or move to a dedicated log DB file separated from the action/job state DB.
Move logs out of SQLite: per-service rotating files plus an indexer for the UI; or a circular ring buffer in memory with periodic flushes; or LMDB if we want still-transactional but faster.

Each option is a separate decision; we can't make it without (1) and (2).

Acceptance

Per-source send/drop counters wired into log_batcher, exposed via an RPC or periodic log line.
SQLite log-write benchmark documented (docs/dev/LOG_BATCHER_BENCH.md) with numbers from the demo VM hardware profile.
Decision committed on the volume/SQLite/elsewhere axis, with a follow-up implementation issue if the chosen path is non-trivial.
On the running demo: dropped_total from #80's bounded-channel fix should be zero under steady-state load. If it isn't, the structural fix didn't work.

Cross-references

Closes the architectural angle of #80 (the bounded-channel fix is the safety net; this is the upstream answer).
Related to #84 (service readiness contract) — both are about replacing safety-net coping layers with proper observable signals.
Pattern reference: same shape as lhumina_code/hero_osis#43 (contexts as data) where #42's env-var fix is the deploy-side coping and #43 is the architectural answer.

Signed-off-by: mik-tf

## Premise https://forge.ourworld.tf/lhumina_code/hero_proc/issues/80 was the symptom: `log_batcher`'s unbounded mpsc channel grew without bound under SQLite write pressure, and `hero_proc_server` OOM'd to 8 GB on herodemo. The bounded-channel fix that closes #80 caps in-channel memory at ~10 MB and drops entries gracefully under overload — durable coping, not root cause. This issue tracks the architectural follow-up: **understand and fix why the channel fills in the first place.** ## What we don't know yet 1. **Producer rate** — three sources push into the batcher, never measured separately: - `crates/hero_proc_server/src/rpc/log.rs:168` — `logs.insert` RPC (callers: sysmon, external services) - `crates/hero_proc_server/src/supervisor/executor.rs:687` — PTY stdout line-by-line for every supervised process - `crates/hero_proc_server/src/supervisor/executor.rs:710` — same for stderr - `crates/hero_proc_server/src/supervisor/executor.rs:882` — PTY tee path for the WS-streamed scrollback 2. **SQLite write capacity** — never benchmarked at the schema and storage layout we actually run on the demo VM. Don't know peak sustained insert rate, don't know contention profile against other writes. 3. **Schema fitness for log volume** — is the `logs` table indexed for the access patterns the UI uses? Is journal mode WAL? Synchronous=NORMAL? Per-partition tables? 4. **Whether SQLite is the right place at all** — observability data (sysmon samples, PTY tail) might fit a circular file buffer or per-service rotating logs better than a transactional store. The bounded-channel fix masks all of this. The new `dropped_total` metric tells us *that* we're overloaded but not *who* the producers are or *why* SQLite can't keep up. ## Proposal Three workstreams, can ship independently: ### 1. Per-source instrumentation Annotate each `batcher.send(...)` call with a source tag (`sysmon` / `pty:<service>` / `rpc`). Track per-source send counts and per-source drop counts in atomic counters. Surface periodically (or via a new `log_batcher.stats` RPC) so we can answer: "during overload, who was producing?" Estimate: half a day. Output: actionable data. ### 2. SQLite write benchmarking Stand up a microbench against the real schema with realistic LogEntry payloads. Measure peak sustained inserts/sec at different journal modes (WAL vs DELETE), batch sizes, with and without competing read load from the UI. Document the headroom. Estimate: 1–2 days. Output: numbers we can compare against measured producer rate. ### 3. Reduce or move logs Decide based on (1) and (2): - **Reduce volume**: rate-limit chatty PTY tail per service; cap sysmon to a sample-rate the demo actually uses; coalesce duplicate lines. - **Speed up SQLite**: WAL, indexes, partition-by-time-window, periodic VACUUM, or move to a dedicated log DB file separated from the action/job state DB. - **Move logs out of SQLite**: per-service rotating files plus an indexer for the UI; or a circular ring buffer in memory with periodic flushes; or LMDB if we want still-transactional but faster. Each option is a separate decision; we can't make it without (1) and (2). ## Acceptance - [ ] Per-source send/drop counters wired into `log_batcher`, exposed via an RPC or periodic log line. - [ ] SQLite log-write benchmark documented (`docs/dev/LOG_BATCHER_BENCH.md`) with numbers from the demo VM hardware profile. - [ ] Decision committed on the volume/SQLite/elsewhere axis, with a follow-up implementation issue if the chosen path is non-trivial. - [ ] On the running demo: `dropped_total` from #80's bounded-channel fix should be **zero** under steady-state load. If it isn't, the structural fix didn't work. ## Cross-references - Closes the architectural angle of https://forge.ourworld.tf/lhumina_code/hero_proc/issues/80 (the bounded-channel fix is the safety net; this is the upstream answer). - Related to https://forge.ourworld.tf/lhumina_code/hero_proc/issues/84 (service readiness contract) — both are about replacing safety-net coping layers with proper observable signals. - Pattern reference: same shape as https://forge.ourworld.tf/lhumina_code/hero_osis/issues/43 (contexts as data) where #42's env-var fix is the deploy-side coping and #43 is the architectural answer. Signed-off-by: mik-tf