lhumina_code/hero_proc

Fork 0

[bug] sysmon /proc fd leak — hero_proc_server retains thousands of /proc/<pid>/stat fds, drives multi-GB memory growth on long-running daemons #81

New issue

Closed

opened 2026-04-30 14:02:25 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-04-30 14:02:25 +00:00

Owner

Symptom

Observed on herodemo (TF Grid VM, 32 GB / 16 CPU) on 2026-04-30:

After a fresh restart, hero_proc_server ballooned to ~5 GB RSS in ~20 minutes of normal operation (much faster under operator activity).
Direct evidence: ls /proc/<pid>/fd | wc -l showed 1917 open file descriptors on the daemon, almost all pointing to /proc/<other_pid>/stat and /proc/<pid>/task/<tid>/stat.
Earlier the same day, the prior daemon reached 26 GB RSS / 31 GB virtual in 2h 13m of uptime under the same conditions; clean restart freed everything.

Root cause — documented behaviour of `sysinfo` we did not opt out of

crates/hero_proc_server/src/sysmon.rs holds a singleton sysinfo::System and calls refresh_processes(ProcessesToUpdate::Some(&pids), true) on it every 5 seconds (background task in main.rs:140). The default ProcessRefreshKind of refresh_processes includes .with_tasks().

From the sysinfo 0.37.2 docs (src/common/system.rs:293-298):

⚠️ On Linux, sysinfo keeps the stat files open by default. You can change this behaviour by using set_open_files_limit.

⚠️ On Linux, if you dont need the tasks of each process, you can use refresh_processes_specifics with ProcessRefreshKind::everything().without_tasks(). Refreshing all processes and their tasks can be quite expensive.

With ~50 supervised jobs each having multiple async-runtime threads, every 5-second refresh opens fds for the whole pid + tid set and retains them. Without set_open_files_limit, there is no cap. The fd count and the kernel-side / userspace memory backing those fds grow without bound. Operator activity (every service list, service restart, every job reattach) accelerates the rate by triggering more refreshes against changing pid sets.

When the leaky pattern was introduced

7ecb271 feat: add pid, cpu_percent, memory_bytes fields to job model and sysmon docs (2026-03-20)
6aa3e52 feat(jobs): live cpu/mem/uptime stats in job.list via background sysmon cache (2026-03-20)

The leak has been latent in the codebase for ~6 weeks but only becomes visible on long-running daemons with many supervised jobs and steady operator activity.

Why this is the actual root cause (and not log_batcher)

In parallel I had hypothesised the unbounded-channel growth in log_batcher.rs (issue #80) as the cause. Direct measurement disproved that:

hero_proc.db-wal was growing (~4 MB after 21 minutes) while RSS grew ~5 GB in the same window — a 1200:1 ratio between memory growth and disk-write rate. If the leak were buffered log entries, the WAL would have been keeping up. It was.
Heap detail (/proc/<pid>/smaps_rollup): 5.7 GB Pss, almost all anonymous private dirty pages. fd count 1917, almost all /proc/.../stat.

The log_batcher channel IS still an unbounded-buffer anti-pattern worth hardening separately (see issue #80), but it is not the leak that hit herodemo today.

Proposed fix (~5 lines)

In crates/hero_proc_server/src/sysmon.rs:

Call sysinfo::set_open_files_limit(0) once at module / process startup to disable fd caching entirely.
Replace refresh_processes(ProcessesToUpdate::Some(&pids), true) with refresh_processes_specifics(...) passing ProcessRefreshKind::nothing().with_memory().with_cpu() (omit .with_tasks() — we expose only PID-level CPU/memory in job.list, never per-thread).

Trade-off: per-refresh CPU goes up slightly because each call re-opens stat files. With a 5-second cadence and ~50 PIDs that's negligible. Correctness over premature optimisation.

Optional follow-up: replace the singleton with a fresh System::new() per refresh, so dead PID entries also can't accumulate in the internal HashMap. Probably unnecessary once set_open_files_limit(0) + no tasks is in.

Plan to verify

Apply the fix on a feature branch.
Build + deploy on herodemo (gentle cargo).
Watch ps -o rss and ls /proc/<pid>/fd | wc -l for several hours under both idle and active operator load.
Pass criterion: fd count steady (< ~50 expected — sockets, db files, log files, etc.), RSS growth bounded over hours.

Cross-references

Issue #80 (the log_batcher hypothesis I filed before measuring carefully — retracted as the cause of today's symptom but kept as a separate hardening idea).

Signed-off-by: mik-tf

## Symptom Observed on herodemo (TF Grid VM, 32 GB / 16 CPU) on 2026-04-30: - After a fresh restart, `hero_proc_server` ballooned to ~5 GB RSS in **~20 minutes** of normal operation (much faster under operator activity). - Direct evidence: `ls /proc/<pid>/fd | wc -l` showed **1917 open file descriptors** on the daemon, almost all pointing to `/proc/<other_pid>/stat` and `/proc/<pid>/task/<tid>/stat`. - Earlier the same day, the prior daemon reached **26 GB RSS / 31 GB virtual** in 2h 13m of uptime under the same conditions; clean restart freed everything. ## Root cause — documented behaviour of `sysinfo` we did not opt out of `crates/hero_proc_server/src/sysmon.rs` holds a singleton `sysinfo::System` and calls `refresh_processes(ProcessesToUpdate::Some(&pids), true)` on it every 5 seconds (background task in `main.rs:140`). The default `ProcessRefreshKind` of `refresh_processes` **includes `.with_tasks()`**. From the `sysinfo` 0.37.2 docs (`src/common/system.rs:293-298`): > ⚠️ On Linux, `sysinfo` keeps the `stat` files open by default. You can change this behaviour by using `set_open_files_limit`. > > ⚠️ On Linux, if you dont need the tasks of each process, you can use `refresh_processes_specifics` with `ProcessRefreshKind::everything().without_tasks()`. Refreshing all processes and their tasks can be quite expensive. With ~50 supervised jobs each having multiple async-runtime threads, every 5-second refresh opens fds for the whole pid + tid set and **retains them**. Without `set_open_files_limit`, there is no cap. The fd count and the kernel-side / userspace memory backing those fds grow without bound. Operator activity (every `service list`, `service restart`, every job reattach) accelerates the rate by triggering more refreshes against changing pid sets. ## When the leaky pattern was introduced - `7ecb271 feat: add pid, cpu_percent, memory_bytes fields to job model and sysmon docs` (2026-03-20) - `6aa3e52 feat(jobs): live cpu/mem/uptime stats in job.list via background sysmon cache` (2026-03-20) The leak has been latent in the codebase for ~6 weeks but only becomes visible on long-running daemons with many supervised jobs and steady operator activity. ## Why this is the actual root cause (and not log_batcher) In parallel I had hypothesised the unbounded-channel growth in `log_batcher.rs` (issue #80) as the cause. Direct measurement disproved that: - `hero_proc.db-wal` was growing (~4 MB after 21 minutes) while RSS grew ~5 GB in the same window — a 1200:1 ratio between memory growth and disk-write rate. If the leak were buffered log entries, the WAL would have been keeping up. It was. - Heap detail (`/proc/<pid>/smaps_rollup`): 5.7 GB Pss, almost all anonymous private dirty pages. fd count 1917, almost all `/proc/.../stat`. The log_batcher channel IS still an unbounded-buffer anti-pattern worth hardening separately (see issue #80), but it is not the leak that hit herodemo today. ## Proposed fix (~5 lines) In `crates/hero_proc_server/src/sysmon.rs`: 1. Call `sysinfo::set_open_files_limit(0)` once at module / process startup to disable fd caching entirely. 2. Replace `refresh_processes(ProcessesToUpdate::Some(&pids), true)` with `refresh_processes_specifics(...)` passing `ProcessRefreshKind::nothing().with_memory().with_cpu()` (omit `.with_tasks()` — we expose only PID-level CPU/memory in `job.list`, never per-thread). Trade-off: per-refresh CPU goes up slightly because each call re-opens stat files. With a 5-second cadence and ~50 PIDs that's negligible. Correctness over premature optimisation. Optional follow-up: replace the singleton with a fresh `System::new()` per refresh, so dead PID entries also can't accumulate in the internal HashMap. Probably unnecessary once `set_open_files_limit(0)` + no tasks is in. ## Plan to verify 1. Apply the fix on a feature branch. 2. Build + deploy on herodemo (gentle cargo). 3. Watch `ps -o rss` and `ls /proc/<pid>/fd | wc -l` for several hours under both idle and active operator load. 4. Pass criterion: fd count steady (< ~50 expected — sockets, db files, log files, etc.), RSS growth bounded over hours. ## Cross-references - Issue #80 (the log_batcher hypothesis I filed before measuring carefully — retracted as the cause of today's symptom but kept as a separate hardening idea). Signed-off-by: mik-tf

mik-tf referenced this issue

2026-04-30 14:02:53 +00:00

[bug] log_batcher unbounded channel — hero_proc_server leaks unbounded memory under sqlite write pressure #80

mik-tf referenced this issue from a commit

2026-04-30 14:04:19 +00:00

fix(sysmon): disable sysinfo /proc fd cache and skip per-thread refresh

mik-tf referenced this issue from a commit

2026-04-30 14:14:50 +00:00

fix(sysmon): disable sysinfo /proc fd cache and skip per-thread refresh (#81)

mik-tf closed this issue

2026-04-30 14:14:50 +00:00

mik-tf commented

2026-04-30 14:15:00 +00:00

Author

Owner

Fixed in 9ab99ad. Verified on herodemo:

before: 1917 open fds, thousands of /proc//stat retained, 16 GB RSS at 39 min, 26 GB at 2h13m
after: 13 open fds, zero /proc//stat fds, 8 MB RSS holding flat at 4+ min through ~50 sysmon refresh cycles + 10 stress-test RPC calls

Leaving the issue in the loop for a few hours of additional observation under operator load before final close.

Signed-off-by: mik-tf

Fixed in https://forge.ourworld.tf/lhumina_code/hero_proc/commit/9ab99ad. Verified on herodemo: - before: 1917 open fds, thousands of /proc/<pid>/stat retained, 16 GB RSS at 39 min, 26 GB at 2h13m - after: 13 open fds, **zero** /proc/<pid>/stat fds, 8 MB RSS holding flat at 4+ min through ~50 sysmon refresh cycles + 10 stress-test RPC calls Leaving the issue in the loop for a few hours of additional observation under operator load before final close. Signed-off-by: mik-tf

mik-tf referenced this issue

2026-04-30 19:34:42 +00:00

[bug] log_batcher unbounded channel — hero_proc_server leaks unbounded memory under sqlite write pressure #80

mik-tf referenced this issue

2026-04-30 23:01:22 +00:00

[meta] hero_proc reliability roadmap — every observed failure mode + tracking + order of attack #86

mik-tf referenced this issue

2026-05-01 14:20:47 +00:00

[bug][P1] Log-store runaway: 58 GB in 24h from one chatty service wedged hero_proc + filled /data on herodemo #87