[bug] sysmon /proc fd leak — hero_proc_server retains thousands of /proc/<pid>/stat fds, drives multi-GB memory growth on long-running daemons #81

Closed
opened 2026-04-30 14:02:25 +00:00 by mik-tf · 1 comment
Owner

Symptom

Observed on herodemo (TF Grid VM, 32 GB / 16 CPU) on 2026-04-30:

  • After a fresh restart, hero_proc_server ballooned to ~5 GB RSS in ~20 minutes of normal operation (much faster under operator activity).
  • Direct evidence: ls /proc/<pid>/fd | wc -l showed 1917 open file descriptors on the daemon, almost all pointing to /proc/<other_pid>/stat and /proc/<pid>/task/<tid>/stat.
  • Earlier the same day, the prior daemon reached 26 GB RSS / 31 GB virtual in 2h 13m of uptime under the same conditions; clean restart freed everything.

Root cause — documented behaviour of sysinfo we did not opt out of

crates/hero_proc_server/src/sysmon.rs holds a singleton sysinfo::System and calls refresh_processes(ProcessesToUpdate::Some(&pids), true) on it every 5 seconds (background task in main.rs:140). The default ProcessRefreshKind of refresh_processes includes .with_tasks().

From the sysinfo 0.37.2 docs (src/common/system.rs:293-298):

⚠️ On Linux, sysinfo keeps the stat files open by default. You can change this behaviour by using set_open_files_limit.

⚠️ On Linux, if you dont need the tasks of each process, you can use refresh_processes_specifics with ProcessRefreshKind::everything().without_tasks(). Refreshing all processes and their tasks can be quite expensive.

With ~50 supervised jobs each having multiple async-runtime threads, every 5-second refresh opens fds for the whole pid + tid set and retains them. Without set_open_files_limit, there is no cap. The fd count and the kernel-side / userspace memory backing those fds grow without bound. Operator activity (every service list, service restart, every job reattach) accelerates the rate by triggering more refreshes against changing pid sets.

When the leaky pattern was introduced

  • 7ecb271 feat: add pid, cpu_percent, memory_bytes fields to job model and sysmon docs (2026-03-20)
  • 6aa3e52 feat(jobs): live cpu/mem/uptime stats in job.list via background sysmon cache (2026-03-20)

The leak has been latent in the codebase for ~6 weeks but only becomes visible on long-running daemons with many supervised jobs and steady operator activity.

Why this is the actual root cause (and not log_batcher)

In parallel I had hypothesised the unbounded-channel growth in log_batcher.rs (issue #80) as the cause. Direct measurement disproved that:

  • hero_proc.db-wal was growing (~4 MB after 21 minutes) while RSS grew ~5 GB in the same window — a 1200:1 ratio between memory growth and disk-write rate. If the leak were buffered log entries, the WAL would have been keeping up. It was.
  • Heap detail (/proc/<pid>/smaps_rollup): 5.7 GB Pss, almost all anonymous private dirty pages. fd count 1917, almost all /proc/.../stat.

The log_batcher channel IS still an unbounded-buffer anti-pattern worth hardening separately (see issue #80), but it is not the leak that hit herodemo today.

Proposed fix (~5 lines)

In crates/hero_proc_server/src/sysmon.rs:

  1. Call sysinfo::set_open_files_limit(0) once at module / process startup to disable fd caching entirely.
  2. Replace refresh_processes(ProcessesToUpdate::Some(&pids), true) with refresh_processes_specifics(...) passing ProcessRefreshKind::nothing().with_memory().with_cpu() (omit .with_tasks() — we expose only PID-level CPU/memory in job.list, never per-thread).

Trade-off: per-refresh CPU goes up slightly because each call re-opens stat files. With a 5-second cadence and ~50 PIDs that's negligible. Correctness over premature optimisation.

Optional follow-up: replace the singleton with a fresh System::new() per refresh, so dead PID entries also can't accumulate in the internal HashMap. Probably unnecessary once set_open_files_limit(0) + no tasks is in.

Plan to verify

  1. Apply the fix on a feature branch.
  2. Build + deploy on herodemo (gentle cargo).
  3. Watch ps -o rss and ls /proc/<pid>/fd | wc -l for several hours under both idle and active operator load.
  4. Pass criterion: fd count steady (< ~50 expected — sockets, db files, log files, etc.), RSS growth bounded over hours.

Cross-references

  • Issue #80 (the log_batcher hypothesis I filed before measuring carefully — retracted as the cause of today's symptom but kept as a separate hardening idea).

Signed-off-by: mik-tf

## Symptom Observed on herodemo (TF Grid VM, 32 GB / 16 CPU) on 2026-04-30: - After a fresh restart, `hero_proc_server` ballooned to ~5 GB RSS in **~20 minutes** of normal operation (much faster under operator activity). - Direct evidence: `ls /proc/<pid>/fd | wc -l` showed **1917 open file descriptors** on the daemon, almost all pointing to `/proc/<other_pid>/stat` and `/proc/<pid>/task/<tid>/stat`. - Earlier the same day, the prior daemon reached **26 GB RSS / 31 GB virtual** in 2h 13m of uptime under the same conditions; clean restart freed everything. ## Root cause — documented behaviour of `sysinfo` we did not opt out of `crates/hero_proc_server/src/sysmon.rs` holds a singleton `sysinfo::System` and calls `refresh_processes(ProcessesToUpdate::Some(&pids), true)` on it every 5 seconds (background task in `main.rs:140`). The default `ProcessRefreshKind` of `refresh_processes` **includes `.with_tasks()`**. From the `sysinfo` 0.37.2 docs (`src/common/system.rs:293-298`): > ⚠️ On Linux, `sysinfo` keeps the `stat` files open by default. You can change this behaviour by using `set_open_files_limit`. > > ⚠️ On Linux, if you dont need the tasks of each process, you can use `refresh_processes_specifics` with `ProcessRefreshKind::everything().without_tasks()`. Refreshing all processes and their tasks can be quite expensive. With ~50 supervised jobs each having multiple async-runtime threads, every 5-second refresh opens fds for the whole pid + tid set and **retains them**. Without `set_open_files_limit`, there is no cap. The fd count and the kernel-side / userspace memory backing those fds grow without bound. Operator activity (every `service list`, `service restart`, every job reattach) accelerates the rate by triggering more refreshes against changing pid sets. ## When the leaky pattern was introduced - `7ecb271 feat: add pid, cpu_percent, memory_bytes fields to job model and sysmon docs` (2026-03-20) - `6aa3e52 feat(jobs): live cpu/mem/uptime stats in job.list via background sysmon cache` (2026-03-20) The leak has been latent in the codebase for ~6 weeks but only becomes visible on long-running daemons with many supervised jobs and steady operator activity. ## Why this is the actual root cause (and not log_batcher) In parallel I had hypothesised the unbounded-channel growth in `log_batcher.rs` (issue #80) as the cause. Direct measurement disproved that: - `hero_proc.db-wal` was growing (~4 MB after 21 minutes) while RSS grew ~5 GB in the same window — a 1200:1 ratio between memory growth and disk-write rate. If the leak were buffered log entries, the WAL would have been keeping up. It was. - Heap detail (`/proc/<pid>/smaps_rollup`): 5.7 GB Pss, almost all anonymous private dirty pages. fd count 1917, almost all `/proc/.../stat`. The log_batcher channel IS still an unbounded-buffer anti-pattern worth hardening separately (see issue #80), but it is not the leak that hit herodemo today. ## Proposed fix (~5 lines) In `crates/hero_proc_server/src/sysmon.rs`: 1. Call `sysinfo::set_open_files_limit(0)` once at module / process startup to disable fd caching entirely. 2. Replace `refresh_processes(ProcessesToUpdate::Some(&pids), true)` with `refresh_processes_specifics(...)` passing `ProcessRefreshKind::nothing().with_memory().with_cpu()` (omit `.with_tasks()` — we expose only PID-level CPU/memory in `job.list`, never per-thread). Trade-off: per-refresh CPU goes up slightly because each call re-opens stat files. With a 5-second cadence and ~50 PIDs that's negligible. Correctness over premature optimisation. Optional follow-up: replace the singleton with a fresh `System::new()` per refresh, so dead PID entries also can't accumulate in the internal HashMap. Probably unnecessary once `set_open_files_limit(0)` + no tasks is in. ## Plan to verify 1. Apply the fix on a feature branch. 2. Build + deploy on herodemo (gentle cargo). 3. Watch `ps -o rss` and `ls /proc/<pid>/fd | wc -l` for several hours under both idle and active operator load. 4. Pass criterion: fd count steady (< ~50 expected — sockets, db files, log files, etc.), RSS growth bounded over hours. ## Cross-references - Issue #80 (the log_batcher hypothesis I filed before measuring carefully — retracted as the cause of today's symptom but kept as a separate hardening idea). Signed-off-by: mik-tf
Author
Owner

Fixed in 9ab99ad. Verified on herodemo:

  • before: 1917 open fds, thousands of /proc//stat retained, 16 GB RSS at 39 min, 26 GB at 2h13m
  • after: 13 open fds, zero /proc//stat fds, 8 MB RSS holding flat at 4+ min through ~50 sysmon refresh cycles + 10 stress-test RPC calls

Leaving the issue in the loop for a few hours of additional observation under operator load before final close.

Signed-off-by: mik-tf

Fixed in https://forge.ourworld.tf/lhumina_code/hero_proc/commit/9ab99ad. Verified on herodemo: - before: 1917 open fds, thousands of /proc/<pid>/stat retained, 16 GB RSS at 39 min, 26 GB at 2h13m - after: 13 open fds, **zero** /proc/<pid>/stat fds, 8 MB RSS holding flat at 4+ min through ~50 sysmon refresh cycles + 10 stress-test RPC calls Leaving the issue in the loop for a few hours of additional observation under operator load before final close. Signed-off-by: mik-tf
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#81
No description provided.