[bug] sysmon /proc fd leak — hero_proc_server retains thousands of /proc/<pid>/stat fds, drives multi-GB memory growth on long-running daemons #81
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#81
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
Observed on herodemo (TF Grid VM, 32 GB / 16 CPU) on 2026-04-30:
hero_proc_serverballooned to ~5 GB RSS in ~20 minutes of normal operation (much faster under operator activity).ls /proc/<pid>/fd | wc -lshowed 1917 open file descriptors on the daemon, almost all pointing to/proc/<other_pid>/statand/proc/<pid>/task/<tid>/stat.Root cause — documented behaviour of
sysinfowe did not opt out ofcrates/hero_proc_server/src/sysmon.rsholds a singletonsysinfo::Systemand callsrefresh_processes(ProcessesToUpdate::Some(&pids), true)on it every 5 seconds (background task inmain.rs:140). The defaultProcessRefreshKindofrefresh_processesincludes.with_tasks().From the
sysinfo0.37.2 docs (src/common/system.rs:293-298):With ~50 supervised jobs each having multiple async-runtime threads, every 5-second refresh opens fds for the whole pid + tid set and retains them. Without
set_open_files_limit, there is no cap. The fd count and the kernel-side / userspace memory backing those fds grow without bound. Operator activity (everyservice list,service restart, every job reattach) accelerates the rate by triggering more refreshes against changing pid sets.When the leaky pattern was introduced
7ecb271 feat: add pid, cpu_percent, memory_bytes fields to job model and sysmon docs(2026-03-20)6aa3e52 feat(jobs): live cpu/mem/uptime stats in job.list via background sysmon cache(2026-03-20)The leak has been latent in the codebase for ~6 weeks but only becomes visible on long-running daemons with many supervised jobs and steady operator activity.
Why this is the actual root cause (and not log_batcher)
In parallel I had hypothesised the unbounded-channel growth in
log_batcher.rs(issue #80) as the cause. Direct measurement disproved that:hero_proc.db-walwas growing (~4 MB after 21 minutes) while RSS grew ~5 GB in the same window — a 1200:1 ratio between memory growth and disk-write rate. If the leak were buffered log entries, the WAL would have been keeping up. It was./proc/<pid>/smaps_rollup): 5.7 GB Pss, almost all anonymous private dirty pages. fd count 1917, almost all/proc/.../stat.The log_batcher channel IS still an unbounded-buffer anti-pattern worth hardening separately (see issue #80), but it is not the leak that hit herodemo today.
Proposed fix (~5 lines)
In
crates/hero_proc_server/src/sysmon.rs:sysinfo::set_open_files_limit(0)once at module / process startup to disable fd caching entirely.refresh_processes(ProcessesToUpdate::Some(&pids), true)withrefresh_processes_specifics(...)passingProcessRefreshKind::nothing().with_memory().with_cpu()(omit.with_tasks()— we expose only PID-level CPU/memory injob.list, never per-thread).Trade-off: per-refresh CPU goes up slightly because each call re-opens stat files. With a 5-second cadence and ~50 PIDs that's negligible. Correctness over premature optimisation.
Optional follow-up: replace the singleton with a fresh
System::new()per refresh, so dead PID entries also can't accumulate in the internal HashMap. Probably unnecessary onceset_open_files_limit(0)+ no tasks is in.Plan to verify
ps -o rssandls /proc/<pid>/fd | wc -lfor several hours under both idle and active operator load.Cross-references
Signed-off-by: mik-tf
Fixed in
9ab99ad. Verified on herodemo:Leaving the issue in the loop for a few hours of additional observation under operator load before final close.
Signed-off-by: mik-tf