[bug][P1] hero_collab accept loop has no backoff on EMFILE — runaway log spam on FD exhaustion #42
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_collab#42
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
hero_collab_serverran out of file descriptors (EMFILE, errno 24), then entered a tight accept-loop with no backoff, logging the same error ~200 lines/second until either the process died or the disk filled. On herodemo this generated 58 GB of logs in 24 hours on 2026-04-30, which took down the live demo by filling/data. The detailed log-store analysis is at the parallel hero_proc incident issue.The runaway pattern is two layered bugs:
EMFILE— whenaccept()fails, the loop logs the error and immediately retries, spinning at 100% CPU and emitting hundreds of log lines per second.Bug 1 alone causes connection failures. Bug 2 turns it into a disk-eating runaway.
Source data
Sample log lines captured before remediation:
Note the timestamp deltas: 4 µs, 16 ms, 6 µs, 4 µs — the loop is firing as fast as the kernel can return
EMFILEand the SQLite log batcher can persist, with nosleep()or yield between iterations.Volume metrics on day 121 (today, before hero_collab_server died at 00:28 UTC):
hero_collab.hero_collab_serverSocket accept error: Too many open files (os error 24))Proposed fixes
Immediate (low-risk)
In the accept loop, on
EMFILE(or anyaccept()error),tokio::time::sleep(Duration::from_millis(100)).awaitbefore the next iteration. This caps the runaway at ~10 lines/second — manageable for the log store and CPU. Pseudocode:Better — exponential backoff with a cap (100ms → 200ms → 400ms → ... → 5s), reset to 100ms after the next successful accept.
Medium-term
Find and fix the FD leak. Likely candidates:
into_inner()/drop()Arc<Connection>but never finishhero_collabmay use a third-party crate that leaks FDs (e.g. some Yjs / OT library; check upstream issues)Per-process FD limit on Linux is usually 1024 (default
ulimit -n). Long-running daemons should either raise this or aggressively drop connections.Long-term
/proc/self/fdcount every N seconds and emits a metric. When the count exceeds 80% ofRLIMIT_NOFILE, log a warning and refuse new connections (rather than wait for hard exhaustion).State on herodemo right now
hero_collab_serveris down (not running). Left down deliberately until backoff + FD leak are fixed.hero_collab_ui(the admin/iframe UI binary) is still running.Severity
P1 by impact (took the demo down via hero_proc log incident). P0 if the FD leak is reproducible — it'll do this again on the next restart.
Cross-refs
hero_proclog-store also failed to absorb the runaway gracefully (no rate limit, no total-size cap).Spotted during docs_hero Phase 1 UX gate (session 52). Reconciliation memo:
memory/investigation_roadmap_reconciliation.md.