[bug][P1] hero_collab accept loop has no backoff on EMFILE — runaway log spam on FD exhaustion #42

New issue

Open

opened 2026-05-01 04:39:58 +00:00 by mik-tf · 0 comments

mik-tf commented

2026-05-01 04:39:58 +00:00

Owner

Summary

hero_collab_server ran out of file descriptors (EMFILE, errno 24), then entered a tight accept-loop with no backoff, logging the same error ~200 lines/second until either the process died or the disk filled. On herodemo this generated 58 GB of logs in 24 hours on 2026-04-30, which took down the live demo by filling /data. The detailed log-store analysis is at the parallel hero_proc incident issue.

The runaway pattern is two layered bugs:

FD leak somewhere in hero_collab — connections, sockets, or files not being closed. Eventually exhausts the per-process FD limit. (Root cause — diagnosis pending.)
Accept loop has no backoff on EMFILE — when accept() fails, the loop logs the error and immediately retries, spinning at 100% CPU and emitting hundreds of log lines per second.

Bug 1 alone causes connection failures. Bug 2 turns it into a disk-eating runaway.

Source data

Sample log lines captured before remediation:

2026-05-01T00:28:31.752605Z ERROR hero_collab_server: Socket accept error: Too many open files (os error 24)
2026-05-01T00:28:31.752609Z ERROR hero_collab_server: Socket accept error: Too many open files (os error 24)
2026-05-01T00:28:31.768457Z ERROR hero_collab_server: Socket accept error: Too many open files (os error 24)
2026-05-01T00:28:31.768463Z ERROR hero_collab_server: Socket accept error: Too many open files (os error 24)

Note the timestamp deltas: 4 µs, 16 ms, 6 µs, 4 µs — the loop is firing as fast as the kernel can return EMFILE and the SQLite log batcher can persist, with no sleep() or yield between iterations.

Volume metrics on day 121 (today, before hero_collab_server died at 00:28 UTC):

3,747,633 log lines from hero_collab.hero_collab_server
528 MB total content
~200 lines/second sustained
All same template (Socket accept error: Too many open files (os error 24))

Proposed fixes

Immediate (low-risk)

In the accept loop, on EMFILE (or any accept() error), tokio::time::sleep(Duration::from_millis(100)).await before the next iteration. This caps the runaway at ~10 lines/second — manageable for the log store and CPU. Pseudocode:

loop {
    match listener.accept().await {
        Ok((conn, _)) => { tokio::spawn(handle(conn)); }
        Err(e) => {
            tracing::error!("Socket accept error: {}", e);
            // Backoff to avoid runaway on EMFILE / EINTR / ENFILE
            tokio::time::sleep(Duration::from_millis(100)).await;
        }
    }
}

Better — exponential backoff with a cap (100ms → 200ms → 400ms → ... → 5s), reset to 100ms after the next successful accept.

Medium-term

Find and fix the FD leak. Likely candidates:

WebSocket or HTTP connection handlers that don't drop the socket on error/disconnect
File handles opened per-request without an into_inner() / drop()
Spawned tasks that hold an Arc<Connection> but never finish
hero_collab may use a third-party crate that leaks FDs (e.g. some Yjs / OT library; check upstream issues)

Per-process FD limit on Linux is usually 1024 (default ulimit -n). Long-running daemons should either raise this or aggressively drop connections.

Long-term

Add a self-monitor that polls /proc/self/fd count every N seconds and emits a metric. When the count exceeds 80% of RLIMIT_NOFILE, log a warning and refuse new connections (rather than wait for hard exhaustion).
Surface FD-count on the hero_collab admin UI.

State on herodemo right now

hero_collab_server is down (not running). Left down deliberately until backoff + FD leak are fixed.
hero_collab_ui (the admin/iframe UI binary) is still running.
The runaway loop is no longer producing logs.

Severity

P1 by impact (took the demo down via hero_proc log incident). P0 if the FD leak is reproducible — it'll do this again on the next restart.

Cross-refs

hero_proc parallel issue (this session) — hero_proc log-store also failed to absorb the runaway gracefully (no rate limit, no total-size cap).
hero_proc#86 — reliability META
hero_demo#52 — vision

Spotted during docs_hero Phase 1 UX gate (session 52). Reconciliation memo: memory/investigation_roadmap_reconciliation.md.

## Summary `hero_collab_server` ran out of file descriptors (`EMFILE`, errno 24), then entered a **tight accept-loop with no backoff**, logging the same error ~200 lines/second until either the process died or the disk filled. On herodemo this generated **58 GB of logs in 24 hours** on 2026-04-30, which took down the live demo by filling `/data`. The detailed log-store analysis is at the parallel hero_proc incident issue. The runaway pattern is two layered bugs: 1. **FD leak somewhere in hero_collab** — connections, sockets, or files not being closed. Eventually exhausts the per-process FD limit. (Root cause — diagnosis pending.) 2. **Accept loop has no backoff on `EMFILE`** — when `accept()` fails, the loop logs the error and immediately retries, spinning at 100% CPU and emitting hundreds of log lines per second. Bug 1 alone causes connection failures. Bug 2 turns it into a disk-eating runaway. ## Source data Sample log lines captured before remediation: ``` 2026-05-01T00:28:31.752605Z ERROR hero_collab_server: Socket accept error: Too many open files (os error 24) 2026-05-01T00:28:31.752609Z ERROR hero_collab_server: Socket accept error: Too many open files (os error 24) 2026-05-01T00:28:31.768457Z ERROR hero_collab_server: Socket accept error: Too many open files (os error 24) 2026-05-01T00:28:31.768463Z ERROR hero_collab_server: Socket accept error: Too many open files (os error 24) ``` Note the timestamp deltas: 4 µs, 16 ms, 6 µs, 4 µs — the loop is firing as fast as the kernel can return `EMFILE` and the SQLite log batcher can persist, with no `sleep()` or yield between iterations. Volume metrics on day 121 (today, before hero_collab_server died at 00:28 UTC): - 3,747,633 log lines from `hero_collab.hero_collab_server` - 528 MB total content - ~200 lines/second sustained - All same template (`Socket accept error: Too many open files (os error 24)`) ## Proposed fixes ### Immediate (low-risk) In the accept loop, on `EMFILE` (or any `accept()` error), `tokio::time::sleep(Duration::from_millis(100)).await` before the next iteration. This caps the runaway at ~10 lines/second — manageable for the log store and CPU. Pseudocode: ```rust loop { match listener.accept().await { Ok((conn, _)) => { tokio::spawn(handle(conn)); } Err(e) => { tracing::error!("Socket accept error: {}", e); // Backoff to avoid runaway on EMFILE / EINTR / ENFILE tokio::time::sleep(Duration::from_millis(100)).await; } } } ``` Better — exponential backoff with a cap (100ms → 200ms → 400ms → ... → 5s), reset to 100ms after the next successful accept. ### Medium-term Find and fix the FD leak. Likely candidates: - WebSocket or HTTP connection handlers that don't drop the socket on error/disconnect - File handles opened per-request without an `into_inner()` / `drop()` - Spawned tasks that hold an `Arc<Connection>` but never finish - `hero_collab` may use a third-party crate that leaks FDs (e.g. some Yjs / OT library; check upstream issues) Per-process FD limit on Linux is usually 1024 (default `ulimit -n`). Long-running daemons should either raise this or aggressively drop connections. ### Long-term - Add a self-monitor that polls `/proc/self/fd` count every N seconds and emits a metric. When the count exceeds 80% of `RLIMIT_NOFILE`, log a warning and refuse new connections (rather than wait for hard exhaustion). - Surface FD-count on the hero_collab admin UI. ## State on herodemo right now - `hero_collab_server` is **down** (not running). Left down deliberately until backoff + FD leak are fixed. - `hero_collab_ui` (the admin/iframe UI binary) is still running. - The runaway loop is no longer producing logs. ## Severity P1 by impact (took the demo down via [hero_proc log incident](https://forge.ourworld.tf/lhumina_code/hero_proc/issues)). P0 if the FD leak is reproducible — it'll do this again on the next restart. ## Cross-refs - hero_proc parallel issue (this session) — `hero_proc` log-store also failed to absorb the runaway gracefully (no rate limit, no total-size cap). - [hero_proc#86 — reliability META](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/86) - [hero_demo#52 — vision](https://forge.ourworld.tf/lhumina_code/hero_demo/issues/52) Spotted during docs_hero Phase 1 UX gate (session 52). Reconciliation memo: `memory/investigation_roadmap_reconciliation.md`.