LogEventBus leaks entries on executor task panic — switch to Drop-guarded RAII #123
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#123
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
LogEventBus(crates/hero_proc_server/src/logging/sse/mod.rs) holds onebroadcast::Sender<LogEvent>per job, inserted byopen(job_id)and removed byclose(job_id). The supervisor's executor spawn-closure does the pairing:If
executor::run_jobpanics — for any reason: a.unwrap()deep in the spawn chain, a stack overflow, an OOM in the log writer, anassert!in the PTY path — the spawned task terminates withJoinError::Panicand neitherclose()nor theactive.lock().await.remove(...)lines run. The entry leaks in the LogEventBus'sHashMap<u32, broadcast::Sender>AND in the supervisor'sactivemap.Over a long-running daemon (the kind of "several hours uptime" that #122 reports), every panic accumulates. With long enough uptime + non-trivial job churn the HashMap grows unbounded and any subscriber attaching to a leaked
job_idgets aSenderwhose correspondingReceiverwill never see a terminator.Why a Drop guard is the right architectural fix
The current API is "manual lifecycle pairing" —
open()andclose()are explicit; the caller is responsible for ensuring both run. This is a Rust footgun because:close()after the first is a silent no-op (the entry's already gone) so there's no signal when something gets dropped wrongRAII via a Drop guard moves the lifecycle into the type system. Standard Rust pattern, used widely (
MutexGuard,JoinHandle's drop-abort behavior,tokio::task::AbortHandle).Proposed fix shape
Switch the internal HashMap from
tokio::sync::RwLock→std::sync::RwLock. The map operations are O(1) and finish in nanoseconds; there's no reason for the async-aware lock. Doing this makesopen/publish/subscribesynchronous, AND makes the Drop body trivial (notokio::spawnfrom insideDroprequired):Call-site change in
supervisor/mod.rs:Three things this gets:
open/publish/subscribeno longer need.await; ~6 call sites inexecutor.rsandweb.rsdrop the.awaitsuffixTests
open(jid)returns guard; drop guard; verify entry is gone frominnerand a finallog.doneevent went to any active subscriberRecvError::Closedfunctional::uc_23_26_logs_sseshould keep passingNot in scope
activemap leak on panic — separate issue, same root cause family, different fix surface#122futex wedge — this fix REDUCES surface area for that bug class but doesn't claim to fix #122 itselfWill pick this up autonomously unless someone wants to take it.
Verified end-to-end before push. Beyond the 6 new unit tests + the existing SSE integration suite:
hero_proc_serverin a freshPATH_ROOTsandboxGET /events?job_id=<id>over SSElog.lineevents as the job ranjob.cancellog.doneevent with{"job_id":N}payload, then the SSE stream closed cleanlyno active log stream for job N— the bus entry was reclaimedCancel and panic both unwind the spawn closure stack and drop locals → same Drop path. The unit test
panic_inside_spawned_task_still_drops_guardexercises the panic case; the manual verify above exercises the cancel case.Wire surface (
GET /events,log.line/log.doneJSON payloads) unchanged — no SDK regen, no consumer rebuild.