cockpit.list_services times out: serialized N+1 RPC fan-out to hero_proc #6
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Found while running a verification pass against a local cockpit install. The
cockpit.list_serviceshandler atcrates/hero_cockpit_server/src/main.rs:460callsservice.list_full, then for each returned service it awaits a serializedservice_statusfollowed by a serializedservice_stats. With around 90 services registered in hero_proc on a realistic VM, that is roughly 180 sequential RPC round-trips per page load, and the request reliably exceeds the router default 10 second upstream timeout. The visible symptom is that opening/services(which is the primary cockpit page after login) returns the plaintext stringupstream timeoutinstead of the services table, which makes every cockpit lifecycle button unreachable. Likely fix is to fan the per-serviceservice_statusandservice_statscalls out concurrently withfutures::future::join_all(or to extendservice.list_fullto return state and stats inline, then drop the secondary calls entirely). Reproduced locally with 91 services discovered by hero_router. Happy to open a PR once a preferred shape is confirmed.Partial fix landed in
c0a2a10ondevelopment. The per-rowservice.statusandservice.statscalls now fire concurrently viatokio::join!plusfutures::future::join_allinstead of the serialized loop, collapsing the cockpit-side fan-out from 200s+ to about 22s on a 101-service local stack. The remaining gap above the hero_router 10s upstream timeout is hero_proc-side: a singleservice.statuscall averages 1.8s under concurrent load and the daemon caps effective parallelism at roughly 9x. Filed as a separate follow-up so this one can close on the N+1 fix that was its title.Follow-up filed at hero_proc#121 for the residual daemon-side latency.
Fully closed by
722ace2—handle_list_servicesnow uses the newservice.status_allbulk RPC fromlhumina_code/hero_proc@e833dc9.The s147 partial fix (
c0a2a10) parallelized the cockpit-side fan-out withtokio::join!+join_all, but the daemon-side per-call cost still dominated (1.8 s/call x ~9x effective parallelism = ~22 s on 101 services, still over the hero_router 10 s upstream timeout). The new bulk RPC eliminates the per-call sysinfo mutex and the 3x-redundant SQL chain per call.Local smoke through hero_router on 105 services:
No 504. Page renders the full table with state, pid, mem_rss_bytes, cpu_percent, restarts, current_run_id, enabled for every supervised service.
Also drops
futuresfrom the workspace deps — was only used by the now-removedjoin_all.