lhumina_code/hero_cockpit

Fork 0

cockpit.list_services times out: serialized N+1 RPC fan-out to hero_proc #6

New issue

Closed

opened 2026-05-23 04:11:59 +00:00 by mik-tf · 3 comments

mik-tf commented

2026-05-23 04:11:59 +00:00

Owner

Found while running a verification pass against a local cockpit install. The cockpit.list_services handler at crates/hero_cockpit_server/src/main.rs:460 calls service.list_full, then for each returned service it awaits a serialized service_status followed by a serialized service_stats. With around 90 services registered in hero_proc on a realistic VM, that is roughly 180 sequential RPC round-trips per page load, and the request reliably exceeds the router default 10 second upstream timeout. The visible symptom is that opening /services (which is the primary cockpit page after login) returns the plaintext string upstream timeout instead of the services table, which makes every cockpit lifecycle button unreachable. Likely fix is to fan the per-service service_status and service_stats calls out concurrently with futures::future::join_all (or to extend service.list_full to return state and stats inline, then drop the secondary calls entirely). Reproduced locally with 91 services discovered by hero_router. Happy to open a PR once a preferred shape is confirmed.

Found while running a verification pass against a local cockpit install. The `cockpit.list_services` handler at `crates/hero_cockpit_server/src/main.rs:460` calls `service.list_full`, then for each returned service it awaits a serialized `service_status` followed by a serialized `service_stats`. With around 90 services registered in hero_proc on a realistic VM, that is roughly 180 sequential RPC round-trips per page load, and the request reliably exceeds the router default 10 second upstream timeout. The visible symptom is that opening `/services` (which is the primary cockpit page after login) returns the plaintext string `upstream timeout` instead of the services table, which makes every cockpit lifecycle button unreachable. Likely fix is to fan the per-service `service_status` and `service_stats` calls out concurrently with `futures::future::join_all` (or to extend `service.list_full` to return state and stats inline, then drop the secondary calls entirely). Reproduced locally with 91 services discovered by hero_router. Happy to open a PR once a preferred shape is confirmed.

mik-tf referenced this issue from a commit

2026-05-23 04:13:56 +00:00

docs(channels/free): s146 verification pass — flip statuses against live local cockpit

mik-tf referenced this issue from a commit

2026-05-23 05:09:11 +00:00

fix(handle_list_services): parallelize per-row status+stats RPC fan-out

mik-tf closed this issue

2026-05-23 05:09:11 +00:00

mik-tf referenced this issue from lhumina_code/hero_proc

2026-05-23 05:09:37 +00:00

service.status RPC latency under concurrent load limits effective parallelism to ~9x #121

mik-tf commented

2026-05-23 05:09:38 +00:00

Author

Owner

Partial fix landed in c0a2a10 on development. The per-row service.status and service.stats calls now fire concurrently via tokio::join! plus futures::future::join_all instead of the serialized loop, collapsing the cockpit-side fan-out from 200s+ to about 22s on a 101-service local stack. The remaining gap above the hero_router 10s upstream timeout is hero_proc-side: a single service.status call averages 1.8s under concurrent load and the daemon caps effective parallelism at roughly 9x. Filed as a separate follow-up so this one can close on the N+1 fix that was its title.

Partial fix landed in [`c0a2a10`](https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/c0a2a10) on `development`. The per-row `service.status` and `service.stats` calls now fire concurrently via `tokio::join!` plus `futures::future::join_all` instead of the serialized loop, collapsing the cockpit-side fan-out from 200s+ to about 22s on a 101-service local stack. The remaining gap above the hero_router 10s upstream timeout is hero_proc-side: a single `service.status` call averages 1.8s under concurrent load and the daemon caps effective parallelism at roughly 9x. Filed as a separate follow-up so this one can close on the N+1 fix that was its title.

mik-tf commented

2026-05-23 05:10:03 +00:00

Author

Owner

Follow-up filed at hero_proc#121 for the residual daemon-side latency.

Follow-up filed at [hero_proc#121](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/121) for the residual daemon-side latency.

mik-tf referenced this issue from a commit

2026-05-23 05:13:00 +00:00

docs(channels/free): s147 fix pass — flip e2e_checklist rationales after hero_cockpit#6 + hero_router#110 closure

mik-tf referenced this issue from a commit

2026-05-23 17:26:48 +00:00

fix(handle_list_services): adopt service.status_all bulk RPC

mik-tf commented

2026-05-23 17:41:52 +00:00

Author

Owner

Fully closed by 722ace2 — handle_list_services now uses the new service.status_all bulk RPC from lhumina_code/hero_proc@e833dc9.

The s147 partial fix (c0a2a10) parallelized the cockpit-side fan-out with tokio::join! + join_all, but the daemon-side per-call cost still dominated (1.8 s/call x ~9x effective parallelism = ~22 s on 101 services, still over the hero_router 10 s upstream timeout). The new bulk RPC eliminates the per-call sysinfo mutex and the 3x-redundant SQL chain per call.

Local smoke through hero_router on 105 services:

call	wall-clock
1	14 ms
2	15 ms
3	14 ms

No 504. Page renders the full table with state, pid, mem_rss_bytes, cpu_percent, restarts, current_run_id, enabled for every supervised service.

Also drops futures from the workspace deps — was only used by the now-removed join_all.

Fully closed by https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/722ace2 — `handle_list_services` now uses the new `service.status_all` bulk RPC from https://forge.ourworld.tf/lhumina_code/hero_proc/commit/e833dc9. The s147 partial fix (https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/c0a2a10) parallelized the cockpit-side fan-out with `tokio::join!` + `join_all`, but the daemon-side per-call cost still dominated (1.8 s/call x ~9x effective parallelism = ~22 s on 101 services, still over the hero_router 10 s upstream timeout). The new bulk RPC eliminates the per-call sysinfo mutex and the 3x-redundant SQL chain per call. Local smoke through hero_router on 105 services: | call | wall-clock | |---|---| | 1 | 14 ms | | 2 | 15 ms | | 3 | 14 ms | No 504. Page renders the full table with state, pid, mem_rss_bytes, cpu_percent, restarts, current_run_id, enabled for every supervised service. Also drops `futures` from the workspace deps — was only used by the now-removed `join_all`.

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

lhumina_code/hero_cockpit#6

No description provided.

Rows
Columns