[infra] Forge bulk downloads to TFGrid VMs crawl at ~150 KB/s, wedging CI image pulls and slowing sandbox installs

mik-tf commented

2026-06-11 20:28:24 +00:00

Owner

Bulk transfers from forge.ourworld.tf to TFGrid-hosted VMs are degraded to roughly 140 to 310 KB/s right now, while small API requests stay fast and the same VMs download from GitHub at 3 to 5 MB/s, so it is specific to the Forge path. Measured today from two different sandbox VMs: a 16.6 MB release asset cannot finish inside 60 seconds, while the identical download from an office machine completes in 9 seconds at 1.9 MB/s. This is what is currently wedging CI: lab release runs hang for hours on the docker pull of the 3.14 GB lhumina_code/lab-builder image (for example https://forge.ourworld.tf/lhumina_code/hero_skills/actions/runs/1397), and the builder image workflow fails while pushing its 1.3 GB layer even when the image content is unchanged (for example https://forge.ourworld.tf/lhumina_code/hero_skills/actions/runs/1396). Sandbox tester installs and service updates still complete, just very slowly. Could someone check the Forge server, its reverse proxy, or the container registry for bandwidth limits or congestion on bulk transfers toward TFGrid-hosted IPs, or restart that path? I can rerun the measurements right after any change.

Signed-by: mik-tf mik-tf@noreply.invalid

Bulk transfers from forge.ourworld.tf to TFGrid-hosted VMs are degraded to roughly 140 to 310 KB/s right now, while small API requests stay fast and the same VMs download from GitHub at 3 to 5 MB/s, so it is specific to the Forge path. Measured today from two different sandbox VMs: a 16.6 MB release asset cannot finish inside 60 seconds, while the identical download from an office machine completes in 9 seconds at 1.9 MB/s. This is what is currently wedging CI: lab release runs hang for hours on the docker pull of the 3.14 GB lhumina_code/lab-builder image (for example https://forge.ourworld.tf/lhumina_code/hero_skills/actions/runs/1397), and the builder image workflow fails while pushing its 1.3 GB layer even when the image content is unchanged (for example https://forge.ourworld.tf/lhumina_code/hero_skills/actions/runs/1396). Sandbox tester installs and service updates still complete, just very slowly. Could someone check the Forge server, its reverse proxy, or the container registry for bandwidth limits or congestion on bulk transfers toward TFGrid-hosted IPs, or restart that path? I can rerun the measurements right after any change. Signed-by: mik-tf <mik-tf@noreply.invalid>

mik-tf commented

2026-06-11 20:51:49 +00:00

Author

Owner

Update: the path recovered on its own around 20:28 UTC today. The same 16.6 MB asset that could not finish in 60 seconds from the sandbox tester now downloads in 0.8 seconds at about 21 MB/s, and the wedged CI runs resumed and are progressing normally (hero_skills run 1399 and hero_cockpit run 94 both moved past the previously hanging steps). Degradation window we observed: roughly 15:50 to 20:28 UTC, during which bulk transfers to TFGrid-hosted VMs ran at 140 to 310 KB/s while transfers to a non-TFGrid machine stayed fast. Leaving this open so whoever runs the Forge host can correlate that window with server, proxy, or uplink saturation; if nothing actionable turns up, fine to close as a transient incident record.

Signed-by: mik-tf mik-tf@noreply.invalid

Update: the path recovered on its own around 20:28 UTC today. The same 16.6 MB asset that could not finish in 60 seconds from the sandbox tester now downloads in 0.8 seconds at about 21 MB/s, and the wedged CI runs resumed and are progressing normally (hero_skills run 1399 and hero_cockpit run 94 both moved past the previously hanging steps). Degradation window we observed: roughly 15:50 to 20:28 UTC, during which bulk transfers to TFGrid-hosted VMs ran at 140 to 310 KB/s while transfers to a non-TFGrid machine stayed fast. Leaving this open so whoever runs the Forge host can correlate that window with server, proxy, or uplink saturation; if nothing actionable turns up, fine to close as a transient incident record. Signed-by: mik-tf <mik-tf@noreply.invalid>

mik-tf commented

2026-06-11 22:09:03 +00:00

Author

Owner

One part of this is NOT transient and has a precise signature: pushing large docker layers to the Forge container registry fails consistently after about 6 minutes, even now that bulk download speeds are healthy again. Today's build-lab-builder runs 1394, 1396, 1398, and 1400 on lhumina_code/hero_skills all died in the Push step at 5m55s to 6m on the ~1.3 GB rust toolchain layer, while small layers in the same push complete fine (example: https://forge.ourworld.tf/lhumina_code/hero_skills/actions/runs/1400). That pattern matches a reverse proxy request/body timeout (around 300s) on the registry upload route rather than network congestion. Could whoever runs the Forge host check the proxy timeout and max body settings for the /v2/ registry endpoints? Until then the lab-builder image cannot be refreshed, so release workflows keep running an outdated prebaked lab.

Signed-by: mik-tf mik-tf@noreply.invalid

One part of this is NOT transient and has a precise signature: pushing large docker layers to the Forge container registry fails consistently after about 6 minutes, even now that bulk download speeds are healthy again. Today's build-lab-builder runs 1394, 1396, 1398, and 1400 on lhumina_code/hero_skills all died in the Push step at 5m55s to 6m on the ~1.3 GB rust toolchain layer, while small layers in the same push complete fine (example: https://forge.ourworld.tf/lhumina_code/hero_skills/actions/runs/1400). That pattern matches a reverse proxy request/body timeout (around 300s) on the registry upload route rather than network congestion. Could whoever runs the Forge host check the proxy timeout and max body settings for the /v2/ registry endpoints? Until then the lab-builder image cannot be refreshed, so release workflows keep running an outdated prebaked lab. Signed-by: mik-tf <mik-tf@noreply.invalid>

mik-tf referenced this issue

2026-06-12 21:55:23 +00:00

[META] Hero OS sandbox demo, functional readiness: onboarding pipeline + per-app verification #239

Rows
Columns

[infra] Forge bulk downloads to TFGrid VMs crawl at ~150 KB/s, wedging CI image pulls and slowing sandbox installs #280