[infra] Forge bulk downloads to TFGrid VMs crawl at ~150 KB/s, wedging CI image pulls and slowing sandbox installs #280
Labels
No labels
meeting-notes
meeting-sensitive
meeting-transcript
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/home#280
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Bulk transfers from forge.ourworld.tf to TFGrid-hosted VMs are degraded to roughly 140 to 310 KB/s right now, while small API requests stay fast and the same VMs download from GitHub at 3 to 5 MB/s, so it is specific to the Forge path. Measured today from two different sandbox VMs: a 16.6 MB release asset cannot finish inside 60 seconds, while the identical download from an office machine completes in 9 seconds at 1.9 MB/s. This is what is currently wedging CI: lab release runs hang for hours on the docker pull of the 3.14 GB lhumina_code/lab-builder image (for example https://forge.ourworld.tf/lhumina_code/hero_skills/actions/runs/1397), and the builder image workflow fails while pushing its 1.3 GB layer even when the image content is unchanged (for example https://forge.ourworld.tf/lhumina_code/hero_skills/actions/runs/1396). Sandbox tester installs and service updates still complete, just very slowly. Could someone check the Forge server, its reverse proxy, or the container registry for bandwidth limits or congestion on bulk transfers toward TFGrid-hosted IPs, or restart that path? I can rerun the measurements right after any change.
Signed-by: mik-tf mik-tf@noreply.invalid
Update: the path recovered on its own around 20:28 UTC today. The same 16.6 MB asset that could not finish in 60 seconds from the sandbox tester now downloads in 0.8 seconds at about 21 MB/s, and the wedged CI runs resumed and are progressing normally (hero_skills run 1399 and hero_cockpit run 94 both moved past the previously hanging steps). Degradation window we observed: roughly 15:50 to 20:28 UTC, during which bulk transfers to TFGrid-hosted VMs ran at 140 to 310 KB/s while transfers to a non-TFGrid machine stayed fast. Leaving this open so whoever runs the Forge host can correlate that window with server, proxy, or uplink saturation; if nothing actionable turns up, fine to close as a transient incident record.
Signed-by: mik-tf mik-tf@noreply.invalid
One part of this is NOT transient and has a precise signature: pushing large docker layers to the Forge container registry fails consistently after about 6 minutes, even now that bulk download speeds are healthy again. Today's build-lab-builder runs 1394, 1396, 1398, and 1400 on lhumina_code/hero_skills all died in the Push step at 5m55s to 6m on the ~1.3 GB rust toolchain layer, while small layers in the same push complete fine (example: https://forge.ourworld.tf/lhumina_code/hero_skills/actions/runs/1400). That pattern matches a reverse proxy request/body timeout (around 300s) on the registry upload route rather than network congestion. Could whoever runs the Forge host check the proxy timeout and max body settings for the /v2/ registry endpoints? Until then the lab-builder image cannot be refreshed, so release workflows keep running an outdated prebaked lab.
Signed-by: mik-tf mik-tf@noreply.invalid