deploy_webgateway 300s timeout too tight for QA substrate latency #131

Closed
opened 2026-05-27 18:17:22 +00:00 by mik-tf · 1 comment
Owner

What I saw

On QA chain (twin 703, gateway node 2, zone gent01.qa.grid.tf), ComputeService.deploy_webgateway(name, kind=Name, ...) consistently hits the 300 second inline-await timeout. Three consecutive attempts from the admin VM today, each about 300 seconds, each followed by clean rollback of 2 orphan contracts.

Daemon log shape:

deploy_webgateway: selected gateway node name=alice-demo node_id=2 twin_id=9 zone=gent01.qa.grid.tf
deploy_webgateway: starting on-chain deploy gateway_sid=001a name=alice-demo kind=Name node_id=2 twin_id=9
... 300 seconds later ...
TFGrid webgateway deploy timed out after 300s; attempting inline orphan-contract rollback gateway_sid=001a elapsed_ms=300001 node_id=2 operator_twin=703

Rollback cancels 2 contracts cleanly. Same behaviour on three consecutive attempts (about 14 minutes total) so this is not a transient flake right now.

Compare to two days ago

Two days ago, deploy_webgateway on the same QA chain plus the same gateway node 2 succeeded in 49 seconds. So the call shape works. QA substrate finalization latency has degraded since then, or is bursty under load.

Suggested directions

  1. Bump the 300 second constant. The original sizing was against the deploy_vm reference. Gateway contracts may legitimately need more headroom on QA, and mainnet would benefit too. The substrate finalization budget is asymmetric across chains.
  2. Expose the timeout as an env var so operators can tune per deployment without a code change.
  3. Differentiate the timeout per chain. QAnet and mainnet have different SLAs. One number is unlikely to fit both.

Surfaced while wiring the deployer side of the gateway URL flow (hero_os_tfgrid_deployer@15e5473). That caller is now in place. The gateway URL becomes reachable as soon as the substrate ack fits inside the configured window on QA.

## What I saw On QA chain (twin 703, gateway node 2, zone `gent01.qa.grid.tf`), `ComputeService.deploy_webgateway(name, kind=Name, ...)` consistently hits the 300 second inline-await timeout. Three consecutive attempts from the admin VM today, each about 300 seconds, each followed by clean rollback of 2 orphan contracts. Daemon log shape: ``` deploy_webgateway: selected gateway node name=alice-demo node_id=2 twin_id=9 zone=gent01.qa.grid.tf deploy_webgateway: starting on-chain deploy gateway_sid=001a name=alice-demo kind=Name node_id=2 twin_id=9 ... 300 seconds later ... TFGrid webgateway deploy timed out after 300s; attempting inline orphan-contract rollback gateway_sid=001a elapsed_ms=300001 node_id=2 operator_twin=703 ``` Rollback cancels 2 contracts cleanly. Same behaviour on three consecutive attempts (about 14 minutes total) so this is not a transient flake right now. ## Compare to two days ago Two days ago, `deploy_webgateway` on the same QA chain plus the same gateway node 2 succeeded in 49 seconds. So the call shape works. QA substrate finalization latency has degraded since then, or is bursty under load. ## Suggested directions 1. Bump the 300 second constant. The original sizing was against the deploy_vm reference. Gateway contracts may legitimately need more headroom on QA, and mainnet would benefit too. The substrate finalization budget is asymmetric across chains. 2. Expose the timeout as an env var so operators can tune per deployment without a code change. 3. Differentiate the timeout per chain. QAnet and mainnet have different SLAs. One number is unlikely to fit both. Surfaced while wiring the deployer side of the gateway URL flow (`hero_os_tfgrid_deployer@15e5473`). That caller is now in place. The gateway URL becomes reachable as soon as the substrate ack fits inside the configured window on QA.
Author
Owner

Shipped at hero_compute@720eedb.

Replaces the hardcoded 300 second constant at both deploy_vm (rpc.rs:1357) and deploy_webgateway (rpc.rs:2248) with a single tunable read from TFGRID_DEPLOY_TIMEOUT_SECS. Default raised to 600 seconds. The operator-facing override flows through hero_proc's core/-context secret store (matches the existing TFGRID_NODE_IDS / TFGRID_NETWORK pattern in this repo, no service.toml [[env]] block per the no-env-blocks decision after the secret-store refactor). Empty / unparseable / zero values fall back to the default to avoid an instant-timeout footgun.

Tests: 53 library pass, of which 5 are new for the pure parse helper (parse_deploy_timeout_secs) covering unset, empty, whitespace, unparseable, negative, zero, and positive integer cases. 16 integration pass. Pre-merge gate clean (fmt on the touched region, clippy --workspace --all-targets -- -D warnings, workspace release build, --info smoke on the built binary).

Closes this issue. The per-chain differentiation suggested in option 3 of the original report can layer on top if it turns out useful, but a single tunable with a roomier default is enough to unblock the home#238 admin/tester arc.

Shipped at [hero_compute@720eedb](https://forge.ourworld.tf/lhumina_code/hero_compute/commit/720eedb). Replaces the hardcoded 300 second constant at both `deploy_vm` (rpc.rs:1357) and `deploy_webgateway` (rpc.rs:2248) with a single tunable read from `TFGRID_DEPLOY_TIMEOUT_SECS`. Default raised to 600 seconds. The operator-facing override flows through `hero_proc`'s `core/`-context secret store (matches the existing `TFGRID_NODE_IDS` / `TFGRID_NETWORK` pattern in this repo, no `service.toml [[env]]` block per the no-env-blocks decision after the secret-store refactor). Empty / unparseable / zero values fall back to the default to avoid an instant-timeout footgun. Tests: 53 library pass, of which 5 are new for the pure parse helper (`parse_deploy_timeout_secs`) covering unset, empty, whitespace, unparseable, negative, zero, and positive integer cases. 16 integration pass. Pre-merge gate clean (fmt on the touched region, `clippy --workspace --all-targets -- -D warnings`, workspace release build, `--info` smoke on the built binary). Closes this issue. The per-chain differentiation suggested in option 3 of the original report can layer on top if it turns out useful, but a single tunable with a roomier default is enough to unblock the home#238 admin/tester arc.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#131
No description provided.