deploy_webgateway 300s timeout too tight for QA substrate latency #131
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_compute#131
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What I saw
On QA chain (twin 703, gateway node 2, zone
gent01.qa.grid.tf),ComputeService.deploy_webgateway(name, kind=Name, ...)consistently hits the 300 second inline-await timeout. Three consecutive attempts from the admin VM today, each about 300 seconds, each followed by clean rollback of 2 orphan contracts.Daemon log shape:
Rollback cancels 2 contracts cleanly. Same behaviour on three consecutive attempts (about 14 minutes total) so this is not a transient flake right now.
Compare to two days ago
Two days ago,
deploy_webgatewayon the same QA chain plus the same gateway node 2 succeeded in 49 seconds. So the call shape works. QA substrate finalization latency has degraded since then, or is bursty under load.Suggested directions
Surfaced while wiring the deployer side of the gateway URL flow (
hero_os_tfgrid_deployer@15e5473). That caller is now in place. The gateway URL becomes reachable as soon as the substrate ack fits inside the configured window on QA.Shipped at hero_compute@720eedb.
Replaces the hardcoded 300 second constant at both
deploy_vm(rpc.rs:1357) anddeploy_webgateway(rpc.rs:2248) with a single tunable read fromTFGRID_DEPLOY_TIMEOUT_SECS. Default raised to 600 seconds. The operator-facing override flows throughhero_proc'score/-context secret store (matches the existingTFGRID_NODE_IDS/TFGRID_NETWORKpattern in this repo, noservice.toml [[env]]block per the no-env-blocks decision after the secret-store refactor). Empty / unparseable / zero values fall back to the default to avoid an instant-timeout footgun.Tests: 53 library pass, of which 5 are new for the pure parse helper (
parse_deploy_timeout_secs) covering unset, empty, whitespace, unparseable, negative, zero, and positive integer cases. 16 integration pass. Pre-merge gate clean (fmt on the touched region,clippy --workspace --all-targets -- -D warnings, workspace release build,--infosmoke on the built binary).Closes this issue. The per-chain differentiation suggested in option 3 of the original report can layer on top if it turns out useful, but a single tunable with a roomier default is enough to unblock the home#238 admin/tester arc.