Shared (non dedicated) node support for tester VMs: badge, per network placement policy, drift detection, orphan contract reaper #24

Open
opened 2026-06-12 15:08:41 +00:00 by mik-tf · 2 comments
Owner

The deployer already places tester VMs on shared grid nodes in practice (QA node 2 hosts two live testers today with no rent contract, since shared nodes accept per deployment contracts from anyone), but the UI and the placement model assume every managed node is a dedicated rented node and the Nodes page labels everything as dedicated. Proposal: show a Dedicated or Shared badge per node from the live rent status, add a per network placement policy (shared allowed on QA and testnet, mainnet dedicated only unless a shared node is explicitly opted in, because shared mainnet deployments spend real TFT per deployment and their capacity can be taken by other users at any time), and wire the existing Find Nodes search so adding a found node registers it as shared.

Two maintenance features belong with this: the Nodes page should flag registry entries that drift from the chain (for example a node we no longer rent), and the deployer should periodically diff its twin's on chain contracts against the VMs and gateways it tracks and offer to cancel orphans. While auditing this we found about twenty orphaned gateway and name contracts from old provisioning attempts billing the ops wallet on mainnet; they were cancelled manually today.

Not scheduled for now. Current focus is proving the tester onboarding flow end to end on the two node setup (QA node 5 and mainnet node 8072).

The deployer already places tester VMs on shared grid nodes in practice (QA node 2 hosts two live testers today with no rent contract, since shared nodes accept per deployment contracts from anyone), but the UI and the placement model assume every managed node is a dedicated rented node and the Nodes page labels everything as dedicated. Proposal: show a Dedicated or Shared badge per node from the live rent status, add a per network placement policy (shared allowed on QA and testnet, mainnet dedicated only unless a shared node is explicitly opted in, because shared mainnet deployments spend real TFT per deployment and their capacity can be taken by other users at any time), and wire the existing Find Nodes search so adding a found node registers it as shared. Two maintenance features belong with this: the Nodes page should flag registry entries that drift from the chain (for example a node we no longer rent), and the deployer should periodically diff its twin's on chain contracts against the VMs and gateways it tracks and offer to cancel orphans. While auditing this we found about twenty orphaned gateway and name contracts from old provisioning attempts billing the ops wallet on mainnet; they were cancelled manually today. Not scheduled for now. Current focus is proving the tester onboarding flow end to end on the two node setup (QA node 5 and mainnet node 8072).
Author
Owner

Bumping this with findings from the zos-light work. This is no longer just a UI and maintenance item, it is the unlock for using light nodes at all. Across mainnet and QA there is exactly one alive zos-light node per network (mainnet 8072, QA 62) and both are non-dedicated; every dedicated zos-light node on the grid is registered as rentable but is actually dead (offline for hours to months). So we cannot prove or use light deployment on real hardware without shared-node support, unless a farmer brings a dedicated light node online, which we do not control (a rented standby node only wakes if the farm runs a healthy farmerbot, and the one we tried never woke). One requirement to add to the placement work here: filter candidate nodes by liveness (healthy and a recent last-seen timestamp), not by the rentable or status flags. A dead node still advertises rentable=true; we rented one this week that had not been seen in about seven months and it never came online. Good news for scope: the daemon already deploys on a shared node without a rent contract (confirmed again this week by registering the live shared node and getting a real on-chain deployment), so the gap is the surrounding model (shared-aware capacity from live free resources rather than a fixed exclusive catalog, skip the rent step, contention handling) and not the deploy itself. Doing this also gives us a live light node to finish debugging the light deploy, which currently times out waiting for the network workload, see lhumina_code/hero_compute#135 . The generation-aware selection layer for that already landed on hero_compute integration.

Bumping this with findings from the zos-light work. This is no longer just a UI and maintenance item, it is the unlock for using light nodes at all. Across mainnet and QA there is exactly one alive zos-light node per network (mainnet 8072, QA 62) and both are non-dedicated; every dedicated zos-light node on the grid is registered as rentable but is actually dead (offline for hours to months). So we cannot prove or use light deployment on real hardware without shared-node support, unless a farmer brings a dedicated light node online, which we do not control (a rented standby node only wakes if the farm runs a healthy farmerbot, and the one we tried never woke). One requirement to add to the placement work here: filter candidate nodes by liveness (healthy and a recent last-seen timestamp), not by the rentable or status flags. A dead node still advertises rentable=true; we rented one this week that had not been seen in about seven months and it never came online. Good news for scope: the daemon already deploys on a shared node without a rent contract (confirmed again this week by registering the live shared node and getting a real on-chain deployment), so the gap is the surrounding model (shared-aware capacity from live free resources rather than a fixed exclusive catalog, skip the rent step, contention handling) and not the deploy itself. Doing this also gives us a live light node to finish debugging the light deploy, which currently times out waiting for the network workload, see https://forge.ourworld.tf/lhumina_code/hero_compute/issues/135 . The generation-aware selection layer for that already landed on hero_compute integration.
Author
Owner

Plan locked with the operator. Sequenced across a few sessions so each one lands something proven.

This session: shared node support, proven on standard (classic) zos where deploys already work, kept separate from the light deploy work.

  1. Placement filters candidate nodes by liveness (healthy and recently seen), not by the rentable or status flags. A dead node still advertises rentable=true; we recently rented one that never woke, so liveness is the load bearing rule.
  2. Auto placement prefers dedicated nodes first, because we pay for the whole rented node so testers should fill those first, and overflows to shared nodes only when dedicated capacity is full. A master switch keeps shared placement off until enabled, and the per network policy stays shared allowed on QA and testnet, mainnet dedicated only unless a shared node is explicitly opted in.
  3. The skip rent path for shared nodes becomes a real supported path. Shared capacity is read live from the node free resources and treated as best effort, with per node serialization and a capacity recheck at deploy time so two concurrent provisions cannot both think they fit.
  4. Nodes page reframed from dedicated only to one table with a Dedicated or Shared badge, a liveness column that greys out dead nodes, and capacity labelled by type (reserved room on dedicated, approximate free and not reserved on shared). Adopt actions split into Rent and adopt versus Use shared, each with a one line consequence.
  5. Gateway name availability. If the web address field is left blank we auto pick a unique name and silently add a suffix if it ever collides. If a custom name is typed and it is already taken on chain, we say so immediately and block the deploy with choose another name, rather than silently changing what was asked for. The daemon already has the on chain name lookup; we expose it so the form can check the name, and keep the authoritative check at deploy time so a race cannot strand a VM with no URL.

Also folded in from this issue: the Dedicated or Shared badge, per network placement policy, registry drift detection, and the orphan contract reaper.

Next session: finish the zos light deploy. With shared support giving a stable live light node to test against, we diff our light workload against what the dashboard submits and add path and timing logging plus a clear deploy error message map. Tracked at lhumina_code/hero_compute#135 .

Later: pre warm pool so a node already has Hero and a VM deployed and onboarding only sets the user login and gateway name, which makes onboarding much faster and moves the fresh VM network lag off the critical path. The gateway name resolution built this session is kept in one place so the pre warm flow can reuse it unchanged. Tracked at lhumina_code/home#266 .

Signed-by: mik-tf mik-tf@noreply.invalid

Plan locked with the operator. Sequenced across a few sessions so each one lands something proven. **This session: shared node support, proven on standard (classic) zos** where deploys already work, kept separate from the light deploy work. 1. Placement filters candidate nodes by liveness (healthy and recently seen), not by the `rentable` or `status` flags. A dead node still advertises `rentable=true`; we recently rented one that never woke, so liveness is the load bearing rule. 2. Auto placement prefers dedicated nodes first, because we pay for the whole rented node so testers should fill those first, and overflows to shared nodes only when dedicated capacity is full. A master switch keeps shared placement off until enabled, and the per network policy stays shared allowed on QA and testnet, mainnet dedicated only unless a shared node is explicitly opted in. 3. The skip rent path for shared nodes becomes a real supported path. Shared capacity is read live from the node free resources and treated as best effort, with per node serialization and a capacity recheck at deploy time so two concurrent provisions cannot both think they fit. 4. Nodes page reframed from dedicated only to one table with a Dedicated or Shared badge, a liveness column that greys out dead nodes, and capacity labelled by type (reserved room on dedicated, approximate free and not reserved on shared). Adopt actions split into Rent and adopt versus Use shared, each with a one line consequence. 5. Gateway name availability. If the web address field is left blank we auto pick a unique name and silently add a suffix if it ever collides. If a custom name is typed and it is already taken on chain, we say so immediately and block the deploy with choose another name, rather than silently changing what was asked for. The daemon already has the on chain name lookup; we expose it so the form can check the name, and keep the authoritative check at deploy time so a race cannot strand a VM with no URL. Also folded in from this issue: the Dedicated or Shared badge, per network placement policy, registry drift detection, and the orphan contract reaper. **Next session: finish the zos light deploy.** With shared support giving a stable live light node to test against, we diff our light workload against what the dashboard submits and add path and timing logging plus a clear deploy error message map. Tracked at https://forge.ourworld.tf/lhumina_code/hero_compute/issues/135 . **Later: pre warm pool** so a node already has Hero and a VM deployed and onboarding only sets the user login and gateway name, which makes onboarding much faster and moves the fresh VM network lag off the critical path. The gateway name resolution built this session is kept in one place so the pre warm flow can reuse it unchanged. Tracked at https://forge.ourworld.tf/lhumina_code/home/issues/266 . Signed-by: mik-tf <mik-tf@noreply.invalid>
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_os_tfgrid_deployer#24
No description provided.