[placement] Node feature set (zos light vs classic) is invisible to placement; light-only node accepted then rejects every deploy #25
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Provisioning a tester VM onto mainnet node 8072 fails after contract submission: the node only advertises the light workload feature set (zmachine-light, network-light, per Grid Proxy /nodes/8072), while the compute daemon deploys classic network and zmachine workloads, so the ZOS daemon rejects the deployment and the network contract is left for manual cancellation (contract 2097718, cancelled). The node registry and capacity board show such a node as healthy and fitting, so placement happily selects a node that can never run our workloads. Suggest the placement/registration guard also compare the node's advertised features against the workload types the daemon deploys, refuse registration or placement with a clear message when they do not match, and surface the node generation on the Nodes page. Until then the rented mainnet node 8072 cannot serve tester VMs; either a node with the classic feature set is needed on mainnet or light workload support in the compute daemon.
Important reframe after talking to ops (projectmycelium/circle_ops#837): light nodes are not an anomaly to avoid, they are the direction. New ops-provisioned nodes for the sandbox are v3light on purpose, and a dashboard deployment on node 8072 succeeds because the dashboard submits the light workload types. So the resolution here is not "prefer classic nodes": the compute daemon needs to submit zmachine-light and network-light on nodes that advertise the light feature set. The feature-aware placement check described above is still wanted, but as a compatibility router (pick the workload family per node) rather than only a guard that refuses light nodes.
Folded into the compute daemon issue at lhumina_code/hero_compute#135 . The right fix is for the daemon to deploy on any node generation (detect the node's advertised features and use the light workload path on light-only nodes), after which the deployer placement no longer needs to block these nodes and just surfaces the node generation on the Nodes page. Closing here in favor of that issue.