deploy_vm consistently rejected at ZOS workload phase on FreeFarm mainnet (regression since 2026-05-23) #125
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_compute#125
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
ComputeService.deploy_vmagainst TFGrid mainnet consistently fails with the ZOS-side errorvm deployment entered error state, independent of the chosen node or image. Last known good was the deploy_vm round trip 2026-05-23 16:25 UTC (vm sid000ton tfnode-12, twin 6905). Today (2026-05-24) every deploy attempt fails at the ZOS workload phase even though TFChain accepts both contracts and Grid Proxy shows ample free capacity.Reproduction
Workstation operator: hero_compute self-hosted from the
core/TFGRID_MNEMONIC(twin 6905) on TFGrid mainnet. Stack: hero_router + hero_tfgrid_deployer_server + my_compute_zos_server, all freshly built fromorigin/developmentHEAD as of this filing.cargo test --workspaceclean.deployer.provision_vm(which callsComputeService.deploy_vminternally) was tried with:tfnode-1(sid000u), nodetfnode-12(sid000q)ubuntu-22.04(deployer default), imageUbuntu 24.04(catalog default), imagehttps://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist(explicit flist URL)All four attempts produced the same error:
Timing: each attempt creates 2 contracts on TFChain (network + VM, ~30s apart), then ZOS rejects the workload ~30-60s later. Total elapsed: ~60-90s.
Grid state checks
status=up, total MRU 188 GB, used 73 GB, free 115.6 GB. All 26 slicesstatus=freeper ComputeService.list_slices.my_compute_zos_server --cancel-contracts <vm> <net>(4 pairs cleaned this session: 2095131, 2095132, 2095133, 2095134, 2095135, 2095136, 2095137, 2095138).Hypothesis
The
vm deployment entered error statetext originates fromtfgrid_sdk_rust's GridClient when ZOS rejects the workload at the daemon side. Possible causes:b8774c34is now incompatible with current TFChain or ZOS protocolWhat was verified working today: list_nodes, list_slices, list_images, node_register, node_status, deploy_vm filter / param validation (everything pre-contract-submission). Contract creation on TFChain also works (we created 8 contracts and all reached state Created, then we cancelled all 8). So the operator wallet, network connectivity to TFChain, and the GridClient are all healthy. The break is specifically at the post-contract ZOS workload stage.
Asks
zoscompute.gent01.qa.grid.tftopology) to confirm whether mainnet is affected uniquely.This is the gating issue for the demo-deployer arc closure: home#235 closure depends on at least one full successful round trip through deployer.provision_vm.
Investigation update from a 3-deploy probe at s157a close (2026-05-24 22:30-22:40Z), now that hero_compute has the deploy_vm Err-path orphan rollback landed at
8be3294(closes #119 reopened):1. Repro is NOT node-specific. Two consecutive
provision_vmattempts viahero_tfgrid_deployer_serveragainst FreeFarm node 1 (tfnode-1, sid000u) AND one attempt against FreeFarm node 12 (tfnode-12, sid000q) all failed identically at the ZOS workload phase, ~110-115s wall clock per attempt. Both nodes reportstatus=onlineon Grid Proxy. The error shape is bit-for-bit identical across nodes:Six contracts created on twin 6905 (2095139+2095140 on node 1, 2095141+2095142 on node 1 with patched SDK, 2095143+2095144 on node 12). All six auto-cancelled by the new s157a rollback within ~6s of the SDK Err.
2. The SDK is not hiding the reason — ZOS itself provides none. We patched
tfgrid_sdk_rust/src/grid_client/mod.rsat lines 1063-1068 and 1219-1224 locally to surfacevm_workload.result.error: String(the field is already there inzos::ResultData, just discarded byGridError::backend("vm deployment entered error state")). With the patched binary, the surfaced error reads:The format is
workload_name=error_string. The error string is empty. ZOS marks the VM workloadSTATE_ERRORbut writes no message intoresult.error. This holds across all three deploys on both nodes.3. Only the VM workload errors; the network workload provisions OK. Every failed deploy minted exactly two contracts on chain: one network contract (e.g. 2095139,
deployment_data.type=network,name=rust_net_<ts>) and one VM contract (e.g. 2095140,deployment_data.type=vm,name=001t). Thevm_changes.iter().filter(state==Error)in the SDK matches only the VM workload, never the network workload. So the substrate side and the network workload are healthy; the ZOS daemon rejects the VM workload specifically, silently.4. Window of regression. s149 (2026-05-23) successfully deployed and round-tripped a VM on node 12 with the same flow. s156 (2026-05-24 ~16:44Z) and s157 (2026-05-24 ~20:00Z) and now s157a (2026-05-24 ~22:30Z) all fail with the identical opaque pattern. The regression window is approximately 24h, on the TFGrid side (we ruled out our flist URL form, image variant, and SDK rev — the SDK is on the same
b8774c34mainnet pin throughout).5. Action items.
a. We cannot diagnose further from off-node —
result.erroris empty so the SDK has no more to give. Next step needs node-side ZOS logs on FreeFarm node 1 OR node 12 around the workload errors (e.g.zinit log zos.boot,journalctl -u zos, or the ZOSstate.jsonfor the failed deployments). If anyone has node-side access on FreeFarm, the workloads we just tried are: contract 2095140 (vm001ton node 1), 2095142 (vm001ton node 1), 2095144 (vm001uon node 12), all under twin 6905, all in state=Deleted now after rollback.b. The SDK-side gap (
result.errordiscarded) is worth a small upstream PR tothreefoldtech/tfgrid-sdk-rustso future investigations don't need a local fork. Diff is ~10 lines split across two near-identical spots ingrid_client/mod.rs. Not needed for #125 itself (the error is empty), but useful infrastructure for any future "what did ZOS reject" debugging.c. With s157a's rollback landed, this failure mode no longer leaks contracts — the daemon cleans up automatically on every Err. The bug remains gating in that no real VM can be provisioned on mainnet right now, but the cost is now bounded to substrate fees on the (instantly cancelled) network+vm contracts per attempt.
Reproduction inputs (in case someone wants to retry):
provision_vmviadeployer(port 9988 →/hero_tfgrid_deployer/rpc) withnode_sid=000uor000q, default imageubuntu-22.04, default slice_count 2hero_computebuild at8be3294(origin/development)ROOT CAUSE IDENTIFIED — rootfs_size_bytes massively overflows SSD capacity
Following the user's tip ("hero_demo with OpenTofu works"), I dug into the differential between our SDK path and the canonical Go/Terraform path. The bug is on OUR side, not TFGrid.
The smoking gun — slice catalog on FreeFarm node 1 (sid
000u) right now reportsdisk_gb=5128per slice (24 slices total). Node 12 reportsdisk_gb=182926per slice (2 slices total). For a default 2-slice deploy the resultingrootfs_size_bytesis ~10 TB on node 1 and ~365 TB on node 12. ZOS rejects silently with state=Error and empty error string because rootfs is SSD-backed and the request is orders of magnitude larger than the available SSD.The bug chain:
crates/my_compute_zos_server/src/cloud/node_capacity.rs:381computestotal_disk_gbfrom SSD plus HDD combined:For node 1, Grid Proxy reports
sru=1863 GBandhru=134112 GB(134 TB of spinning disk), sototal_disk_gb = ~136,000 GB.cloud/node_capacity.rs:171 size_node_catalogthen divides that by slice_count with headroom:yielding ~5128 GB per slice on a 24-slice node 1, ~182926 GB per slice on a 2-slice node 12.
cloud/rpc.rs:1120then sums those at deploy time:and
cloud/rpc.rs:1253passes that to the SDK as rootfs bytes:The Rust SDK serializes this as the workload's
"size"field in the zmachine workload data. ZOS allocates rootfs from SSD only. ~10 TB rootfs request, ~1.3 TB SSD free. ZOS bails before the VM ever boots, sets state=Error.The Rust SDK's
result.errorreads empty (confirmed via local SDK patch surfacingvm_workload.result.error: Stringingrid_client/mod.rs:1063-1068). ZOS does not currently write a reason into the workload result for this failure mode.Why s149 worked and s156+ does not: most likely the s149 node had nominal HDD (or none), so the math produced a sane rootfs size. The first FreeFarm node with substantial HDD that came online and got registered tipped the catalog past what SSD can satisfy. Both currently-registered nodes (1 + 12) have substantial HDD, so both fail identically.
Why OpenTofu (
hero_demo/deploy/single-vm/tf/main.tf) works: the OpenTofu config setsrootfs_size = 2048 MB(or 16384 MB for nu-shell setups) by hand, and the data disk is a separate workload mounted at/data. The rootfs is sized to fit on SSD, and the data disk can land on HDD via mount. The Rust path conflates rootfs and total disk into one field, then sources both fromsru+hru.Proposed fix (small, structural):
a.
node_capacity.rs:381should sizetotal_disk_gbfromsruonly, notsru+hru. HDD is irrelevant to rootfs and should not feed the slice catalog's disk dimension. With this fix, node 1 yields ~70 GB rootfs per slice (1863 GB SSD across 24 slices with 110% headroom), which is healthy.b. Optional but desirable: have
cloud/rpc.rs:deploy_vmcap rootfs at a fixed sane value (e.g., 20-50 GB) and pass any larger disk request through a separatevolumesfield onVmSpec(which the SDK supports —VmSpec.volumes: Vec<VolumeMountSpec>). This mirrors the Go SDK / Terraform pattern ofdisks + mountsand makes rootfs vs data-disk semantics explicit.c. Until either lands: existing registrations on nodes with substantial HDD must be reset.
ComputeService.node_unregister+node_registerfor nodes 1 and 12 will re-size the catalog after the node_capacity fix lands.Triangulation done in this session (3 live deploys today, all auto-rolled-back by s157a
8be3294, zero residue on Grid Proxy):ubuntu-22.04image, node 1 → state=Error, empty SDK error, 2 orphans cancelled (2095139, 2095140)vm_workload.result.error→ state=Error, error string EMPTY, 2 orphans cancelled (2095141, 2095142)https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flistimage, node 1 → identical failure mode. Flist URL form is not the cause.The auto-rollback (
grid_driver::scan_orphan_contracts_since+cancel_one_on_tfgrid) worked four-for-four across both nodes, on every Err path. Zero new state=Created contracts under twin 6905 since session start.Next: implement (a) above on a small
development_mikbranch, re-register nodes 1 and 12, retry one deploy, expect Running state with mycelium_ip populated. Will pick this up in s157b after a session boundary so the rollback fix lands clean first.s157c update — 2 SDK challenge bugs found + failure-mode shift
Context for Mahmoud: s157c picked this up overnight. Three things you need to know.
1. We can rent dedicated nodes from twin 14199 ops wallet — confirmed with a successful
rent_node(3467)on farm 646 (JimboTFT, Canada). Discovery: the substrate gate for public rent is extraFee > 0 on the node (notrentable: Truealone — that flag alone firesOnlyTwinAdminCanDeploy). FreeFarm rentable big-class nodes (2010-2025) are allstatus: downright now (~25 minute timing observed) — that's why FreeFarm dedicated path looked open at s157b /stop and was closed by the time s157c started.2. Wire-payload diff against tfgrid-sdk-go found two
workload_challengebugs for ZMACHINE_TYPE in tfgrid-sdk-rust HEADb8774c34— same class of bug as commit74d9ed2(fix(gateway): correct field order in workload_challenge to match ZOS) that just landed for gateway workloads:MachineNetworkfield-order swap: Rust emitsInterfacesbeforeMyceliumin the challenge; Go SDK struct order isPublicIP → Planetary → Mycelium → Interfaces. Swapping the order intfgrid-sdk-rust/src/grid_client/deployment.rs:680-683is the primary fix.corexfield: theZMachineDatastruct includescorex: boolbut the challenge code does NOT include it. Go SDK struct hasCorexbetweenEnvandGPU. Insert atdeployment.rs:702to match.3. After applying both patches against the rented node, the failure mode changed: from
vm deployment entered error state [deploy-phase=zos-workload]in ~110s (ZOS signature reject) toTFGrid deploy timed out after 300s(substrate-accept, ZOS provisioning, just not Running by the D-27 5-min timeout). Bumping the timeout to 900s now; this looks like the patches are correct and the VM needs more time to come up on a fresh node.Direct ask: Can you confirm those two challenge fixes look right? And: what changed on mainnet ZOS between 2026-05-23 and 2026-05-24 — was it a coordinated SDK + ZOS protocol bump where these challenge fields were added? Will continue probing while you sleep; if we land a successful deploy we will report back here.
State preserved on TFChain: twin 14199 baseline = 1 active RentContract on node 3467 (contract_id 2095159, billing hourly), all probe-orphan node contracts auto-cancelled by the s157a rollback. Twin 6905 treasury untouched at 40 contracts.
s157c session close — null results that narrow the investigation
Correction to my earlier comment 36619: the two SDK fixes I suggested were wrong. After reading zosbase canonical
Challenge()directly (https://github.com/threefoldtech/zosbase/blob/master/pkg/gridtypes/zos/zmachine.go,ZMachine.Challenge()line 191 andMachineNetwork.Challenge()line 30):MachineNetwork.Challenge()canonical order IS PublicIP → Planetary → Interfaces → Mycelium (matches the ORIGINAL Rust SDK attfgrid-sdk-rust@b8774c34/src/grid_client/deployment.rs:673-683). The Go struct order has Mycelium before Interfaces, but the Challenge method does NOT follow struct order. My swap patch was wrong and actively regressed deploys (workload never appeared on node when applied).ZMachine.Challenge()does NOT includecorexeither. So inserting corex was neutral (or also wrong).6 deploy probes ran tonight on rented dedicated node 3467 (farm 646, JimboTFT, Canada). All failed identically with
vm deployment entered error state [deploy-phase=zos-workload]empty-error rejection. Variables we ruled out:ssh_keysempty vs populated — same failureb8774c34; no newer commits onmasterto bump toState preserved on TFChain: RentContract 2095159 cancelled (substrate Deleted, ~32s); twin 14199 active contracts = 0; all probe-orphan node contracts auto-cancelled by the s157a rollback path; twin 6905 treasury untouched at 40 contracts.
Direct ask for you, Mahmoud: the empty
result.errorfrom ZOS is the worst possible signal — we have ruled out the obvious hypotheses but have no positive direction. Two specific questions for when you wake:zoscompute.gent01.qa.grid.tf(QA chain) the only place a deploy currently works for you? Has anything on mainnet ZOS daemons changed between 2026-05-23 and 2026-05-24 that requires SDK / wire-payload changes we don't yet have?result.errorfrom the workload rejection? Or a known-good wire-payload from a successful recent mainnet deploy we can diff against?If the answer to Question A is "mainnet broken for everyone, awaiting upstream fix", that's actionable — we hold. If you have a working mainnet recipe, that's a 1-session-to-arc-closure unblock.
Session artefacts: Forge comment 36619 above carries the full s157c chain; full session manifest is in our workspace
sessions/157c.yml. We will pick up here when you reply or when we have new signal.s157d close — fixed, on our side
Found and fixed. Mahmoud, you do NOT need to investigate this — the bug was entirely on our side, never reachable from the empty
result.errorZOS returns.Root cause: hero_compute's
deploy_vmpassed the user-suppliedimagestring straight to the TFGrid SDK as the zmachine workload'sflistfield. When a caller used the friendly name fromlist_images(e.g."Ubuntu 24.04"), ZOS received the literal string"Ubuntu 24.04"as its flist URL, set workload state to Error with emptyresult.error, and we surfaced the same opaquevm deployment entered error state. s149 worked because that loop happened to pass the URL directly.How we found it: grepping
tfgrid_sdk_rustsource for tracing macros returned ZERO. But grepping forworkload.result.stateled to a function calledtrace_step()atsrc/grid_client/mod.rs:2361:The SDK has a built-in debug system gated on
TFGRID_DEBUG=1— never documented anywhere visible. Setting it produced lines likeworkload states for contract 2095178: data=init, 0052=init, data=ok, 0052=errorand the full workload JSON payload, immediately showingflist: "Ubuntu 24.04"(the name) where ZOS expected a URL.Fix landed: hero_compute@1f59151 on
development. Adds a 5-entryIMAGE_REFERENCE_MAPmirroringlist_images+ aresolve_image_reference()helper called once at the top ofdeploy_vm. Pass-through forhttps:///http://URLs; lookup by name for known entries; friendlyInvalidInputerror for anything else (lists the valid names plus the URL form).Live verification on rented dedicated node 3467 (Canada, farm 646 JimboTFT, RentContract 2095174 under twin 14199 ops):
https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist→ VM sid0053state=running, contracts 2095179 + 2095180 persist on chain (the first successful deploy_vm of the arc since s149)."Ubuntu 24.04"(the original failing input) → daemon resolves to URL internally → VM sid0054state=running, contracts 2095181 + 2095182 persist. Same rented node, distinct slice, distinct secret. Multi-tenant pattern proven.cargo fmt --check+cargo clippy --workspace --all-targets -- -D warnings+cargo build --workspace --releaseall clean.Bonus substrate insight (s157c):
node.rentable: TrueALONE does NOT allow non-admin-twin rents on mainnet. The substrate gate isnode.extraFee > 0on the node (the farmer's opt-in to allow public dedicated rent). Node 7609 (extraFee=0) was rejected withOnlyTwinAdminCanDeploy; node 3467 (extraFee=10000 mUSD) accepted immediately. This goes into the workspacedecisions/D-29-...md(Hero demo target = any rentable+extraFee>0+up dedicated node on mainnet; not specifically FreeFarm).Follow-up hero_compute polish (separate concern, not arc-blocking): hero_compute#121 —
wait_until_runningreturns beforemycelium_ipis populated in the workload result, soget_vmreturnsstate=runningwith empty mycelium_ip and SSH-from-our-workstation is blocked. ZOS-side the VM IS reachable; this is a daemon bookkeeping gap not a deploy gap.Closing this one. Thanks for the QA-instance reference + OpenRPC spec — confirming code parity helped narrow the search.