delete_vm returns Ok before the spawned on-chain cancel finishes; can leave orphaned mainnet contracts #119

New issue

Closed

opened 2026-05-23 16:03:53 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-05-23 16:03:53 +00:00

Owner

ComputeService.delete_vm (at crates/my_compute_zos_server/src/cloud/rpc.rs:1262-1324) returns success as soon as the VM record is marked Deleting locally, but the actual cancelContract substrate calls run in a tokio::spawn'd task whose result is not awaited and not propagated back to the caller. On 2026-05-23 a deploy_vm + delete_vm round-trip against FreeFarm node 12 left two active contracts (2095003, 2095004 under twin 6905) on TFChain mainnet despite the daemon logging VM 000t deleted (contracts cancelled, slices freed) at the local-state-cleanup step; the orphans were only detected by querying Grid Proxy directly and recovered out of band via my_compute_zos_server --cancel-contracts 2095003 2095004. The fix is to await the spawned cancel task and either return Ok only after each contract is confirmed Deleted on-chain (or return a typed per-contract error so the caller knows which contracts still need cleanup). Until this lands, every successful delete_vm return value is an unverified promise.

`ComputeService.delete_vm` (at `crates/my_compute_zos_server/src/cloud/rpc.rs:1262-1324`) returns success as soon as the VM record is marked `Deleting` locally, but the actual `cancelContract` substrate calls run in a `tokio::spawn`'d task whose result is not awaited and not propagated back to the caller. On 2026-05-23 a `deploy_vm` + `delete_vm` round-trip against FreeFarm node 12 left two active contracts (2095003, 2095004 under twin 6905) on TFChain mainnet despite the daemon logging `VM 000t deleted (contracts cancelled, slices freed)` at the local-state-cleanup step; the orphans were only detected by querying Grid Proxy directly and recovered out of band via `my_compute_zos_server --cancel-contracts 2095003 2095004`. The fix is to await the spawned cancel task and either return Ok only after each contract is confirmed `Deleted` on-chain (or return a typed per-contract error so the caller knows which contracts still need cleanup). Until this lands, every successful `delete_vm` return value is an unverified promise.

mik-tf referenced this issue

2026-05-23 16:04:22 +00:00

deploy_vm returns Ok before the spawned substrate submission finishes #120

mik-tf referenced this issue from lhumina_code/home

2026-05-24 03:06:27 +00:00

[META] Hero OS demo-deployer arc tracker (cockpit + proxy + content + deployer + manifest + integration) #235

mik-tf referenced this issue from a commit

2026-05-24 05:12:11 +00:00

zos: await substrate ack inline on deploy_vm + delete_vm (D-27, #119, #120)

mik-tf closed this issue

2026-05-24 05:12:11 +00:00

mik-tf commented

2026-05-24 16:58:15 +00:00

Author

Owner

The D-27 fix at 39d9b8a covers the success path correctly (deploy_vm now awaits the substrate task before returning Ok), but the error path still leaks contracts. Reproduction today on a fresh daemon at PID 3527196 against TFGrid mainnet node 1 (tfnode-1): four deploy_vm calls returned 'vm deployment entered error state' after roughly 70 seconds each. Each error response created TWO on-chain contracts under twin 6905 (one network plus one vm), so eight orphan contracts in total appeared on Grid Proxy at IDs 2095121 through 2095128, all created between 16:44:54Z and 16:49:48Z UTC, all in state Created with no deployer-side record. The pattern means an Ok response after the fix is now substrate-confirmed, but an Err response leaks two contracts every time. Recovery took one my_compute_zos_server --cancel-contracts 2095121 2095122 ... 2095128 run (144 seconds) to flip all eight to state Deleted. Suggested fix: wrap the deploy_vm body so that on any error after contract creation, the partial state is rolled back by cancelling whatever contract IDs the substrate task has already published, before returning Err. Same applies symmetrically to delete_vm. The recovery binary path can stay as a debug tool, but should not be the only path that prevents charge leaks on a normal failure mode.

Signed-by: mik-tf mik-tf@noreply.invalid

The D-27 fix at 39d9b8a covers the success path correctly (deploy_vm now awaits the substrate task before returning Ok), but the error path still leaks contracts. Reproduction today on a fresh daemon at PID 3527196 against TFGrid mainnet node 1 (tfnode-1): four deploy_vm calls returned `'vm deployment entered error state'` after roughly 70 seconds each. Each error response created TWO on-chain contracts under twin 6905 (one network plus one vm), so eight orphan contracts in total appeared on Grid Proxy at IDs 2095121 through 2095128, all created between 16:44:54Z and 16:49:48Z UTC, all in state Created with no deployer-side record. The pattern means an Ok response after the fix is now substrate-confirmed, but an Err response leaks two contracts every time. Recovery took one `my_compute_zos_server --cancel-contracts 2095121 2095122 ... 2095128` run (144 seconds) to flip all eight to state Deleted. Suggested fix: wrap the deploy_vm body so that on any error after contract creation, the partial state is rolled back by cancelling whatever contract IDs the substrate task has already published, before returning Err. Same applies symmetrically to delete_vm. The recovery binary path can stay as a debug tool, but should not be the only path that prevents charge leaks on a normal failure mode. Signed-by: mik-tf <mik-tf@noreply.invalid>

mik-tf referenced this issue

2026-05-24 16:58:16 +00:00

deploy_vm returns Ok before the spawned substrate submission finishes #120

mik-tf referenced this issue

2026-05-24 16:58:43 +00:00

ComputeService.deploy_vm returns opaque 'vm deployment entered error state' on substrate failure #124

mik-tf referenced this issue from a commit

2026-05-24 19:41:03 +00:00

fix(cloud/rpc): resolve node by sid not hostname + classify deploy failures

mik-tf referenced this issue from a commit

2026-05-24 22:25:39 +00:00

fix(cloud): roll back orphan TFGrid contracts on deploy_vm error (D-27 part 2, #119)