delete_vm returns Ok before the spawned on-chain cancel finishes; can leave orphaned mainnet contracts #119

Closed
opened 2026-05-23 16:03:53 +00:00 by mik-tf · 1 comment
Owner

ComputeService.delete_vm (at crates/my_compute_zos_server/src/cloud/rpc.rs:1262-1324) returns success as soon as the VM record is marked Deleting locally, but the actual cancelContract substrate calls run in a tokio::spawn'd task whose result is not awaited and not propagated back to the caller. On 2026-05-23 a deploy_vm + delete_vm round-trip against FreeFarm node 12 left two active contracts (2095003, 2095004 under twin 6905) on TFChain mainnet despite the daemon logging VM 000t deleted (contracts cancelled, slices freed) at the local-state-cleanup step; the orphans were only detected by querying Grid Proxy directly and recovered out of band via my_compute_zos_server --cancel-contracts 2095003 2095004. The fix is to await the spawned cancel task and either return Ok only after each contract is confirmed Deleted on-chain (or return a typed per-contract error so the caller knows which contracts still need cleanup). Until this lands, every successful delete_vm return value is an unverified promise.

`ComputeService.delete_vm` (at `crates/my_compute_zos_server/src/cloud/rpc.rs:1262-1324`) returns success as soon as the VM record is marked `Deleting` locally, but the actual `cancelContract` substrate calls run in a `tokio::spawn`'d task whose result is not awaited and not propagated back to the caller. On 2026-05-23 a `deploy_vm` + `delete_vm` round-trip against FreeFarm node 12 left two active contracts (2095003, 2095004 under twin 6905) on TFChain mainnet despite the daemon logging `VM 000t deleted (contracts cancelled, slices freed)` at the local-state-cleanup step; the orphans were only detected by querying Grid Proxy directly and recovered out of band via `my_compute_zos_server --cancel-contracts 2095003 2095004`. The fix is to await the spawned cancel task and either return Ok only after each contract is confirmed `Deleted` on-chain (or return a typed per-contract error so the caller knows which contracts still need cleanup). Until this lands, every successful `delete_vm` return value is an unverified promise.
Author
Owner

The D-27 fix at 39d9b8a covers the success path correctly (deploy_vm now awaits the substrate task before returning Ok), but the error path still leaks contracts. Reproduction today on a fresh daemon at PID 3527196 against TFGrid mainnet node 1 (tfnode-1): four deploy_vm calls returned 'vm deployment entered error state' after roughly 70 seconds each. Each error response created TWO on-chain contracts under twin 6905 (one network plus one vm), so eight orphan contracts in total appeared on Grid Proxy at IDs 2095121 through 2095128, all created between 16:44:54Z and 16:49:48Z UTC, all in state Created with no deployer-side record. The pattern means an Ok response after the fix is now substrate-confirmed, but an Err response leaks two contracts every time. Recovery took one my_compute_zos_server --cancel-contracts 2095121 2095122 ... 2095128 run (144 seconds) to flip all eight to state Deleted. Suggested fix: wrap the deploy_vm body so that on any error after contract creation, the partial state is rolled back by cancelling whatever contract IDs the substrate task has already published, before returning Err. Same applies symmetrically to delete_vm. The recovery binary path can stay as a debug tool, but should not be the only path that prevents charge leaks on a normal failure mode.

Signed-by: mik-tf mik-tf@noreply.invalid

The D-27 fix at 39d9b8a covers the success path correctly (deploy_vm now awaits the substrate task before returning Ok), but the error path still leaks contracts. Reproduction today on a fresh daemon at PID 3527196 against TFGrid mainnet node 1 (tfnode-1): four deploy_vm calls returned `'vm deployment entered error state'` after roughly 70 seconds each. Each error response created TWO on-chain contracts under twin 6905 (one network plus one vm), so eight orphan contracts in total appeared on Grid Proxy at IDs 2095121 through 2095128, all created between 16:44:54Z and 16:49:48Z UTC, all in state Created with no deployer-side record. The pattern means an Ok response after the fix is now substrate-confirmed, but an Err response leaks two contracts every time. Recovery took one `my_compute_zos_server --cancel-contracts 2095121 2095122 ... 2095128` run (144 seconds) to flip all eight to state Deleted. Suggested fix: wrap the deploy_vm body so that on any error after contract creation, the partial state is rolled back by cancelling whatever contract IDs the substrate task has already published, before returning Err. Same applies symmetrically to delete_vm. The recovery binary path can stay as a debug tool, but should not be the only path that prevents charge leaks on a normal failure mode. Signed-by: mik-tf <mik-tf@noreply.invalid>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#119
No description provided.