delete_vm returns Ok before the spawned on-chain cancel finishes; can leave orphaned mainnet contracts #119
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_compute#119
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
ComputeService.delete_vm(atcrates/my_compute_zos_server/src/cloud/rpc.rs:1262-1324) returns success as soon as the VM record is markedDeletinglocally, but the actualcancelContractsubstrate calls run in atokio::spawn'd task whose result is not awaited and not propagated back to the caller. On 2026-05-23 adeploy_vm+delete_vmround-trip against FreeFarm node 12 left two active contracts (2095003, 2095004 under twin 6905) on TFChain mainnet despite the daemon loggingVM 000t deleted (contracts cancelled, slices freed)at the local-state-cleanup step; the orphans were only detected by querying Grid Proxy directly and recovered out of band viamy_compute_zos_server --cancel-contracts 2095003 2095004. The fix is to await the spawned cancel task and either return Ok only after each contract is confirmedDeletedon-chain (or return a typed per-contract error so the caller knows which contracts still need cleanup). Until this lands, every successfuldelete_vmreturn value is an unverified promise.The D-27 fix at
39d9b8acovers the success path correctly (deploy_vm now awaits the substrate task before returning Ok), but the error path still leaks contracts. Reproduction today on a fresh daemon at PID 3527196 against TFGrid mainnet node 1 (tfnode-1): four deploy_vm calls returned'vm deployment entered error state'after roughly 70 seconds each. Each error response created TWO on-chain contracts under twin 6905 (one network plus one vm), so eight orphan contracts in total appeared on Grid Proxy at IDs 2095121 through 2095128, all created between 16:44:54Z and 16:49:48Z UTC, all in state Created with no deployer-side record. The pattern means an Ok response after the fix is now substrate-confirmed, but an Err response leaks two contracts every time. Recovery took onemy_compute_zos_server --cancel-contracts 2095121 2095122 ... 2095128run (144 seconds) to flip all eight to state Deleted. Suggested fix: wrap the deploy_vm body so that on any error after contract creation, the partial state is rolled back by cancelling whatever contract IDs the substrate task has already published, before returning Err. Same applies symmetrically to delete_vm. The recovery binary path can stay as a debug tool, but should not be the only path that prevents charge leaks on a normal failure mode.Signed-by: mik-tf mik-tf@noreply.invalid