[deployer] Multi-network provisioning can fail when two networks assign the same VM id #26

Open
opened 2026-06-13 20:37:09 +00:00 by mik-tf · 1 comment
Owner

When the deployer manages compute daemons on more than one TFGrid network, each daemon assigns VM short ids from its own independent sequence, so two daemons can hand out the same id (for example a QA VM and a mainnet VM both numbered 005o). The deployer stores every VM in a single table whose vm id column is globally unique, so the second insert is rejected and the provision fails even though the VM already deployed on chain, which also leaves a running VM the deployer never recorded. This was hit live while provisioning a mainnet VM whose id collided with an existing QA tester (mainnet 005o vs an existing QA 005o), and it will get more likely as a multi network fleet grows. The fix is to make the deployer's stored VM identity unique per daemon (composite uniqueness on daemon plus vm id, keeping the raw daemon id for calls back to that daemon) via a schema migration that leaves existing single network VMs unchanged. Because the deployer keys on the vm id throughout the provision and lookup paths and the admin UI, this is a contained but real identity change plus a migration on live data, so it should land as its own reviewed change rather than a quick patch. Until then, multi network fleets can hit intermittent provision failures and leave orphaned compute VMs that must be cleaned up by hand.

Signed-by: mik-tf mik-tf@noreply.invalid

When the deployer manages compute daemons on more than one TFGrid network, each daemon assigns VM short ids from its own independent sequence, so two daemons can hand out the same id (for example a QA VM and a mainnet VM both numbered 005o). The deployer stores every VM in a single table whose vm id column is globally unique, so the second insert is rejected and the provision fails even though the VM already deployed on chain, which also leaves a running VM the deployer never recorded. This was hit live while provisioning a mainnet VM whose id collided with an existing QA tester (mainnet 005o vs an existing QA 005o), and it will get more likely as a multi network fleet grows. The fix is to make the deployer's stored VM identity unique per daemon (composite uniqueness on daemon plus vm id, keeping the raw daemon id for calls back to that daemon) via a schema migration that leaves existing single network VMs unchanged. Because the deployer keys on the vm id throughout the provision and lookup paths and the admin UI, this is a contained but real identity change plus a migration on live data, so it should land as its own reviewed change rather than a quick patch. Until then, multi network fleets can hit intermittent provision failures and leave orphaned compute VMs that must be cleaned up by hand. Signed-by: mik-tf <mik-tf@noreply.invalid>
Author
Owner

Fixed on the integration branch (hero_os_tfgrid_deployer 275cdbc). Stored VM identity is now composite on (owning network, vm id) instead of the vm id alone: a schema migration recreates the table with a uniqueness constraint on the pair, the provision insert writes the owning network in the same statement so two networks that pick the same id no longer collide, and every read, update, and delete is keyed on the pair so a write can never touch the wrong network's record. A lookup by id alone now returns an error when that id exists on more than one network rather than guessing, and delete, install, update, and check-updates gained an optional network field so a caller can disambiguate. Proven live on the admin VM: the existing database migrated with all 8 records preserved, and on a copy of the migrated data a second record reusing an existing id under a different network inserts cleanly while an exact duplicate is still rejected. Tests are green (172 server, 13 SDK). Not yet promoted to main.

Fixed on the integration branch (hero_os_tfgrid_deployer 275cdbc). Stored VM identity is now composite on (owning network, vm id) instead of the vm id alone: a schema migration recreates the table with a uniqueness constraint on the pair, the provision insert writes the owning network in the same statement so two networks that pick the same id no longer collide, and every read, update, and delete is keyed on the pair so a write can never touch the wrong network's record. A lookup by id alone now returns an error when that id exists on more than one network rather than guessing, and delete, install, update, and check-updates gained an optional network field so a caller can disambiguate. Proven live on the admin VM: the existing database migrated with all 8 records preserved, and on a copy of the migrated data a second record reusing an existing id under a different network inserts cleanly while an exact duplicate is still rejected. Tests are green (172 server, 13 SDK). Not yet promoted to main.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_os_tfgrid_deployer#26
No description provided.