[deployer] Multi-network provisioning can fail when two networks assign the same VM id

mik-tf commented

2026-06-13 20:37:09 +00:00

Owner

When the deployer manages compute daemons on more than one TFGrid network, each daemon assigns VM short ids from its own independent sequence, so two daemons can hand out the same id (for example a QA VM and a mainnet VM both numbered 005o). The deployer stores every VM in a single table whose vm id column is globally unique, so the second insert is rejected and the provision fails even though the VM already deployed on chain, which also leaves a running VM the deployer never recorded. This was hit live while provisioning a mainnet VM whose id collided with an existing QA tester (mainnet 005o vs an existing QA 005o), and it will get more likely as a multi network fleet grows. The fix is to make the deployer's stored VM identity unique per daemon (composite uniqueness on daemon plus vm id, keeping the raw daemon id for calls back to that daemon) via a schema migration that leaves existing single network VMs unchanged. Because the deployer keys on the vm id throughout the provision and lookup paths and the admin UI, this is a contained but real identity change plus a migration on live data, so it should land as its own reviewed change rather than a quick patch. Until then, multi network fleets can hit intermittent provision failures and leave orphaned compute VMs that must be cleaned up by hand.

Signed-by: mik-tf mik-tf@noreply.invalid

When the deployer manages compute daemons on more than one TFGrid network, each daemon assigns VM short ids from its own independent sequence, so two daemons can hand out the same id (for example a QA VM and a mainnet VM both numbered 005o). The deployer stores every VM in a single table whose vm id column is globally unique, so the second insert is rejected and the provision fails even though the VM already deployed on chain, which also leaves a running VM the deployer never recorded. This was hit live while provisioning a mainnet VM whose id collided with an existing QA tester (mainnet 005o vs an existing QA 005o), and it will get more likely as a multi network fleet grows. The fix is to make the deployer's stored VM identity unique per daemon (composite uniqueness on daemon plus vm id, keeping the raw daemon id for calls back to that daemon) via a schema migration that leaves existing single network VMs unchanged. Because the deployer keys on the vm id throughout the provision and lookup paths and the admin UI, this is a contained but real identity change plus a migration on live data, so it should land as its own reviewed change rather than a quick patch. Until then, multi network fleets can hit intermittent provision failures and leave orphaned compute VMs that must be cleaned up by hand. Signed-by: mik-tf <mik-tf@noreply.invalid>

~~mik-tf referenced this issue from lhumina_code/home 2026-06-14 00:16:04 +00:00~~

[META] Hero OS sandbox demo, functional readiness: onboarding pipeline + per-app verification #239

mik-tf referenced this issue from lhumina_code/home

2026-06-14 00:16:16 +00:00

[META] Hero OS sandbox demo, functional readiness: onboarding pipeline + per-app verification #239

mik-tf referenced this issue from a commit

2026-06-14 01:04:51 +00:00

fix(deployer): composite VM identity (daemon_label + vm_sid) so two networks can share a vm id

mik-tf commented

2026-06-14 01:11:42 +00:00

Author

Owner

Fixed on the integration branch (hero_os_tfgrid_deployer 275cdbc). Stored VM identity is now composite on (owning network, vm id) instead of the vm id alone: a schema migration recreates the table with a uniqueness constraint on the pair, the provision insert writes the owning network in the same statement so two networks that pick the same id no longer collide, and every read, update, and delete is keyed on the pair so a write can never touch the wrong network's record. A lookup by id alone now returns an error when that id exists on more than one network rather than guessing, and delete, install, update, and check-updates gained an optional network field so a caller can disambiguate. Proven live on the admin VM: the existing database migrated with all 8 records preserved, and on a copy of the migrated data a second record reusing an existing id under a different network inserts cleanly while an exact duplicate is still rejected. Tests are green (172 server, 13 SDK). Not yet promoted to main.

Fixed on the integration branch (hero_os_tfgrid_deployer 275cdbc). Stored VM identity is now composite on (owning network, vm id) instead of the vm id alone: a schema migration recreates the table with a uniqueness constraint on the pair, the provision insert writes the owning network in the same statement so two networks that pick the same id no longer collide, and every read, update, and delete is keyed on the pair so a write can never touch the wrong network's record. A lookup by id alone now returns an error when that id exists on more than one network rather than guessing, and delete, install, update, and check-updates gained an optional network field so a caller can disambiguate. Proven live on the admin VM: the existing database migrated with all 8 records preserved, and on a copy of the migrated data a second record reusing an existing id under a different network inserts cleanly while an exact duplicate is still rejected. Tests are green (172 server, 13 SDK). Not yet promoted to main.

Rows
Columns

[deployer] Multi-network provisioning can fail when two networks assign the same VM id #26