[deployer] Pre-warm a pool of tester VMs so onboarding is fast and reliable

mik-tf commented

2026-06-07 15:59:25 +00:00

Owner

The demo runs on a dedicated node we already pay for, so leaving virtual machines on it idle costs nothing extra. Instead of creating a tester VM on demand each time we add someone, which makes a person wait while a brand new machine boots and joins the network (and sometimes that network route never comes up in time, so the install fails outright), we could pre provision a pool of tester VMs up front, each already booted with the admin SSH keys and left ready. A periodic health check would ping each pool machine to confirm it is reachable, and tear down and recreate any that are unresponsive, so the pool stays known good. Adding a tester then becomes preparing their account and running the Hero stack setup on a machine that is already booted and reachable, which takes the slow and flaky part off the moment someone is actually waiting. A natural follow up is to also pre install the Hero binaries on the pool machines so only the per user configuration runs at assignment, which makes onboarding both reliable and fast. This needs the machine records to carry a pool and assignment model rather than one machine created per user, and the provision step to split into create a pool machine and assign a machine to a user. Capacity should be sized on real placement rather than raw slice counts, and the recreate path needs to handle teardown reliably.

How big the pool is should depend on the kind of node. On a dedicated node we already pay for the whole node whether or not it is full, so the pool should simply fill the node to its real capacity (for example, if a node fits five tester machines, keep all five pre warmed). There is no reason to leave paid capacity idle. On a shared node each machine we hold ready costs money, so the pool size should be a number the operator sets and the deployer keeps ready at all times: zero means no pre warming on shared nodes (create on demand, slower but no standing cost), a small number like three or five means always keep that many ready so onboarding is quick without holding a large paid pool, and a higher number trades more standing cost for more instant onboarding. As pool machines get assigned to users the deployer tops the pool back up to the target. So the rule is one target per node type: on dedicated nodes it defaults to the node capacity (free to max out), and on shared nodes the operator chooses it to balance onboarding speed against spend.

This pool target also feeds creating many accounts at once (lhumina_code/home#288): a warm pool of the right size lets a group of testers be stood up quickly, so the target should be set with both single onboarding and group creation in mind. Part of the composable provisioning product tracked at lhumina_code/home#285.

Signed-by: mik-tf mik-tf@noreply.invalid

The demo runs on a dedicated node we already pay for, so leaving virtual machines on it idle costs nothing extra. Instead of creating a tester VM on demand each time we add someone, which makes a person wait while a brand new machine boots and joins the network (and sometimes that network route never comes up in time, so the install fails outright), we could pre provision a pool of tester VMs up front, each already booted with the admin SSH keys and left ready. A periodic health check would ping each pool machine to confirm it is reachable, and tear down and recreate any that are unresponsive, so the pool stays known good. Adding a tester then becomes preparing their account and running the Hero stack setup on a machine that is already booted and reachable, which takes the slow and flaky part off the moment someone is actually waiting. A natural follow up is to also pre install the Hero binaries on the pool machines so only the per user configuration runs at assignment, which makes onboarding both reliable and fast. This needs the machine records to carry a pool and assignment model rather than one machine created per user, and the provision step to split into create a pool machine and assign a machine to a user. Capacity should be sized on real placement rather than raw slice counts, and the recreate path needs to handle teardown reliably. How big the pool is should depend on the kind of node. On a dedicated node we already pay for the whole node whether or not it is full, so the pool should simply fill the node to its real capacity (for example, if a node fits five tester machines, keep all five pre warmed). There is no reason to leave paid capacity idle. On a shared node each machine we hold ready costs money, so the pool size should be a number the operator sets and the deployer keeps ready at all times: zero means no pre warming on shared nodes (create on demand, slower but no standing cost), a small number like three or five means always keep that many ready so onboarding is quick without holding a large paid pool, and a higher number trades more standing cost for more instant onboarding. As pool machines get assigned to users the deployer tops the pool back up to the target. So the rule is one target per node type: on dedicated nodes it defaults to the node capacity (free to max out), and on shared nodes the operator chooses it to balance onboarding speed against spend. This pool target also feeds creating many accounts at once (https://forge.ourworld.tf/lhumina_code/home/issues/288): a warm pool of the right size lets a group of testers be stood up quickly, so the target should be set with both single onboarding and group creation in mind. Part of the composable provisioning product tracked at https://forge.ourworld.tf/lhumina_code/home/issues/285. Signed-by: mik-tf <mik-tf@noreply.invalid>

mik-tf commented

2026-06-07 16:01:23 +00:00

Author

Owner

One refinement on the golden image follow up mentioned above: pre installing the binaries on pool machines is probably not worth it. The binaries are only about two minutes of the install, and main and development rebuild often, so a pre baked image would go stale quickly for marginal gain. The warm pool on its own already takes onboarding from around twenty minutes down to a few minutes by removing the brand new machine boot and network wait. So lets do the warm pool first and only revisit pre baking if we find a clean way to keep it fresh.

Signed-by: mik-tf mik-tf@noreply.invalid

One refinement on the golden image follow up mentioned above: pre installing the binaries on pool machines is probably not worth it. The binaries are only about two minutes of the install, and main and development rebuild often, so a pre baked image would go stale quickly for marginal gain. The warm pool on its own already takes onboarding from around twenty minutes down to a few minutes by removing the brand new machine boot and network wait. So lets do the warm pool first and only revisit pre baking if we find a clean way to keep it fresh. Signed-by: mik-tf <mik-tf@noreply.invalid>