[deployer] Explore removing the single-admin-VM single point of failure #276
Labels
No labels
meeting-notes
meeting-transcript
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/home#276
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Today the whole deployer fleet runs on one admin VM: the control database, the admin dashboard, the shared embedder and voice engines, and the per network compute workers. If that one VM is lost, every tester on every network loses its control plane and shared engines at once. This is acceptable for the current sandbox and investor demos, but before any wider use we should remove this single point of failure. Options to explore when we get there, not a priority now: keep the durable control state (the database and secrets) on a resilient backing such as an admin account or repository on forge.ourworld.tf or a replicated store, run the control plane as a small clustered service across more than one node, or run a second admin VM on another dedicated node that shares the same database. Filing so it is tracked. Context: part of the multi node and multi chain build out at #264 .