[P0] my_compute_zos_server not running on admin VM 0069 blocks deployer.provision_vm #13
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Found during s169 end-to-end admin walk on the public QA admin VM.
deployer.provision_vmreturns:State on VM 0069:
my_compute_*binaries are installed under/home/driver/hero/bin/(timestamps from May 26 23:17), includingmy_compute_zos_server.service.tomlfiles exist on disk under/home/driver/hero/code/hero_compute/crates/*/service.toml.hero_proc service listdoes not list any compute service; onlyhero_tfgrid_deployer_adminandhero_tfgrid_deployer_server./home/driver/hero/var/sockets/hero_compute*.hero_tfgrid_deployer_serverenv hasHERO_COMPUTE_NODE_ADDR=127.0.0.1:9988, so the deployer correctly tries to reach a local compute daemon that nobody ever started.core/TFGRID_MNEMONICandcore/TFGRID_NETWORK=qaare already set in the secret store.History: s158 admin VM 0062 had
my_compute_zos_serverrunning locally. s164 redeploy onto VM 0069 included setup-binaries.sh but did not register or start any compute service, and that gap was not surfaced until the s169 verification walk because the s168 admin UX code was the first thing to exercise the path.Fix shape (operational, no code change):
hero_proc service addformy_compute_zos_serverusing the existingservice.toml, with the secrets it expects already in place.hero_proc service start my_compute_zos_serverand confirm it binds 127.0.0.1:9988.compute.list_nodesorcompute.deploy_vmsmoke from the VM.lhumina_code/home/docs/channels/free/admin-vm-deployment-runbook.mdto include the compute daemon bring-up explicitly, and updatesetup-binaries.shif there is a deterministic way to register the service automatically on fresh installs.Per-session scope: surfaced and reproduced in s169, not fixed in s169 because bringing up a substrate-talking daemon mid-session adds orphan-contract risk on QA twin 703 and was outside the medium-effort budget for this verification session.
mik-tf referenced this issue from lhumina_code/home2026-05-27 16:11:07 +00:00
Closed at s170.
my_compute_zos_serveris now registered + running on admin VM0069under hero_proc.Bring-up steps:
lab service my_compute_zos_server --starton the VM as thedriveruser.ComputeService.tfgrid_twin_idreturned 703 (QA),register_node(5)assigned local sid0001.lhumina_code/home/docs/channels/free/admin-vm-deployment-runbook.mdupdated to include the daemon registration and the per-daemon-fresh local node sid dance.End-to-end Provision verified:
deployer.provision_vmfor a freshly minted Forge user with an uploaded SSH key returned a running VM (vm_sid=000m, mycelium IP populated), andssh root@[mycelium_ip]returned the Ubuntu 24.04 banner.Two related deployer fixes shipped alongside (development @
0ffaa0e):COMPUTE_RPC_PATHincompute.rswas stale at/hero_compute_zos/...past the my_compute workspace rename; fixed to/my_compute_zos/....HERO_COMPUTE_NODE_ADDRis now wired throughcrates/hero_tfgrid_deployer_server/service.tomlso the hero_proc-set secret reaches the supervised process env. The matching secret on the VM is set to127.0.0.1:9988.