[P0] my_compute_zos_server not running on admin VM 0069 blocks deployer.provision_vm #13

Closed
opened 2026-05-27 15:13:59 +00:00 by mik-tf · 1 comment
Owner

Found during s169 end-to-end admin walk on the public QA admin VM.

deployer.provision_vm returns:

RPC error -32000: compute.deploy_vm: compute RPC error:
{"code":-32000,"message":"Socket 'rpc.sock' not found for 'hero_compute_zos' — daemon not running"}

State on VM 0069:

  • All my_compute_* binaries are installed under /home/driver/hero/bin/ (timestamps from May 26 23:17), including my_compute_zos_server.
  • service.toml files exist on disk under /home/driver/hero/code/hero_compute/crates/*/service.toml.
  • hero_proc service list does not list any compute service; only hero_tfgrid_deployer_admin and hero_tfgrid_deployer_server.
  • No sockets under /home/driver/hero/var/sockets/hero_compute*.
  • hero_tfgrid_deployer_server env has HERO_COMPUTE_NODE_ADDR=127.0.0.1:9988, so the deployer correctly tries to reach a local compute daemon that nobody ever started.
  • core/TFGRID_MNEMONIC and core/TFGRID_NETWORK=qa are already set in the secret store.

History: s158 admin VM 0062 had my_compute_zos_server running locally. s164 redeploy onto VM 0069 included setup-binaries.sh but did not register or start any compute service, and that gap was not surfaced until the s169 verification walk because the s168 admin UX code was the first thing to exercise the path.

Fix shape (operational, no code change):

  1. hero_proc service add for my_compute_zos_server using the existing service.toml, with the secrets it expects already in place.
  2. hero_proc service start my_compute_zos_server and confirm it binds 127.0.0.1:9988.
  3. Verify with a direct compute.list_nodes or compute.deploy_vm smoke from the VM.
  4. Update lhumina_code/home/docs/channels/free/admin-vm-deployment-runbook.md to include the compute daemon bring-up explicitly, and update setup-binaries.sh if there is a deterministic way to register the service automatically on fresh installs.

Per-session scope: surfaced and reproduced in s169, not fixed in s169 because bringing up a substrate-talking daemon mid-session adds orphan-contract risk on QA twin 703 and was outside the medium-effort budget for this verification session.

Found during s169 end-to-end admin walk on the public QA admin VM. `deployer.provision_vm` returns: ``` RPC error -32000: compute.deploy_vm: compute RPC error: {"code":-32000,"message":"Socket 'rpc.sock' not found for 'hero_compute_zos' — daemon not running"} ``` State on VM 0069: - All `my_compute_*` binaries are installed under `/home/driver/hero/bin/` (timestamps from May 26 23:17), including `my_compute_zos_server`. - `service.toml` files exist on disk under `/home/driver/hero/code/hero_compute/crates/*/service.toml`. - `hero_proc service list` does not list any compute service; only `hero_tfgrid_deployer_admin` and `hero_tfgrid_deployer_server`. - No sockets under `/home/driver/hero/var/sockets/hero_compute*`. - `hero_tfgrid_deployer_server` env has `HERO_COMPUTE_NODE_ADDR=127.0.0.1:9988`, so the deployer correctly tries to reach a local compute daemon that nobody ever started. - `core/TFGRID_MNEMONIC` and `core/TFGRID_NETWORK=qa` are already set in the secret store. History: s158 admin VM 0062 had `my_compute_zos_server` running locally. s164 redeploy onto VM 0069 included setup-binaries.sh but did not register or start any compute service, and that gap was not surfaced until the s169 verification walk because the s168 admin UX code was the first thing to exercise the path. Fix shape (operational, no code change): 1. `hero_proc service add` for `my_compute_zos_server` using the existing `service.toml`, with the secrets it expects already in place. 2. `hero_proc service start my_compute_zos_server` and confirm it binds 127.0.0.1:9988. 3. Verify with a direct `compute.list_nodes` or `compute.deploy_vm` smoke from the VM. 4. Update `lhumina_code/home/docs/channels/free/admin-vm-deployment-runbook.md` to include the compute daemon bring-up explicitly, and update `setup-binaries.sh` if there is a deterministic way to register the service automatically on fresh installs. Per-session scope: surfaced and reproduced in s169, not fixed in s169 because bringing up a substrate-talking daemon mid-session adds orphan-contract risk on QA twin 703 and was outside the medium-effort budget for this verification session.
Author
Owner

Closed at s170. my_compute_zos_server is now registered + running on admin VM 0069 under hero_proc.

Bring-up steps:

  • lab service my_compute_zos_server --start on the VM as the driver user.
  • Smoke: ComputeService.tfgrid_twin_id returned 703 (QA), register_node(5) assigned local sid 0001.
  • Runbook step 10 in lhumina_code/home/docs/channels/free/admin-vm-deployment-runbook.md updated to include the daemon registration and the per-daemon-fresh local node sid dance.

End-to-end Provision verified: deployer.provision_vm for a freshly minted Forge user with an uploaded SSH key returned a running VM (vm_sid=000m, mycelium IP populated), and ssh root@[mycelium_ip] returned the Ubuntu 24.04 banner.

Two related deployer fixes shipped alongside (development @ 0ffaa0e):

  • COMPUTE_RPC_PATH in compute.rs was stale at /hero_compute_zos/... past the my_compute workspace rename; fixed to /my_compute_zos/....
  • HERO_COMPUTE_NODE_ADDR is now wired through crates/hero_tfgrid_deployer_server/service.toml so the hero_proc-set secret reaches the supervised process env. The matching secret on the VM is set to 127.0.0.1:9988.
Closed at s170. `my_compute_zos_server` is now registered + running on admin VM `0069` under hero_proc. Bring-up steps: - `lab service my_compute_zos_server --start` on the VM as the `driver` user. - Smoke: `ComputeService.tfgrid_twin_id` returned 703 (QA), `register_node(5)` assigned local sid `0001`. - Runbook step 10 in `lhumina_code/home/docs/channels/free/admin-vm-deployment-runbook.md` updated to include the daemon registration and the per-daemon-fresh local node sid dance. End-to-end Provision verified: `deployer.provision_vm` for a freshly minted Forge user with an uploaded SSH key returned a running VM (`vm_sid=000m`, mycelium IP populated), and `ssh root@[mycelium_ip]` returned the Ubuntu 24.04 banner. Two related deployer fixes shipped alongside (development @ 0ffaa0e): - `COMPUTE_RPC_PATH` in `compute.rs` was stale at `/hero_compute_zos/...` past the my_compute workspace rename; fixed to `/my_compute_zos/...`. - `HERO_COMPUTE_NODE_ADDR` is now wired through `crates/hero_tfgrid_deployer_server/service.toml` so the hero_proc-set secret reaches the supervised process env. The matching secret on the VM is set to `127.0.0.1:9988`.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_os_tfgrid_deployer#13
No description provided.