coopcloud_code/home

Fork 0

hero_compute auto-heartbeat — keep nodes online (F14) #70

New issue

Closed

opened 2026-04-08 23:07:45 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-04-08 23:07:45 +00:00

Member

Context

hero_compute nodes register with the explorer and send heartbeats to stay online. However, the heartbeat TTL expires and the node goes offline, requiring manual intervention to bring it back.

This means farmer nodes periodically disappear from the catalog until someone manually restarts the heartbeat.

Current behavior

hero_compute_server registers with explorer on startup
Sends initial heartbeat
Heartbeat TTL expires (exact TTL TBD — needs investigation)
Node shows as offline in explorer
Manual heartbeat required to bring it back online

Options to investigate

Option A: Background task in marketplace backend

Periodic task (tokio::spawn) that sends heartbeat for all registered nodes
Pro: centralized, easy to monitor
Con: marketplace shouldn't own node health

Option B: Fix in hero_compute_server itself

hero_compute_server should auto-renew its own heartbeat on a timer
Pro: correct ownership — the node keeps itself alive
Con: may require upstream changes to hero_compute

Option C: Systemd timer on each compute node

Simple cron/systemd timer that calls the heartbeat API
Pro: works immediately, no code changes
Con: fragile, not infrastructure-as-code

Investigation needed before implementation

What is the heartbeat TTL in hero_compute_explorer?
Does hero_compute_server already have a heartbeat loop that's just not working?
What's the heartbeat API call? (endpoint, auth)
Is this a bug in hero_compute or a missing feature?

Repos

Likely hero_compute (upstream) or projectmycelium_marketplace_backend (workaround)

Priority

Medium — affects farmer reliability but has a manual workaround.

— mik-tf

## Context hero_compute nodes register with the explorer and send heartbeats to stay online. However, **the heartbeat TTL expires** and the node goes offline, requiring manual intervention to bring it back. This means farmer nodes periodically disappear from the catalog until someone manually restarts the heartbeat. ## Current behavior 1. `hero_compute_server` registers with explorer on startup 2. Sends initial heartbeat 3. Heartbeat TTL expires (exact TTL TBD — needs investigation) 4. Node shows as offline in explorer 5. Manual heartbeat required to bring it back online ## Options to investigate ### Option A: Background task in marketplace backend - Periodic task (tokio::spawn) that sends heartbeat for all registered nodes - Pro: centralized, easy to monitor - Con: marketplace shouldn't own node health ### Option B: Fix in hero_compute_server itself - hero_compute_server should auto-renew its own heartbeat on a timer - Pro: correct ownership — the node keeps itself alive - Con: may require upstream changes to hero_compute ### Option C: Systemd timer on each compute node - Simple cron/systemd timer that calls the heartbeat API - Pro: works immediately, no code changes - Con: fragile, not infrastructure-as-code ## Investigation needed before implementation - [ ] What is the heartbeat TTL in hero_compute_explorer? - [ ] Does hero_compute_server already have a heartbeat loop that's just not working? - [ ] What's the heartbeat API call? (endpoint, auth) - [ ] Is this a bug in hero_compute or a missing feature? ## Repos - Likely `hero_compute` (upstream) or `projectmycelium_marketplace_backend` (workaround) ## Priority Medium — affects farmer reliability but has a manual workaround. — mik-tf

mik-tf referenced this issue

2026-04-08 23:08:06 +00:00

Launch Checklist — Remaining Items Until Production #55

mik-tf commented

2026-04-10 19:11:52 +00:00

Author

Member

Investigation Complete — Not a Marketplace Issue

Root Cause

The heartbeat is a push-based system entirely within hero_compute:

hero_compute_server (on each node)
  → sends heartbeat every 60s (HERO_COMPUTE_HEARTBEAT_INTERVAL_SECS)
  → to explorer(s) listed in EXPLORER_ADDRESSES env var

hero_compute_explorer (master)
  → marks node offline after 600s without heartbeat (HERO_COMPUTE_OFFLINE_THRESHOLD_SECS)

marketplace (passive consumer)
  → queries explorer, filters status=="online"
  → zero role in keeping nodes alive

Key Files (upstream in hero_compute)

File	Role
`hero_compute_server/src/heartbeat_sender.rs`	Sends heartbeat every 60s
`hero_compute_explorer/src/explorer/heartbeat.rs`	Monitors nodes, marks offline after 600s TTL
`hero_compute_explorer/src/explorer/rpc.rs`	Receives `node_heartbeat` RPC calls

Env Vars That Control It

Var	Default	Purpose
`HERO_COMPUTE_HEARTBEAT_INTERVAL_SECS`	60	How often server sends heartbeat
`HERO_COMPUTE_OFFLINE_THRESHOLD_SECS`	600	TTL before explorer marks node offline
`EXPLORER_ADDRESSES`	—	Where server sends heartbeats (must be set)

Likely Root Cause on Dev

EXPLORER_ADDRESSES may not be set on the compute server running on node 50
Or the hero_compute_server process crashes/restarts and the heartbeat loop doesn't survive
This is an upstream fix in hero_compute_server or MOS config, not marketplace code

Resolution

Closing as not-a-marketplace-issue. The fix belongs in:

hero_compute_server — ensure heartbeat loop is resilient to restarts
MOS config (mos_config) — ensure EXPLORER_ADDRESSES is baked into node config
hero_proc — ensure hero_compute_server auto-restarts if it crashes

Next step: MOS bare-metal deployment testing (see mycelium_code/home#55 for production infra)

Signed: mik-tf

## Investigation Complete — Not a Marketplace Issue ### Root Cause The heartbeat is a **push-based** system entirely within hero_compute: ``` hero_compute_server (on each node) → sends heartbeat every 60s (HERO_COMPUTE_HEARTBEAT_INTERVAL_SECS) → to explorer(s) listed in EXPLORER_ADDRESSES env var hero_compute_explorer (master) → marks node offline after 600s without heartbeat (HERO_COMPUTE_OFFLINE_THRESHOLD_SECS) marketplace (passive consumer) → queries explorer, filters status=="online" → zero role in keeping nodes alive ``` ### Key Files (upstream in hero_compute) | File | Role | |------|------| | `hero_compute_server/src/heartbeat_sender.rs` | Sends heartbeat every 60s | | `hero_compute_explorer/src/explorer/heartbeat.rs` | Monitors nodes, marks offline after 600s TTL | | `hero_compute_explorer/src/explorer/rpc.rs` | Receives `node_heartbeat` RPC calls | ### Env Vars That Control It | Var | Default | Purpose | |-----|---------|--------| | `HERO_COMPUTE_HEARTBEAT_INTERVAL_SECS` | 60 | How often server sends heartbeat | | `HERO_COMPUTE_OFFLINE_THRESHOLD_SECS` | 600 | TTL before explorer marks node offline | | `EXPLORER_ADDRESSES` | — | Where server sends heartbeats (must be set) | ### Likely Root Cause on Dev - `EXPLORER_ADDRESSES` may not be set on the compute server running on node 50 - Or the hero_compute_server process crashes/restarts and the heartbeat loop doesn't survive - This is an upstream fix in hero_compute_server or MOS config, not marketplace code ### Resolution Closing as **not-a-marketplace-issue**. The fix belongs in: - **hero_compute_server** — ensure heartbeat loop is resilient to restarts - **MOS config (mos_config)** — ensure `EXPLORER_ADDRESSES` is baked into node config - **hero_proc** — ensure hero_compute_server auto-restarts if it crashes Next step: MOS bare-metal deployment testing (see https://forge.ourworld.tf/mycelium_code/home/issues/55 for production infra) Signed: mik-tf

mik-tf closed this issue

2026-04-10 19:12:03 +00:00

mik-tf referenced this issue

2026-04-10 19:21:24 +00:00

MOS bare-metal deployment + full stack integration test #71