hero_compute auto-heartbeat — keep nodes online (F14) #70

Closed
opened 2026-04-08 23:07:45 +00:00 by mik-tf · 1 comment
Member

Context

hero_compute nodes register with the explorer and send heartbeats to stay online. However, the heartbeat TTL expires and the node goes offline, requiring manual intervention to bring it back.

This means farmer nodes periodically disappear from the catalog until someone manually restarts the heartbeat.

Current behavior

  1. hero_compute_server registers with explorer on startup
  2. Sends initial heartbeat
  3. Heartbeat TTL expires (exact TTL TBD — needs investigation)
  4. Node shows as offline in explorer
  5. Manual heartbeat required to bring it back online

Options to investigate

Option A: Background task in marketplace backend

  • Periodic task (tokio::spawn) that sends heartbeat for all registered nodes
  • Pro: centralized, easy to monitor
  • Con: marketplace shouldn't own node health

Option B: Fix in hero_compute_server itself

  • hero_compute_server should auto-renew its own heartbeat on a timer
  • Pro: correct ownership — the node keeps itself alive
  • Con: may require upstream changes to hero_compute

Option C: Systemd timer on each compute node

  • Simple cron/systemd timer that calls the heartbeat API
  • Pro: works immediately, no code changes
  • Con: fragile, not infrastructure-as-code

Investigation needed before implementation

  • What is the heartbeat TTL in hero_compute_explorer?
  • Does hero_compute_server already have a heartbeat loop that's just not working?
  • What's the heartbeat API call? (endpoint, auth)
  • Is this a bug in hero_compute or a missing feature?

Repos

  • Likely hero_compute (upstream) or projectmycelium_marketplace_backend (workaround)

Priority

Medium — affects farmer reliability but has a manual workaround.

— mik-tf

## Context hero_compute nodes register with the explorer and send heartbeats to stay online. However, **the heartbeat TTL expires** and the node goes offline, requiring manual intervention to bring it back. This means farmer nodes periodically disappear from the catalog until someone manually restarts the heartbeat. ## Current behavior 1. `hero_compute_server` registers with explorer on startup 2. Sends initial heartbeat 3. Heartbeat TTL expires (exact TTL TBD — needs investigation) 4. Node shows as offline in explorer 5. Manual heartbeat required to bring it back online ## Options to investigate ### Option A: Background task in marketplace backend - Periodic task (tokio::spawn) that sends heartbeat for all registered nodes - Pro: centralized, easy to monitor - Con: marketplace shouldn't own node health ### Option B: Fix in hero_compute_server itself - hero_compute_server should auto-renew its own heartbeat on a timer - Pro: correct ownership — the node keeps itself alive - Con: may require upstream changes to hero_compute ### Option C: Systemd timer on each compute node - Simple cron/systemd timer that calls the heartbeat API - Pro: works immediately, no code changes - Con: fragile, not infrastructure-as-code ## Investigation needed before implementation - [ ] What is the heartbeat TTL in hero_compute_explorer? - [ ] Does hero_compute_server already have a heartbeat loop that's just not working? - [ ] What's the heartbeat API call? (endpoint, auth) - [ ] Is this a bug in hero_compute or a missing feature? ## Repos - Likely `hero_compute` (upstream) or `projectmycelium_marketplace_backend` (workaround) ## Priority Medium — affects farmer reliability but has a manual workaround. — mik-tf
Author
Member

Investigation Complete — Not a Marketplace Issue

Root Cause

The heartbeat is a push-based system entirely within hero_compute:

hero_compute_server (on each node)
  → sends heartbeat every 60s (HERO_COMPUTE_HEARTBEAT_INTERVAL_SECS)
  → to explorer(s) listed in EXPLORER_ADDRESSES env var

hero_compute_explorer (master)
  → marks node offline after 600s without heartbeat (HERO_COMPUTE_OFFLINE_THRESHOLD_SECS)

marketplace (passive consumer)
  → queries explorer, filters status=="online"
  → zero role in keeping nodes alive

Key Files (upstream in hero_compute)

File Role
hero_compute_server/src/heartbeat_sender.rs Sends heartbeat every 60s
hero_compute_explorer/src/explorer/heartbeat.rs Monitors nodes, marks offline after 600s TTL
hero_compute_explorer/src/explorer/rpc.rs Receives node_heartbeat RPC calls

Env Vars That Control It

Var Default Purpose
HERO_COMPUTE_HEARTBEAT_INTERVAL_SECS 60 How often server sends heartbeat
HERO_COMPUTE_OFFLINE_THRESHOLD_SECS 600 TTL before explorer marks node offline
EXPLORER_ADDRESSES Where server sends heartbeats (must be set)

Likely Root Cause on Dev

  • EXPLORER_ADDRESSES may not be set on the compute server running on node 50
  • Or the hero_compute_server process crashes/restarts and the heartbeat loop doesn't survive
  • This is an upstream fix in hero_compute_server or MOS config, not marketplace code

Resolution

Closing as not-a-marketplace-issue. The fix belongs in:

  • hero_compute_server — ensure heartbeat loop is resilient to restarts
  • MOS config (mos_config) — ensure EXPLORER_ADDRESSES is baked into node config
  • hero_proc — ensure hero_compute_server auto-restarts if it crashes

Next step: MOS bare-metal deployment testing (see mycelium_code/home#55 for production infra)

Signed: mik-tf

## Investigation Complete — Not a Marketplace Issue ### Root Cause The heartbeat is a **push-based** system entirely within hero_compute: ``` hero_compute_server (on each node) → sends heartbeat every 60s (HERO_COMPUTE_HEARTBEAT_INTERVAL_SECS) → to explorer(s) listed in EXPLORER_ADDRESSES env var hero_compute_explorer (master) → marks node offline after 600s without heartbeat (HERO_COMPUTE_OFFLINE_THRESHOLD_SECS) marketplace (passive consumer) → queries explorer, filters status=="online" → zero role in keeping nodes alive ``` ### Key Files (upstream in hero_compute) | File | Role | |------|------| | `hero_compute_server/src/heartbeat_sender.rs` | Sends heartbeat every 60s | | `hero_compute_explorer/src/explorer/heartbeat.rs` | Monitors nodes, marks offline after 600s TTL | | `hero_compute_explorer/src/explorer/rpc.rs` | Receives `node_heartbeat` RPC calls | ### Env Vars That Control It | Var | Default | Purpose | |-----|---------|--------| | `HERO_COMPUTE_HEARTBEAT_INTERVAL_SECS` | 60 | How often server sends heartbeat | | `HERO_COMPUTE_OFFLINE_THRESHOLD_SECS` | 600 | TTL before explorer marks node offline | | `EXPLORER_ADDRESSES` | — | Where server sends heartbeats (must be set) | ### Likely Root Cause on Dev - `EXPLORER_ADDRESSES` may not be set on the compute server running on node 50 - Or the hero_compute_server process crashes/restarts and the heartbeat loop doesn't survive - This is an upstream fix in hero_compute_server or MOS config, not marketplace code ### Resolution Closing as **not-a-marketplace-issue**. The fix belongs in: - **hero_compute_server** — ensure heartbeat loop is resilient to restarts - **MOS config (mos_config)** — ensure `EXPLORER_ADDRESSES` is baked into node config - **hero_proc** — ensure hero_compute_server auto-restarts if it crashes Next step: MOS bare-metal deployment testing (see https://forge.ourworld.tf/mycelium_code/home/issues/55 for production infra) Signed: mik-tf
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coopcloud_code/home#70
No description provided.