feat: Upgrade-aware install script — preserve running state across binary updates #77

New issue

Open

opened 2026-04-07 11:34:48 +00:00 by mahmoud · 0 comments

mahmoud commented

2026-04-07 11:34:48 +00:00

Owner

Problem

The install script replaces binaries and restarts hero_proc/mycelium, but does not gracefully handle an already-running hero_compute deployment. After upgrading:

Services are killed (not gracefully stopped)
hero_compute is not restarted — user must manually run hero_compute --start --mode <mode>
The UI shows "Register Node" even though the database (~/hero/var/compute/data/) still has the node record and all VMs
The user doesn't know what mode the node was running in (local/master/worker) or what master IP was configured

No data is lost (the sled database survives), but the user experience is confusing and error-prone.

Objective

Make install.sh seamlessly upgrade a running deployment — stop gracefully, replace binaries, restart with the same configuration.

Current Behavior

Downloads binaries (skips if version matches)
Starts hero_proc_server and mycelium
Prints "run hero_compute --start"
User must remember their mode and flags

Desired Behavior

Detect if hero_compute is currently running
Read ~/hero/var/hero_compute.env to capture current mode, master IP, and other config
Gracefully stop: hero_compute --stop (not pkill)
Download and replace binaries
Restart hero_proc_server and mycelium
Auto-restart hero_compute with the saved mode and flags
Verify services are healthy

Implementation Plan

1. Pre-upgrade state capture

Before stopping anything, read ~/hero/var/hero_compute.env and extract:

HERO_COMPUTE_UI_MODE → the running mode (local/master/worker)
MASTER_IP → master IP if worker mode
EXPLORER_ADDRESSES → explorer config

Store these in shell variables for post-upgrade restart.

2. Graceful stop

Replace raw pkill with:

hero_compute --stop 2>/dev/null || true

This tells hero_proc to stop all hero_compute services cleanly, giving VMs time to shut down properly.

3. Auto-restart with saved config

After binary replacement:

if [ "$SAVED_MODE" = "master" ]; then
    hero_compute --start --mode master
elif [ "$SAVED_MODE" = "worker" ] && [ -n "$SAVED_MASTER_IP" ]; then
    hero_compute --start --mode worker --master-ip "$SAVED_MASTER_IP"
else
    hero_compute --start
fi

4. Post-upgrade health check

After restart, verify:

Server socket exists at ~/hero/var/sockets/hero_compute/rpc.sock
UI responds on port 9001: curl -s http://localhost:9001/health
Explorer socket exists (master mode only)

5. Skip-start flag

Add --no-start flag for cases where the user wants to upgrade binaries without restarting:

curl -sfL .../install.sh | bash -s -- --no-start

Edge Cases

First install (no env file): Current behavior — just install, don't auto-start hero_compute
Services not running (env file exists but stopped): Install binaries, don't auto-start unless --start is passed
Mode changed: If user wants to switch from worker to master, they should stop, upgrade, and start with new flags manually
hero_proc not running: Start it before attempting hero_compute --start

#76 — UI should auto-detect registered node instead of showing "Register Node" prompt
#68 — Install script base issue

Acceptance Criteria

Running install.sh on a live master node upgrades binaries and restarts in master mode automatically
Running install.sh on a live worker node restarts with the correct master IP
VMs and node registration survive the upgrade
No manual hero_compute --start needed after upgrade
--no-start flag available to skip auto-restart
First-time install behavior unchanged

## Problem The install script replaces binaries and restarts hero_proc/mycelium, but does not gracefully handle an already-running hero_compute deployment. After upgrading: - Services are killed (not gracefully stopped) - hero_compute is not restarted — user must manually run `hero_compute --start --mode <mode>` - The UI shows "Register Node" even though the database (`~/hero/var/compute/data/`) still has the node record and all VMs - The user doesn't know what mode the node was running in (local/master/worker) or what master IP was configured No data is lost (the sled database survives), but the user experience is confusing and error-prone. ## Objective Make `install.sh` seamlessly upgrade a running deployment — stop gracefully, replace binaries, restart with the same configuration. ## Current Behavior 1. Downloads binaries (skips if version matches) 2. Starts hero_proc_server and mycelium 3. Prints "run `hero_compute --start`" 4. User must remember their mode and flags ## Desired Behavior 1. Detect if hero_compute is currently running 2. Read `~/hero/var/hero_compute.env` to capture current mode, master IP, and other config 3. Gracefully stop: `hero_compute --stop` (not `pkill`) 4. Download and replace binaries 5. Restart hero_proc_server and mycelium 6. Auto-restart hero_compute with the saved mode and flags 7. Verify services are healthy ## Implementation Plan ### 1. Pre-upgrade state capture Before stopping anything, read `~/hero/var/hero_compute.env` and extract: - `HERO_COMPUTE_UI_MODE` → the running mode (local/master/worker) - `MASTER_IP` → master IP if worker mode - `EXPLORER_ADDRESSES` → explorer config Store these in shell variables for post-upgrade restart. ### 2. Graceful stop Replace raw `pkill` with: ```bash hero_compute --stop 2>/dev/null || true ``` This tells hero_proc to stop all hero_compute services cleanly, giving VMs time to shut down properly. ### 3. Auto-restart with saved config After binary replacement: ```bash if [ "$SAVED_MODE" = "master" ]; then hero_compute --start --mode master elif [ "$SAVED_MODE" = "worker" ] && [ -n "$SAVED_MASTER_IP" ]; then hero_compute --start --mode worker --master-ip "$SAVED_MASTER_IP" else hero_compute --start fi ``` ### 4. Post-upgrade health check After restart, verify: - Server socket exists at `~/hero/var/sockets/hero_compute/rpc.sock` - UI responds on port 9001: `curl -s http://localhost:9001/health` - Explorer socket exists (master mode only) ### 5. Skip-start flag Add `--no-start` flag for cases where the user wants to upgrade binaries without restarting: ```bash curl -sfL .../install.sh | bash -s -- --no-start ``` ## Edge Cases - **First install** (no env file): Current behavior — just install, don't auto-start hero_compute - **Services not running** (env file exists but stopped): Install binaries, don't auto-start unless `--start` is passed - **Mode changed**: If user wants to switch from worker to master, they should stop, upgrade, and start with new flags manually - **hero_proc not running**: Start it before attempting `hero_compute --start` ## Related - #76 — UI should auto-detect registered node instead of showing "Register Node" prompt - #68 — Install script base issue ## Acceptance Criteria - [ ] Running `install.sh` on a live master node upgrades binaries and restarts in master mode automatically - [ ] Running `install.sh` on a live worker node restarts with the correct master IP - [ ] VMs and node registration survive the upgrade - [ ] No manual `hero_compute --start` needed after upgrade - [ ] `--no-start` flag available to skip auto-restart - [ ] First-time install behavior unchanged