central source for configs #40

Closed
opened 2026-03-19 12:20:50 +00:00 by thabeta · 2 comments
Owner

e.g the explorers that the project works against

something similar to https://github.com/threefoldtech/zos-config

e.g the explorers that the project works against something similar to https://github.com/threefoldtech/zos-config
mahmoud added this to the ACTIVE project 2026-03-25 09:45:56 +00:00
mahmoud added this to the now milestone 2026-03-25 09:45:58 +00:00
Owner

Analysis & Implementation Plan (v2)

Current State

  • All config flows through environment variables loaded from .env files
  • Two things are effectively hardcoded (with env overrides):
    • Image registry URLhttps://forge.ourworld.tf/.../images.toml (in constants.rs)
    • Mycelium bootstrap peers → 10 TCP addresses (in constants.rs)
  • Everything else (socket paths, heartbeat intervals, slice sizes, ports) is per-node env vars

What needs to be done

1. Create a new config repo (e.g., forge.ourworld.tf/lhumina_code/hero_compute_config)

Use TOML (not JSON) — consistent with the rest of the hero ecosystem (images.toml, Cargo.toml, zinit configs). CI validation can still be done with a TOML linter/schema check step.

Structure:

hero_compute_config/
├── development.toml            # Dev environment config
├── testing.toml                # Testing/QA config  
├── production.toml             # Production config
└── .forgejo/workflows/
    └── validate.yaml           # CI to validate TOML files

Config file format — include a version field for future schema evolution:

version = "1"
environment = "production"

[network]
explorer_addresses = ["tcp://explorer.hero.tf:9002"]
registry_url = "https://forge.ourworld.tf/lhumina_code/hero_compute_registry/raw/branch/main/images.toml"
mycelium_peers = [
    "tcp://188.40.132.242:9651",
    "tcp://136.243.47.186:9651",
]

[tuning]
heartbeat_interval_secs = 300
offline_threshold_secs = 600
monitor_interval_secs = 60

2. Decide what goes in these config files — values shared across all nodes in an environment:

Config Key Current Location Why centralize
registry_url HERO_COMPUTE_REGISTRY_URL / hardcoded default Same for all nodes in an env
mycelium_peers DEFAULT_MYCELIUM_PEERS / hardcoded Same for all nodes, changes over time
explorer_addresses EXPLORER_ADDRESSES env var Per-environment master address
heartbeat_interval_secs env var, default 60 Should be consistent per env
offline_threshold_secs env var, default 600 Should be consistent per env

Things that should stay local (per-node .env): socket paths, ports, MASTER_IP, RUST_LOG, slice size (hardware-dependent).

3. Add a config fetcher to hero_compute

  • On startup, fetch the appropriate {environment}.toml from the raw git URL
  • Determine environment from a single env var like HERO_COMPUTE_ENV=development
  • Merge remote config with local .env overrides (local wins, so operators can still override)
  • Cache the config locally so nodes can start even if the git server is unreachable
  • Reuse the existing pattern from image_registry.rs which already does HTTP GET → parse TOML → return struct. Refactor it into a generic remote_config.rs that both the image registry and central config use.

4. Update constants.rs to remove hardcoded defaults for mycelium peers and registry URL — these become part of the central config.

5. Add CI validation in the config repo. Important security concern: production config changes must require approval from a specific maintainer before merging. If anyone can open a PR to production.toml and it auto-deploys to all nodes, that's a significant attack surface. Use branch protection + required reviewers.

  1. Refactor image_registry.rs into a generic remote_config.rs module that both the image registry and the central config can use. Avoids duplication.
  2. Create the config repo with a development.toml containing the current hardcoded values, using the versioned TOML format above.
  3. Build the config fetcher using the refactored remote_config.rs — get it working with a minimal config first.
  4. Wire it into startup in hero_compute/src/main.rs — fetch config before spawning services, with local .env overrides.
  5. Add fallback logic — use cached config if fetch fails.
  6. Remove hardcoded defaults from constants.rs, source them from the central config instead.
  7. Define the TOML schema/validation once you see what fields naturally emerge from actual usage.
  8. Add CI validation workflow + branch protection for production config.

Blocker

The explorer_addresses field in the central config is only useful once nodes can actually connect to remote explorers over TCP. Until the TCP transport issue is resolved, centralizing that field has no practical effect — nodes in worker mode still need the direct MASTER_IP env var.

## Analysis & Implementation Plan (v2) ### Current State - All config flows through **environment variables** loaded from `.env` files - Two things are effectively hardcoded (with env overrides): - **Image registry URL** → `https://forge.ourworld.tf/.../images.toml` (in `constants.rs`) - **Mycelium bootstrap peers** → 10 TCP addresses (in `constants.rs`) - Everything else (socket paths, heartbeat intervals, slice sizes, ports) is per-node env vars ### What needs to be done **1. Create a new config repo** (e.g., `forge.ourworld.tf/lhumina_code/hero_compute_config`) Use **TOML** (not JSON) — consistent with the rest of the hero ecosystem (`images.toml`, `Cargo.toml`, zinit configs). CI validation can still be done with a TOML linter/schema check step. Structure: ``` hero_compute_config/ ├── development.toml # Dev environment config ├── testing.toml # Testing/QA config ├── production.toml # Production config └── .forgejo/workflows/ └── validate.yaml # CI to validate TOML files ``` **Config file format** — include a version field for future schema evolution: ```toml version = "1" environment = "production" [network] explorer_addresses = ["tcp://explorer.hero.tf:9002"] registry_url = "https://forge.ourworld.tf/lhumina_code/hero_compute_registry/raw/branch/main/images.toml" mycelium_peers = [ "tcp://188.40.132.242:9651", "tcp://136.243.47.186:9651", ] [tuning] heartbeat_interval_secs = 300 offline_threshold_secs = 600 monitor_interval_secs = 60 ``` **2. Decide what goes in these config files** — values shared across all nodes in an environment: | Config Key | Current Location | Why centralize | |---|---|---| | `registry_url` | `HERO_COMPUTE_REGISTRY_URL` / hardcoded default | Same for all nodes in an env | | `mycelium_peers` | `DEFAULT_MYCELIUM_PEERS` / hardcoded | Same for all nodes, changes over time | | `explorer_addresses` | `EXPLORER_ADDRESSES` env var | Per-environment master address | | `heartbeat_interval_secs` | env var, default 60 | Should be consistent per env | | `offline_threshold_secs` | env var, default 600 | Should be consistent per env | Things that should **stay local** (per-node `.env`): socket paths, ports, `MASTER_IP`, `RUST_LOG`, slice size (hardware-dependent). **3. Add a config fetcher to hero_compute** - On startup, fetch the appropriate `{environment}.toml` from the raw git URL - Determine environment from a single env var like `HERO_COMPUTE_ENV=development` - Merge remote config with local `.env` overrides (local wins, so operators can still override) - Cache the config locally so nodes can start even if the git server is unreachable - **Reuse the existing pattern** from `image_registry.rs` which already does HTTP GET → parse TOML → return struct. Refactor it into a generic `remote_config.rs` that both the image registry and central config use. **4. Update `constants.rs`** to remove hardcoded defaults for mycelium peers and registry URL — these become part of the central config. **5. Add CI validation** in the config repo. **Important security concern:** production config changes must require approval from a specific maintainer before merging. If anyone can open a PR to `production.toml` and it auto-deploys to all nodes, that's a significant attack surface. Use branch protection + required reviewers. ### Recommended order of work 0. **Refactor `image_registry.rs`** into a generic `remote_config.rs` module that both the image registry and the central config can use. Avoids duplication. 1. **Create the config repo** with a `development.toml` containing the current hardcoded values, using the versioned TOML format above. 2. **Build the config fetcher** using the refactored `remote_config.rs` — get it working with a minimal config first. 3. **Wire it into startup** in `hero_compute/src/main.rs` — fetch config before spawning services, with local `.env` overrides. 4. **Add fallback logic** — use cached config if fetch fails. 5. **Remove hardcoded defaults** from `constants.rs`, source them from the central config instead. 6. **Define the TOML schema/validation** once you see what fields naturally emerge from actual usage. 7. **Add CI validation workflow** + branch protection for production config. ### Blocker The `explorer_addresses` field in the central config is only useful once nodes can actually connect to remote explorers over TCP. Until the TCP transport issue is resolved, centralizing that field has no practical effect — nodes in worker mode still need the direct `MASTER_IP` env var.
Owner

Note: the explorer_addresses field will be populated but ignored until the native TCP transport issue is resolved. Until then, nodes continue using MASTER_IP + socat bridge. This field is included in the schema now, so the format is stable when TCP lands.

Note: the explorer_addresses field will be populated but ignored until the native [TCP transport](https://forge.ourworld.tf/lhumina_code/hero_compute/issues/30) issue is resolved. Until then, nodes continue using MASTER_IP + socat bridge. This field is included in the schema now, so the format is stable when TCP lands.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#40
No description provided.