Production hardening, security, stability, and code quality #51

Closed
opened 2026-03-31 09:56:40 +00:00 by mahmoud · 1 comment
Owner

Description

production readiness audit and fixes for hero_compute before public deployment. Covers security vulnerabilities, stability improvements, build reproducibility, and code quality cleanup.

Changes

Security (Phase 1)

  • Console sessions: added max 20 session limit, 1-hour idle timeout, automatic cleanup task
  • Proxy connections: added 10s connect timeout and 30s read timeout to prevent hung nodes from blocking the cluster
  • SSH key injection: validate key format (prefix, no newlines, no shell metacharacters) before embedding in scripts
  • IPv6 validation: reject malformed IPs before interpolating into shell commands
  • CORS: removed wildcard Access-Control-Allow-Origin: * from all endpoints (same-origin only)
  • TCP bridge: added HERO_COMPUTE_BRIDGE_TOKEN auth, HERO_COMPUTE_BIND_ADDRESS config, max 100 concurrent connections via
    semaphore
  • WebSocket: added Origin header validation before console upgrade

Stability (Phase 2)

  • Replaced 25 silent let _ = on critical DB operations (vm_set, slice_set, vm_delete) with proper error logging
  • Added escapeAttr() for all dynamic values in onclick handlers to prevent XSS
  • Added JS console session limit (max 5 parked sessions, evicts oldest)
  • Removed silent failure in Makefile binary copy step

Reliability (Phase 3)

  • Replaced panic!() and .expect() on socket bind with graceful ? error returns
  • Added reconciliation lock (AtomicBool + Drop guard) to prevent concurrent runs
  • Added 5-minute idle timeout on vsock background reader — closes session if no clients attached
  • Added socket path validation in explorer proxy (reject .. traversal)
  • Increased broadcast channel capacity from 1024 to 4096, replay buffer on lag
  • Moved hardcoded dependency versions from configure.sh to buildenv.sh (configurable via env vars)
  • Replaced detailed error messages in HTTP responses with generic messages (details logged server-side only)
  • Added 5s timeout on /version handler (git + my_hypervisor commands)

Build & CI (Phase 4)

  • Removed cargo update from CI pipelines (Cargo.lock ensures reproducibility)
  • Added RELEASE_ID numeric validation in CI release workflow

Code Quality (Phase 5)

  • Eliminated all 3 shell-outs to my_hypervisor CLI — replaced with direct my_hypervisor-lib API calls:
    • resolve_external_vm() → state_store.resolve_fresh() (no more my_hypervisor inspect)
    • set_vm_hostname() → hypervisor.vm_exec() (no more Command::new)
    • inject_ssh_keys_to_vm() → hypervisor.vm_exec() (no more Command::new)
  • Updated my_hypervisor-lib dependency to v0.1.3 (adds resolve_fresh() support)
  • Fixed build_lib.sh comment (said "hero_redis")
  • Narrowed broad #![allow(dead_code, unused_imports, unused_variables)] in explorer
  • Cleaned up unsafe env access in tests
### Description production readiness audit and fixes for hero_compute before public deployment. Covers security vulnerabilities, stability improvements, build reproducibility, and code quality cleanup. ### Changes Security (Phase 1) - Console sessions: added max 20 session limit, 1-hour idle timeout, automatic cleanup task - Proxy connections: added 10s connect timeout and 30s read timeout to prevent hung nodes from blocking the cluster - SSH key injection: validate key format (prefix, no newlines, no shell metacharacters) before embedding in scripts - IPv6 validation: reject malformed IPs before interpolating into shell commands - CORS: removed wildcard Access-Control-Allow-Origin: * from all endpoints (same-origin only) - TCP bridge: added HERO_COMPUTE_BRIDGE_TOKEN auth, HERO_COMPUTE_BIND_ADDRESS config, max 100 concurrent connections via semaphore - WebSocket: added Origin header validation before console upgrade Stability (Phase 2) - Replaced 25 silent let _ = on critical DB operations (vm_set, slice_set, vm_delete) with proper error logging - Added escapeAttr() for all dynamic values in onclick handlers to prevent XSS - Added JS console session limit (max 5 parked sessions, evicts oldest) - Removed silent failure in Makefile binary copy step Reliability (Phase 3) - Replaced panic!() and .expect() on socket bind with graceful ? error returns - Added reconciliation lock (AtomicBool + Drop guard) to prevent concurrent runs - Added 5-minute idle timeout on vsock background reader — closes session if no clients attached - Added socket path validation in explorer proxy (reject .. traversal) - Increased broadcast channel capacity from 1024 to 4096, replay buffer on lag - Moved hardcoded dependency versions from configure.sh to buildenv.sh (configurable via env vars) - Replaced detailed error messages in HTTP responses with generic messages (details logged server-side only) - Added 5s timeout on /version handler (git + my_hypervisor commands) Build & CI (Phase 4) - Removed cargo update from CI pipelines (Cargo.lock ensures reproducibility) - Added RELEASE_ID numeric validation in CI release workflow Code Quality (Phase 5) - Eliminated all 3 shell-outs to my_hypervisor CLI — replaced with direct my_hypervisor-lib API calls: - resolve_external_vm() → state_store.resolve_fresh() (no more my_hypervisor inspect) - set_vm_hostname() → hypervisor.vm_exec() (no more Command::new) - inject_ssh_keys_to_vm() → hypervisor.vm_exec() (no more Command::new) - Updated my_hypervisor-lib dependency to v0.1.3 (adds resolve_fresh() support) - Fixed build_lib.sh comment (said "hero_redis") - Narrowed broad #![allow(dead_code, unused_imports, unused_variables)] in explorer - Cleaned up unsafe env access in tests
mahmoud self-assigned this 2026-03-31 09:56:43 +00:00
mahmoud added this to the ACTIVE project 2026-03-31 09:56:45 +00:00
mahmoud added this to the now milestone 2026-03-31 09:56:47 +00:00
Author
Owner

Should be done

Should be done
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#51
No description provided.