SUPER URGENT BASE PLATFORM #234
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
STEP 1: the bootstrap
STEP 2: simplify the paths used
STEP 3: rework lab to do above
STEP 4: we have 2 agents now, soon 3 (hero_shrimp)
STEP 5: build publish the core
STEP 6a: review & fix hero_proc
STEP 6b: review ai_client_direct in herolib
STEP 7: review & fix hero_router
STEP X:
Update — daemon singleton + this session's follow-throughs
Closing the loop on the hero_proc daemon-singleton work and a few related items from this week.
Daemon singleton enforcement in hero_proc (
5abe664)4-state PID machine (Inactive → Starting → Running → Success/Failed), 13 integration tests. A duplicate
hero_proc_servernow refuses to start withAcquireError::PeerAliveinstead of silently coexisting and corrupting state. Per-host singleton is now a daemon-side concern.Secrets preserved across schema upgrades (
e7b9b1e)Wired
runs_model::auto_migrate_or_wipeintofactory.rs::HeroProcDb::new. Schema bumps no longer destroy user secrets.Follow-throughs this session (hero_skills):
Republished hero_proc binaries on Forge. Until today the two fixes above were on
developmentbut not in the customer-install binary — releaselatestwas 2+ days stale. New install now actually gets the singleton enforcement + secrets-preserve. Did it vialab build --upload— validates the lab-driven publish path for any Hero repo.Removed lab's
pkill -f hero_procprelude instart_hero_proc(4488d1f) — lab used to broad-substring-kill any process with "hero_proc" in argv before launching the daemon. With singleton enforcement landed, that prelude was redundant AND counter-productive: it masked the newAcquireError::PeerAlivediagnostic. Also was a serious user-side regression — killed colleagues' SSH sessions whose bash hadcd hero_procin argv. Linux now relies on the precise/proc/exewalk + the daemon-side singleton.hero_os_hostedstep ordering (6840ebd) — bridge config moved before service start so per-user hero_router binds to its own mycelium IPv6 instead of colliding on127.0.0.1:9988. Each hosted user now spawns its own hero_proc on its own UID with the singleton intact, no cross-user collisions.Mycelium opt-in setup + preflight (
6829d09) —lab install mycelium --startfor the host-level daemon;hero_os_hostednow preflights for it and dies loud with the exact install command if missing.Verified end-to-end on kristof5: customer install → user provisioned → hero_proc + admin + router up with healthy sockets and singleton enforcement active.
Step 5 (Build & publish core) is current for
lab(was 76 days stale),hero_router(7),hero_proc(2). Step 6a (Review hero_proc) advanced via the two fixes above landing in the customer binary.Update —
hero_proc_testintegration suite run + 13-issue triage (Step 6a items 1+3 progress)Following up on Step 6a items 1 ("service start not good enough") + 3 ("integration tester still finds bugs"):
Step 6a item 3 —
hero_proc_testrun againsthero_proc719ba10Ran
--basic --functionalagainst a freshly builthero_proc_server. 20 tests fail. Categorized:Real defects in the daemon (need fixing):
uc09_failed_parent_blocks_child—child must NOT be succeeded when parent failed, got succeeded→ hero_proc#64 reopened (DAG-gating logic insupervisor/mod.rs:660-742looks right but doesn't actually cascade-cancel end-to-end)uc07_retry_succeeds_on_second_attempt—expected exactly 2 attempts, got 1→ hero_proc#106 reopened (Kristof's03e7ed8retry fix not working end-to-end; likely related to HP-04 restart-counter reset)uc31_action_cascade_delete—action delete should cascade to >=3 jobs, got 0→ relates to hero_proc#66 (cascade-delete unimplemented)uc41_job_timeout_ms—job stuck in running after 15s→ timeout enforcement not firinguc32_process_stats_while_running/uc33_job_why_waiting— RPC -32100 "job not found" mid-test; race or DB lookup buguc12-16(service lifecycle: start/stop/restart/remove/system-class) — multiple, need individual triageInfra setup issues (not real bugs):
basic::daemon_singleton::*(5 tests) — try to spawn from/home/sameh/hero/bin/hero_proc_server(not installed vialab user initon the test host); not a defect of singleton itself.Full output saved as artifacts in
errors/per test.13-issue triage on hero_proc — 8 closed, 5 kept open
Closed with citation comments (verified empirically — code review + admin UI Playwright + unit tests):
SPECS.md:35)Kept open after empirical re-verification (with explanation comments):
uc09anduc07faileditActionandeditSecretwork, buteditServiceis silently no-opdevelopmentand x86_64-Linux binary refreshed, but arm64+macOS assets still 4 days stale onlatest(lab-publish.yamlis x86_64-only by design)Step 6a item 1 status
hero_proc#92(bind-race readiness await) still open — Step 6a item 1 verbatim.rpc/service.rs:317-385returns synchronously after job-row insert with no readiness probe.Other Step 6a items
hero_proc#95open. Wire hero_db intoserver.rs::run()init phase.Step 5 (Build & publish core) — hero_proc
lab-publish.yaml(Mik'sf0ba646, 2026-05-19) is auto-republishing x86_64-Linux on every push to development. Latest x86_64 asset is from today 09:20 UTC. arm64+macOS need their own publish workflows.Addendum to my previous comment — sandbox caveat on the
hero_proc_test20-failures finding:I ran the test against a
hero_proc_serverI started by invoking./target/release/hero_proc_serverdirectly, NOT vialab service hero_proc --start --force --build(the canonical path eacherrors/*.mdre-run instruction prints). My local box also isn'tlab user init'd. So:basic::daemon_singleton::*failures are confirmed-not-bugs (try to spawn from missing/home/sameh/hero/bin/hero_proc_server).The Step 6a item 3 progress claim ("integration tester still finds bugs") is right in spirit — there are real bugs there — but the specific list of 15 needs re-run under
lab service --starton kristof5 (or equivalent) before treating each as a separate ticket.Re-run on kristof5 under canonical setup — confirmed/corrected findings.
Per the addendum I posted earlier flagging the sandbox caveat, I re-ran
hero_proc_test --basic --functionalon kristof5 (PATH_ROOT=/home/sameh/hero, hero_proc auto-started via lab,latestbinary from 2026-05-20T13:29). Result: 23 failures / 234 passes.Confirmed real defects (fail on BOTH sandbox AND kristof5)
uc07_retry_succeeds_on_second_attemptexpected exactly 2 attempts, got 1→ hero_proc#106uc09_failed_parent_blocks_childchild must NOT be succeeded when parent failed, got succeeded→ hero_proc#64uc12_define_service_start_stoprunning=falseafter 10suc13_restart_produces_new_job_idsrunning=falseafter 10suc14_stop_with_remove_jobs_deletes_recordsrunning=falseafter 10suc16_system_class_excluded_from_stop_allrunning=falseafter 10suc20_cron_scheduled_job_fires_within_75suc32_process_stats_while_running/uc33_job_why_waitinguc41_job_timeout_msfunctional::runs::structured_logs_all_levelsfunctional::runs::logs_query_by_service_src_wildcardThat's 12 confirmed daemon-internal defects, several closely related — the
uc12-16cluster suggests services aren't transitioning torunningstate properly, which is a fundamental supervisor regression.Big new finding — scheduler tests all fail on kristof5
10
functional::schedule::*tests fail on kristof5 with the same shape:timeout waiting for 1 jobs for action '<sched-...>'. None of these failed in my local sandbox. Looking at the affected tests:test_interval_with_large_duration_workstest_time_window_within_range_allows_schedulingtest_time_window_transition_at_start_time+_end_timetest_nr_of_instances_default_is_onetest_scheduled_job_has_scheduled_tagtest_cron_scheduled_job_tagged_as_crontest_scheduled_job_output_capturedtest_schedule_policy_removal_triggers_cleanuptest_very_long_interval_worksThe scheduler appears to be entirely non-functional on the published
latestbinary (2026-05-20). This is a production regression — customer scheduled jobs aren't firing. Worth filing as its own P0 issue separate from #92/#95/#106 (Step 6a items 1/2). My local sandbox (binary built today from 719ba10) passed many of these — fix may already exist ondevelopment, needs re-publish to verify.Sandbox-only failures (fail locally, pass on kristof5)
These I correctly classified as sandbox artifacts earlier — kristof5 confirmation:
basic::daemon_singleton::*(need installed lab binaries to spawn from)basic::logging::*(need installed lab paths)uc31_action_cascade_delete— passes on kristof5 (so hero_proc#66 cascade-delete may already work in production)uc19_interval_scheduled_job_fires_automatically— passes on kristof5Updated triage on #64 and #106
Both addenda on those issues edited to confirm real-defect status (was provisional, now confirmed by kristof5 canonical run).
Revised recommendation for Step 6a
latest. File as own issue; rebuild + republish to see ifdevelopmentalready has the fix.uc12-16cluster: service-state-transition regression (4 lifecycle tests all show services stuckrunning=false). Possibly the same root cause across all 4. Investigate first.Correction to my earlier scheduler-regression claim — published binary deployed on kristof5 + 3-way diff.
Sameh rightly pushed back on my "scheduler regression on
latestbinary" framing. I deployed today's published binary (hero_proc_server-linux-musl-x86_64md511029c562b6ef3cb085548f820afd3f1, built from 719ba10, onlatestsince 2026-05-21T09:18) to kristof5, restarted hero_proc, and re-ran the full suite. Then I 3-way-diffed:What the binary update actually accomplished
uc20_cron_scheduled_job_fires_within_75s,uc32_process_stats_while_running(failed R2, pass R3)My scheduler-regression claim was wrong
The 10
functional::schedule::*tests that failed on kristof5 with the OLD binary STILL fail with the NEW binary on kristof5 — and additional schedule tests now fail too — but the SAME tests with the SAME new binary pass on clean local sandbox. So the failure cause is environmental on the loaded kristof5 box (2154 background jobs + supervisor under load → scheduler can't meet test tick deadlines), not a code regression. The scheduler works; it just times out under load.11 stable real defects (fail in ALL 3 runs — independent of env or binary version)
The
uc12-16cluster (4 tests, all "service stuck running=false after 10s") is the most concerning shared-root-cause candidate.uc09+uc07confirm hero_proc#64 and hero_proc#106 are real defects.Other clarifications
uc31_action_cascade_deleteis FLAKY, not a clean pass — failed R1+R3, passed R2. Cannot confirm hero_proc#66 either way; needs deeper investigation.Retracting my earlier "P0 scheduler regression" framing. Task list updated accordingly.