job.steer always fails — loop never writes task_state.json (the file the steer gate requires) #28
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_shrimp#28
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
job.steeris a dead surface. It returnsNo active autonomy state found in <workspace>for every job — including jobs that are visibly still running and updating their plan.Repro
Build a Python CLI tool wxcli that fetches weather, with tests and a Makefile).p1: in_progress, p2: pending), calljob.steer.JSON-RPC job.steer failed: No active autonomy state found in /home/<user>/hero/var/shrimp/workspace/jobs/<job_id>. ...Root cause
The operator path that backs
job.steer(crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs:84-93) loads state from:(
state.rs:21)But the running agent loop never writes that file. The loop writes
.agent/job_plan.json(viaupdate_plan/complete_phase), which is a different file.persist_goal_and_state(the only writer oftask_state.json) is called only from operator-driven sites:mod.rs:167— insideactivate_autonomy_job_state(operator-triggered)promote.rs:122/125/220— plan promotion (operator-triggered)operator.rs:68—action == "activate_run"(operator-triggered)Zero call sites inside the loop body. The file the steer gate reads only exists if an operator action created it first — but the UI provides no such action surface (no
job.resumeRPC, no Resume button — see also the misleading recommendation in the error message itself).Evidence from a live job
Live job
rpc_job_..._3489030_4mid-flight,.agent/contents:No
task_state.json.job.steerfails.Fix options
task_state.jsonalongsidejob_plan.jsonon every plan update / iteration boundary. Steer then has a file to read. This matches the operator path's expectation.job_plan.json+ DB rows instead of requiring a separatetask_state.json. Removes the duplicate persistence surface.job.resumeRPC, no Resume UI). Either build that surface or change the message to something actionable (e.g. "send a follow-up message via job.follow_up").Related
Same pattern as #27 (plan-approval UI persists a decision the engine doesn't consume). Both are operator-control surfaces wired to UIs/RPCs that the engine doesn't honor.
Spec: fix
job.steerby having the agent loop writetask_state.jsonalongsidejob_plan.jsonObjective
Make
job.steer(and any other operator action that reads<workspace>/.agent/task_state.json) succeed for a live, mid-flight autonomy job — including jobs whose.agent/directory currently contains onlyjob_plan.jsonandplan_versions/. We do this by adding a minimal "task_state mirror" write to the same two writers that the loop already drives (update_planandcomplete_phase), keyed off thejob_plan.jsonwe just wrote. We also fix the steer error message so it stops directing operators to a Resume button that does not exist.Chosen approach
Option 1 + Option 3.
Option 1 (loop writes
task_state.jsonalongsidejob_plan.json) is the right answer because:STATE_PATH = ".agent/task_state.json"(operator.rs:84-93) and the in-promptOperatorGuidanceProviderreads the same file (job_context.rs:206) — so without a loop-side writer, bothjob.steerand the prompt-side guidance injection are broken for any job that wasn't started via the explicit operator activation path.handle_update_plan(plan_ops.rs:88) andhandle_complete_phase(verify/mod.rs:615) — both of which already writejob_plan.jsonnext to the futuretask_state.json. Mirroring there is one helper call per site.job_plan.json+ DB) would force us to re-derive every field ofExecutionState(timeline, coverage, recovery_ladder, blocked_reason, operator_guidance, etc.) from a plan file that doesn't have them, and would still leave the prompt-sideoperator_guidance_from_workspacebroken or force a second rewrite. It is a larger blast radius than the bug warrants.Option 3 (fix the misleading error message) is additive and cheap, and stops sending operators after a non-existent button.
Requirements
job.steeragainst a mid-flight autonomy job whose only on-disk artifact is.agent/job_plan.jsonmust succeed (actionsteer_jobwritesoperator_guidanceand actionclear_steerclears it).task_state.jsonmust havejob_idset to the run's artifact job id so the existing identity check atoperator.rs:94passes.job_plan.jsonis missing or unreadable, or if writingtask_state.jsonfails, the originatingupdate_plan/complete_phasecall must still succeed (the mirror is a best-effort sidecar, not a hard gate).task_state.jsonalready exists for the workspace and itsjob_idmatches the loop'sjob_id, the loop's write must preserveoperator_guidance,operator_force_replan,operator_pause_requested,blocked_reason, andtimeline— i.e. we must not stomp operator-set fields when refreshing the plan/phase mirror.operator.rs:87-91must no longer instruct the user to use a "Resume button on the run page" (no such surface exists). Replace with an actionable message that names the real follow-up surface (job.follow_up/ sending a new prompt) and explains why state may legitimately not exist (the job has not yet emittedupdate_planand so has no plan to steer against).activate_autonomy_job_state,fork_autonomy_job, theresume_jobaction) — they still writetask_state.jsonthroughpersist_goal_and_state. We are adding a second writer, not replacing the existing one.Files to modify
crates/hero_shrimp_engine/src/orchestration/autonomy/persistence.rs— add a new public helpermirror_task_state_from_plan(workspace_dir: &Path, job_id: &str)that reads.agent/job_plan.json, merges it onto an existingtask_state.json(preserving operator fields and timeline) or synthesizes a minimalExecutionStateif none exists, and writes back to.agent/task_state.json. Best-effort: never panics, never propagates errors past atracing::warn!.crates/hero_shrimp_engine/src/orchestration/autonomy/mod.rs— re-export the new helper next to the otherpersistence::*re-exports (lines 66-70) so the tool handlers can call it throughcrate::autonomy::mirror_task_state_from_plan.crates/hero_shrimp_engine/src/tools/tool_catalog/verify/plan_ops.rs— at the end ofhandle_update_plan(after the kanban sync, before the success return), call the mirror helper usingcontext.workspace_dirandcontext.job_id. Best-effort; ignore the result.crates/hero_shrimp_engine/src/tools/tool_catalog/verify/mod.rs— at the end of the "Passed — advance the phase in the plan." block inhandle_complete_phase(after the plan-versions copy, before theemit_scoped), call the same mirror helper. Best-effort; ignore the result.crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs— replace the misleading error message at lines 86-92 with a message that does not reference a "Resume button". Suggested text:"No autonomy state at {workspace}/.agent/task_state.json yet. The job either hasn't published a plan via update_plan / complete_phase, or it never entered the autonomy path. Send a follow-up prompt instead — direct steer requires an active plan.".Implementation plan
Each step is self-contained. Steps 1 and 2 are setup; steps 3 and 4 are the two writer hooks; step 5 is the error-message fix. Steps 3, 4, and 5 can be done in any order after step 1+2. Step 6 is the test layer.
Step 1 — add the
mirror_task_state_from_planhelper topersistence.rscrates/hero_shrimp_engine/src/orchestration/autonomy/persistence.rs<workspace>/.agent/job_plan.json(returns early if absent / unparseable).<workspace>/.agent/task_state.jsonexists AND itsjob_id == job_id, load it and refresh onlyplan,phases(matching by phase id to preserve per-phase status),status(only promote""/"planned"→"running"; never demote a"running"state), andupdated_at. Preserveoperator_guidance,operator_force_replan,operator_pause_requested,blocked_reason,timeline.ExecutionState::new(plan, AutonomyMode::Execute, "running")withjob_id = Some(job_id.to_string())from the on-disk plan.save_json_file(no DB upsert, no goal-doc writes — those belong to operator-drivenpersist_goal_and_state).tracing::warn!, never propagated.Step 2 — re-export the helper from
autonomy/mod.rscrates/hero_shrimp_engine/src/orchestration/autonomy/mod.rsmirror_task_state_from_planto the existingpub use self::persistence::{ … }block so it is reachable ascrate::autonomy::mirror_task_state_from_plan.Step 3 — hook
update_plancrates/hero_shrimp_engine/src/tools/tool_catalog/verify/plan_ops.rshandle_update_plan, after the kanban sync and before the success return, call:update_plancall already populatestask_state.json.Step 4 — hook
complete_phasecrates/hero_shrimp_engine/src/tools/tool_catalog/verify/mod.rshandle_complete_phase, after the plan-versions file is written and before theemit_scopedevent, insert the same mirror call shown in Step 3.task_state.jsonin sync as phases tick over the life of the run.Step 5 — fix the misleading error message
crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs(lines 86-92)anyhow!(...)body with:Step 6 — tests
.agent/job_plan.jsonwith two phases → call helper → asserttask_state.jsonexists, deserialises intoExecutionState,job_id == Some("rpc_job_xyz"), 2 phases with matching ids,status == "running".apply_autonomy_operator_actionwithaction="steer_job", message="be careful"and assert it returnsOk(no longer the "No active autonomy state" error). Verify the resultingtask_state.jsonhasoperator_guidanceset.task_state.jsonwithoperator_guidance = Some("be quick")andoperator_force_replan = true, then call the mirror helper, then assert those two fields are still set on disk.handle_update_planagainst a workspace with aToolContext { workspace_dir, job_id: Some(...) }and assert.agent/task_state.jsonexists after the call.Acceptance criteria
job.steer { job_id, message }against a job whose workspace contains only.agent/job_plan.jsonand.agent/plan_versions/returns success and setsoperator_guidancein.agent/task_state.json.job.steer { job_id, clear: true }against the same job returns success and clearsoperator_guidance.OperatorGuidanceProvider(job_context.rs:204) sees the steering text on the next loop iteration viaoperator_guidance_from_workspace— i.e. the steering reaches the LLM prompt without going through the DB-onlypending_operator_guidancefallback.operator_guidance,operator_force_replan,operator_pause_requested,blocked_reason) survive subsequentupdate_planandcomplete_phasecalls.update_plan(Tier-0 trivial path) does NOT crash — the mirror helper is a no-op whenjob_plan.jsondoesn't exist.task_state.jsongenuinely cannot be located no longer mentions a "Resume button".Notes
persist_goal_and_statedirectly? That helper does a synchronous DB upsert (upsert_job_state_snapshot) and writes two markdown files (archived + canonical goal docs). Doing all of that on every singleupdate_planandcomplete_phaseinvocation would add per-tick DB writes and per-tick goal-doc rewrites for a side-channel mirror. The loop already writes to the DB via execution-control and tojob_plan.jsondirectly — the mirror only needs to satisfy the operator path's "is there anExecutionStateatSTATE_PATH?" check. A directsave_json_fileis the minimum-blast-radius write.task_state.json'sjob_iddoesn't matchcontext.job_id? The merge branch only fires when they match. On mismatch, we fall through to synthesise a freshExecutionStatewith the loop's job_id — correct, because the operator's identity check atoperator.rs:94keys off the loop's job_id, not a stale snapshot's.job_plan.jsonwrites.JobStateSnapshotRowcontinues to be authoritative for callers that go throughpersist_goal_and_state/ the DB-first path ofload_state.job.resumeRPC or a Resume button is explicitly NOT part of this fix. The error message rewording is enough to stop the wild-goose chase; the new surface, if needed, is a separate issue.Test Results
Summary
development— not introduced by this change)New tests added (
crates/hero_shrimp_engine/src/orchestration/autonomy/tests.rs)mirror_task_state_from_plan_creates_state_when_none_existsmirror_task_state_from_plan_is_noop_when_plan_missingmirror_task_state_from_plan_preserves_operator_fieldsmirror_task_state_from_plan_replaces_state_on_job_id_mismatchmirror_task_state_from_plan_ignores_empty_job_idsteer_error_message_no_longer_mentions_resume_buttonThese directly assert:
ExecutionStatewith the correctjob_idand phase shells.job_plan.jsonis missing (Tier-0 / non-autonomy jobs do not crash).operator_guidance,operator_force_replan, andblocked_reasonsurvive subsequentupdate_plan/complete_phaseticks.job_idmismatch causes a fresh state to be synthesised rather than a stale merge.job.steererror message no longer points operators at a non-existent "Resume button".Autonomy submodule test results (
orchestration::autonomy::tests::)Pre-existing test failures (verified unrelated to this change)
The following 9 tests fail under
cargo test -p hero_shrimp_engine --libBOTH on this branch and ondevelopment. They were confirmed to all pass when run in isolation, and the autonomy backend test fails only becausebubblewrapis installed on this dev machine (environmental, not code-related):Running the same 9 tests in isolation:
So no regressions are introduced by this change.
Build
cargo check -p hero_shrimp_engine— clean (no warnings, no errors).cargo build -p hero_shrimp_engine --tests— clean.Implementation Summary
Changes Made
Implemented Option 1 + Option 3 from the issue: the agent loop now mirrors
<workspace>/.agent/job_plan.jsoninto<workspace>/.agent/task_state.jsonon everyupdate_planandcomplete_phase, sojob.steerandOperatorGuidanceProviderhave the state file they require for any live autonomy job. The misleading error message that pointed operators at a non-existent "Resume button" was also rewritten.Files modified
crates/hero_shrimp_engine/src/orchestration/autonomy/persistence.rs— addedmirror_task_state_from_plan(workspace_dir, job_id). Best-effort sidecar writer: reads.agent/job_plan.json, merges over an existingtask_state.json(preservingoperator_guidance,operator_force_replan,operator_pause_requested,blocked_reason,timeline, and per-phase status keyed by phase id) or synthesizes a freshExecutionStateif none exists, then writes to.agent/task_state.json. Any failure is logged atwarn!and swallowed so the caller's primary write succeeds.crates/hero_shrimp_engine/src/orchestration/autonomy/mod.rs— re-exportedmirror_task_state_from_planalongside the otherpersistence::*entries.crates/hero_shrimp_engine/src/tools/tool_catalog/verify/plan_ops.rs—handle_update_plancalls the mirror helper after writingjob_plan.json. Now the firstupdate_planpopulatestask_state.jsonandjob.steerbecomes usable.crates/hero_shrimp_engine/src/tools/tool_catalog/verify/mod.rs—handle_complete_phasecalls the mirror helper after the post-verification plan rewrite, so phase progression continues to refresh the mirror.crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs— rewrote theNo active autonomy state founderror message to drop the reference to a "Resume button" that does not exist and point operators at the real surface (update_plan/ follow-up prompt).crates/hero_shrimp_engine/src/orchestration/autonomy/tests.rs— added 6 unit tests covering creation, no-op-on-missing-plan, operator-field preservation, job_id mismatch resynthesis, emptyjob_idrejection, and the new error message.Deliberately NOT changed
persist_goal_and_stateand its existing callers (activate_autonomy_job_state,fork_autonomy_job,resume_job). The mirror is additive — those paths still own DB upsert + goal-doc writes for operator-driven activations.JobStateSnapshotRowschema andload_state's DB-first behaviour.job_plan.jsonformat itself.Test Results
cargo test -p hero_shrimp_engine --libare pre-existing ondevelopment(bubblewrap-detection test + shell/e2e tests that pass in isolation but fail under the full-suite harness). Confirmed by running them ondevelopmentdirectly.Acceptance criteria
job.steer { job_id, message }against a job whose workspace contains only.agent/job_plan.json+.agent/plan_versions/succeeds and setsoperator_guidance(validated bymirror_task_state_from_plan_creates_state_when_none_exists+ manual trace throughapply_autonomy_operator_action).job.steer { job_id, clear: true }clearsoperator_guidance— same code path now reachable.OperatorGuidanceProviderwill see steering text on the next loop iteration (it already reads STATE_PATH; the file now exists).operator_guidance,operator_force_replan,operator_pause_requested,blocked_reasonsurvive subsequentupdate_plan/complete_phaseticks (validated bymirror_task_state_from_plan_preserves_operator_fields).update_plan) do not crash (validated bymirror_task_state_from_plan_is_noop_when_plan_missing).steer_error_message_no_longer_mentions_resume_button).Notes
job.resumeRPC or Resume button. The error message rewording stops the wild-goose chase; building that surface, if desired, is a separate issue.Follow-up: extended scope after testing
After verifying the original fix, two additional issues surfaced and were addressed on the same branch.
Issue A —
job.steerstill errored for jobs that never entered the autonomy/plan pathThe original fix wrote
task_state.jsonfromupdate_plan/complete_phase, so steer worked for autonomy jobs that had published a plan. But jobs running in the fast/Tier-0 path (or autonomy jobs that hadn't yet calledupdate_plan) never created the file, sojob.steerstill failed with the new error message.Fix:
apply_autonomy_operator_actionnow accepts the "instructional" actions (steer_job,clear_steer,force_replan,pause_job) even whentask_state.jsondoesn't exist. A minimalExecutionStateis synthesized in memory, the field is set, and the state is persisted. Phase-specific actions (retry_phase,skip_phase) still error when no plan exists — they legitimately need a phase id.crates/hero_shrimp_engine/src/orchestration/autonomy/operator.rs— addedsynthesize_minimal_state_for_operator(job_id); restructured the state-load to fall back to synthesis for the four instructional actions.Issue B —
[WARN] reconciled timed-out autonomy run from completed subagentsspammed the logsA pre-existing idempotency bug in
stamp_subagent_reconciliation_details:failure_kindandrun_timeoutwere only cleared fromdetails_jsonwhensummary.status == "completed". If subagents finished but reconciliation produced any other verdict (e.g. contract verification failed),failure_kind: "run_timeout"was left in place, sorow_failed_due_to_run_timeoutmatched on the very next UI poll and re-fired the reconciliation (and the warn) for the same row forever.Fix: clear
failure_kindandrun_timeoutafter every reconciliation regardless of verdict, since the original "merely timed out" classification is no longer accurate once subagent reconciliation has run.crates/hero_shrimp_server/src/rpc/methods/job/contract.rs— hoisted the twodetails.remove(...)calls out of thesummary.status == "completed"branch.Issue C — message.send "queued while starting" UX preserved
The Issue A change inadvertently broke
message_send_queues_guidance_when_active_state_is_not_ready:steer_existing_job_from_messageinsession_autonomy.rspreviously triggered thequeue_pending_operator_guidancefallback only on steer failure. With steer now succeeding via synthesized state, the friendly"Queued guidance while the active job finishes starting."message and the DB-sidepending_operator_guidancebackstop both stopped firing.Fix: detect
task_state.jsonabsence before the steer call and route through the queue helper when the state had to be synthesized. The user-facing message stays accurate ("queued while starting", not "applied") and the DB backstop continues to writepending_operator_guidanceso the autonomy loop can read it via either the file mirror orpending_operator_guidance_from_db.crates/hero_shrimp_server/src/rpc/methods/session_autonomy.rs— pre-steerstate_existedcheck; route throughqueue_pending_operator_guidancewhen state was synthesized.New tests
steer_job_succeeds_when_no_state_file_existsclear_steer_succeeds_when_no_state_file_existsretry_phase_still_errors_when_no_state_and_no_resume_button_mentionedstamp_clears_failure_kind_even_when_summary_failedstamp_clears_failure_kind_when_summary_completedPlus the previously-failing
message_send_queues_guidance_when_active_state_is_not_readyis green again after the UX preservation fix in Issue C.Build status
cargo check -p hero_shrimp_engine— clean.cargo check -p hero_shrimp_server— clean.cargo test -p hero_shrimp_engine --lib orchestration::autonomy::tests::— 29 passed, 0 failed.cargo test -p hero_shrimp_server --lib rpc::methods::job::contract::reconciliation_tests— 2 passed, 0 failed.Summary of behavior change
job.steerandjob.clear_steer(andforce_replan,pause_job) now succeed for any running job — autonomy or fast-path — regardless of whether a plan has been published. The guidance is delivered via:OperatorGuidanceProviderreadingtask_state.json(file path), andpending_operator_guidance_from_dbreadingautonomy_jobs.details_jsonwhen the message.send code path queued it.The reconcile-loop log spam is also fixed.