Port Tier 1/2/3 learnings from xmoncode/shrimp #90

Merged
thabeta merged 16 commits from port-from-xmoncode-shrimp into integration 2026-06-11 23:37:43 +00:00
Owner

Summary

Ports a batch of ideas from xmonader's personal shrimp agent (~/xmoncode/shrimp) into hero_shrimpall of Tier 1/2/3 from the comparison write-up. Each item is adapted to hero_shrimp's architecture (not a copy — upstream is a different crate layout), wired into the engine/runtime/CLI, and unit-tested.

Full details: docs/ports-from-xmoncode-shrimp.md.

What's included

New tools (registered + routed):

  • repo_wiki — drift-tracked ARCHITECTURE.md from the repo map
  • find_clones — near-duplicate function bodies (token-bag cosine)
  • impacted_tests — tests depending on changed files (via blast_radius)
  • ast_edit — tree-sitter symbol replacement (Rust) with a post-parse rollback gate
  • expand_context — retrieve the full text of an elided tool output on demand
  • fork — best-of-N candidate race in isolated git worktrees
  • mcp_search — BM25 ranking + name-resolve over MCP tools
  • skill_evolve — deterministic skill minting from recurring success patterns

Behavior / hot paths:

  • loop-detection cold-start exploration grace
  • Anthropic prompt cache anchored on the last stable (assistant) message
  • per-server MCP circuit breaker
  • RRF + MMR diversity re-rank in memory recall
  • per-segment shell grant keys in the session approval cache
  • conversational approve-over-chat (Telegram) + reject-with-feedback
  • declarative file-defined crews (dependency-wave DAG + typed handoffs)
  • macOS Seatbelt (sandbox-exec) shell backend
  • typed llm:deltaMessagePartial at the client edge
  • council raised 3 → 4 members (MAX_COUNCILORS + tier clamp)
  • new tools wired into tool_routing groups

Harnesses:

  • 5 new behavioral eval scenarios that assert real on-disk effects
  • eval/fromscratch/ — held-out-oracle capability harness (ported)

Verification

  • cargo build --workspace clean, 0 warnings on changed crates.
  • Unit suite: 1717 passed (2 failures are a sandbox artifact — /tmp is itself a git repo in CI; they pass under a non-git TMPDIR).
  • Behavioral eval: 16/16 through the real agent loop (scripted LLM). The 5 new scenarios assert real file effects (ast_edit rewrites a file, repo_wiki writes the doc, etc.) — this is what caught a real routing bug where the new tools were registered but never offered to the model.
  • From-scratch harness, run live: deepseek-v4-flash built a complete bencode encoder/decoder from a spec; the held-out acceptance test passed 5/5.
  • Live multi-model verification: executor deepseek-v4-flash + a 4-model council (deepseek-v4-pro, z-ai/glm-5.1, minimax/minimax-m3, moonshotai/kimi-k2.6) — all confirmed responding (authoritative: cost ledger + council_positions table). ~$0.14 total.

Notes for the reviewer

  • Council cap 3 → 4 is included intentionally (raises council size, ~33% more cost per consult). Easy to revert to config-only if undesired.
  • Not yet exercised end-to-end (unit-tested + isolated, low blast radius): fork, declarative crews, watch, expand_context, conversational approvals. Prompt-cache anchoring and MMR recall are hot-path changes validated by structure/unit tests but not against live external behavior.
  • Adds 3 dependencies (tree-sitter, tree-sitter-rust, streaming-iterator) for ast_edit.

Test plan

  • cargo build --workspace
  • cargo test --workspace (engine 1717 pass; 2 env-only)
  • make eval → 16/16
  • eval/fromscratch/run.sh bencode (live) → 5/5 held-out
  • live executor + 4-model council run
## Summary Ports a batch of ideas from xmonader's personal `shrimp` agent (`~/xmoncode/shrimp`) into `hero_shrimp` — **all of Tier 1/2/3** from the comparison write-up. Each item is adapted to hero_shrimp's architecture (not a copy — upstream is a different crate layout), wired into the engine/runtime/CLI, and unit-tested. Full details: `docs/ports-from-xmoncode-shrimp.md`. ## What's included **New tools (registered + routed):** - `repo_wiki` — drift-tracked `ARCHITECTURE.md` from the repo map - `find_clones` — near-duplicate function bodies (token-bag cosine) - `impacted_tests` — tests depending on changed files (via `blast_radius`) - `ast_edit` — tree-sitter symbol replacement (Rust) with a post-parse rollback gate - `expand_context` — retrieve the full text of an elided tool output on demand - `fork` — best-of-N candidate race in isolated git worktrees - `mcp_search` — BM25 ranking + name-resolve over MCP tools - `skill_evolve` — deterministic skill minting from recurring success patterns **Behavior / hot paths:** - loop-detection cold-start exploration grace - Anthropic prompt cache anchored on the last *stable* (assistant) message - per-server MCP circuit breaker - RRF + MMR diversity re-rank in memory recall - per-segment shell grant keys in the session approval cache - conversational approve-over-chat (Telegram) + reject-with-feedback - declarative file-defined crews (dependency-wave DAG + typed handoffs) - macOS Seatbelt (`sandbox-exec`) shell backend - typed `llm:delta` → `MessagePartial` at the client edge - council raised 3 → 4 members (`MAX_COUNCILORS` + tier clamp) - new tools wired into `tool_routing` groups **Harnesses:** - 5 new behavioral eval scenarios that assert **real on-disk effects** - `eval/fromscratch/` — held-out-oracle capability harness (ported) ## Verification - `cargo build --workspace` clean, **0 warnings** on changed crates. - Unit suite: **1717 passed** (2 failures are a sandbox artifact — `/tmp` is itself a git repo in CI; they pass under a non-git `TMPDIR`). - Behavioral eval: **16/16** through the real agent loop (scripted LLM). The 5 new scenarios assert real file effects (`ast_edit` rewrites a file, `repo_wiki` writes the doc, etc.) — this is what caught a real routing bug where the new tools were registered but never offered to the model. - **From-scratch harness, run live:** `deepseek-v4-flash` built a complete bencode encoder/decoder from a spec; the held-out acceptance test passed **5/5**. - **Live multi-model verification:** executor `deepseek-v4-flash` + a 4-model council (`deepseek-v4-pro`, `z-ai/glm-5.1`, `minimax/minimax-m3`, `moonshotai/kimi-k2.6`) — all confirmed responding (authoritative: cost ledger + `council_positions` table). ~$0.14 total. ## Notes for the reviewer - **Council cap 3 → 4** is included intentionally (raises council size, ~33% more cost per consult). Easy to revert to config-only if undesired. - Not yet exercised end-to-end (unit-tested + isolated, low blast radius): `fork`, declarative crews, `watch`, `expand_context`, conversational approvals. Prompt-cache anchoring and MMR recall are hot-path changes validated by structure/unit tests but not against live external behavior. - Adds 3 dependencies (`tree-sitter`, `tree-sitter-rust`, `streaming-iterator`) for `ast_edit`. ## Test plan - [x] `cargo build --workspace` - [x] `cargo test --workspace` (engine 1717 pass; 2 env-only) - [x] `make eval` → 16/16 - [x] `eval/fromscratch/run.sh bencode` (live) → 5/5 held-out - [x] live executor + 4-model council run
Merge pull request 'update main' (#83) from development into main
All checks were successful
Build Linux / build-linux (push) Successful in 12m16s
Verify / verify (push) Successful in 38m10s
7da7d6f587
Reviewed-on: #83
chore: build on main — hero_lifecycle factor-out + herolib_openrpc, CI 1.96
All checks were successful
Build Linux / build-linux (push) Successful in 4m59s
Verify / verify (push) Successful in 32m12s
5644285cce
feat: port Tier 1/2/3 learnings from xmoncode/shrimp
Some checks failed
Verify / verify (push) Failing after 21s
bf6c279992
Adapts a batch of ideas from xmonader's personal `shrimp` agent into
hero_shrimp, each wired into the engine/runtime/CLI and unit-tested.
Workspace builds clean (0 warnings); behavioral eval 16/16; live-verified
against deepseek-v4-flash (executor) + a 4-model council (deepseek-v4-pro,
z-ai/glm-5.1, minimax/minimax-m3, moonshotai/kimi-k2.6).

New tools (registered + routed):
- repo_wiki        drift-tracked ARCHITECTURE.md from the repo map
- find_clones      near-dup function bodies (token-bag cosine)
- impacted_tests   tests depending on changed files (blast_radius)
- ast_edit         tree-sitter symbol replacement (Rust) + rollback gate
- expand_context   retrieve full elided tool output on demand
- fork             best-of-N candidate race in git worktrees
- mcp_search       BM25 ranking + name-resolve over MCP tools
- skill_evolve     deterministic skill minting from success patterns

Behavior / hot paths:
- loop-detection cold-start exploration grace
- Anthropic prompt cache anchored on the last stable (assistant) message
- per-server MCP circuit breaker
- RRF+MMR diversity re-rank in memory recall
- per-segment shell grant keys in the session approval cache
- conversational approve-over-chat (Telegram) + reject-with-feedback
- declarative file-defined crews (dependency-wave DAG + typed handoffs)
- macOS Seatbelt (sandbox-exec) shell backend
- typed llm:delta -> MessagePartial at the client edge
- council raised 3 -> 4 members (MAX_COUNCILORS + tier clamp)
- new tools wired into tool_routing groups

Harnesses:
- 5 new behavioral eval scenarios that assert real on-disk effects
- eval/fromscratch/ held-out-oracle capability harness (ported)

Docs: docs/ports-from-xmoncode-shrimp.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
thabeta changed title from port-from-xmoncode-shrimp to Port Tier 1/2/3 learnings from xmoncode/shrimp 2026-06-04 23:23:31 +00:00
chore(ci): green the Verify action + add recipe workflows tool
All checks were successful
Verify / verify (push) Successful in 13m48s
208d2127b3
CI hygiene so the forgejo "Verify" action passes end to end:
- cargo fmt --all
- fix 2 clippy lints under `-D warnings`:
  - fork_ops: `.filter_map` that always returns Some -> `.map`
  - auto_evolve: `contains_key` + `insert` -> `HashSet::insert`
- update tests for the intentionally-changed behavior:
  - chaos_testing: cold-start exploration grace now exempts the opening
    read-only burst, so the loop-detector tests first do a real edit
  - council: tier clamp raised 3 -> 4 (matches the 4-member council)

New idea ported from xmoncode/shrimp (better coding agent + assistant):
- `recipe` tool — canned expert workflows: ask, tour, spec (spec-driven
  dev), adr, harden, audit, learn. The agent adopts a battle-tested
  instruction prompt for the turn. Wired into tool routing (always-on).

Verified locally with the CI toolchain (rust 1.96): cargo fmt --check,
cargo clippy --workspace --all-targets -D warnings, make smoke-features,
cargo test --workspace (TMPDIR non-git), make eval (16/16), cargo deny
check, cargo audit --deny warnings — all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
harden ports + ast_edit multi-language (adversarial review)
Some checks failed
Verify / verify (push) Has been cancelled
b2b04baaa5
A self-review of the ported code surfaced real boundary regressions and
edge cases; this fixes them and lands the biggest capability upgrade.

Security:
- shell_grant: non-subcommand programs now key on their FULL args, and
  redirections / command-substitution / newlines never coalesce — so
  approving `rm -rf ./build` can't auto-approve `rm -rf /`, and a benign
  `git log` can't auto-approve `git log > /etc/passwd`.
- ast_edit + repo_wiki: confine writes to the workspace via path_policy
  (were write-anywhere). repo_wiki drops its unconfined `path` arg and is
  re-tagged cap:fs.write.
- telegram conversational approval: only the unambiguous yes/no/always
  vocabulary resolves a pending request — arbitrary free text no longer
  silently cancels a tool. `/approve` and `/deny` are scoped to the
  chat's own session (no cross-operator/cross-session resolution).
- auto_evolve: sanitize values interpolated into skill YAML frontmatter;
  guard an empty name slug.

Correctness:
- watch: dedup on (file, instruction) not line, so a shifted marker isn't
  re-dispatched in a loop; comment detection is quote-aware, so `//`
  inside a string or URL no longer false-fires.
- ast_edit: regression test that replacing a middle item keeps the
  following item on its own line.

Capability:
- ast_edit now supports Python, JavaScript, TypeScript/TSX, and Go
  (tree-sitter grammars + per-language item queries), not just Rust.

Performance:
- find_clones: length-banded sweep + a hard cap on functions compared,
  so the all-pairs scan can't hang on a large monorepo.

Green with the CI toolchain (rust 1.96): fmt, clippy -D warnings,
smoke-features, cargo test --workspace, make eval, cargo deny, cargo audit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ports five high-impact features from kimi-cli to achieve feature parity:

**Think Tool** (`tools/tool_catalog/think_tool.rs`)
- Explicit reasoning tool for structured deliberation
- Model calls `think` with reasoning, logged and returned for later reference
- Added to new "cognition" toolset

**Ralph Mode** (`commands/ralph.rs`)
- Iterative execution until agent emits STOP signal
- Feeds task repeatedly; sends "Continue." nudge between iterations
- Configurable `--max-iterations` (default: 10) with safety bound
- Added `ralph` CLI subcommand and dispatch wiring

**Dynamic Injection Providers** (`agent_core/agent/dynamic_injection.rs`)
- Runtime-state-driven system reminders injected into agent loop
- `plan_mode_provider`: reminds model write tools are blocked in PlanOnly mode
- `yolo_mode_provider`: reminds model to be conservative in auto-approve mode
- Injected as trailing user notes (not system-prompt edits) to preserve cache prefix
- Fires every 8 iterations so model doesn't drift
- Wired into `llm_loop.rs` between context compression and LLM invocation

**Multi-Scope Skill Discovery** (`skills/markdown_skills.rs`)
- Brand-compatible skill directories: `.shrimp/`, `.agents/`, `.kimi/`, `.claude/`, `.codex/`
- Four-tier hierarchy: project → config → user → extra (`HERO_SHRIMP_EXTRA_SKILL_DIRS`)
- Earlier brand names win when same skill exists in multiple dirs
- Backward-compatible with legacy `.agents/` and `.shrimp/` flat dirs

**Agent Tracing Visualizer** (`web/src/routes/vis.rs`)
- Standalone `/vis` route serving debug HTML page
- Connects to `/api/events` SSE stream
- Color-coded timeline: LLM calls (blue), tool starts/finishes (green), phases (orange), council (purple), errors (red)
- Auto-scroll, collapsible payloads, clear/controls

**Integration**
- All features compile (`cargo check --workspace` passes)
- CLI tests pass (32/32)
- No new `.unwrap()` calls introduced
- Follows existing code style and patterns
feat: SEARCH/REPLACE edit format, three-tier compaction, fix critical bugs
Some checks failed
Verify / verify (push) Failing after 9s
f29af5c3b9
Implemented:
- SEARCH/REPLACE edit format recovery (aider-style) in agent loop
- Three-tier compaction (warn/auto/hard thresholds) for context management

Fixed critical bugs:
- Removed transmute UB in agency/dispatch.rs, runtime_state.rs, config_cache.rs
  by moving subagent mutexes to static LazyLock variables
- Fixed SQL injection in db/backup.rs (VACUUM INTO path validation)
- Fixed SQL injection in db/state.rs (scrub_table identifier validation)
- Fixed Dockerfile rust version (1.85 → 1.96)
- Removed committed node_modules (99MB, 4771 files) and fixed .gitignore typo
- Removed orphan hero_shrimp_executor dependency from Cargo.toml
- Updated README workspace layout to reflect actual crates
- Enhanced file_edit tool to support batch edits via  array parameter
- Added unified diff preview in tool results using  crate
- Maintains backward compatibility with single-edit / params
- Uses fuzzy matching from file_multi_edit for whitespace-tolerant edits
- Verification gate was already opt-in:  defaults to proof=false,
   defaults to false in AgentOptions
Core Editing:
- Added syntax validation gate to file_write, file_edit, file_multi_edit
  using tree-sitter (supports Rust, Python, JS, TS, Go)
- Validates code parses correctly before committing to disk
- Rejects edits that would introduce syntax errors

Edit Formats:
- Added apply_patch tool for unified diff application
- Supports git-diff format with file creation/deletion/rename
- Uses git apply --reject for safe partial application

Context Management:
- Added image stripping during compaction (base64 data URIs)
- Saves tokens by replacing images with [image: stripped] placeholder
- Applied to both DefaultCompactionStrategy and AggressiveTrimStrategy

Safety & Approval:
- Enhanced confirmation prompts to show batch edit diffs
- Supports new  array parameter in file_edit tool
- Shows per-edit diff with numbered headers

Reliability:
- Made repair module functions pub(crate) for reuse
- Added strip_images function with regex matching
- All changes compile and tests pass
Implements importance scoring for tool results during context compaction:
- Critical: errors, failures, compilation issues, panics (preserved in full)
- High: test output, linter warnings (prefer keeping)
- Normal: standard tool output (eligible for summarization)
- Low: ambient reads like file_list, grep, git_status (first to prune)

Modifies both prune_tool_results_internal and prune_to_water_mark to sort
by importance ascending before pruning, ensuring failures survive longer
than successes during context window pressure.

Adds 3 new tests verifying smart preservation behavior.
Implements multiple competitive features:

1. Auto-lint feedback loop (coding/auto_lint.rs):
   - Runs project linter automatically after file edits
   - Supports cargo clippy, eslint, ruff, flake8, pylint, golangci-lint
   - 10-second timeout, best-effort — never blocks edit result
   - Appends lint errors to tool result so model can self-correct
   - Integrated into augment_with_lsp_diagnostics post-edit hook

2. RepoMap with ctags (runtime/repo_map.rs):
   - try_ctags_extraction() runs universal-ctags when available
   - JSON output parsing for accurate symbol extraction
   - Falls back to regex extractors when ctags is unavailable
   - Handles Rust, Python, JS/TS, Go via ctags

3. Chat history search API:
   - message.search RPC method using existing FTS5 (messages_fts)
   - /api/messages/search endpoint in web server
   - Returns ranked results with relevance scores

4. Unwrap lint prevention:
   - Added #![cfg_attr(not(test), warn(clippy::unwrap_used))]
   - Warns on new unwrap additions in production code
   - Tests are exempt (test code can panic)

5. Updated REEVALUATION.md with corrected score (~72%)
Comprehensive technical capability comparison:
- kimi-cli wins: simplicity, shell mode, VS Code extension, ACP
- qwen-code wins: IDE integration (VS Code/Zed/JetBrains), SDKs, daemon mode
- aider wins: 14 edit formats, ctags repo map, architect/editor split, auto-lint/test
- claude-code wins: polish, GitHub integration, plugins, commercial support

hero_shrimp wins: verification gate, signed wire log, budget caps, sandboxes,
council voting, parallel execution, smart compaction, syntax validation

Identified picoclaw as not found in public repos.
New edit format dialects for tool-call recovery:
- whole_file: ### File: path +  (→ file_write)
- wdiff: [-old-]{+new+} inline markers (→ file_edit with batch edits)
- hunk_diff: @@ -l,s +l,s @@ with context lines (→ file_edit with batch edits)

All 3 formats integrated into lift_recovered_tool_calls with proper tests.

Autofix loop (auto_lint.rs):
- After detecting lint errors, attempts automatic fix via:
  - cargo clippy --fix --allow-dirty --allow-staged
  - ruff check --fix / black
  - eslint --fix / biome lint --write
  - gofmt -w
- Reports fix results alongside lint diagnostics
- 15-second timeout on fix attempt
- Skips fix for non-fixable linters (flake8, pylint)

1745 tests passing (2 environmental failures)
Detailed source-level analysis of two competitive coding agents:

**mistral-vibe wins at:**
- Trust folder system (trusted/untrusted paths)
- Agent safety tiers (Safe/Neutral/Destructive/YOLO)
- Middleware pipeline (compaction, price limits, read-only)
- ACP protocol (full IDE integration)
- Skills system with subagent delegation

**soulforge wins at:**
- AST surgical editing (65+ operations via ts-morph)
- SQLite repo map with PageRank + git co-change + clone detection
- Memory system with embeddings + deduplication
- Per-prompt git checkpoints with undo/redo
- Task router (different models per slot)
- 33 languages supported

Identified 5 quick wins (low effort, high impact) and 5 long-term
investments for hero_shrimp to close competitive gaps.
feat: port soulforge/mistral-vibe features and fix time-travel hermeticity
Some checks failed
Verify / verify (push) Failing after 1m35s
1b5592e5ac
This commit closes four competitive gaps from the adversarial review and
stabilizes a long-standing flaky test pair.

Features:
- Git co-change weighting for PageRank (runtime/repo_map.rs)
  Parse git log --name-only, derive pairwise co-change edge weights, and
  blend them into the identifier-frequency reference graph so files that
  change together rank together.

- Per-prompt git checkpoints with /back and /forward (agent/git_checkpoint.rs)
  Auto-commit workspace mutations per iteration as shrimp-ck/<session>/<n>.
  Users can rewind/advance via slash commands; history persists across tags.

- AST surgical editing beyond whole-definition replacement (tools/ast_edit_ops.rs)
  Add ast_insert and ast_delete tools, keeping the same safety model:
  exact-one definition match, syntax-tree span, post-parse rollback.

- Memory with auto-embedding on save (tools/tool_catalog/memory_ops.rs)
  manage_memory save now best-effort embeds key+value so memories are
  immediately discoverable via the existing vector/hybrid search path.

Stability:
- Fix is_git_workspace() in orchestration/time_travel.rs
  Require the directory itself to be the git worktree root instead of
  merely being inside any git repo. This made tests under /tmp fail when
  /tmp happened to be a git repo because git add -A walked outside the
  temp workspace.

All workspace tests pass: 1757 passed, 0 failed, 1 ignored.
The shrimp_home tests mutated process-wide HOME/SHRIMP_HOME env vars,
which leaked into concurrently-running tests that resolve config paths.
This produced spurious 'Permission denied' failures when the suite ran
in parallel (the default).

Refactor shrimp_home to delegate to a pure helper that takes the env
values as arguments, and test the helper directly without touching the
process environment.
thabeta merged commit e9299f9207 into integration 2026-06-11 23:37:43 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_shrimp!90
No description provided.