feat(mcp): defer MCP tool schemas to slash per-turn context cost #3

Closed
thabeta wants to merge 0 commits from feat/deferred-mcp-tools into development
Owner

MCP tools were registered into the toolset and sent to the model on every
turn with full JSON schemas plus a boilerplate description prefix. With a few
servers that is thousands of tokens per turn, re-sent each step, occupying the
context window and pushing real content toward compaction.

Defer them instead (qwen-code / shrimp's idea): MCP tools are kept out of the
tool list and replaced by a compact, per-server index appended to the system
prompt. The model calls a new tool_search tool with a keyword to activate the
tools a task needs; their schemas re-enter the prompt from the next step on.

  • KimiToolset gains a shared DeferState (catalog + activated set) so tool_search
    can activate tools from inside a spawned tool-call task without re-locking the
    toolset mutex held by the in-flight step.
  • tools() sends only non-deferred tools plus activated ones, and hides
    tool_search entirely when nothing is deferred (no change for non-MCP users).
  • deferred_hint() groups mcp____ by server, one line each up to a
    24-server cap, then a single summary line — O(1) prompt cost at fleet scale.
  • MCP tools are now exposed as mcp____ (collision-safe, enables
    per-server grouping); the server-local name is still sent in tools/call.
  • The 16-token boilerplate description prefix becomes a 5-token [MCP ] tag.
  • The deferred hint is appended to the system prompt at the step site; empty when
    nothing is deferred, so the cached system prefix stays stable.

Measured on three real MCP servers (github/everything/filesystem, 53 tools): the
per-turn tools[] payload drops from ~7,400 to ~295 tokens at idle (25x smaller,
96% less). Proven by three test layers:

  • deferred_tests: matcher, bounded/O(1) index, hide-until-activated.
  • token_proof: drives the real tools() + kosong's real convert_tool encoder over
    captured schemas, asserts >5x smaller (regression guard).
  • live_proof: connects real MCP servers over stdio and captures the literal
    tools[] + system prompt the agent transmits through kosong::step; gated behind
    KIMI_LIVE_MCP_PROOF so the default suite stays offline.

kosong::convert_tool is made pub so tests measure the real wire encoding.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com

MCP tools were registered into the toolset and sent to the model on every turn with full JSON schemas plus a boilerplate description prefix. With a few servers that is thousands of tokens per turn, re-sent each step, occupying the context window and pushing real content toward compaction. Defer them instead (qwen-code / shrimp's idea): MCP tools are kept out of the tool list and replaced by a compact, per-server index appended to the system prompt. The model calls a new `tool_search` tool with a keyword to activate the tools a task needs; their schemas re-enter the prompt from the next step on. - KimiToolset gains a shared DeferState (catalog + activated set) so tool_search can activate tools from inside a spawned tool-call task without re-locking the toolset mutex held by the in-flight step. - tools() sends only non-deferred tools plus activated ones, and hides tool_search entirely when nothing is deferred (no change for non-MCP users). - deferred_hint() groups mcp__<server>__<tool> by server, one line each up to a 24-server cap, then a single summary line — O(1) prompt cost at fleet scale. - MCP tools are now exposed as mcp__<server>__<tool> (collision-safe, enables per-server grouping); the server-local name is still sent in tools/call. - The 16-token boilerplate description prefix becomes a 5-token [MCP <server>] tag. - The deferred hint is appended to the system prompt at the step site; empty when nothing is deferred, so the cached system prefix stays stable. Measured on three real MCP servers (github/everything/filesystem, 53 tools): the per-turn tools[] payload drops from ~7,400 to ~295 tokens at idle (25x smaller, 96% less). Proven by three test layers: - deferred_tests: matcher, bounded/O(1) index, hide-until-activated. - token_proof: drives the real tools() + kosong's real convert_tool encoder over captured schemas, asserts >5x smaller (regression guard). - live_proof: connects real MCP servers over stdio and captures the literal tools[] + system prompt the agent transmits through kosong::step; gated behind KIMI_LIVE_MCP_PROOF so the default suite stays offline. kosong::convert_tool is made pub so tests measure the real wire encoding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
MCP tools were registered into the toolset and sent to the model on every
turn with full JSON schemas plus a boilerplate description prefix. With a few
servers that is thousands of tokens per turn, re-sent each step, occupying the
context window and pushing real content toward compaction.

Defer them instead (qwen-code / shrimp's idea): MCP tools are kept out of the
tool list and replaced by a compact, per-server index appended to the system
prompt. The model calls a new `tool_search` tool with a keyword to activate the
tools a task needs; their schemas re-enter the prompt from the next step on.

- KimiToolset gains a shared DeferState (catalog + activated set) so tool_search
  can activate tools from inside a spawned tool-call task without re-locking the
  toolset mutex held by the in-flight step.
- tools() sends only non-deferred tools plus activated ones, and hides
  tool_search entirely when nothing is deferred (no change for non-MCP users).
- deferred_hint() groups mcp__<server>__<tool> by server, one line each up to a
  24-server cap, then a single summary line — O(1) prompt cost at fleet scale.
- MCP tools are now exposed as mcp__<server>__<tool> (collision-safe, enables
  per-server grouping); the server-local name is still sent in tools/call.
- The 16-token boilerplate description prefix becomes a 5-token [MCP <server>] tag.
- The deferred hint is appended to the system prompt at the step site; empty when
  nothing is deferred, so the cached system prefix stays stable.

Measured on three real MCP servers (github/everything/filesystem, 53 tools): the
per-turn tools[] payload drops from ~7,400 to ~295 tokens at idle (25x smaller,
96% less). Proven by three test layers:
- deferred_tests: matcher, bounded/O(1) index, hide-until-activated.
- token_proof: drives the real tools() + kosong's real convert_tool encoder over
  captured schemas, asserts >5x smaller (regression guard).
- live_proof: connects real MCP servers over stdio and captures the literal
  tools[] + system prompt the agent transmits through kosong::step; gated behind
  KIMI_LIVE_MCP_PROOF so the default suite stays offline.

kosong::convert_tool is made pub so tests measure the real wire encoding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Author
Owner

some other ideas could be

  • large file offloading e.g instead of returning large content we return a preview and info on how to access the file if needed for more
  • o(1) prompt indexing (could be worthy only in case of VERY large number of MCP servers and tools)
  • em bedder integration fro semantic search
some other ideas could be - large file offloading e.g instead of returning large content we return a preview and info on how to access the file if needed for more - o(1) prompt indexing (could be worthy only in case of VERY large number of MCP servers and tools) - em bedder integration fro semantic search
Replace the per-tool deferral+activation model with a generic dispatch
surface: the entire MCP toolset now lives behind two fixed tools,
`mcp_search` and `mcp_call`, so the per-turn tool list is O(1) in the
number of connected tools/servers instead of growing with each one.

Why: tool *definitions* are re-serialized into every turn, so the old
`tool_search` activation model let an activated tool cost tokens for the
rest of the session — a broad query could activate 100+ tools and undo
the savings. Moving schema knowledge into ephemeral tool *results* pays
for a schema once (on the turn it's searched) rather than every turn.

- `McpRegistry` replaces `DeferState`: holds the catalog + an
  exposed_name -> callable map; MCP tools are never placed in `tools()`.
- `mcp_search(query, limit?)` ranks the catalog, returns up to N (default
  5) matches' name + description + argument schema as a result.
- `mcp_call(name, arguments)` dispatches by name and validates arguments
  against the target tool's real schema, preserving schema-aware errors.
- The bounded per-server index hint is retained (O(1) at 1000 servers).

Proof (token_proof, real tools() + kosong wire encoder over 53 captured
schemas): tool list is flat — 1173 bytes after 1 MCP tool or all 53
(delta 0), vs 34541 bytes sending every schema each turn (~19x smaller).
Verified live end-to-end against hero_router MCP (hero_proc +
hero_whiteboard, deepseek-v4-flash): search -> dispatch chain creates a
workspace/board/object and reads it back. All 40 lib tests pass.

Trade-off: PreToolUse/PostToolUse hooks now fire for `mcp_call` rather
than the specific `mcp__server__tool` name.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generic dispatch made the *input* (tool schemas) O(1), but MCP tool
*results* were still injected raw — a single `object_list`, log dump, or
`rpc_discover` can return 100k+ characters and blow the context window.

Head+tail truncate an `mcp_call` result's text to ~12k chars
(MCP_RESULT_MAX_CHARS) with a notice telling the model how to fetch just
the part it needs (filter, pagination, smaller limit, specific id). Head
and tail are both kept so the start and the shape/end of structured
output survive; media parts pass through untouched.

Verified live against hero_proc `rpc_discover` (153,528 chars): trimmed
to ~12k with 141,528 omitted, and the model still read both the head
(`components.schemas`) and tail (`debug.process_tree`). Small results
(service_list, object_list) are unaffected. Adds two unit tests; 42 lib
tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Four low-overhead refinements to the MCP efficiency surface:

- Cap per-tool descriptions kept in the catalog (default 300 chars) so a
  verbose server can't bloat the index / mcp_search results; the full
  per-argument docs still live in the schema mcp_search returns.
- mcp_search prints "(no arguments)" for a no-arg tool instead of an
  empty schema blob.
- mcp_search does not re-emit a tool's schema if it already returned it
  earlier this session — a repeat search shows the tool by name only
  (new McpRegistry.searched set).
- Make all the context caps tunable per deployment via env vars (with the
  existing consts as defaults): KIMI_MCP_RESULT_MAX_CHARS,
  KIMI_MCP_SEARCH_RESULT_CAP, KIMI_MCP_DESC_MAX_CHARS.

Verified live against hero_proc: a no-arg tool is labelled "(no
arguments)", a repeat search reports "shown earlier this session", and
dispatch still works. Adds 3 unit tests; 45 lib tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Five context/ergonomics refinements to the MCP surface:

- mcp_search now ranks with BM25 over tokenized name+description (plural
  stemming + small synonym map + name-token boost) instead of raw
  substring scoring. "make a board" / "rm file" / "enumerate services"
  find the create/delete/list tools without exact word overlap. Stays
  offline and deterministic — no embeddings.
- mcp_call auto-resolves a dropped mcp__<server>__ prefix (the common
  case of a bare tool name) and returns "did you mean" suggestions on a
  typo instead of failing silently.
- Read-only tools (server readOnlyHint: true) get their results cached by
  (resolved-name, args); an identical repeat call is free and re-injects
  nothing.
- Oversized JSON-array results are summarized item-wise (whole leading
  items + "N of M shown") rather than byte-truncated.
- Per-server connect logs tools count + ~bytes of schema kept out of each
  turn (telemetry); the result caps are env-tunable
  (KIMI_MCP_RESULT_MAX_CHARS / _SEARCH_RESULT_CAP / _DESC_MAX_CHARS).

Verified live against hero_proc/hero_whiteboard: "make a board" finds
board_create, and a bare "board_create" (prefix dropped) resolves and
creates the board. Adds 6 unit tests; 49 lib tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
readOnlyHint: true means a tool doesn't *modify* state — not that its
answer is *stable*. A status/list result changes over time, so caching it
for the whole session (no expiry) could serve a stale value indefinitely
and mislead a polling model. That was a correctness footgun.

Caching is now OFF by default. It activates only when
KIMI_MCP_CACHE_TTL_SECS is set to a positive value, and each cached entry
is stamped and treated as a miss once older than the TTL — bounding how
stale a served answer can be. Mutating tools are still never cached.

Cache value is now (Instant, ToolReturnValue); TTL lives on McpRegistry
(set from env at construction) so the behaviour is unit-testable without
env races. Test now asserts both the ON (deduped) and OFF (every call
executes) paths. 49 lib tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A single green run isn't proof of reliable execution. Add a deterministic,
offline (CI-able) reliability module that establishes the properties a
happy path can't:

- pure_functions_never_panic_on_adversarial_input: match_deferred,
  resolve_mcp_name, lex_tokens, cap_mcp_output, summarize_json_array fed
  empty/null/huge/unicode/malformed input — never panic, invariants hold
  (bounded output, only-known names).
- read_only_cache_entry_expires_after_ttl: a cached read-only result stops
  being served once older than the TTL (staleness is bounded).
- a_panicking_tool_is_caught_not_fatal: handle()'s catch_unwind turns a
  panicking tool into an error result, not a crash.
- unknown_tool_name_is_a_graceful_error: bad names degrade to a structured
  error.
- concurrent_dispatch_does_not_deadlock: 200 concurrent search+dispatch
  tasks all complete within a timeout — proving no lock is held across an
  await point.

54 lib tests pass. Live repeatability (deepseek-v4-flash + real
hero_proc): the search→dispatch path executed correctly 8/8 runs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a reproducible, real-data benchmark for the MCP context savings, plus
the capture harness that feeds it.

mcp_bench_capture.py connects (via the real MCP stdio/HTTP handshake) to a
list of popular servers — Notion, GitHub, Playwright, filesystem, memory,
git, sqlite, redis, puppeteer, time, fetch, context7, desktop-commander,
the hero_router servers, and more — and writes their *real* tools/list
schemas to a fixture. No schema is fabricated; servers that need
credentials are simply recorded as failed.

The gated `mcp_bench` test (KIMI_MCP_BENCH=1) loads that fixture and
measures, through the real build_mcp_tool_base + KimiToolset::tools() +
kosong::convert_tool encoder, the per-turn tools[] payload BEFORE (every
schema each turn) vs AFTER (the two dispatch tools + bounded index),
printing a per-server table and dumping JSON for independent tokenization.

Measured result (20 servers connected, 403 real tools, cl100k_base):
  BEFORE  59,671 tokens/turn (269,689 bytes)
  AFTER      946 tokens/turn (3,594 bytes; 2 tools + index)
  -> 63x fewer tokens, 98.4% less, and flat regardless of server count.

Run:
  python3 crates/hero_kimi_agent/tests/fixtures/mcp_bench_capture.py
  KIMI_MCP_BENCH=1 cargo test -p hero_kimi_agent --lib mcp_bench -- --nocapture

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
omarz closed this pull request 2026-06-04 18:07:34 +00:00

Pull request closed

Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_kimi_rust!3
No description provided.