feat(mcp): defer MCP tool schemas to slash per-turn context cost

thabeta commented

2026-06-03 20:58:26 +00:00

Owner

MCP tools were registered into the toolset and sent to the model on every
turn with full JSON schemas plus a boilerplate description prefix. With a few
servers that is thousands of tokens per turn, re-sent each step, occupying the
context window and pushing real content toward compaction.

Defer them instead (qwen-code / shrimp's idea): MCP tools are kept out of the
tool list and replaced by a compact, per-server index appended to the system
prompt. The model calls a new tool_search tool with a keyword to activate the
tools a task needs; their schemas re-enter the prompt from the next step on.

KimiToolset gains a shared DeferState (catalog + activated set) so tool_search
can activate tools from inside a spawned tool-call task without re-locking the
toolset mutex held by the in-flight step.
tools() sends only non-deferred tools plus activated ones, and hides
tool_search entirely when nothing is deferred (no change for non-MCP users).
deferred_hint() groups mcp____ by server, one line each up to a
24-server cap, then a single summary line — O(1) prompt cost at fleet scale.
MCP tools are now exposed as mcp____ (collision-safe, enables
per-server grouping); the server-local name is still sent in tools/call.
The 16-token boilerplate description prefix becomes a 5-token [MCP ] tag.
The deferred hint is appended to the system prompt at the step site; empty when
nothing is deferred, so the cached system prefix stays stable.

Measured on three real MCP servers (github/everything/filesystem, 53 tools): the
per-turn tools[] payload drops from ~7,400 to ~295 tokens at idle (25x smaller,
96% less). Proven by three test layers:

deferred_tests: matcher, bounded/O(1) index, hide-until-activated.
token_proof: drives the real tools() + kosong's real convert_tool encoder over
captured schemas, asserts >5x smaller (regression guard).
live_proof: connects real MCP servers over stdio and captures the literal
tools[] + system prompt the agent transmits through kosong::step; gated behind
KIMI_LIVE_MCP_PROOF so the default suite stays offline.

kosong::convert_tool is made pub so tests measure the real wire encoding.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com

MCP tools were registered into the toolset and sent to the model on every turn with full JSON schemas plus a boilerplate description prefix. With a few servers that is thousands of tokens per turn, re-sent each step, occupying the context window and pushing real content toward compaction. Defer them instead (qwen-code / shrimp's idea): MCP tools are kept out of the tool list and replaced by a compact, per-server index appended to the system prompt. The model calls a new `tool_search` tool with a keyword to activate the tools a task needs; their schemas re-enter the prompt from the next step on. - KimiToolset gains a shared DeferState (catalog + activated set) so tool_search can activate tools from inside a spawned tool-call task without re-locking the toolset mutex held by the in-flight step. - tools() sends only non-deferred tools plus activated ones, and hides tool_search entirely when nothing is deferred (no change for non-MCP users). - deferred_hint() groups mcp__<server>__<tool> by server, one line each up to a 24-server cap, then a single summary line — O(1) prompt cost at fleet scale. - MCP tools are now exposed as mcp__<server>__<tool> (collision-safe, enables per-server grouping); the server-local name is still sent in tools/call. - The 16-token boilerplate description prefix becomes a 5-token [MCP <server>] tag. - The deferred hint is appended to the system prompt at the step site; empty when nothing is deferred, so the cached system prefix stays stable. Measured on three real MCP servers (github/everything/filesystem, 53 tools): the per-turn tools[] payload drops from ~7,400 to ~295 tokens at idle (25x smaller, 96% less). Proven by three test layers: - deferred_tests: matcher, bounded/O(1) index, hide-until-activated. - token_proof: drives the real tools() + kosong's real convert_tool encoder over captured schemas, asserts >5x smaller (regression guard). - live_proof: connects real MCP servers over stdio and captures the literal tools[] + system prompt the agent transmits through kosong::step; gated behind KIMI_LIVE_MCP_PROOF so the default suite stays offline. kosong::convert_tool is made pub so tests measure the real wire encoding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

thabeta added 1 commit

2026-06-03 20:58:26 +00:00

feat(mcp): defer MCP tool schemas to slash per-turn context cost 1c6107c0ae

MCP tools were registered into the toolset and sent to the model on every
turn with full JSON schemas plus a boilerplate description prefix. With a few
servers that is thousands of tokens per turn, re-sent each step, occupying the
context window and pushing real content toward compaction.

Defer them instead (qwen-code / shrimp's idea): MCP tools are kept out of the
tool list and replaced by a compact, per-server index appended to the system
prompt. The model calls a new `tool_search` tool with a keyword to activate the
tools a task needs; their schemas re-enter the prompt from the next step on.

- KimiToolset gains a shared DeferState (catalog + activated set) so tool_search
  can activate tools from inside a spawned tool-call task without re-locking the
  toolset mutex held by the in-flight step.
- tools() sends only non-deferred tools plus activated ones, and hides
  tool_search entirely when nothing is deferred (no change for non-MCP users).
- deferred_hint() groups mcp__<server>__<tool> by server, one line each up to a
  24-server cap, then a single summary line — O(1) prompt cost at fleet scale.
- MCP tools are now exposed as mcp__<server>__<tool> (collision-safe, enables
  per-server grouping); the server-local name is still sent in tools/call.
- The 16-token boilerplate description prefix becomes a 5-token [MCP <server>] tag.
- The deferred hint is appended to the system prompt at the step site; empty when
  nothing is deferred, so the cached system prefix stays stable.

Measured on three real MCP servers (github/everything/filesystem, 53 tools): the
per-turn tools[] payload drops from ~7,400 to ~295 tokens at idle (25x smaller,
96% less). Proven by three test layers:
- deferred_tests: matcher, bounded/O(1) index, hide-until-activated.
- token_proof: drives the real tools() + kosong's real convert_tool encoder over
  captured schemas, asserts >5x smaller (regression guard).
- live_proof: connects real MCP servers over stdio and captures the literal
  tools[] + system prompt the agent transmits through kosong::step; gated behind
  KIMI_LIVE_MCP_PROOF so the default suite stays offline.

kosong::convert_tool is made pub so tests measure the real wire encoding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

omarz was assigned by thabeta

2026-06-03 20:58:34 +00:00

thabeta commented

2026-06-03 21:02:18 +00:00

Author

Owner

some other ideas could be

large file offloading e.g instead of returning large content we return a preview and info on how to access the file if needed for more
o(1) prompt indexing (could be worthy only in case of VERY large number of MCP servers and tools)
em bedder integration fro semantic search

some other ideas could be - large file offloading e.g instead of returning large content we return a preview and info on how to access the file if needed for more - o(1) prompt indexing (could be worthy only in case of VERY large number of MCP servers and tools) - em bedder integration fro semantic search

thabeta added 1 commit

2026-06-04 10:25:28 +00:00

handle cap gracefully 5290820828

thabeta added 3 commits

2026-06-04 10:55:58 +00:00

feat(mcp): front MCP tools with generic dispatch for O(1) per-turn cost 3832ce2add

Replace the per-tool deferral+activation model with a generic dispatch
surface: the entire MCP toolset now lives behind two fixed tools,
`mcp_search` and `mcp_call`, so the per-turn tool list is O(1) in the
number of connected tools/servers instead of growing with each one.

Why: tool *definitions* are re-serialized into every turn, so the old
`tool_search` activation model let an activated tool cost tokens for the
rest of the session — a broad query could activate 100+ tools and undo
the savings. Moving schema knowledge into ephemeral tool *results* pays
for a schema once (on the turn it's searched) rather than every turn.

- `McpRegistry` replaces `DeferState`: holds the catalog + an
  exposed_name -> callable map; MCP tools are never placed in `tools()`.
- `mcp_search(query, limit?)` ranks the catalog, returns up to N (default
  5) matches' name + description + argument schema as a result.
- `mcp_call(name, arguments)` dispatches by name and validates arguments
  against the target tool's real schema, preserving schema-aware errors.
- The bounded per-server index hint is retained (O(1) at 1000 servers).

Proof (token_proof, real tools() + kosong wire encoder over 53 captured
schemas): tool list is flat — 1173 bytes after 1 MCP tool or all 53
(delta 0), vs 34541 bytes sending every schema each turn (~19x smaller).
Verified live end-to-end against hero_router MCP (hero_proc +
hero_whiteboard, deepseek-v4-flash): search -> dispatch chain creates a
workspace/board/object and reads it back. All 40 lib tests pass.

Trade-off: PreToolUse/PostToolUse hooks now fire for `mcp_call` rather
than the specific `mcp__server__tool` name.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(mcp): cap MCP tool result size to bound output context 2cb5d6eb30

Generic dispatch made the *input* (tool schemas) O(1), but MCP tool
*results* were still injected raw — a single `object_list`, log dump, or
`rpc_discover` can return 100k+ characters and blow the context window.

Head+tail truncate an `mcp_call` result's text to ~12k chars
(MCP_RESULT_MAX_CHARS) with a notice telling the model how to fetch just
the part it needs (filter, pagination, smaller limit, specific id). Head
and tail are both kept so the start and the shape/end of structured
output survive; media parts pass through untouched.

Verified live against hero_proc `rpc_discover` (153,528 chars): trimmed
to ~12k with 141,528 omitted, and the model still read both the head
(`components.schemas`) and tail (`debug.process_tree`). Small results
(service_list, object_list) are unaffected. Adds two unit tests; 42 lib
tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(mcp): trim mcp_search output and make context caps tunable 213cc0fc4d

Four low-overhead refinements to the MCP efficiency surface:

- Cap per-tool descriptions kept in the catalog (default 300 chars) so a
  verbose server can't bloat the index / mcp_search results; the full
  per-argument docs still live in the schema mcp_search returns.
- mcp_search prints "(no arguments)" for a no-arg tool instead of an
  empty schema blob.
- mcp_search does not re-emit a tool's schema if it already returned it
  earlier this session — a repeat search shows the tool by name only
  (new McpRegistry.searched set).
- Make all the context caps tunable per deployment via env vars (with the
  existing consts as defaults): KIMI_MCP_RESULT_MAX_CHARS,
  KIMI_MCP_SEARCH_RESULT_CAP, KIMI_MCP_DESC_MAX_CHARS.

Verified live against hero_proc: a no-arg tool is labelled "(no
arguments)", a repeat search reports "shown earlier this session", and
dispatch still works. Adds 3 unit tests; 45 lib tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

thabeta added 3 commits

2026-06-04 11:32:33 +00:00

feat(mcp): BM25 search, prefix-resolve, read-only cache, array summary cc57e8c2c1

Five context/ergonomics refinements to the MCP surface:

- mcp_search now ranks with BM25 over tokenized name+description (plural
  stemming + small synonym map + name-token boost) instead of raw
  substring scoring. "make a board" / "rm file" / "enumerate services"
  find the create/delete/list tools without exact word overlap. Stays
  offline and deterministic — no embeddings.
- mcp_call auto-resolves a dropped mcp__<server>__ prefix (the common
  case of a bare tool name) and returns "did you mean" suggestions on a
  typo instead of failing silently.
- Read-only tools (server readOnlyHint: true) get their results cached by
  (resolved-name, args); an identical repeat call is free and re-injects
  nothing.
- Oversized JSON-array results are summarized item-wise (whole leading
  items + "N of M shown") rather than byte-truncated.
- Per-server connect logs tools count + ~bytes of schema kept out of each
  turn (telemetry); the result caps are env-tunable
  (KIMI_MCP_RESULT_MAX_CHARS / _SEARCH_RESULT_CAP / _DESC_MAX_CHARS).

Verified live against hero_proc/hero_whiteboard: "make a board" finds
board_create, and a bare "board_create" (prefix dropped) resolves and
creates the board. Adds 6 unit tests; 49 lib tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fix(mcp): make read-only result cache opt-in with a TTL (was stale-prone) ceb5e8f1e1

readOnlyHint: true means a tool doesn't *modify* state — not that its
answer is *stable*. A status/list result changes over time, so caching it
for the whole session (no expiry) could serve a stale value indefinitely
and mislead a polling model. That was a correctness footgun.

Caching is now OFF by default. It activates only when
KIMI_MCP_CACHE_TTL_SECS is set to a positive value, and each cached entry
is stamped and treated as a miss once older than the TTL — bounding how
stale a served answer can be. Mutating tools are still never cached.

Cache value is now (Instant, ToolReturnValue); TTL lives on McpRegistry
(set from env at construction) so the behaviour is unit-testable without
env races. Test now asserts both the ON (deduped) and OFF (every call
executes) paths. 49 lib tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

test(mcp): reliability suite — fuzz, faults, cache expiry, concurrency 47387de703

A single green run isn't proof of reliable execution. Add a deterministic,
offline (CI-able) reliability module that establishes the properties a
happy path can't:

- pure_functions_never_panic_on_adversarial_input: match_deferred,
  resolve_mcp_name, lex_tokens, cap_mcp_output, summarize_json_array fed
  empty/null/huge/unicode/malformed input — never panic, invariants hold
  (bounded output, only-known names).
- read_only_cache_entry_expires_after_ttl: a cached read-only result stops
  being served once older than the TTL (staleness is bounded).
- a_panicking_tool_is_caught_not_fatal: handle()'s catch_unwind turns a
  panicking tool into an error result, not a crash.
- unknown_tool_name_is_a_graceful_error: bad names degrade to a structured
  error.
- concurrent_dispatch_does_not_deadlock: 200 concurrent search+dispatch
  tasks all complete within a timeout — proving no lock is held across an
  await point.

54 lib tests pass. Live repeatability (deepseek-v4-flash + real
hero_proc): the search→dispatch path executed correctly 8/8 runs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

thabeta added 1 commit

2026-06-04 11:59:03 +00:00

test(mcp): benchmark generic dispatch over 20 real popular MCP servers fc98cf0d89

Add a reproducible, real-data benchmark for the MCP context savings, plus
the capture harness that feeds it.

mcp_bench_capture.py connects (via the real MCP stdio/HTTP handshake) to a
list of popular servers — Notion, GitHub, Playwright, filesystem, memory,
git, sqlite, redis, puppeteer, time, fetch, context7, desktop-commander,
the hero_router servers, and more — and writes their *real* tools/list
schemas to a fixture. No schema is fabricated; servers that need
credentials are simply recorded as failed.

The gated `mcp_bench` test (KIMI_MCP_BENCH=1) loads that fixture and
measures, through the real build_mcp_tool_base + KimiToolset::tools() +
kosong::convert_tool encoder, the per-turn tools[] payload BEFORE (every
schema each turn) vs AFTER (the two dispatch tools + bounded index),
printing a per-server table and dumping JSON for independent tokenization.

Measured result (20 servers connected, 403 real tools, cl100k_base):
  BEFORE  59,671 tokens/turn (269,689 bytes)
  AFTER      946 tokens/turn (3,594 bytes; 2 tools + index)
  -> 63x fewer tokens, 98.4% less, and flat regardless of server count.

Run:
  python3 crates/hero_kimi_agent/tests/fixtures/mcp_bench_capture.py
  KIMI_MCP_BENCH=1 cargo test -p hero_kimi_agent --lib mcp_bench -- --nocapture

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

thabeta added 1 commit

2026-06-04 12:28:41 +00:00

Add MCP_IMPROVEMENTS.md doc 57757df8c3

omarz closed this pull request

2026-06-04 18:07:34 +00:00

mik-tf referenced this pull request from lhumina_code/home

2026-06-04 19:05:25 +00:00

Kimi assistant: trim the MCP tool surface so chat actions stay fast #249

Pull request closed

This pull request cannot be reopened because the branch was deleted.

Rows
Columns

feat(mcp): defer MCP tool schemas to slash per-turn context cost #3

Pull request closed