[nu-demo] hero_books re-runs LLM Q&A extraction despite pre-shipped .ai/<page>.toml cache — content_hash mismatch #26
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
On heronu, adding the 4 demo libraries (
docs_hero,docs_geomind,docs_mycelium,docs_owh) tolibraries.txttriggers hero_books to re-extract Q&A via LLM calls to hero_aibroker for every page in every collection. With ~300 pages across geomind/mycelium/ourworld, this takes 20-40 minutes and burns API tokens — even though each library already ships with pre-extracted Q&A.Every cloned library has:
Example —
docs_mycelium/collections/ai_platform_tech/.ai/0_tech_overview.toml(~10 KB):This is exactly the format hero_books would produce. The extraction work has already been done upstream and committed.
Current behavior (confirmed on heronu 2026-04-24)
hero_books log shows for each page:
The
[Q&A cached]marker appears ONLY for pages in docs_hero. Every page in geomind / mycelium / ourworld is re-extracted.Root cause (hypothesis)
hero_books_lib/src/ai/book.rs:241computescontent_hash = compute_content_hash(&content)on the exported.mdfile — i.e. the version hero_books writes into/home/driver/hero/var/books/<ns>/books/<collection>/<page>.mdafter its own post-processing step.The pre-shipped
.ai/<page>.tomlwas generated by an upstream pipeline (likely the heroscript → markdown conversion tool). Itscontent_hashwas computed on a different input — probably the raw markdown before any normalization, OR the.herosource file before conversion, OR the html rendered output.Result: hero_books's hash ≠ stored hash → cache miss → full LLM re-extraction.
The Q&A pairs in the stored TOML are still valid and well-formed. The
.vectors.binsidecars are likewise valid — they encode real embeddings. Throwing all of that away to call Claude again is pure waste.Proposed fixes (pragmatic → proper)
1. Trust-if-present mode
Add a config flag (default on for libraries cloned from remote repos):
2. Normalize content_hash computation
Match whatever upstream pipeline does. If the upstream hashes the raw source (before rendering), hero_books should do the same. Read through how
docs_myceliumwas built (probably abuild.shor similar in the repo) and align.3. Audit & document the
.ai/contractDefine a canonical spec for what
.ai/<page>.tomlmust contain and howcontent_hashis computed, and publish it as part of thedocs_herolibrary contributor guide. Both the upstream generator and hero_books' consumer must follow the same rule — right now they're drifting.4. Reuse
.vectors.binEven if Q&A extraction needs to re-run for some reason (e.g. schema migration), the pre-computed embeddings in
.vectors.binshould always be reused. Upload them directly to hero_embedder instead of re-computing.Impact on demo
libraries.txt, during which the AI Assistant can't ground answers in their content.Verification
After fix:
Related
Signed-off-by: mik-tf
Originally filed as home#158 on 2026-04-24 by mik-tf — moved to hero_demo as part of consolidating issue tracking.
.ai/<page>.tomlcache — content_hash mismatch #158