Voice AI Phase 2: cross-browser wake word + local Whisper STT #78
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Phase 1 (issue #74) delivered wake word detection using browser
webkitSpeechRecognition— Chrome/Edge only. This phase makes wake word and STT work on ALL browsers by moving detection server-side.Architecture
Level 1: Server-side wake word via Rustpotter
Rustpotter is a pure Rust wake word engine (~500KB model). Detects keywords without full STT.
rustpottercrate to hero_voice{"type": "wake_word", "word": "hero"}via WebSocketLevel 2: Local Whisper STT via ONNX
Replace cloud Groq dependency with local Whisper inference. We already have
ort(ONNX Runtime) in hero_embedder.ortdependency to hero_voiceHERO_VOICE_STT_LOCAL=trueenv var to prefer localLevel 3: Client-side WASM wake word (future)
For fully offline client-side detection (no WebSocket needed):
Key decisions
Existing infrastructure we reuse
ortcrate (already in hero_embedder)Phasing update
This is next round work. Current round focuses on fixing basics first:
Once basics work → Phase 2 (this issue) for cross-browser wake word + local Whisper.
Starting now — server-side audio stack
#80 delivered conversation CRUD, voice input, auto-scroll, and web-sys foundation. TTS playback deferred here because server-side audio is the production solution (no browser gesture fights).
Deliverables
Rustpotter wake word (hero_voice)
{"type":"wake_word"}to client → triggers conversation modeLocal Whisper STT (hero_voice)
ortcrate + Whisper tiny ONNX modelServer TTS via WebSocket (hero_agent + hero_voice)
AudioWorkletNode (hero_archipelagos)
Repos
Build
make dist-clean-wasm(island changes) + model downloads in DockerSigned-off-by: mik-tf
Ready to start — bundled scope
Current state (v0.7.5-dev on herodev)
Working: SSE chat, multiple messages, voice transcription, conversations CRUD, error dismiss, transcribing text reset, MCP 62 tools, 21/22 build
Broken: auto-scroll (#84), read aloud/TTS (#78), wake word Firefox (#78)
Deliverables for #78 (includes #84 fix)
1. Auto-scroll fix (#84) — 15 min
hero_archipelagos/archipelagos/intelligence/ai/src/views/message_list.rsset_scroll_topfires before DOM updaterequest_animation_framecallback viaweb_sys::window().request_animation_frame()+Closure::once2. Server-side TTS via WebSocket — hero_agent + hero_voice
event: done, generate audio via aibroker (gpt-4o-mini-tts) and send base64 audio3. Rustpotter wake word — hero_voice
rustpottercrate to hero_voice.rpwmodel for "Hero" / "Hey Hero"{"type":"wake_word"}to client4. Local Whisper STT (ONNX) — hero_voice
ortcrate + Whisper tiny ONNX modelArchitecture
Files to modify
hero_archipelagos/archipelagos/intelligence/ai/src/views/message_list.rs— auto-scrollhero_archipelagos/archipelagos/intelligence/ai/src/island.rs— TTS audio playback from SSEhero_archipelagos/archipelagos/intelligence/ai/src/voice.rs— play received audiohero_agent/crates/hero_agent_server/src/routes.rs— add TTS to done eventhero_voice/crates/hero_voice/src/— Rustpotter, Whisper ONNXhero_services/docker/build-local.sh— model downloadsBuild
make dist-clean-wasm(island changes) + model downloads in DockerPipeline
branch → code → build →
make test-local20/20 → squash merge → deploy → verifySigned-off-by: mik-tf
Implementation Design — Decided & In Progress
Full voice AI pipeline for v0.7.5-dev. All 6 deliverables coded, pending build + test.
Architecture
Deliverables
message_list.rsroutes.rs,ai_service.rs,voice.rs,island.rswakeword.rs(new),ws.rs,island.rslocal_transcriber.rs(new),ws.rs,build-local.shisland.rs,voice.rsvoice.rs,island.rsKey Decisions
.rpwon diskHERO_VOICE_STT_LOCAL=trueenables localevent: audiorequestAnimationFramewrapperRepos Touched
hero_archipelagos— 4 files (island, voice, ai_service, message_list)hero_agent— 1 file (routes.rs)hero_voice— 5 files + 2 new (wakeword.rs, local_transcriber.rs)hero_services— 1 file (build-local.sh)Build
make dist-clean-wasmrequired (island changes + new server modules).Signed-off-by: mik-tf
Updated Design — Wake Word UX (Industry Standard Pattern)
Two distinct voice modes
Wake command flow
Key decisions
Server protocol (ws.rs)
{type: "listen"}— passive mode, VAD+transcribe, only wake detection{type: "wake_word", command: "what services are running"}{type: "wake_word", command: null}Implementation status
Build: 21/22 (only hero_compute fails, pre-existing #83)
Signed-off-by: mik-tf
v0.7.6-dev Status — Voice Pipeline
Deployed on herodev
What's built and compiles
What's broken — TTS runtime
hero_agent calls Groq Orpheus TTS via
reqwestbut gets 401 Unauthorized inside Docker. Same API key works withcurlfrom the same container. Root cause:reqwestuseshyper-rustlsTLS backend which behaves differently from OpenSSL/curl for the Groq API auth.Fix options for next session:
native-tls(uses OpenSSL)What's blocked — dependency conflicts
Next steps
Signed-off-by: mik-tf
Updated Plan — Voice Pipeline Final Architecture
Settings UX
API keys stay ONLY in Environment tab (already has Groq). Voice tab picks which provider to USE. No duplication.
Voice & Audio tab layout
TTS priority: local first, cloud fallback
OS-wide voice service
hero_voice_ui becomes the voice gateway. Any island calls
POST /hero_voice_ui/api/ttswith text + provider preference → gets audio back.Implementation steps
Signed-off-by: mik-tf
Consolidated Plan — Complete Voice Pipeline
Root cause:
ortversion alignmentOne dependency fix unblocks BOTH Kokoro local TTS and Rustpotter server-side wake word:
Fix: Fork
voice_activity_detector, pin toort 2.0.0-rc.11. This unblocks Kokoro immediately. Rustpotter needs a separate fork (candle-core → 0.8+).Full local voice pipeline (target)
Zero API calls for voice processing. Only the LLM requires cloud.
Settings page — Voice & Audio tab (NEW)
Added between Appearance and Environment tabs:
API keys stay in Environment tab only (already has Groq). No duplication.
Implementation steps (ordered)
Related issues
Signed-off-by: mik-tf
Session End Status — 2026-03-24
What's done (code written, compiles, on disk)
What's NOT done
Known build issues
Voice names — need fixing
Current: OpenAI names (Alloy/Echo/Fable/Shimmer) mapped to Groq (diana/austin/hannah/autumn)
Target: Dynamic dropdown from active provider:
Next session should
Signed-off-by: mik-tf
Kokoro Unblocked — No Fork Needed
The
ortversion conflict is resolved by replacing the VAD crate, not forking anything.Root cause
Solution:
earshot 1.0.0(pure Rust VAD)Replace
voice_activity_detectorwithearshot:ortdependencysilero-vad-rustwas considered but usesort 2.0.0-rc.10— still conflicts.earshotis the only conflict-free option.Verified compatibility
Full local pipeline (no API calls for voice)
Signed-off-by: mik-tf
Phase 1-2 Progress: v0.7.0-dev deployed to herodev
Code review + fixes across 4 repos
hero_archipelagos:
binary.bytes()→binary.chars()for bytes > 127)hero_agent:
orpheus_voice()helper to deduplicate voice mappinghero_voice:
tokio::spawn→spawn_blocking)hero_services:
Build & deploy
Next: browser verification
Signed-off-by: mik-tf
Phase 1-2 Complete — v0.7.0-dev released
Repos squash-merged to development
8a7a4fedf95b38ada6867b9e6e3d3b89040(lockfile fix for #83)Release
https://forge.ourworld.tf/lhumina_code/hero_services/releases/tag/v0.7.0-dev
Verified working
Remaining (tracked in #87)
Signed-off-by: mik-tf
v0.7.1-dev deployed to herodev
Phase 3 complete:
Repos: hero_voice, hero_agent, hero_archipelagos, hero_os, hero_services
Tests: 112 smoke + 20 integration, 0 failures
Signed-off-by: mik-tf
Complete in v0.7.1-dev: earshot VAD (pure Rust), kokoro-micro TTS (54+ voices), local Whisper STT, 3-tier routing (Kokoro→Groq→aibroker), sentence-level streaming, trackbar with pause/play/stop, Settings Voice & Audio tab. 164 tests, 0 failures.
Signed-off-by: mik-tf